Onto the second talk of the day. Steve Capper is going to tell us about the good bits of Java They do exist [Audience] Could this have been a lightening talk? [Audience laughter] Believe it or not we've got some good stuff here. I was as skeptical as you guys when I first looked. First many apologies for not attending the mini-conf last year I was unfortunately ill on the day I was due to give this talk. Let me figure out how to use a computer. Sorry about this. There we go; it's because I've not woken up. Last year I worked at Linaro in the Enterprise group and we performed analysis on so called 'Big Data' application sets. As many of you know quite a lot of these big data applications are written in Java. I'm from ARM and we were very interested in 64bit ARM support. So this is mainly AArch64 examples for things like assembler but most of the messages are pertinent for any architecture. These good bits are shared between most if not all the architectures. Whilst trying to optimise a lot of these big data applications I stumbled a across quite a few things in the JVM and I thought 'actually that's really clever; that's really cool' So I thought that would make a good basis for an interesting talk. This talk is essentially some of the clever things I found in the Java Virtual Machine; these optimisations are in OpenJDK. Source is available it's all there, readily available and in play now. I'm going to finish with some of the optimisation work we did with Java. People who know me will know I'm not a Java zealot. I don't particularly believe in programming in a language over another one So to make it clear from the outset I'm not attempting to convert anyone to Java programmers. I'm just going to highlight a few salient things in the Java Virtual Machine which I found to be quite clever and interesting and I'll try and talk through them with my understanding of them. Let's jump straight in and let's start with an example. This is a minimal example for computing a SHA1 sum of a file. I've omitted some of the checking in the beginning of the function see when command line parsing and that sort of thing. I've highlighted the salient points in red. Essentially we instantiate a SHA1 crypto message service digest. And we do the equivalent in Java of an mmap. Get it all in memory. And then we just put this status straight into the crypto engine. And eventually at the end of the program we'll spit out the SHA1 hash. It's a very simple program. It's basically mmap, SHA1, output the hash afterwards. In order to concentrate on the CPU aspect rather than worry about IO I decided to cheat a little bit by setting this up. I decided to use a sparse file. As many of you know a sparse file is a file that not all the contents are stored necessarily on disc. The assumption is that the bits that aren't stored are zero. For instance on Linux you can create a 20TB sparse file on a 10MB file system and use it as normal. Just don't write too much to it otherwise you're going to run out of space. The idea behind using a sparse file is I'm just focusing on the computational aspects of the SHA1 sum. I'm not worried about the file system or anything like that. I don't want to worry about the IO. I just want to focus on the actual compute. In order to set up a sparse file I used the following runes. The important point is that you seek and the other important point is you set a count otherwise you'll fill your disc up. I decided to run this against firstly let's get the native SHA1 sum command that's built into Linux and let's normalise these results and say that's 1.0 I used an older version of the OpenJDK and ran the Java program and that's 1.09 times slower than the reference command. That's quite good. Then I used the new OpenJDK, this is now the current JDK as this is a year on. And 0.21 taken. It's significantly faster. I've stressed that I've done nothing surreptitious in the Java program. It is mmap, compute, spit result out. But the OpenJDK has essentially got some more context information. I'll talk about that as we go through. Before when I started Java I had a very simplistic view of Java. Traditionally Java is taught as a virtual machine that runs bytecode. Now when you compile a Java program it compiles into bytecode. The older versions of the Java Virtual Machine would interpret this bytecode and then run through. Newer versions would employ a just-in-time engine and try and compile this bytecode into native machine code. That is not the only thing that goes on when you run a Java program. There is some extra optimisations as well. So this alone would not account for the newer version of the SHA1 sum being significantly faster than the distro supplied one. Java knows about context. It has a class library and these class libraries have reasonably well defined purposes. We have classes that provide crypto services. We have some misc unsafe that every single project seems to pull in their project when they're not supposed to. These have well defined meanings. These do not necessarily have to be written in Java. They come as Java classes, they come supplied. But most JVMs now have a notion of a virtual machine intrinsic. And the virtual machine intrinsic says ok please do a SHA1 in the best possible way that your implementation allows. This is something done automatically by the JVM. You don't ask for it. If the JVM knows what it's running on and it's reasonably recent this will just happen for you for free. And there's quite a few classes that do this. There's quite a few clever things with atomics, there's crypto, there's mathematical routines as well. Most of these routines in the class library have a well defined notion of a virtual machine intrinsic and they do run reasonably optimally. They are a subject of continuous optimisation as well. We've got some runes that are presented on the slides here. These are quite useful if you are interested in how these intrinsics are made. You can ask the JVM to print out a lot of the just-in-time compiled code. You can ask the JVM to print out the native methods as well as these intrinsics and in this particular case after sifting through about 5MB of text I've come across this particular SHA1 sum implementation. This is AArch64. This is employing the cryptographic extensions in the architecture. So it's essentially using the CPU instructions which would explain why it's faster. But again it's done all this automatically. This did not require any specific runes or anything to activate. We'll see a bit later on how you can more easily find the hot spots rather than sifting through a lot of assembler. I've mentioned that the cryptographic engine is employed and again this routine was generated at run time as well. This is one of the important things about certain execution of amps like Java. You don't have to know everything at compile time. You know a lot more information at run time and you can use that in theory to optimise. You can switch off these clever routines. For instance I've got a deactivate here and we get back to the slower performance we expected. Again, this particular set of routines is present in OpenJDK, I think for all the architectures that support it. We get this optimisation for free on X86 and others as well. It works quite well. That was one surprise I came across as the instrinsics. One thing I thought it would be quite good to do would be to go through a slightly more complicated example. And use this example to explain a lot of other things that happen in the JVM as well. I will spend a bit of time going through this example and explain roughly the notion of what it's supposed to be doing. This is an imaginary method that I've contrived to demonstrate a lot of points in the fewest possible lines of code. I'll start with what it's meant to do. This is meant to be a routine that gets a reference to something and let's you know whether or not it's an image and in a hypothetical cache. I'll start with the important thing here the weak reference. In Java and other garbage collected languages we have the notion of references Most of the time when you are running a Java program you have something like a variable name and that is in the current execution context that is referred to as a strong reference to the object. In other words I can see it. I am using it. Please don't get rid of it. Bad things will happen if you do. So the garbage collector knows not to get rid of it. In Java and other languages you also have the notion of a weak reference. This is essentially the programmer saying to the virtual machine "Look I kinda care about this but just a little bit." "If you want to get rid of it feel free to but please let me know." This is why this is for a CacheClass. For instance the JVM in this particular case could decide that it's running quite low on memory this particular xMB image has not been used for a while it can garbage collect it. The important thing is how we go about expressing this in the language. We can't just have a reference to the object because that's a strong reference and the JVM will know it can't get rid of this because the program can see it actively. So we have a level of indirection which is known as a weak reference. We have this hypothetical CacheClass that I've devised. At this point it is a weak reference. Then we get it. This is calling the weak reference routine. Now it becomes a strong reference so it's not going to be garbage collected. When we get to the return path it becomes a weak reference again because our strong reference has disappeared. The salient points in this example are: We're employing a method to get a reference. We're checking an item to see if it's null. So let's say that the JVM decided to garbage collect this before we executed the method. The weak reference class is still valid because we've got a strong reference to it but the actual object behind this is gone. If we're too late and the garbage collector has killed it it will be null and we return. So it's a level of indirection to see does this still exist if so can I please have it and then operate on it as normal and then return becomes weak reference again. This example program is quite useful when we look at how it's implemented in the JVM and we'll go through a few things now. First off we'll go through the bytecode. The only point of this slide is to show it's roughly the same as this. We get our variable. We use our getter. This bit is extra this checkcast. The reason that bit is extra is because we're using the equivalent of a template in Java. And the way that's implemented in Java is it just basically casts everything to an object so that requires extra compiler information. And this is the extra check. The rest of this we load the reference, we check to see if it is null, If it's not null we invoke a virtual function - is it the image? and we return as normal. Essentially the point I'm trying to make is when we compile this to bytecode this execution happens. This null check happens. This execution happens. And we return. In the actual Java class files we've not lost anything. This is what it looks like when it's been JIT'd. Now we've lost lots of things. The JIT has done quite a few clever things which I'll talk about. First off if we look down here there's a single branch here. And this is only if our check cast failed We've got comments on the right hand side. Our get method has been inlined so we're no longer calling. We seem to have lost our null check, that's just gone. And again we've got a get field as well. That's no longer a method, that's been inlined as well. We've also got some other cute things. Those more familiar with AArch64 will understand that the pointers we're using are 32bit not 64bit. What we're doing is getting a pointer and shifting it left 3 and widening it to a 64bit pointer. We've also got 32bit pointers on a 64bit system as well. So that's saving a reasonable amount of memory and cache. To summarise. We don't have any branches or function calls and we've got a lot of inlining. We did have function calls in the class file so it's the JVM; it's the JIT that has done this. We've got no null checks either and I'm going to talk through this now. The null check elimination is quite a clever feature in Java and other programs. The idea behind null check elimination is most of the time this object is not going to be null. If this object is null the operating system knows this quite quickly. So if you try to dereference a null pointer you'll get either a SIGSEGV or a SIGBUS depending on a few circumstances. That goes straight back to the JVM and the JVM knows where the null exception took place. Because it knows where it took place it can look this up and unwind it as part of an exception. Those null checks just go. Completely gone. Most of the time this works and you are saving a reasonable amount of execution. I'll talk about when it doesn't work in a second. That's reasonably clever. We have similar programming techniques in other places even the Linux kernel for instance when you copy data to and from user space it does pretty much identical the same thing. It has an exception unwind table and it knows if it catches a page fault on this particular program counter it can deal with it because it knows the program counter and it knows conceptually what it was doing. In a similar way the JIT know what its doing to a reasonable degree. It can handle the null check elimination. I mentioned the sneaky one. We've got essentially 32bit pointers on a 64bit system. Most of the time in Java people typically specify heap size smaller than 32GB. Which is perfect if you want to use 32bit pointers and left shift 3. Because that gives you 32GB of addressable memory. That's a significant memory saving because otherwise a lot of things would double up. There's a significant number of pointers in Java. The one that should make people jump out of their seat is the fact that most methods in Java are actually virtual. So what the JVM has actually done is inlined a virtual function. A virtual function is essentially a function were you don't know where you're going until run time. You can have several different classes and they share the same virtual function in the base class and dependent upon which specific class you're running different virtual functions will get executed. In C++ that will be a read from a V table and then you know where to go. The JVM's inlined it. We've saved a memory load. We've saved a branch as well The reason the JVM can inline it is because the JVM knows every single class that has been loaded. So it knows that although this looks polymorphic to the casual programmer It actually is monomorphic. The JVM knows this. Because it knows this it can be clever. And this is really clever. That's a significant cost saving. This is all great. I've already mentioned the null check elimination. We're taking a signal as most of you know if we do that a lot it's going to be slow. Jumping into kernel, into user, bouncing around. The JVM also has a notion of 'OK I've been a bit too clever now; I need to back off a bit' Also there's nothing stopping the user loading more classes and rendering the monomorphic assumption invalid. So the JVM needs to have a notion of backpeddling and go 'Ok I've gone to far and need to deoptimise' The JVM has the ability to deoptimise. In other words it essentially knows that for certain code paths everything's OK. But for certain new objects it can't get away with these tricks. By the time the new objects are executed they are going to be safe. There are ramifications for this. This is the important thing to consider with something like Java and other languages and other virtual machines. If you're trying to profile this it means there is a very significant ramification. You can have the same class and method JIT'd multiple ways and executed at the same time. So if you're trying to find a hot spot the program counter's nodding off. Because you can refer to the same thing in several different ways. This is quite common as well as deoptimisation does take place. That's something to bear in mind with JVM and similar runtime environments. You can get a notion of what the JVM's trying to do. You can ask it nicely and add a print compilation option and it will tell you what it's doing. This is reasonably verbose. Typically what happens is the JVM gets excited JIT'ing everything and optimising everything then it settles down. Until you load something new and it gets excited again. There's a lot of logs. This is mainly useful for debugging but it gives you an appreciation that it's doing a lot of work. You can go even further with a log compilation option. That produces a lot of XML and that is useful for people debugging the JVM as well. It's quite handy to get an idea of what's going on. If that is not enough information you also have the ability to go even further. This is beyond the limit of my understanding. I've gone into this little bit just to show you what can be done. You have release builds of OpenJDK and they have debug builds of OpenJDK. The release builds will by default turn off a lot of the diagnostic options. You can switch them back on again. When you do you can also gain insight into the actual, it's colloquially referred to as the C2 JIT, the compiler there. You can see, for instance, objects in timelines and visualize them as they're being optimised at various stages and various things. So this is based on a masters thesis by Thomas Würthinger. This is something you can play with as well and see how far the optimiser goes. And it's also good for people hacking with the JVM. I'll move onto some stuff we did. Last year we were working on the big data. Relatively new architecture ARM64, it's called AArch64 in OpenJDK land but ARM64 in Debian land. We were a bit concerned because everything's all shiny and new. Has it been optimised correctly? Are there any obvious things we need to optimise? And we're also interested because everything was so shiny and new in the whole system. Not just the JVM but the glibc and the kernel as well. So how do we get a view of all of this? I gave a quick talk before at the Debian mini-conf before last [2014] about perf so decided we could try and do some clever things with Linux perf and see if we could get some actual useful debugging information out. We have the flame graphs that are quite well known. We also have some previous work, Johannes had a special perf map agent that could basically hook into perf and it would give you a nice way of running perf-top for want of a better expression and viewing the top Java function names. This is really good work and it's really good for a particular use case if you just want to do a quick snap shot once and see in that snap shot where the hotspots where. For a prolonged work load with all the functions being JIT'd multiple ways with the optimisation going on and everything moving around it require a little bit more information to be captured. I decided to do a little bit of work on a very similar thing to perf-map-agent but an agent that would capture it over a prolonged period of time. Here's an example Flame graph, these are all over the internet. This is the SHA1 computation example that I gave at the beginning. As expected the VM intrinsic SHA1 is the top one. Not expected by me was this quite significant chunk of CPU execution time. And there was a significant amount of time being spent copying memory from the mmapped memory region into a heap and then that was passed to the crypto engine. So we're doing a ton of memory copies for no good reason. That essentially highlighted an example. That was an assumption I made about Java to begin with which was if you do the equivalent of mmap it should just work like mmap right? You should just be able to address the memory. That is not the case. If you've got a file mapping object and you try to address it it has to be copied into safe heap memory first. And that is what was slowing down the programs. If that was omitted you could make the SHA1 computation even quicker. So that would be the logical target you would want to optimise. I wanted to extend Johannes' work with something called a Java Virtual Machine Tools Interface profiling agent. This is part of the Java Virtual Machine standard as you can make a special library and then hook this into the JVM. And the JVM can expose quite a few things to the library. It exposes a reasonable amount of information as well. Perf as well has the ability to look at map files natively. If you are profiling JavaScript, or something similar, I think the Google V8 JavaScript engine will write out a special map file that says these program counter addresses correspond to these function names. I decided to use that in a similar way to what Johannes did for the extended profiling agent but I also decided to capture some more information as well. I decided to capture the disassembly so when we run perf annotate we can see the actual JVM bytecode in our annotation. We can see how it was JIT'd at the time when it was JIT'd. We can see where the hotspots where. And that's good. But we can go even better. We can run an annotated trace that contains the Java class, the Java method and the bytecode all in one place at the same time. You can see everything from the JVM at the same place. This works reasonably well because the perf interface is extremely extensible. And again we can do entire system optimisation. The bits in red here are the Linux kernel. Then we got into libraries. And then we got into Java and more libraries as well. So we can see everything from top to bottom in one fell swoop. This is just a quick slide showing the mechanisms employed. Essentially we have this agent which is a shared object file. And this will spit out useful files here in a standard way. And the Linux perf basically just records the perf data dump file as normal. We have 2 sets of recording going on. To report it it's very easy to do normal reporting with the PID map. This is just out of the box, works with the Google V8 engine as well. If you want to do very clever annotations perf has the ability to have Python scripts passed to it. So you can craft quite a dodgy Python script and that can interface with the perf annotation output. That's how I was able to get the extra Java information in the same annotation. And this is really easy to do; it's quite easy to knock the script up. And again the only thing we do for this profiling is we hook in the profiling agent which dumps out various things. We preserve the frame pointer because that makes things considerably easier on winding. This will effect performance a little bit. And again when we're reporting we just hook in a Python script. It's really easy to hook everything in and get it working. At the moment we have a JVMTI agent. It's actually on http://git.linaro.org now. Since I gave this talk Google have extended perf anyway so it will do quite a lot of similar things out of the box anyway. It's worth having a look at the latest perf. These techniques in this slide deck can be used obviously in other JITs quite easily. The fact that perf is so easy to extend with scripts can be useful for other things. And OpenJDK has a significant amount of cleverness associated with it that I thought was very surprising and good. So that's what I covered in the talk. These are basically references to things like command line arguments and the Flame graphs and stuff like that. If anyone is interested in playing with OpenJDK on ARM64 I'd suggest going here: http://openjdk.linaro.org Where the most recent builds are. Obviously fixes are going in upstream and they're going into distributions as well. They're included in OpenJDK so it should be good as well. I've run through quite a few fundamental things reasonably quickly. I'd be happy to accept any questions or comments And if you want to talk to me privately about Java afterwards feel free to when no-one's looking. [Audience] Applause [Audience] It's not really a question so much as a comment. Last mini-Deb conf we had a talk about using the JVM with other languages. And it seems to me that all this would apply even if you hate Java programming language and want to write in, I don't know, lisp or something instead if you've got a lisp system that can generate JVM bytecode. Yeah, totally. And the other big data language we looked at was Scala. It uses the JVM back end but a completely different language on the front. Cheers guys.