WEBVTT 00:00:01.753 --> 00:00:10.762 Onto the second talk of the day. 00:00:10.762 --> 00:00:15.511 Steve Capper is going to tell us about the good bits of Java 00:00:15.511 --> 00:00:18.248 They do exist 00:00:18.248 --> 00:00:23.417 [Audience] Could this have been a lightening talk? [Audience laughter] 00:00:23.417 --> 00:00:26.204 Believe it or not we've got some good stuff here. 00:00:27.293 --> 00:00:30.534 I was as skeptical as you guys when I first looked. 00:00:30.534 --> 00:00:34.911 First many apologies for not attending the mini-conf last year 00:00:34.911 --> 00:00:40.471 I was unfortunately ill on the day I was due to give this talk. 00:00:43.619 --> 00:00:45.742 Let me figure out how to use a computer. 00:01:00.894 --> 00:01:02.160 Sorry about this. 00:01:12.795 --> 00:01:16.034 There we go; it's because I've not woken up. 00:01:16.034 --> 00:01:27.981 Last year I worked at Linaro in the Enterprise group and we performed analysis 00:01:27.981 --> 00:01:32.206 on so called 'Big Data' application sets. 00:01:32.206 --> 00:01:38.441 As many of you know quite a lot of these big data applications are written in Java. 00:01:38.441 --> 00:01:43.039 I'm from ARM and we were very interested in 64bit ARM support. 00:01:43.039 --> 00:01:47.613 So this is mainly AArch64 examples for things like assembler 00:01:47.613 --> 00:01:53.708 but most of the messages are pertinent for any architecture. 00:01:53.708 --> 00:01:58.956 These good bits are shared between most if not all the architectures. 00:01:58.956 --> 00:02:03.008 Whilst trying to optimise a lot of these big data applications 00:02:03.008 --> 00:02:06.409 I stumbled a across quite a few things in the JVM and I thought 00:02:06.409 --> 00:02:11.425 'actually that's really clever; that's really cool' 00:02:11.425 --> 00:02:16.673 So I thought that would make a good basis for an interesting talk. 00:02:16.673 --> 00:02:20.330 This talk is essentially some of the clever things I found in the 00:02:20.330 --> 00:02:25.322 Java Virtual Machine; these optimisations are in OpenJDK. 00:02:25.322 --> 00:02:33.554 Source is available it's all there, readily available and in play now. 00:02:33.554 --> 00:02:38.349 I'm going to finish with some of the optimisation work we did with Java. 00:02:38.349 --> 00:02:42.818 People who know me will know I'm not a Java zealot. 00:02:42.818 --> 00:02:48.809 I don't particularly believe in programming in a language over another one 00:02:48.809 --> 00:02:51.561 So to make it clear from the outset I'm not attempting to convert 00:02:51.561 --> 00:02:54.556 anyone to Java programmers. 00:02:54.556 --> 00:02:57.424 I'm just going to highlight a few salient things in the Java Virtual Machine 00:02:57.424 --> 00:02:59.791 which I found to be quite clever and interesting 00:02:59.791 --> 00:03:05.365 and I'll try and talk through them with my understanding of them. 00:03:05.365 --> 00:03:10.241 Let's jump straight in and let's start with an example. 00:03:10.241 --> 00:03:15.199 This is a minimal example for computing a SHA1 sum of a file. 00:03:15.199 --> 00:03:19.819 I've omitted some of the checking in the beginning of the function see when 00:03:19.819 --> 00:03:22.397 command line parsing and that sort of thing. 00:03:22.397 --> 00:03:25.160 I've highlighted the salient points in red. 00:03:25.160 --> 00:03:31.093 Essentially we instantiate a SHA1 crypto message service digest. 00:03:31.093 --> 00:03:36.108 And we do the equivalent in Java of an mmap. 00:03:36.108 --> 00:03:38.279 Get it all in memory. 00:03:38.279 --> 00:03:42.784 And then we just put this status straight into the crypto engine. 00:03:42.784 --> 00:03:46.952 And eventually at the end of the program we'll spit out the SHA1 hash. 00:03:46.952 --> 00:03:48.682 It's a very simple program. 00:03:48.682 --> 00:03:56.356 It's basically mmap, SHA1, output the hash afterwards. 00:03:56.356 --> 00:04:04.077 In order to concentrate on the CPU aspect rather than worry about IO 00:04:04.077 --> 00:04:08.465 I decided to cheat a little bit by setting this up. 00:04:08.465 --> 00:04:15.176 I decided to use a sparse file. As many of you know a sparse file is a file that not 00:04:15.176 --> 00:04:19.529 all the contents are stored necessarily on disc. The assumption is that the bits 00:04:19.529 --> 00:04:26.379 that aren't stored are zero. For instance on Linux you can create a 20TB sparse file 00:04:26.379 --> 00:04:31.162 on a 10MB file system and use it as normal. 00:04:31.162 --> 00:04:34.611 Just don't write too much to it otherwise you're going to run out of space. 00:04:34.611 --> 00:04:40.915 The idea behind using a sparse file is I'm just focusing on the computational aspects 00:04:40.915 --> 00:04:44.990 of the SHA1 sum. I'm not worried about the file system or anything like that. 00:04:44.990 --> 00:04:49.690 I don't want to worry about the IO. I just want to focus on the actual compute. 00:04:49.690 --> 00:04:53.106 In order to set up a sparse file I used the following runes. 00:04:53.106 --> 00:04:56.877 The important point is that you seek and the other important point 00:04:56.877 --> 00:05:03.206 is you set a count otherwise you'll fill your disc up. 00:05:03.206 --> 00:05:09.568 I decided to run this against firstly let's get the native SHA1 sum command 00:05:09.568 --> 00:05:15.188 that's built into Linux and let's normalise these results and say that's 1.0 00:05:15.188 --> 00:05:21.794 I used an older version of the OpenJDK and ran the Java program 00:05:21.794 --> 00:05:30.339 and that's 1.09 times slower than the reference command. That's quite good. 00:05:30.339 --> 00:05:39.174 Then I used the new OpenJDK, this is now the current JDK as this is a year on. 00:05:39.174 --> 00:05:46.732 And 0.21 taken. It's significantly faster. 00:05:46.732 --> 00:05:50.993 I've stressed that I've done nothing surreptitious in the Java program. 00:05:50.993 --> 00:05:56.171 It is mmap, compute, spit result out. 00:05:56.171 --> 00:06:01.570 But the OpenJDK has essentially got some more context information. 00:06:01.570 --> 00:06:03.625 I'll talk about that as we go through. 00:06:06.202 --> 00:06:11.194 Before when I started Java I had a very simplistic view of Java. 00:06:11.194 --> 00:06:17.416 Traditionally Java is taught as a virtual machine that runs bytecode. 00:06:17.416 --> 00:06:21.481 Now when you compile a Java program it compiles into bytecode. 00:06:21.481 --> 00:06:25.428 The older versions of the Java Virtual Machine would interpret this bytecode 00:06:25.428 --> 00:06:33.056 and then run through. Newer versions would employ a just-in-time engine 00:06:33.056 --> 00:06:39.627 and try and compile this bytecode into native machine code. 00:06:39.627 --> 00:06:43.191 That is not the only thing that goes on when you run a Java program. 00:06:43.191 --> 00:06:47.568 There is some extra optimisations as well. So this alone would not account for 00:06:47.568 --> 00:06:52.315 the newer version of the SHA1 sum being significantly faster 00:06:52.315 --> 00:06:55.568 than the distro supplied one. 00:06:55.568 --> 00:07:00.850 Java knows about context. It has a class library and these class libraries 00:07:00.850 --> 00:07:04.624 have reasonably well defined purposes. 00:07:04.624 --> 00:07:07.746 We have classes that provide crypto services. 00:07:07.746 --> 00:07:11.067 We have some misc unsafe that every single project seems to pull in their 00:07:11.067 --> 00:07:13.493 project when they're not supposed to. 00:07:13.493 --> 00:07:17.673 These have well defined meanings. 00:07:17.673 --> 00:07:21.075 These do not necessarily have to be written in Java. 00:07:21.075 --> 00:07:24.569 They come as Java classes, they come supplied. 00:07:24.569 --> 00:07:28.911 But most JVMs now have a notion of a virtual machine intrinsic. 00:07:28.911 --> 00:07:35.227 And the virtual machine intrinsic says ok please do a SHA1 in the best possible way 00:07:35.227 --> 00:07:39.500 that your implementation allows. This is something done automatically by the JVM. 00:07:39.500 --> 00:07:43.482 You don't ask for it. If the JVM knows what it's running on and it's reasonably 00:07:43.482 --> 00:07:47.824 recent this will just happen for you for free. 00:07:47.824 --> 00:07:50.123 And there's quite a few classes that do this. 00:07:50.123 --> 00:07:53.650 There's quite a few clever things with atomics, there's crypto, 00:07:53.650 --> 00:07:58.378 there's mathematical routines as well. Most of these routines in the 00:07:58.378 --> 00:08:03.136 class library have a well defined notion of a virtual machine intrinsic 00:08:03.136 --> 00:08:06.992 and they do run reasonably optimally. 00:08:06.992 --> 00:08:11.984 They are a subject of continuous optimisation as well. 00:08:11.984 --> 00:08:17.312 We've got some runes that are presented on the slides here. 00:08:17.312 --> 00:08:21.342 These are quite useful if you are interested in 00:08:21.342 --> 00:08:24.696 how these intrinsics are made. 00:08:24.696 --> 00:08:29.248 You can ask the JVM to print out a lot of the just-in-time compiled code. 00:08:29.248 --> 00:08:35.239 You can ask the JVM to print out the native methods as well as these intrinsics 00:08:35.239 --> 00:08:40.731 and in this particular case after sifting through about 5MB of text 00:08:40.731 --> 00:08:45.235 I've come across this particular SHA1 sum implementation. 00:08:45.235 --> 00:08:53.420 This is AArch64. This is employing the cryptographic extensions 00:08:53.420 --> 00:08:57.263 in the architecture. 00:08:57.263 --> 00:08:59.353 So it's essentially using the CPU instructions which would explain why 00:08:59.353 --> 00:09:01.884 it's faster. But again it's done all this automatically. 00:09:01.884 --> 00:09:05.948 This did not require any specific runes or anything to activate. 00:09:05.948 --> 00:09:12.229 We'll see a bit later on how you can more easily find the hot spots 00:09:12.229 --> 00:09:15.128 rather than sifting through a lot of assembler. 00:09:15.128 --> 00:09:18.658 I've mentioned that the cryptographic engine is employed and again 00:09:18.658 --> 00:09:23.374 this routine was generated at run time as well. 00:09:23.374 --> 00:09:28.390 This is one of the important things about certain execution of amps like Java. 00:09:28.390 --> 00:09:31.439 You don't have to know everything at compile time. 00:09:31.439 --> 00:09:35.019 You know a lot more information at run time and you can use that 00:09:35.019 --> 00:09:37.838 in theory to optimise. 00:09:37.838 --> 00:09:40.394 You can switch off these clever routines. 00:09:40.394 --> 00:09:43.459 For instance I've got a deactivate here and we get back to the 00:09:43.459 --> 00:09:46.699 slower performance we expected. 00:09:46.699 --> 00:09:52.968 Again, this particular set of routines is present in OpenJDK, 00:09:52.968 --> 00:09:57.391 I think for all the architectures that support it. 00:09:57.391 --> 00:10:00.758 We get this optimisation for free on X86 and others as well. 00:10:00.758 --> 00:10:03.301 It works quite well. 00:10:03.301 --> 00:10:08.514 That was one surprise I came across as the instrinsics. 00:10:08.514 --> 00:10:13.216 One thing I thought it would be quite good to do would be to go through 00:10:13.216 --> 00:10:18.104 a slightly more complicated example. And use this example to explain 00:10:18.104 --> 00:10:21.865 a lot of other things that happen in the JVM as well. 00:10:21.865 --> 00:10:24.652 I will spend a bit of time going through this example 00:10:24.652 --> 00:10:32.512 and explain roughly the notion of what it's supposed to be doing. 00:10:32.512 --> 00:10:39.605 This is an imaginary method that I've contrived to demonstrate a lot of points 00:10:39.605 --> 00:10:43.146 in the fewest possible lines of code. 00:10:43.146 --> 00:10:43.828 I'll start with what it's meant to do. 00:10:43.828 --> 00:10:50.886 This is meant to be a routine that gets a reference to something and let's you know 00:10:50.886 --> 00:10:58.692 whether or not it's an image and in a hypothetical cache. 00:10:58.692 --> 00:11:02.512 I'll start with the important thing here the weak reference. 00:11:02.512 --> 00:11:09.443 In Java and other garbage collected languages we have the notion of references 00:11:09.443 --> 00:11:13.588 Most of the time when you are running a Java program you have something like a 00:11:13.588 --> 00:11:19.451 variable name and that is in the current execution context that is referred to as a 00:11:19.451 --> 00:11:24.176 strong reference to the object. In other words I can see it. I am using it. 00:11:24.176 --> 00:11:27.659 Please don't get rid of it. Bad things will happen if you do. 00:11:27.659 --> 00:11:30.561 So the garbage collector knows not to get rid of it. 00:11:30.561 --> 00:11:36.146 In Java and other languages you also have the notion of a weak reference. 00:11:36.146 --> 00:11:40.244 This is essentially the programmer saying to the virtual machine 00:11:40.244 --> 00:11:43.936 "Look I kinda care about this but just a little bit." 00:11:43.936 --> 00:11:49.973 "If you want to get rid of it feel free to but please let me know." 00:11:49.973 --> 00:11:54.142 This is why this is for a CacheClass. For instance the JVM in this particular 00:11:54.142 --> 00:12:01.421 case could decide that it's running quite low on memory this particular xMB image 00:12:01.421 --> 00:12:04.416 has not been used for a while it can garbage collect it. 00:12:04.416 --> 00:12:08.770 The important thing is how we go about expressing this in the language. 00:12:08.770 --> 00:12:13.530 We can't just have a reference to the object because that's a strong reference 00:12:13.530 --> 00:12:17.768 and the JVM will know it can't get rid of this because the program 00:12:17.768 --> 00:12:19.521 can see it actively. 00:12:19.521 --> 00:12:25.152 So we have a level of indirection which is known as a weak reference. 00:12:25.152 --> 00:12:28.621 We have this hypothetical CacheClass that I've devised. 00:12:28.621 --> 00:12:32.280 At this point it is a weak reference. 00:12:32.280 --> 00:12:36.204 Then we get it. This is calling the weak reference routine. 00:12:36.204 --> 00:12:41.139 Now it becomes a strong reference so it's not going to be garbage collected. 00:12:41.139 --> 00:12:45.202 When we get to the return path it becomes a weak reference again 00:12:45.202 --> 00:12:48.929 because our strong reference has disappeared. 00:12:48.929 --> 00:12:50.856 The salient points in this example are: 00:12:50.856 --> 00:12:54.293 We're employing a method to get a reference. 00:12:54.293 --> 00:12:57.079 We're checking an item to see if it's null. 00:12:57.079 --> 00:13:01.050 So let's say that the JVM decided to garbage collect this 00:13:01.050 --> 00:13:04.544 before we executed the method. 00:13:04.544 --> 00:13:08.885 The weak reference class is still valid because we've got a strong reference to it 00:13:08.885 --> 00:13:11.926 but the actual object behind this is gone. 00:13:11.926 --> 00:13:14.970 If we're too late and the garbage collector has killed it 00:13:14.970 --> 00:13:17.919 it will be null and we return. 00:13:17.919 --> 00:13:22.331 So it's a level of indirection to see does this still exist 00:13:22.331 --> 00:13:27.960 if so can I please have it and then operate on it as normal 00:13:27.960 --> 00:13:30.794 and then return becomes weak reference again. 00:13:30.794 --> 00:13:37.888 This example program is quite useful when we look at how it's implemented in the JVM 00:13:37.888 --> 00:13:40.384 and we'll go through a few things now. 00:13:40.384 --> 00:13:43.644 First off we'll go through the bytecode. 00:13:43.644 --> 00:13:49.312 The only point of this slide is to show it's roughly 00:13:49.312 --> 00:13:54.152 the same as this. 00:13:54.152 --> 00:13:56.371 We get our variable. 00:13:56.371 --> 00:13:58.981 We use our getter. 00:13:58.981 --> 00:14:03.836 This bit is extra this checkcast. The reason that bit is extra is 00:14:03.836 --> 00:14:15.110 because we're using the equivalent of a template in Java. 00:14:15.110 --> 00:14:19.173 And the way that's implemented in Java is it just basically casts everything to an 00:14:19.173 --> 00:14:23.167 object so that requires extra compiler information. 00:14:23.167 --> 00:14:26.035 And this is the extra check. 00:14:26.035 --> 00:14:30.795 The rest of this we load the reference, we check to see if it is null, 00:14:30.795 --> 00:14:35.149 If it's not null we invoke a virtual function - is it the image? 00:14:35.149 --> 00:14:37.981 and we return as normal. 00:14:37.981 --> 00:14:43.357 Essentially the point I'm trying to make is when we compile this to bytecode 00:14:43.357 --> 00:14:45.365 this execution happens. 00:14:45.365 --> 00:14:46.758 This null check happens. 00:14:46.758 --> 00:14:48.604 This execution happens. 00:14:48.604 --> 00:14:50.450 And we return. 00:14:50.450 --> 00:14:55.118 In the actual Java class files we've not lost anything. 00:14:55.118 --> 00:14:58.356 This is what it looks like when it's been JIT'd. 00:14:58.356 --> 00:15:01.108 Now we've lost lots of things. 00:15:01.108 --> 00:15:05.578 The JIT has done quite a few clever things which I'll talk about. 00:15:05.578 --> 00:15:10.905 First off if we look down here there's a single branch here. 00:15:10.905 --> 00:15:16.167 And this is only if our check cast failed 00:15:16.167 --> 00:15:20.381 We've got comments on the right hand side. 00:15:20.381 --> 00:15:27.126 Our get method has been inlined so we're no longer calling. 00:15:27.126 --> 00:15:32.699 We seem to have lost our null check, that's just gone. 00:15:32.699 --> 00:15:35.938 And again we've got a get field as well. 00:15:35.938 --> 00:15:40.153 That's no longer a method, that's been inlined as well. 00:15:40.153 --> 00:15:42.579 We've also got some other cute things. 00:15:42.579 --> 00:15:46.062 Those more familiar with AArch64 will understand 00:15:46.062 --> 00:15:49.870 that the pointers we're using are 32bit not 64bit. 00:15:49.870 --> 00:15:54.015 What we're doing is getting a pointer and shifting it left 3 00:15:54.015 --> 00:15:56.476 and widening it to a 64bit pointer. 00:15:56.476 --> 00:16:02.245 We've also got 32bit pointers on a 64bit system as well. 00:16:02.245 --> 00:16:06.148 So that's saving a reasonable amount of memory and cache. 00:16:06.148 --> 00:16:09.712 To summarise. We don't have any branches or function calls 00:16:09.712 --> 00:16:12.742 and we've got a lot of inlining. 00:16:12.742 --> 00:16:16.061 We did have function calls in the class file so it's the JVM; 00:16:16.061 --> 00:16:18.326 it's the JIT that has done this. 00:16:18.326 --> 00:16:23.330 We've got no null checks either and I'm going to talk through this now. 00:16:23.330 --> 00:16:29.996 The null check elimination is quite a clever feature in Java and other programs. 00:16:29.996 --> 00:16:33.605 The idea behind null check elimination is 00:16:33.605 --> 00:16:37.494 most of the time this object is not going to be null. 00:16:37.494 --> 00:16:43.599 If this object is null the operating system knows this quite quickly. 00:16:43.599 --> 00:16:47.943 So if you try to dereference a null pointer you'll get either a SIGSEGV or 00:16:47.943 --> 00:16:51.275 a SIGBUS depending on a few circumstances. 00:16:51.275 --> 00:16:53.655 That goes straight back to the JVM 00:16:53.655 --> 00:16:58.810 and the JVM knows where the null exception took place. 00:16:58.810 --> 00:17:01.759 Because it knows where it took place it can look this up 00:17:01.759 --> 00:17:05.774 and unwind it as part of an exception. 00:17:05.774 --> 00:17:09.909 Those null checks just go. Completely gone. 00:17:09.909 --> 00:17:15.447 Most of the time this works and you are saving a reasonable amount of execution. 00:17:15.447 --> 00:17:19.836 I'll talk about when it doesn't work in a second. 00:17:19.836 --> 00:17:24.143 That's reasonably clever. We have similar programming techniques in other places 00:17:24.143 --> 00:17:27.800 even the Linux kernel for instance when you copy data to and from user space 00:17:27.800 --> 00:17:30.912 it does pretty much identical the same thing. 00:17:30.912 --> 00:17:36.636 It has an exception unwind table and it knows if it catches a page fault on 00:17:36.636 --> 00:17:40.130 this particular program counter it can deal with it because it knows 00:17:40.130 --> 00:17:43.729 the program counter and it knows conceptually what it was doing. 00:17:43.729 --> 00:17:47.630 In a similar way the JIT know what its doing to a reasonable degree. 00:17:47.630 --> 00:17:53.459 It can handle the null check elimination. 00:17:53.459 --> 00:17:57.325 I mentioned the sneaky one. We've got essentially 32bit pointers 00:17:57.325 --> 00:17:59.658 on a 64bit system. 00:17:59.658 --> 00:18:04.534 Most of the time in Java people typically specify heap size smaller than 32GB. 00:18:04.534 --> 00:18:10.444 Which is perfect if you want to use 32bit pointers and left shift 3. 00:18:10.444 --> 00:18:12.940 Because that gives you 32GB of addressable memory. 00:18:12.940 --> 00:18:19.337 That's a significant memory saving because otherwise a lot of things would double up. 00:18:19.337 --> 00:18:23.041 There's a significant number of pointers in Java. 00:18:23.041 --> 00:18:29.055 The one that should make people jump out of their seat is 00:18:29.055 --> 00:18:32.224 the fact that most methods in Java are actually virtual. 00:18:32.224 --> 00:18:37.147 So what the JVM has actually done is inlined a virtual function. 00:18:37.147 --> 00:18:41.698 A virtual function is essentially a function were you don't know where 00:18:41.698 --> 00:18:43.439 you're going until run time. 00:18:43.439 --> 00:18:47.503 You can have several different classes and they share the same virtual function 00:18:47.503 --> 00:18:51.439 in the base class and dependent upon which specific class you're running 00:18:51.439 --> 00:18:54.364 different virtual functions will get executed. 00:18:54.364 --> 00:19:01.307 In C++ that will be a read from a V table and then you know where to go. 00:19:01.307 --> 00:19:03.606 The JVM's inlined it. 00:19:03.606 --> 00:19:05.255 We've saved a memory load. 00:19:05.255 --> 00:19:07.925 We've saved a branch as well 00:19:07.925 --> 00:19:12.023 The reason the JVM can inline it is because the JVM knows 00:19:12.023 --> 00:19:14.485 every single class that has been loaded. 00:19:14.485 --> 00:19:19.767 So it knows that although this looks polymorphic to the casual programmer 00:19:19.767 --> 00:19:26.420 It actually is monomorphic. The JVM knows this. 00:19:26.420 --> 00:19:30.982 Because it knows this it can be clever. And this is really clever. 00:19:30.982 --> 00:19:35.660 That's a significant cost saving. 00:19:35.660 --> 00:19:41.664 This is all great. I've already mentioned the null check elimination. 00:19:41.664 --> 00:19:47.410 We're taking a signal as most of you know if we do that a lot it's going to be slow. 00:19:47.410 --> 00:19:51.137 Jumping into kernel, into user, bouncing around. 00:19:51.137 --> 00:19:55.735 The JVM also has a notion of 'OK I've been a bit too clever now; 00:19:55.735 --> 00:19:57.906 I need to back off a bit' 00:19:57.906 --> 00:20:02.782 Also there's nothing stopping the user loading more classes 00:20:02.782 --> 00:20:06.799 and rendering the monomorphic assumption invalid. 00:20:06.799 --> 00:20:10.166 So the JVM needs to have a notion of backpeddling and go 00:20:10.166 --> 00:20:14.264 'Ok I've gone to far and need to deoptimise' 00:20:14.264 --> 00:20:17.051 The JVM has the ability to deoptimise. 00:20:17.051 --> 00:20:22.865 In other words it essentially knows that for certain code paths everything's OK. 00:20:22.865 --> 00:20:27.140 But for certain new objects it can't get away with these tricks. 00:20:27.140 --> 00:20:32.678 By the time the new objects are executed they are going to be safe. 00:20:32.678 --> 00:20:36.823 There are ramifications for this. This is the important thing to consider 00:20:36.823 --> 00:20:39.818 with something like Java and other languages and other virtual machines. 00:20:39.818 --> 00:20:46.351 If you're trying to profile this it means there is a very significant ramification. 00:20:46.351 --> 00:20:52.252 You can have the same class and method JIT'd multiple ways 00:20:52.252 --> 00:20:54.969 and executed at the same time. 00:20:54.969 --> 00:21:00.832 So if you're trying to find a hot spot the program counter's nodding off. 00:21:00.832 --> 00:21:03.885 Because you can refer to the same thing in several different ways. 00:21:03.885 --> 00:21:07.914 This is quite common as well as deoptimisation does take place. 00:21:07.914 --> 00:21:16.215 That's something to bear in mind with JVM and similar runtime environments. 00:21:16.215 --> 00:21:19.814 You can get a notion of what the JVM's trying to do. 00:21:19.814 --> 00:21:22.612 You can ask it nicely and add a print compilation option 00:21:22.612 --> 00:21:24.667 and it will tell you what it's doing. 00:21:24.667 --> 00:21:26.885 This is reasonably verbose. 00:21:26.885 --> 00:21:30.008 Typically what happens is the JVM gets excited JIT'ing everything 00:21:30.008 --> 00:21:32.643 and optimising everything then it settles down. 00:21:32.643 --> 00:21:35.430 Until you load something new and it gets excited again. 00:21:35.430 --> 00:21:38.588 There's a lot of logs. This is mainly useful for debugging but 00:21:38.588 --> 00:21:42.361 it gives you an appreciation that it's doing a lot of work. 00:21:42.361 --> 00:21:45.635 You can go even further with a log compilation option. 00:21:45.635 --> 00:21:50.732 That produces a lot of XML and that is useful for people debugging the JVM as well. 00:21:50.732 --> 00:21:56.270 It's quite handy to get an idea of what's going on. 00:21:56.270 --> 00:22:04.025 If that is not enough information you also have the ability to go even further. 00:22:04.025 --> 00:22:07.334 This is beyond the limit of my understanding. 00:22:07.334 --> 00:22:10.945 I've gone into this little bit just to show you what can be done. 00:22:10.945 --> 00:22:17.342 You have release builds of OpenJDK and they have debug builds of OpenJDK. 00:22:17.342 --> 00:22:24.877 The release builds will by default turn off a lot of the diagnostic options. 00:22:24.877 --> 00:22:27.950 You can switch them back on again. 00:22:27.950 --> 00:22:33.619 When you do you can also gain insight into the actual, it's colloquially 00:22:33.619 --> 00:22:37.624 referred to as the C2 JIT, the compiler there. 00:22:37.624 --> 00:22:41.966 You can see, for instance, objects in timelines and visualize them 00:22:41.966 --> 00:22:45.693 as they're being optimised at various stages and various things. 00:22:45.693 --> 00:22:52.938 So this is based on a masters thesis by Thomas Würthinger. 00:22:52.938 --> 00:23:00.566 This is something you can play with as well and see how far the optimiser goes. 00:23:00.566 --> 00:23:05.163 And it's also good for people hacking with the JVM. 00:23:05.163 --> 00:23:08.275 I'll move onto some stuff we did. 00:23:08.275 --> 00:23:15.740 Last year we were working on the big data. Relatively new architecture 00:23:15.740 --> 00:23:23.937 ARM64, it's called AArch64 in OpenJDK land but ARM64 in Debian land. 00:23:23.937 --> 00:23:26.769 We were a bit concerned because everything's all shiny and new. 00:23:26.769 --> 00:23:28.926 Has it been optimised correctly? 00:23:28.926 --> 00:23:31.680 Are there any obvious things we need to optimise? 00:23:31.680 --> 00:23:34.060 And we're also interested because everything was so shiny and new 00:23:34.060 --> 00:23:35.291 in the whole system. 00:23:35.291 --> 00:23:39.389 Not just the JVM but the glibc and the kernel as well. 00:23:39.389 --> 00:23:42.060 So how do we get a view of all of this? 00:23:42.060 --> 00:23:49.827 I gave a quick talk before at the Debian mini-conf before last [2014] about perf 00:23:49.827 --> 00:23:53.020 so decided we could try and do some clever things with Linux perf 00:23:53.020 --> 00:23:57.965 and see if we could get some actual useful debugging information out. 00:23:57.965 --> 00:24:02.598 We have the flame graphs that are quite well known. 00:24:02.598 --> 00:24:08.612 We also have some previous work, Johannes had a special perf map agent that 00:24:08.612 --> 00:24:13.500 could basically hook into perf and it would give you a nice way of running 00:24:13.500 --> 00:24:20.431 perf-top for want of a better expression and viewing the top Java function names. 00:24:20.431 --> 00:24:25.130 This is really good work and it's really good for a particular use case 00:24:25.130 --> 00:24:29.405 if you just want to do a quick snap shot once and see in that snap shot 00:24:29.405 --> 00:24:32.284 where the hotspots where. 00:24:32.284 --> 00:24:38.670 For a prolonged work load with all the functions being JIT'd multiple ways 00:24:38.670 --> 00:24:42.165 with the optimisation going on and everything moving around 00:24:42.165 --> 00:24:46.762 it require a little bit more information to be captured. 00:24:46.762 --> 00:24:50.872 I decided to do a little bit of work on a very similar thing to perf-map-agent 00:24:50.872 --> 00:24:56.155 but an agent that would capture it over a prolonged period of time. 00:24:56.155 --> 00:24:59.394 Here's an example Flame graph, these are all over the internet. 00:24:59.394 --> 00:25:05.257 This is the SHA1 computation example that I gave at the beginning. 00:25:05.257 --> 00:25:10.654 As expected the VM intrinsic SHA1 is the top one. 00:25:10.654 --> 00:25:17.494 Not expected by me was this quite significant chunk of CPU execution time. 00:25:17.494 --> 00:25:21.498 And there was a significant amount of time being spent copying memory 00:25:21.498 --> 00:25:28.430 from the mmapped memory region into a heap 00:25:28.430 --> 00:25:31.575 and then that was passed to the crypto engine. 00:25:31.575 --> 00:25:35.687 So we're doing a ton of memory copies for no good reason. 00:25:35.687 --> 00:25:38.972 That essentially highlighted an example. 00:25:38.972 --> 00:25:42.664 That was an assumption I made about Java to begin with which was if you do 00:25:42.664 --> 00:25:45.334 the equivalent of mmap it should just work like mmap right? 00:25:45.334 --> 00:25:48.398 You should just be able to address the memory. That is not the case. 00:25:48.398 --> 00:25:54.146 If you've got a file mapping object and you try to address it it has to be copied 00:25:54.146 --> 00:25:59.510 into safe heap memory first. And that is what was slowing down the programs. 00:25:59.510 --> 00:26:04.886 If that was omitted you could make the SHA1 computation even quicker. 00:26:04.886 --> 00:26:08.982 So that would be the logical target you would want to optimise. 00:26:08.982 --> 00:26:12.014 I wanted to extend Johannes' work with something called a 00:26:12.014 --> 00:26:15.741 Java Virtual Machine Tools Interface profiling agent. 00:26:15.741 --> 00:26:22.649 This is part of the Java Virtual Machine standard as you can make a special library 00:26:22.649 --> 00:26:24.597 and then hook this into the JVM. 00:26:24.597 --> 00:26:28.349 And the JVM can expose quite a few things to the library. 00:26:28.349 --> 00:26:32.413 It exposes a reasonable amount of information as well. 00:26:32.413 --> 00:26:40.237 Perf as well has the ability to look at map files natively. 00:26:40.237 --> 00:26:44.557 If you are profiling JavaScript, or something similar, I think the 00:26:44.557 --> 00:26:48.493 Google V8 JavaScript engine will write out a special map file that says 00:26:48.493 --> 00:26:52.846 these program counter addresses correspond to these function names. 00:26:52.846 --> 00:26:57.026 I decided to use that in a similar way to what Johannes did for the extended 00:26:57.026 --> 00:27:03.621 profiling agent but I also decided to capture some more information as well. 00:27:03.621 --> 00:27:10.110 I decided to capture the disassembly so when we run perf annotate 00:27:10.110 --> 00:27:13.675 we can see the actual JVM bytecode in our annotation. 00:27:13.675 --> 00:27:17.437 We can see how it was JIT'd at the time when it was JIT'd. 00:27:17.437 --> 00:27:19.909 We can see where the hotspots where. 00:27:19.909 --> 00:27:23.067 And that's good. But we can go even better. 00:27:23.067 --> 00:27:29.174 We can run an annotated trace that contains the Java class, 00:27:29.174 --> 00:27:34.190 the Java method and the bytecode all in one place at the same time. 00:27:34.190 --> 00:27:38.950 You can see everything from the JVM at the same place. 00:27:38.950 --> 00:27:43.940 This works reasonably well because the perf interface is extremely extensible. 00:27:43.940 --> 00:27:47.889 And again we can do entire system optimisation. 00:27:47.889 --> 00:27:52.046 The bits in red here are the Linux kernel. 00:27:52.046 --> 00:27:55.111 Then we got into libraries. 00:27:55.111 --> 00:27:58.350 And then we got into Java and more libraries as well. 00:27:58.350 --> 00:28:03.110 So we can see everything from top to bottom in one fell swoop. 00:28:03.110 --> 00:28:07.789 This is just a quick slide showing the mechanisms employed. 00:28:07.789 --> 00:28:12.340 Essentially we have this agent which is a shared object file. 00:28:12.340 --> 00:28:16.740 And this will spit out useful files here in a standard way. 00:28:16.740 --> 00:28:27.445 And the Linux perf basically just records the perf data dump file as normal. 00:28:27.445 --> 00:28:29.755 We have 2 sets of recording going on. 00:28:29.755 --> 00:28:35.559 To report it it's very easy to do normal reporting with the PID map. 00:28:35.559 --> 00:28:40.819 This is just out of the box, works with the Google V8 engine as well. 00:28:40.819 --> 00:28:45.127 If you want to do very clever annotations perf has the ability to have 00:28:45.127 --> 00:28:47.913 Python scripts passed to it. 00:28:47.913 --> 00:28:53.521 So you can craft quite a dodgy Python script and that can interface 00:28:53.521 --> 00:28:55.529 with the perf annotation output. 00:28:55.529 --> 00:29:00.614 That's how I was able to get the extra Java information in the same annotation. 00:29:00.614 --> 00:29:05.247 And this is really easy to do; it's quite easy to knock the script up. 00:29:05.247 --> 00:29:10.216 And again the only thing we do for this profiling is we hook in the profiling 00:29:10.216 --> 00:29:13.440 agent which dumps out various things. 00:29:13.440 --> 00:29:18.470 We preserve the frame pointer because that makes things considerably easier 00:29:18.470 --> 00:29:21.547 on winding. This will effect performance a little bit. 00:29:21.547 --> 00:29:26.133 And again when we're reporting we just hook in a Python script. 00:29:26.133 --> 00:29:29.697 It's really easy to hook everything in and get it working. 00:29:31.462 --> 00:29:38.602 At the moment we have a JVMTI agent. It's actually on http://git.linaro.org now. 00:29:38.602 --> 00:29:42.364 Since I gave this talk Google have extended perf anyway so it will do 00:29:42.364 --> 00:29:45.765 quite a lot of similar things out of the box anyway. 00:29:45.765 --> 00:29:49.748 It's worth having a look at the latest perf. 00:29:49.748 --> 00:29:54.554 These techniques in this slide deck can be used obviously in other JITs quite easily. 00:29:54.554 --> 00:29:59.628 The fact that perf is so easy to extend with scripts can be useful 00:29:59.628 --> 00:30:01.345 for other things. 00:30:01.345 --> 00:30:05.769 And OpenJDK has a significant amount of cleverness associated with it that 00:30:05.769 --> 00:30:11.180 I thought was very surprising and good. So that's what I covered in the talk. 00:30:11.180 --> 00:30:17.832 These are basically references to things like command line arguments 00:30:17.832 --> 00:30:20.851 and the Flame graphs and stuff like that. 00:30:20.851 --> 00:30:26.308 If anyone is interested in playing with OpenJDK on ARM64 I'd suggest going here: 00:30:26.308 --> 00:30:31.332 http://openjdk.linaro.org Where the most recent builds are. 00:30:31.332 --> 00:30:36.002 Obviously fixes are going in upstream and they're going into distributions as well. 00:30:36.002 --> 00:30:41.586 They're included in OpenJDK so it should be good as well. 00:30:41.586 --> 00:30:45.034 I've run through quite a few fundamental things reasonably quickly. 00:30:45.034 --> 00:30:48.274 I'd be happy to accept any questions or comments 00:30:54.671 --> 00:30:57.817 And if you want to talk to me privately about Java afterwards feel free to 00:30:57.817 --> 00:30:59.129 when no-one's looking. 00:31:06.896 --> 00:31:12.698 [Audience] Applause 00:31:13.315 --> 00:31:19.028 [Audience] It's not really a question so much as a comment. 00:31:19.028 --> 00:31:31.881 Last mini-Deb conf we had a talk about using the JVM with other languages. 00:31:31.881 --> 00:31:36.025 And it seems to me that all this would apply even if you hate Java programming 00:31:36.025 --> 00:31:39.671 language and want to write in, I don't know, lisp or something instead 00:31:39.671 --> 00:31:42.794 if you've got a lisp system that can generate JVM bytecode. 00:31:42.794 --> 00:31:48.982 [Presenter] Yeah, totally. And the other big data language we looked at was Scala. 00:31:48.982 --> 00:31:53.870 It uses the JVM back end but a completely different language on the front. 00:32:03.875 --> 00:32:08.104 Cheers guys.