1 00:00:01,695 --> 00:00:03,483 Onto the second talk of the day. 2 00:00:10,562 --> 00:00:14,451 Steve Capper is going to tell us about the good bits of Java 3 00:00:15,101 --> 00:00:17,198 They do exist 4 00:00:17,198 --> 00:00:21,067 [Audience] Could this have been a lightening talk? [Audience laughter] 5 00:00:23,258 --> 00:00:26,019 Believe it or not we've got some good stuff here. 6 00:00:27,325 --> 00:00:30,459 I was as skeptical as you guys when I first looked. 7 00:00:30,534 --> 00:00:34,756 First many apologies for not attending the mini-conf last year 8 00:00:34,756 --> 00:00:40,031 I was unfortunately ill on the day I was due to give this talk. 9 00:00:43,664 --> 00:00:45,588 Let me figure out how to use a computer. 10 00:01:00,997 --> 00:01:03,020 Sorry about this. 11 00:01:12,607 --> 00:01:15,567 There we go; it's because I've not woken up. 12 00:01:19,952 --> 00:01:26,945 Last year I worked at Linaro in the Enterprise group and we performed analysis 13 00:01:28,013 --> 00:01:31,863 on so called 'Big Data' application sets. 14 00:01:32,210 --> 00:01:37,423 As many of you know quite a lot of these big data applications are written in Java. 15 00:01:38,456 --> 00:01:42,550 I'm from ARM and we were very interested in 64bit ARM support. 16 00:01:42,931 --> 00:01:47,261 So this is mainly AArch64 examples for things like assembler 17 00:01:47,679 --> 00:01:52,680 but most of the messages are pertinent for any architecture. 18 00:01:53,505 --> 00:01:58,217 These good bits are shared between most if not all the architectures. 19 00:01:58,964 --> 00:02:02,783 Whilst trying to optimise a lot of these big data applications 20 00:02:03,027 --> 00:02:06,409 I stumbled a across quite a few things in the JVM and I thought 21 00:02:06,412 --> 00:02:11,475 'actually that's really clever; that's really cool' 22 00:02:12,609 --> 00:02:16,656 So I thought that would make a good basis for an interesting talk. 23 00:02:16,673 --> 00:02:20,313 This talk is essentially some of the clever things I found in the 24 00:02:20,330 --> 00:02:25,322 Java Virtual Machine; these optimisations are in OpenJDK. 25 00:02:26,238 --> 00:02:31,602 Source is available it's all there, readily available and in play now. 26 00:02:32,992 --> 00:02:37,512 I'm going to finish with some of the optimisation work we did with Java. 27 00:02:38,371 --> 00:02:42,573 People who know me will know I'm not a Java zealot. 28 00:02:42,574 --> 00:02:47,860 I don't particularly believe in programming in a language over another one 29 00:02:48,445 --> 00:02:51,366 So to make it clear from the outset I'm not attempting to convert 30 00:02:51,366 --> 00:02:53,617 anyone to Java programmers. 31 00:02:53,617 --> 00:02:57,221 I'm just going to highlight a few salient things in the Java Virtual Machine 32 00:02:57,221 --> 00:02:59,571 which I found to be quite clever and interesting 33 00:02:59,577 --> 00:03:03,827 and I'll try and talk through them with my understanding of them. 34 00:03:04,349 --> 00:03:08,867 Let's jump straight in and let's start with an example. 35 00:03:10,060 --> 00:03:14,498 This is a minimal example for computing a SHA1 sum of a file. 36 00:03:15,208 --> 00:03:19,717 I've omitted some of the checking in the beginning of the function see when 37 00:03:19,717 --> 00:03:22,196 command line parsing and that sort of thing. 38 00:03:22,196 --> 00:03:25,040 I've highlighted the salient points in red. 39 00:03:25,040 --> 00:03:29,653 Essentially we instantiate a SHA1 crypto message service digest. 40 00:03:30,116 --> 00:03:35,422 And we do the equivalent in Java of an mmap. 41 00:03:35,907 --> 00:03:37,992 Get it all in memory. 42 00:03:37,992 --> 00:03:42,388 And then we just put this status straight into the crypto engine. 43 00:03:42,494 --> 00:03:46,632 And eventually at the end of the program we'll spit out the SHA1 hash. 44 00:03:46,632 --> 00:03:48,682 It's a very simple program. 45 00:03:48,682 --> 00:03:53,466 It's basically mmap, SHA1, output the hash afterwards. 46 00:03:55,857 --> 00:04:02,832 In order to concentrate on the CPU aspect rather than worry about IO 47 00:04:03,822 --> 00:04:07,432 I decided to cheat a little bit by setting this up. 48 00:04:08,129 --> 00:04:14,659 I decided to use a sparse file. As many of you know a sparse file is a file that not 49 00:04:14,659 --> 00:04:19,529 all the contents are stored necessarily on disc. The assumption is that the bits 50 00:04:19,529 --> 00:04:26,379 that aren't stored are zero. For instance on Linux you can create a 20TB sparse file 51 00:04:26,379 --> 00:04:30,802 on a 10MB file system and use it as normal. 52 00:04:30,802 --> 00:04:34,401 Just don't write too much to it otherwise you're going to run out of space. 53 00:04:34,401 --> 00:04:40,675 The idea behind using a sparse file is I'm just focusing on the computational aspects 54 00:04:40,675 --> 00:04:44,730 of the SHA1 sum. I'm not worried about the file system or anything like that. 55 00:04:44,734 --> 00:04:48,910 I don't want to worry about the IO. I just want to focus on the actual compute. 56 00:04:48,910 --> 00:04:52,856 In order to set up a sparse file I used the following runes. 57 00:04:52,856 --> 00:04:56,727 The important point is that you seek and the other important point 58 00:04:56,728 --> 00:05:01,046 is you set a count otherwise you'll fill your disc up. 59 00:05:02,556 --> 00:05:09,218 I decided to run this against firstly let's get the native SHA1 sum command 60 00:05:09,218 --> 00:05:15,188 that's built into Linux and let's normalise these results and say that's 1.0 61 00:05:17,469 --> 00:05:21,454 I used an older version of the OpenJDK and ran the Java program 62 00:05:21,454 --> 00:05:28,499 and that's 1.09 times slower than the reference command. That's quite good. 63 00:05:30,009 --> 00:05:38,944 Then I used the new OpenJDK, this is now the current JDK as this is a year on. 64 00:05:38,944 --> 00:05:44,662 And 0.21 taken. It's significantly faster. 65 00:05:45,712 --> 00:05:50,693 I've stressed that I've done nothing surreptitious in the Java program. 66 00:05:50,693 --> 00:05:54,481 It is mmap, compute, spit result out. 67 00:05:55,771 --> 00:06:00,840 But the OpenJDK has essentially got some more context information. 68 00:06:00,840 --> 00:06:03,625 I'll talk about that as we go through. 69 00:06:06,202 --> 00:06:11,014 Before when I started Java I had a very simplistic view of Java. 70 00:06:11,014 --> 00:06:16,816 Traditionally Java is taught as a virtual machine that runs bytecode. 71 00:06:16,816 --> 00:06:21,421 Now when you compile a Java program it compiles into bytecode. 72 00:06:21,421 --> 00:06:25,428 The older versions of the Java Virtual Machine would interpret this bytecode 73 00:06:25,428 --> 00:06:31,936 and then run through. Newer versions would employ a just-in-time engine 74 00:06:32,436 --> 00:06:38,167 and try and compile this bytecode into native machine code. 75 00:06:39,327 --> 00:06:42,841 That is not the only thing that goes on when you run a Java program. 76 00:06:42,841 --> 00:06:47,318 There is some extra optimisations as well. So this alone would not account for 77 00:06:47,318 --> 00:06:51,975 the newer version of the SHA1 sum being significantly faster 78 00:06:51,975 --> 00:06:55,568 than the distro supplied one. 79 00:06:55,568 --> 00:07:00,700 Java knows about context. It has a class library and these class libraries 80 00:07:00,700 --> 00:07:04,184 have reasonably well defined purposes. 81 00:07:04,184 --> 00:07:07,596 We have classes that provide crypto services. 82 00:07:07,596 --> 00:07:10,947 We have some misc unsafe that every single project seems to pull in their 83 00:07:10,947 --> 00:07:13,493 project when they're not supposed to. 84 00:07:13,493 --> 00:07:17,303 These have well defined meanings. 85 00:07:17,303 --> 00:07:20,785 These do not necessarily have to be written in Java. 86 00:07:20,785 --> 00:07:24,289 They come as Java classes, they come supplied. 87 00:07:24,289 --> 00:07:28,591 But most JVMs now have a notion of a virtual machine intrinsic. 88 00:07:28,591 --> 00:07:34,787 And the virtual machine intrinsic says ok please do a SHA1 in the best possible way 89 00:07:34,787 --> 00:07:39,170 that your implementation allows. This is something done automatically by the JVM. 90 00:07:39,170 --> 00:07:43,482 You don't ask for it. If the JVM knows what it's running on and it's reasonably 91 00:07:43,482 --> 00:07:47,564 recent this will just happen for you for free. 92 00:07:47,564 --> 00:07:50,053 And there's quite a few classes that do this. 93 00:07:50,053 --> 00:07:53,520 There's quite a few clever things with atomics, there's crypto, 94 00:07:53,520 --> 00:07:58,208 there's mathematical routines as well. Most of these routines in the 95 00:07:58,208 --> 00:08:02,876 class library have a well defined notion of a virtual machine intrinsic 96 00:08:02,876 --> 00:08:06,992 and they do run reasonably optimally. 97 00:08:06,992 --> 00:08:11,044 They are a subject of continuous optimisation as well. 98 00:08:11,684 --> 00:08:15,952 We've got some runes that are presented on the slides here. 99 00:08:17,102 --> 00:08:21,342 These are quite useful if you are interested in 100 00:08:21,342 --> 00:08:24,326 how these intrinsics are made. 101 00:08:24,326 --> 00:08:28,798 You can ask the JVM to print out a lot of the just-in-time compiled code. 102 00:08:28,798 --> 00:08:34,899 You can ask the JVM to print out the native methods as well as these intrinsics 103 00:08:34,899 --> 00:08:40,371 and in this particular case after sifting through about 5MB of text 104 00:08:40,371 --> 00:08:45,035 I've come across this particular SHA1 sum implementation. 105 00:08:45,035 --> 00:08:52,480 This is AArch64. This is employing the cryptographic extensions 106 00:08:52,480 --> 00:08:54,093 in the architecture. 107 00:08:54,093 --> 00:08:57,333 So it's essentially using the CPU instructions which would explain why 108 00:08:57,333 --> 00:09:00,464 it's faster. But again it's done all this automatically. 109 00:09:00,464 --> 00:09:05,948 This did not require any specific runes or anything to activate. 110 00:09:07,618 --> 00:09:12,009 We'll see a bit later on how you can more easily find the hot spots 111 00:09:12,009 --> 00:09:14,778 rather than sifting through a lot of assembler. 112 00:09:14,778 --> 00:09:18,658 I've mentioned that the cryptographic engine is employed and again 113 00:09:18,658 --> 00:09:23,094 this routine was generated at run time as well. 114 00:09:23,094 --> 00:09:28,070 This is one of the important things about certain execution of amps like Java. 115 00:09:28,070 --> 00:09:31,439 You don't have to know everything at compile time. 116 00:09:31,439 --> 00:09:35,019 You know a lot more information at run time and you can use that 117 00:09:35,019 --> 00:09:37,408 in theory to optimise. 118 00:09:37,408 --> 00:09:40,014 You can switch off these clever routines. 119 00:09:40,014 --> 00:09:43,179 For instance I've got a deactivate here and we get back to the 120 00:09:43,179 --> 00:09:46,699 slower performance we expected. 121 00:09:46,699 --> 00:09:52,968 Again, this particular set of routines is present in OpenJDK, 122 00:09:52,968 --> 00:09:57,001 I think for all the architectures that support it. 123 00:09:57,001 --> 00:10:00,758 We get this optimisation for free on X86 and others as well. 124 00:10:00,758 --> 00:10:03,071 It works quite well. 125 00:10:03,071 --> 00:10:07,914 That was one surprise I came across as the instrinsics. 126 00:10:07,914 --> 00:10:13,096 One thing I thought it would be quite good to do would be to go through 127 00:10:13,096 --> 00:10:17,874 a slightly more complicated example. And use this example to explain 128 00:10:17,874 --> 00:10:21,495 a lot of other things that happen in the JVM as well. 129 00:10:21,495 --> 00:10:24,392 I will spend a bit of time going through this example 130 00:10:24,392 --> 00:10:29,772 and explain roughly the notion of what it's supposed to be doing. 131 00:10:32,526 --> 00:10:39,305 This is an imaginary method that I've contrived to demonstrate a lot of points 132 00:10:39,305 --> 00:10:42,756 in the fewest possible lines of code. 133 00:10:42,756 --> 00:10:45,238 I'll start with what it's meant to do. 134 00:10:45,238 --> 00:10:50,606 This is meant to be a routine that gets a reference to something and let's you know 135 00:10:50,606 --> 00:10:55,952 whether or not it's an image and in a hypothetical cache. 136 00:10:58,155 --> 00:11:02,252 I'll start with the important thing here the weak reference. 137 00:11:02,252 --> 00:11:09,023 In Java and other garbage collected languages we have the notion of references 138 00:11:09,023 --> 00:11:13,428 Most of the time when you are running a Java program you have something like a 139 00:11:13,428 --> 00:11:19,351 variable name and that is in the current execution context that is referred to as a 140 00:11:19,351 --> 00:11:24,176 strong reference to the object. In other words I can see it. I am using it. 141 00:11:24,176 --> 00:11:27,289 Please don't get rid of it. Bad things will happen if you do. 142 00:11:27,289 --> 00:11:30,561 So the garbage collector knows not to get rid of it. 143 00:11:30,561 --> 00:11:35,766 In Java and other languages you also have the notion of a weak reference. 144 00:11:35,766 --> 00:11:40,044 This is essentially the programmer saying to the virtual machine 145 00:11:40,044 --> 00:11:43,936 "Look I kinda care about this but just a little bit." 146 00:11:44,216 --> 00:11:49,053 "If you want to get rid of it feel free to but please let me know." 147 00:11:49,493 --> 00:11:53,952 This is why this is for a CacheClass. For instance the JVM in this particular 148 00:11:53,952 --> 00:12:01,201 case could decide that it's running quite low on memory this particular xMB image 149 00:12:01,201 --> 00:12:04,416 has not been used for a while it can garbage collect it. 150 00:12:04,416 --> 00:12:08,530 The important thing is how we go about expressing this in the language. 151 00:12:08,530 --> 00:12:12,980 We can't just have a reference to the object because that's a strong reference 152 00:12:12,980 --> 00:12:17,628 and the JVM will know it can't get rid of this because the program 153 00:12:17,628 --> 00:12:19,341 can see it actively. 154 00:12:19,341 --> 00:12:23,712 So we have a level of indirection which is known as a weak reference. 155 00:12:24,652 --> 00:12:28,621 We have this hypothetical CacheClass that I've devised. 156 00:12:28,621 --> 00:12:32,130 At this point it is a weak reference. 157 00:12:32,130 --> 00:12:35,974 Then we get it. This is calling the weak reference routine. 158 00:12:35,974 --> 00:12:40,759 Now it becomes a strong reference so it's not going to be garbage collected. 159 00:12:40,759 --> 00:12:44,722 When we get to the return path it becomes a weak reference again 160 00:12:44,722 --> 00:12:48,439 because our strong reference has disappeared. 161 00:12:48,439 --> 00:12:50,856 The salient points in this example are: 162 00:12:50,856 --> 00:12:54,293 We're employing a method to get a reference. 163 00:12:54,293 --> 00:12:56,759 We're checking an item to see if it's null. 164 00:12:56,759 --> 00:13:01,050 So let's say that the JVM decided to garbage collect this 165 00:13:01,970 --> 00:13:04,354 before we executed the method. 166 00:13:04,354 --> 00:13:08,715 The weak reference class is still valid because we've got a strong reference to it 167 00:13:08,715 --> 00:13:11,786 but the actual object behind this is gone. 168 00:13:11,786 --> 00:13:14,970 If we're too late and the garbage collector has killed it 169 00:13:14,970 --> 00:13:17,789 it will be null and we return. 170 00:13:17,789 --> 00:13:22,131 So it's a level of indirection to see does this still exist 171 00:13:22,131 --> 00:13:27,680 if so can I please have it and then operate on it as normal 172 00:13:27,680 --> 00:13:31,254 and then return becomes weak reference again. 173 00:13:31,254 --> 00:13:37,438 This example program is quite useful when we look at how it's implemented in the JVM 174 00:13:37,438 --> 00:13:40,214 and we'll go through a few things now. 175 00:13:40,214 --> 00:13:43,504 First off we'll go through the bytecode. 176 00:13:43,504 --> 00:13:48,862 The only point of this slide is to show it's roughly 177 00:13:48,862 --> 00:13:53,882 the same as this. 178 00:13:53,882 --> 00:13:56,371 We get our variable. 179 00:13:56,371 --> 00:13:58,821 We use our getter. 180 00:13:58,821 --> 00:14:03,776 This bit is extra this checkcast. The reason that bit is extra is 181 00:14:03,776 --> 00:14:14,950 because we're using the equivalent of a template in Java. 182 00:14:14,950 --> 00:14:18,983 And the way that's implemented in Java is it just basically casts everything to an 183 00:14:18,983 --> 00:14:23,337 object so that requires extra compiler information. 184 00:14:23,337 --> 00:14:25,385 And this is the extra check. 185 00:14:25,385 --> 00:14:30,795 The rest of this we load the reference, we check to see if it is null, 186 00:14:30,795 --> 00:14:35,149 If it's not null we invoke a virtual function - is it the image? 187 00:14:35,149 --> 00:14:37,831 and we return as normal. 188 00:14:37,831 --> 00:14:43,357 Essentially the point I'm trying to make is when we compile this to bytecode 189 00:14:43,357 --> 00:14:45,255 this execution happens. 190 00:14:45,255 --> 00:14:46,758 This null check happens. 191 00:14:46,758 --> 00:14:48,364 This execution happens. 192 00:14:48,364 --> 00:14:50,000 And we return. 193 00:14:50,000 --> 00:14:54,608 In the actual Java class files we've not lost anything. 194 00:14:54,848 --> 00:14:58,166 This is what it looks like when it's been JIT'd. 195 00:14:58,166 --> 00:15:01,108 Now we've lost lots of things. 196 00:15:01,108 --> 00:15:05,578 The JIT has done quite a few clever things which I'll talk about. 197 00:15:05,578 --> 00:15:10,635 First off if we look down here there's a single branch here. 198 00:15:10,635 --> 00:15:15,077 And this is only if our check cast failed 199 00:15:17,037 --> 00:15:19,991 We've got comments on the right hand side. 200 00:15:19,991 --> 00:15:25,516 Our get method has been inlined so we're no longer calling. 201 00:15:27,006 --> 00:15:31,359 We seem to have lost our null check, that's just gone. 202 00:15:32,179 --> 00:15:35,598 And again we've got a get field as well. 203 00:15:35,598 --> 00:15:39,803 That's no longer a method, that's been inlined as well. 204 00:15:39,803 --> 00:15:42,319 We've also got some other cute things. 205 00:15:42,319 --> 00:15:45,742 Those more familiar with AArch64 will understand 206 00:15:45,742 --> 00:15:49,590 that the pointers we're using are 32bit not 64bit. 207 00:15:49,590 --> 00:15:53,795 What we're doing is getting a pointer and shifting it left 3 208 00:15:53,795 --> 00:15:56,826 and widening it to a 64bit pointer. 209 00:15:56,996 --> 00:16:01,655 We've also got 32bit pointers on a 64bit system as well. 210 00:16:02,035 --> 00:16:05,988 So that's saving a reasonable amount of memory and cache. 211 00:16:05,988 --> 00:16:09,582 To summarise. We don't have any branches or function calls 212 00:16:09,582 --> 00:16:12,632 and we've got a lot of inlining. 213 00:16:12,632 --> 00:16:15,911 We did have function calls in the class file so it's the JVM; 214 00:16:15,911 --> 00:16:18,236 it's the JIT that has done this. 215 00:16:18,236 --> 00:16:22,330 We've got no null checks either and I'm going to talk through this now. 216 00:16:24,070 --> 00:16:29,426 The null check elimination is quite a clever feature in Java and other programs. 217 00:16:30,446 --> 00:16:33,345 The idea behind null check elimination is 218 00:16:33,345 --> 00:16:37,494 most of the time this object is not going to be null. 219 00:16:37,724 --> 00:16:42,749 If this object is null the operating system knows this quite quickly. 220 00:16:43,179 --> 00:16:47,793 So if you try to dereference a null pointer you'll get either a SIGSEGV or 221 00:16:47,793 --> 00:16:50,915 a SIGBUS depending on a few circumstances. 222 00:16:50,915 --> 00:16:53,455 That goes straight back to the JVM 223 00:16:53,455 --> 00:16:57,970 and the JVM knows where the null exception took place. 224 00:16:58,470 --> 00:17:01,759 Because it knows where it took place it can look this up 225 00:17:01,759 --> 00:17:05,364 and unwind it as part of an exception. 226 00:17:05,364 --> 00:17:09,709 Those null checks just go. Completely gone. 227 00:17:09,709 --> 00:17:15,447 Most of the time this works and you are saving a reasonable amount of execution. 228 00:17:16,417 --> 00:17:19,836 I'll talk about when it doesn't work in a second. 229 00:17:19,836 --> 00:17:23,913 That's reasonably clever. We have similar programming techniques in other places 230 00:17:23,913 --> 00:17:27,680 even the Linux kernel for instance when you copy data to and from user space 231 00:17:27,680 --> 00:17:30,912 it does pretty much identical the same thing. 232 00:17:30,912 --> 00:17:36,456 It has an exception unwind table and it knows if it catches a page fault on 233 00:17:36,456 --> 00:17:40,130 this particular program counter it can deal with it because it knows 234 00:17:40,130 --> 00:17:43,729 the program counter and it knows conceptually what it was doing. 235 00:17:43,729 --> 00:17:47,630 In a similar way the JIT knows what its doing to a reasonable degree. 236 00:17:47,630 --> 00:17:51,869 It can handle the null check elimination. 237 00:17:53,059 --> 00:17:57,325 I mentioned the sneaky one. We've got essentially 32bit pointers 238 00:17:57,325 --> 00:17:59,478 on a 64bit system. 239 00:17:59,478 --> 00:18:04,534 Most of the time in Java people typically specify heap size smaller than 32GB. 240 00:18:05,404 --> 00:18:10,004 Which is perfect if you want to use 32bit pointers and left shift 3. 241 00:18:10,004 --> 00:18:12,800 Because that gives you 32GB of addressable memory. 242 00:18:12,800 --> 00:18:18,837 That's a significant memory saving because otherwise a lot of things would double up. 243 00:18:18,847 --> 00:18:22,861 There's a significant number of pointers in Java. 244 00:18:22,861 --> 00:18:28,845 The one that should make people jump out of their seat is 245 00:18:28,845 --> 00:18:32,224 the fact that most methods in Java are actually virtual. 246 00:18:32,224 --> 00:18:36,917 So what the JVM has actually done is inlined a virtual function. 247 00:18:36,917 --> 00:18:41,578 A virtual function is essentially a function were you don't know where 248 00:18:41,578 --> 00:18:43,369 you're going until run time. 249 00:18:43,369 --> 00:18:47,333 You can have several different classes and they share the same virtual function 250 00:18:47,333 --> 00:18:51,439 in the base class and dependent upon which specific class you're running 251 00:18:51,439 --> 00:18:54,204 different virtual functions will get executed. 252 00:18:54,204 --> 00:18:59,517 In C++ that will be a read from a V table and then you know where to go. 253 00:19:01,047 --> 00:19:03,316 The JVM's inlined it. 254 00:19:03,316 --> 00:19:05,105 We've saved a memory load. 255 00:19:05,105 --> 00:19:07,925 We've saved a branch as well 256 00:19:07,925 --> 00:19:11,803 The reason the JVM can inline it is because the JVM knows 257 00:19:11,803 --> 00:19:14,285 every single class that has been loaded. 258 00:19:14,285 --> 00:19:19,517 So it knows that although this looks polymorphic to the casual programmer 259 00:19:19,517 --> 00:19:26,010 It actually is monomorphic. The JVM knows this. 260 00:19:26,010 --> 00:19:30,982 Because it knows this it can be clever. And this is really clever. 261 00:19:30,982 --> 00:19:34,940 That's a significant cost saving. 262 00:19:35,240 --> 00:19:40,554 This is all great. I've already mentioned the null check elimination. 263 00:19:41,364 --> 00:19:46,950 We're taking a signal as most of you know if we do that a lot it's going to be slow. 264 00:19:46,950 --> 00:19:51,137 Jumping into kernel, into user, bouncing around. 265 00:19:51,137 --> 00:19:55,525 The JVM also has a notion of 'OK I've been a bit too clever now; 266 00:19:55,525 --> 00:19:57,806 I need to back off a bit' 267 00:19:57,806 --> 00:20:02,382 Also there's nothing stopping the user loading more classes 268 00:20:02,382 --> 00:20:06,799 and rendering the monomorphic assumption invalid. 269 00:20:06,799 --> 00:20:09,936 So the JVM needs to have a notion of backpeddling and go 270 00:20:09,936 --> 00:20:14,104 'Ok I've gone to far and need to deoptimise' 271 00:20:14,104 --> 00:20:16,881 The JVM has the ability to deoptimise. 272 00:20:16,881 --> 00:20:22,595 In other words it essentially knows that for certain code paths everything's OK. 273 00:20:22,595 --> 00:20:26,870 But for certain new objects it can't get away with these tricks. 274 00:20:26,870 --> 00:20:32,148 By the time the new objects are executed they are going to be safe. 275 00:20:32,268 --> 00:20:35,363 There are ramifications for this. This is the important thing to consider 276 00:20:35,363 --> 00:20:39,818 with something like Java and other languages and other virtual machines. 277 00:20:39,818 --> 00:20:45,551 If you're trying to profile this it means there is a very significant ramification. 278 00:20:46,181 --> 00:20:51,472 You can have the same class and method JIT'd multiple ways 279 00:20:52,052 --> 00:20:54,829 and executed at the same time. 280 00:20:54,829 --> 00:20:59,952 So if you're trying to find a hot spot the program counter's nodding off. 281 00:21:00,582 --> 00:21:03,625 Because you can refer to the same thing in several different ways. 282 00:21:03,625 --> 00:21:07,914 This is quite common as well as deoptimisation does take place. 283 00:21:08,944 --> 00:21:13,715 That's something to bear in mind with JVM and similar runtime environments. 284 00:21:15,845 --> 00:21:18,984 You can get a notion of what the JVM's trying to do. 285 00:21:18,984 --> 00:21:22,262 You can ask it nicely and add a print compilation option 286 00:21:22,262 --> 00:21:24,667 and it will tell you what it's doing. 287 00:21:24,667 --> 00:21:26,755 This is reasonably verbose. 288 00:21:26,755 --> 00:21:29,718 Typically what happens is the JVM gets excited JIT'ing everything 289 00:21:29,718 --> 00:21:32,413 and optimising everything then it settles down. 290 00:21:32,413 --> 00:21:35,150 Until you load something new and it gets excited again. 291 00:21:35,150 --> 00:21:38,278 There's a lot of logs. This is mainly useful for debugging but 292 00:21:38,278 --> 00:21:42,121 it gives you an appreciation that it's doing a lot of work. 293 00:21:42,121 --> 00:21:45,285 You can go even further with a log compilation option. 294 00:21:45,285 --> 00:21:50,492 That produces a lot of XML and that is useful for people debugging the JVM as well. 295 00:21:51,182 --> 00:21:54,430 It's quite handy to get an idea of what's going on. 296 00:21:56,810 --> 00:22:03,145 If that is not enough information you also have the ability to go even further. 297 00:22:04,735 --> 00:22:07,334 This is beyond the limit of my understanding. 298 00:22:07,334 --> 00:22:10,945 I've gone into this little bit just to show you what can be done. 299 00:22:11,295 --> 00:22:17,342 You have release builds of OpenJDK and they have debug builds of OpenJDK. 300 00:22:17,342 --> 00:22:23,737 The release builds will by default turn off a lot of the diagnostic options. 301 00:22:24,637 --> 00:22:27,660 You can switch them back on again. 302 00:22:27,660 --> 00:22:33,299 When you do you can also gain insight into the actual, it's colloquially 303 00:22:33,299 --> 00:22:37,344 referred to as the C2 JIT, the compiler there. 304 00:22:37,344 --> 00:22:41,706 You can see, for instance, objects in timelines and visualize them 305 00:22:41,706 --> 00:22:45,493 as they're being optimised at various stages and various things. 306 00:22:45,493 --> 00:22:51,978 So this is based on a masters thesis by Thomas Würthinger. 307 00:22:53,578 --> 00:22:58,446 This is something you can play with as well and see how far the optimiser goes. 308 00:22:59,966 --> 00:23:03,373 And it's also good for people hacking with the JVM. 309 00:23:04,813 --> 00:23:08,275 I'll move onto some stuff we did. 310 00:23:09,605 --> 00:23:15,520 Last year we were working on the big data. Relatively new architecture 311 00:23:17,160 --> 00:23:22,457 ARM64, it's called AArch64 in OpenJDK land but ARM64 in Debian land. 312 00:23:23,787 --> 00:23:26,769 We were a bit concerned because everything's all shiny and new. 313 00:23:26,769 --> 00:23:28,726 Has it been optimised correctly? 314 00:23:28,726 --> 00:23:31,440 Are there any obvious things we need to optimise? 315 00:23:31,440 --> 00:23:33,820 And we're also interested because everything was so shiny and new 316 00:23:33,820 --> 00:23:35,201 in the whole system. 317 00:23:35,201 --> 00:23:38,969 Not just the JVM but the glibc and the kernel as well. 318 00:23:38,969 --> 00:23:41,890 So how do we get a view of all of this? 319 00:23:41,890 --> 00:23:48,657 I gave a quick talk before at the Debian mini-conf before last [2014] about perf 320 00:23:49,537 --> 00:23:52,650 so decided we could try and do some clever things with Linux perf 321 00:23:52,650 --> 00:23:57,655 and see if we could get some actual useful debugging information out. 322 00:23:57,655 --> 00:24:02,128 We have the flame graphs that are quite well known. 323 00:24:02,128 --> 00:24:08,192 We also have some previous work, Johannes had a special perf map agent that 324 00:24:08,192 --> 00:24:13,350 could basically hook into perf and it would give you a nice way of running 325 00:24:13,350 --> 00:24:20,001 perf-top for want of a better expression and viewing the top Java function names. 326 00:24:21,581 --> 00:24:24,940 This is really good work and it's really good for a particular use case 327 00:24:24,940 --> 00:24:29,405 if you just want to do a quick snap shot once and see in that snap shot 328 00:24:29,635 --> 00:24:32,104 where the hotspots where. 329 00:24:32,104 --> 00:24:38,490 For a prolonged work load with all the functions being JIT'd multiple ways 330 00:24:38,490 --> 00:24:42,165 with the optimisation going on and everything moving around 331 00:24:42,165 --> 00:24:46,762 it require a little bit more information to be captured. 332 00:24:46,762 --> 00:24:50,872 I decided to do a little bit of work on a very similar thing to perf-map-agent 333 00:24:51,412 --> 00:24:55,725 but an agent that would capture it over a prolonged period of time. 334 00:24:55,725 --> 00:24:59,234 Here's an example Flame graph, these are all over the internet. 335 00:24:59,234 --> 00:25:04,707 This is the SHA1 computation example that I gave at the beginning. 336 00:25:04,707 --> 00:25:10,394 As expected the VM intrinsic SHA1 is the top one. 337 00:25:10,394 --> 00:25:17,494 Not expected by me was this quite significant chunk of CPU execution time. 338 00:25:17,494 --> 00:25:21,408 And there was a significant amount of time being spent copying memory 339 00:25:21,408 --> 00:25:28,080 from the mmapped memory region into a heap 340 00:25:28,080 --> 00:25:31,315 and then that was passed to the crypto engine. 341 00:25:31,315 --> 00:25:35,487 So we're doing a ton of memory copies for no good reason. 342 00:25:35,487 --> 00:25:38,662 That essentially highlighted an example. 343 00:25:38,662 --> 00:25:42,494 That was an assumption I made about Java to begin with which was if you do 344 00:25:42,494 --> 00:25:45,114 the equivalent of mmap it should just work like mmap right? 345 00:25:45,114 --> 00:25:48,118 You should just be able to address the memory. That is not the case. 346 00:25:48,118 --> 00:25:53,836 If you've got a file mapping object and you try to address it it has to be copied 347 00:25:53,836 --> 00:25:59,200 into safe heap memory first. And that is what was slowing down the programs. 348 00:25:59,200 --> 00:26:04,576 If that was omitted you could make the SHA1 computation even quicker. 349 00:26:04,576 --> 00:26:09,022 So that would be the logical target you would want to optimise. 350 00:26:09,022 --> 00:26:12,014 I wanted to extend Johannes' work with something called a 351 00:26:12,014 --> 00:26:15,741 Java Virtual Machine Tools Interface profiling agent. 352 00:26:16,871 --> 00:26:22,649 This is part of the Java Virtual Machine standard as you can make a special library 353 00:26:22,649 --> 00:26:24,507 and then hook this into the JVM. 354 00:26:24,507 --> 00:26:28,199 And the JVM can expose quite a few things to the library. 355 00:26:28,199 --> 00:26:32,413 It exposes a reasonable amount of information as well. 356 00:26:32,413 --> 00:26:39,167 Perf as well has the ability to look at map files natively. 357 00:26:39,837 --> 00:26:44,357 If you are profiling JavaScript, or something similar, I think the 358 00:26:44,357 --> 00:26:48,403 Google V8 JavaScript engine will write out a special map file that says 359 00:26:48,403 --> 00:26:52,716 these program counter addresses correspond to these function names. 360 00:26:52,716 --> 00:26:56,986 I decided to use that in a similar way to what Johannes did for the extended 361 00:26:56,986 --> 00:27:02,481 profiling agent but I also decided to capture some more information as well. 362 00:27:05,261 --> 00:27:09,970 I decided to capture the disassembly so when we run perf annotate 363 00:27:09,970 --> 00:27:13,395 we can see the actual JVM bytecode in our annotation. 364 00:27:13,395 --> 00:27:17,257 We can see how it was JIT'd at the time when it was JIT'd. 365 00:27:17,257 --> 00:27:19,659 We can see where the hotspots where. 366 00:27:19,659 --> 00:27:22,817 And that's good. But we can go even better. 367 00:27:22,817 --> 00:27:29,174 We can run an annotated trace that contains the Java class, 368 00:27:29,174 --> 00:27:33,960 the Java method and the bytecode all in one place at the same time. 369 00:27:33,960 --> 00:27:38,950 You can see everything from the JVM at the same place. 370 00:27:38,950 --> 00:27:43,940 This works reasonably well because the perf interface is extremely extensible. 371 00:27:43,940 --> 00:27:47,689 And again we can do entire system optimisation. 372 00:27:47,689 --> 00:27:51,776 The bits in red here are the Linux kernel. 373 00:27:51,776 --> 00:27:54,991 Then we got into libraries. 374 00:27:54,991 --> 00:27:57,940 And then we got into Java and more libraries as well. 375 00:27:57,940 --> 00:28:02,340 So we can see everything from top to bottom in one fell swoop. 376 00:28:04,120 --> 00:28:07,609 This is just a quick slide showing the mechanisms employed. 377 00:28:07,609 --> 00:28:11,670 Essentially we have this agent which is a shared object file. 378 00:28:11,670 --> 00:28:16,390 And this will spit out useful files here in a standard way. 379 00:28:16,390 --> 00:28:25,625 And the Linux perf basically just records the perf data dump file as normal. 380 00:28:27,155 --> 00:28:29,585 We have 2 sets of recording going on. 381 00:28:29,585 --> 00:28:35,229 To report it it's very easy to do normal reporting with the PID map. 382 00:28:35,229 --> 00:28:40,819 This is just out of the box, works with the Google V8 engine as well. 383 00:28:40,819 --> 00:28:45,077 If you want to do very clever annotations perf has the ability to have 384 00:28:45,077 --> 00:28:47,693 Python scripts passed to it. 385 00:28:47,693 --> 00:28:53,521 So you can craft quite a dodgy Python script and that can interface 386 00:28:53,521 --> 00:28:55,369 with the perf annotation output. 387 00:28:55,369 --> 00:29:00,244 That's how I was able to get the extra Java information in the same annotation. 388 00:29:00,244 --> 00:29:05,047 And this is really easy to do; it's quite easy to knock the script up. 389 00:29:05,047 --> 00:29:10,216 And again the only thing we do for this profiling is we hook in the profiling 390 00:29:10,216 --> 00:29:13,440 agent which dumps out various things. 391 00:29:13,440 --> 00:29:18,330 We preserve the frame pointer because that makes things considerably easier 392 00:29:18,330 --> 00:29:21,367 on winding. This will effect performance a little bit. 393 00:29:21,367 --> 00:29:25,913 And again when we're reporting we just hook in a Python script. 394 00:29:25,913 --> 00:29:29,697 It's really easy to hook everything in and get it working. 395 00:29:32,692 --> 00:29:37,322 At the moment we have a JVMTI agent. It's actually on http://git.linaro.org now. 396 00:29:38,252 --> 00:29:42,154 Since I gave this talk Google have extended perf anyway so it will do 397 00:29:42,154 --> 00:29:45,405 quite a lot of similar things out of the box anyway. 398 00:29:45,405 --> 00:29:49,748 It's worth having a look at the latest perf. 399 00:29:49,748 --> 00:29:54,284 These techniques in this slide deck can be used obviously in other JITs quite easily. 400 00:29:54,284 --> 00:29:59,308 The fact that perf is so easy to extend with scripts can be useful 401 00:29:59,308 --> 00:30:01,215 for other things. 402 00:30:01,215 --> 00:30:05,769 And OpenJDK has a significant amount of cleverness associated with it that 403 00:30:05,769 --> 00:30:10,050 I thought was very surprising and good. So that's what I covered in the talk. 404 00:30:13,000 --> 00:30:17,852 These are basically references to things like command line arguments 405 00:30:17,852 --> 00:30:20,491 and the Flame graphs and stuff like that. 406 00:30:20,491 --> 00:30:26,308 If anyone is interested in playing with OpenJDK on ARM64 I'd suggest going here: 407 00:30:26,308 --> 00:30:31,012 http://openjdk.linaro.org Where the most recent builds are. 408 00:30:31,012 --> 00:30:36,002 Obviously fixes are going in upstream and they're going into distributions as well. 409 00:30:36,002 --> 00:30:40,106 They're included in OpenJDK so it should be good as well. 410 00:30:41,326 --> 00:30:44,874 I've run through quite a few fundamental things reasonably quickly. 411 00:30:44,874 --> 00:30:48,274 I'd be happy to accept any questions or comments 412 00:30:54,161 --> 00:30:57,297 And if you want to talk to me privately about Java afterwards feel free to 413 00:30:57,297 --> 00:30:59,129 when no-one's looking. 414 00:31:06,896 --> 00:31:12,698 [Audience] Applause 415 00:31:13,315 --> 00:31:19,028 [Audience] It's not really a question so much as a comment. 416 00:31:19,028 --> 00:31:26,331 Last mini-Deb conf we had a talk about 417 00:31:27,061 --> 00:31:31,881 using the JVM with other languages. 418 00:31:31,881 --> 00:31:35,845 And it seems to me that all this would apply even if you hate Java programming 419 00:31:35,845 --> 00:31:39,361 language and want to write in, I don't know, lisp or something instead 420 00:31:39,361 --> 00:31:42,424 if you've got a lisp system that can generate JVM bytecode. 421 00:31:42,424 --> 00:31:48,132 [Presenter] Yeah, totally. And the other big data language we looked at was Scala. 422 00:31:49,162 --> 00:31:53,420 It uses the JVM back end but a completely different language on the front. 423 00:32:03,875 --> 00:32:08,104 Cheers guys.