0:00:01.695,0:00:03.483 Onto the second talk of the day. 0:00:10.562,0:00:14.451 Steve Capper is going to tell us[br]about the good bits of Java 0:00:15.101,0:00:17.198 They do exist 0:00:17.198,0:00:21.067 [Audience] Could this have been a [br]lightening talk? [Audience laughter] 0:00:23.258,0:00:26.019 Believe it or not we've got some [br]good stuff here. 0:00:27.325,0:00:30.459 I was as skeptical as you guys [br]when I first looked. 0:00:30.534,0:00:34.756 First many apologies for not attending [br]the mini-conf last year 0:00:34.756,0:00:40.031 I was unfortunately ill on the day [br]I was due to give this talk. 0:00:43.664,0:00:45.588 Let me figure out how to use a computer. 0:01:00.997,0:01:03.020 Sorry about this. 0:01:12.607,0:01:15.567 There we go; it's because [br]I've not woken up. 0:01:19.952,0:01:26.945 Last year I worked at Linaro in the [br]Enterprise group and we performed analysis 0:01:28.013,0:01:31.863 on so called 'Big Data' application sets. 0:01:32.210,0:01:37.423 As many of you know quite a lot of these [br]big data applications are written in Java. 0:01:38.456,0:01:42.550 I'm from ARM and we were very interested[br]in 64bit ARM support. 0:01:42.931,0:01:47.261 So this is mainly AArch64 examples [br]for things like assembler 0:01:47.679,0:01:52.680 but most of the messages are [br]pertinent for any architecture. 0:01:53.505,0:01:58.217 These good bits are shared between [br]most if not all the architectures. 0:01:58.964,0:02:02.783 Whilst trying to optimise a lot of [br]these big data applications 0:02:03.027,0:02:06.409 I stumbled a across quite a few things in [br]the JVM and I thought 0:02:06.412,0:02:11.475 'actually that's really clever; [br]that's really cool' 0:02:12.609,0:02:16.656 So I thought that would make a good [br]basis for an interesting talk. 0:02:16.673,0:02:20.313 This talk is essentially some of the [br]clever things I found in the 0:02:20.330,0:02:25.322 Java Virtual Machine; these [br]optimisations are in OpenJDK. 0:02:26.238,0:02:31.602 Source is available it's all there, [br]readily available and in play now. 0:02:32.992,0:02:37.512 I'm going to finish with some of the [br]optimisation work we did with Java. 0:02:38.371,0:02:42.573 People who know me will know [br]I'm not a Java zealot. 0:02:42.574,0:02:47.860 I don't particularly believe in [br]programming in a language over another one 0:02:48.445,0:02:51.366 So to make it clear from the outset [br]I'm not attempting to convert 0:02:51.366,0:02:53.617 anyone to Java programmers. 0:02:53.617,0:02:57.221 I'm just going to highlight a few salient [br]things in the Java Virtual Machine 0:02:57.221,0:02:59.571 which I found to be quite clever and [br]interesting 0:02:59.577,0:03:03.827 and I'll try and talk through them [br]with my understanding of them. 0:03:04.349,0:03:08.867 Let's jump straight in and let's [br]start with an example. 0:03:10.060,0:03:14.498 This is a minimal example for [br]computing a SHA1 sum of a file. 0:03:15.208,0:03:19.717 I've omitted some of the checking in the [br]beginning of the function see when 0:03:19.717,0:03:22.196 command line parsing and that sort of [br]thing. 0:03:22.196,0:03:25.040 I've highlighted the salient [br]points in red. 0:03:25.040,0:03:29.653 Essentially we instantiate a SHA1 [br]crypto message service digest. 0:03:30.116,0:03:35.422 And we do the equivalent in [br]Java of an mmap. 0:03:35.907,0:03:37.992 Get it all in memory. 0:03:37.992,0:03:42.388 And then we just put this status straight [br]into the crypto engine. 0:03:42.494,0:03:46.632 And eventually at the end of the [br]program we'll spit out the SHA1 hash. 0:03:46.632,0:03:48.682 It's a very simple program. 0:03:48.682,0:03:53.466 It's basically mmap, SHA1, output [br]the hash afterwards. 0:03:55.857,0:04:02.832 In order to concentrate on the CPU [br]aspect rather than worry about IO 0:04:03.822,0:04:07.432 I decided to cheat a little bit by [br]setting this up. 0:04:08.129,0:04:14.659 I decided to use a sparse file. As many of[br]you know a sparse file is a file that not 0:04:14.659,0:04:19.529 all the contents are stored necessarily [br]on disc. The assumption is that the bits 0:04:19.529,0:04:26.379 that aren't stored are zero. For instance[br]on Linux you can create a 20TB sparse file 0:04:26.379,0:04:30.802 on a 10MB file system and use it as [br]normal. 0:04:30.802,0:04:34.401 Just don't write too much to it otherwise [br]you're going to run out of space. 0:04:34.401,0:04:40.675 The idea behind using a sparse file is I'm[br]just focusing on the computational aspects 0:04:40.675,0:04:44.730 of the SHA1 sum. I'm not worried about [br]the file system or anything like that. 0:04:44.734,0:04:48.910 I don't want to worry about the IO. I [br]just want to focus on the actual compute. 0:04:48.910,0:04:52.856 In order to set up a sparse file I used [br]the following runes. 0:04:52.856,0:04:56.727 The important point is that you seek[br]and the other important point 0:04:56.728,0:05:01.046 is you set a count otherwise you'll [br]fill your disc up. 0:05:02.556,0:05:09.218 I decided to run this against firstly [br]let's get the native SHA1 sum command 0:05:09.218,0:05:15.188 that's built into Linux and let's [br]normalise these results and say that's 1.0 0:05:17.469,0:05:21.454 I used an older version of the OpenJDK [br]and ran the Java program 0:05:21.454,0:05:28.499 and that's 1.09 times slower than the [br]reference command. That's quite good. 0:05:30.009,0:05:38.944 Then I used the new OpenJDK, this is now[br]the current JDK as this is a year on. 0:05:38.944,0:05:44.662 And 0.21 taken. It's significantly faster. 0:05:45.712,0:05:50.693 I've stressed that I've done nothing [br]surreptitious in the Java program. 0:05:50.693,0:05:54.481 It is mmap, compute, spit result out. 0:05:55.771,0:06:00.840 But the OpenJDK has essentially got [br]some more context information. 0:06:00.840,0:06:03.625 I'll talk about that as we go through. 0:06:06.202,0:06:11.014 Before when I started Java I had a very [br]simplistic view of Java. 0:06:11.014,0:06:16.816 Traditionally Java is taught as a virtual [br]machine that runs bytecode. 0:06:16.816,0:06:21.421 Now when you compile a Java program it [br]compiles into bytecode. 0:06:21.421,0:06:25.428 The older versions of the Java Virtual [br]Machine would interpret this bytecode 0:06:25.428,0:06:31.936 and then run through. Newer versions [br]would employ a just-in-time engine 0:06:32.436,0:06:38.167 and try and compile this bytecode [br]into native machine code. 0:06:39.327,0:06:42.841 That is not the only thing that goes on[br]when you run a Java program. 0:06:42.841,0:06:47.318 There is some extra optimisations as well.[br]So this alone would not account for 0:06:47.318,0:06:51.975 the newer version of the SHA1 [br]sum being significantly faster 0:06:51.975,0:06:55.568 than the distro supplied one. 0:06:55.568,0:07:00.700 Java knows about context. It has a class [br]library and these class libraries 0:07:00.700,0:07:04.184 have reasonably well defined purposes. 0:07:04.184,0:07:07.596 We have classes that provide [br]crypto services. 0:07:07.596,0:07:10.947 We have some misc unsafe that every [br]single project seems to pull in their 0:07:10.947,0:07:13.493 project when they're not supposed to. 0:07:13.493,0:07:17.303 These have well defined meanings. 0:07:17.303,0:07:20.785 These do not necessarily have to be [br]written in Java. 0:07:20.785,0:07:24.289 They come as Java classes, [br]they come supplied. 0:07:24.289,0:07:28.591 But most JVMs now have a notion [br]of a virtual machine intrinsic. 0:07:28.591,0:07:34.787 And the virtual machine intrinsic says ok [br]please do a SHA1 in the best possible way 0:07:34.787,0:07:39.170 that your implementation allows. This is [br]something done automatically by the JVM. 0:07:39.170,0:07:43.482 You don't ask for it. If the JVM knows[br]what it's running on and it's reasonably 0:07:43.482,0:07:47.564 recent this will just happen [br]for you for free. 0:07:47.564,0:07:50.053 And there's quite a few classes [br]that do this. 0:07:50.053,0:07:53.520 There's quite a few clever things with [br]atomics, there's crypto, 0:07:53.520,0:07:58.208 there's mathematical routines as well. [br]Most of these routines in the 0:07:58.208,0:08:02.876 class library have a well defined notion [br]of a virtual machine intrinsic 0:08:02.876,0:08:06.992 and they do run reasonably optimally. 0:08:06.992,0:08:11.044 They are a subject of continuous [br]optimisation as well. 0:08:11.684,0:08:15.952 We've got some runes that are [br]presented on the slides here. 0:08:17.102,0:08:21.342 These are quite useful if you [br]are interested in 0:08:21.342,0:08:24.326 how these intrinsics are made. 0:08:24.326,0:08:28.798 You can ask the JVM to print out a lot of[br]the just-in-time compiled code. 0:08:28.798,0:08:34.899 You can ask the JVM to print out the [br]native methods as well as these intrinsics 0:08:34.899,0:08:40.371 and in this particular case after sifting [br]through about 5MB of text 0:08:40.371,0:08:45.035 I've come across this particular SHA1 sum[br]implementation. 0:08:45.035,0:08:52.480 This is AArch64. This is employing the [br]cryptographic extensions 0:08:52.480,0:08:54.093 in the architecture. 0:08:54.093,0:08:57.333 So it's essentially using the CPU [br]instructions which would explain why 0:08:57.333,0:09:00.464 it's faster. But again it's done [br]all this automatically. 0:09:00.464,0:09:05.948 This did not require any specific runes [br]or anything to activate. 0:09:07.618,0:09:12.009 We'll see a bit later on how you can [br]more easily find the hot spots 0:09:12.009,0:09:14.778 rather than sifting through a lot [br]of assembler. 0:09:14.778,0:09:18.658 I've mentioned that the cryptographic [br]engine is employed and again 0:09:18.658,0:09:23.094 this routine was generated at run [br]time as well. 0:09:23.094,0:09:28.070 This is one of the important things about [br]certain execution of amps like Java. 0:09:28.070,0:09:31.439 You don't have to know everything at [br]compile time. 0:09:31.439,0:09:35.019 You know a lot more information at [br]run time and you can use that 0:09:35.019,0:09:37.408 in theory to optimise. 0:09:37.408,0:09:40.014 You can switch off these clever routines. 0:09:40.014,0:09:43.179 For instance I've got a deactivate [br]here and we get back to the 0:09:43.179,0:09:46.699 slower performance we expected. 0:09:46.699,0:09:52.968 Again, this particular set of routines is [br]present in OpenJDK, 0:09:52.968,0:09:57.001 I think for all the architectures that [br]support it. 0:09:57.001,0:10:00.758 We get this optimisation for free on X86 [br]and others as well. 0:10:00.758,0:10:03.071 It works quite well. 0:10:03.071,0:10:07.914 That was one surprise I came across [br]as the instrinsics. 0:10:07.914,0:10:13.096 One thing I thought it would be quite [br]good to do would be to go through 0:10:13.096,0:10:17.874 a slightly more complicated example. [br]And use this example to explain 0:10:17.874,0:10:21.495 a lot of other things that happen [br]in the JVM as well. 0:10:21.495,0:10:24.392 I will spend a bit of time going through [br]this example 0:10:24.392,0:10:29.772 and explain roughly the notion of what [br]it's supposed to be doing. 0:10:32.526,0:10:39.305 This is an imaginary method that I've [br]contrived to demonstrate a lot of points 0:10:39.305,0:10:42.756 in the fewest possible lines of code. 0:10:42.756,0:10:45.238 I'll start with what it's meant to do. 0:10:45.238,0:10:50.606 This is meant to be a routine that gets a[br]reference to something and let's you know 0:10:50.606,0:10:55.952 whether or not it's an image and in a [br]hypothetical cache. 0:10:58.155,0:11:02.252 I'll start with the important thing [br]here the weak reference. 0:11:02.252,0:11:09.023 In Java and other garbage collected [br]languages we have the notion of references 0:11:09.023,0:11:13.428 Most of the time when you are running a [br]Java program you have something like a 0:11:13.428,0:11:19.351 variable name and that is in the current [br]execution context that is referred to as a 0:11:19.351,0:11:24.176 strong reference to the object. In other [br]words I can see it. I am using it. 0:11:24.176,0:11:27.289 Please don't get rid of it. [br]Bad things will happen if you do. 0:11:27.289,0:11:30.561 So the garbage collector knows [br]not to get rid of it. 0:11:30.561,0:11:35.766 In Java and other languages you also [br]have the notion of a weak reference. 0:11:35.766,0:11:40.044 This is essentially the programmer saying[br]to the virtual machine 0:11:40.044,0:11:43.936 "Look I kinda care about this but [br]just a little bit." 0:11:44.216,0:11:49.053 "If you want to get rid of it feel free [br]to but please let me know." 0:11:49.493,0:11:53.952 This is why this is for a CacheClass. [br]For instance the JVM in this particular 0:11:53.952,0:12:01.201 case could decide that it's running quite [br]low on memory this particular xMB image 0:12:01.201,0:12:04.416 has not been used for a while it can [br]garbage collect it. 0:12:04.416,0:12:08.530 The important thing is how we go about [br]expressing this in the language. 0:12:08.530,0:12:12.980 We can't just have a reference to the [br]object because that's a strong reference 0:12:12.980,0:12:17.628 and the JVM will know it can't get [br]rid of this because the program 0:12:17.628,0:12:19.341 can see it actively. 0:12:19.341,0:12:23.712 So we have a level of indirection which [br]is known as a weak reference. 0:12:24.652,0:12:28.621 We have this hypothetical CacheClass [br]that I've devised. 0:12:28.621,0:12:32.130 At this point it is a weak reference. 0:12:32.130,0:12:35.974 Then we get it. This is calling the weak [br]reference routine. 0:12:35.974,0:12:40.759 Now it becomes a strong reference so [br]it's not going to be garbage collected. 0:12:40.759,0:12:44.722 When we get to the return path it becomes [br]a weak reference again 0:12:44.722,0:12:48.439 because our strong reference [br]has disappeared. 0:12:48.439,0:12:50.856 The salient points in this example are: 0:12:50.856,0:12:54.293 We're employing a method to get [br]a reference. 0:12:54.293,0:12:56.759 We're checking an item to see if [br]it's null. 0:12:56.759,0:13:01.050 So let's say that the JVM decided to [br]garbage collect this 0:13:01.970,0:13:04.354 before we executed the method. 0:13:04.354,0:13:08.715 The weak reference class is still valid [br]because we've got a strong reference to it 0:13:08.715,0:13:11.786 but the actual object behind this is gone. 0:13:11.786,0:13:14.970 If we're too late and the garbage [br]collector has killed it 0:13:14.970,0:13:17.789 it will be null and we return. 0:13:17.789,0:13:22.131 So it's a level of indirection to see [br]does this still exist 0:13:22.131,0:13:27.680 if so can I please have it and then [br]operate on it as normal 0:13:27.680,0:13:31.254 and then return becomes weak [br]reference again. 0:13:31.254,0:13:37.438 This example program is quite useful when[br]we look at how it's implemented in the JVM 0:13:37.438,0:13:40.214 and we'll go through a few things now. 0:13:40.214,0:13:43.504 First off we'll go through the bytecode. 0:13:43.504,0:13:48.862 The only point of this slide is to [br]show it's roughly 0:13:48.862,0:13:53.882 the same as this. 0:13:53.882,0:13:56.371 We get our variable. 0:13:56.371,0:13:58.821 We use our getter. 0:13:58.821,0:14:03.776 This bit is extra this checkcast. [br]The reason that bit is extra is 0:14:03.776,0:14:14.950 because we're using the equivalent of [br]a template in Java. 0:14:14.950,0:14:18.983 And the way that's implemented in Java is [br]it just basically casts everything to an 0:14:18.983,0:14:23.337 object so that requires extra [br]compiler information. 0:14:23.337,0:14:25.385 And this is the extra check. 0:14:25.385,0:14:30.795 The rest of this we load the reference, [br]we check to see if it is null, 0:14:30.795,0:14:35.149 If it's not null we invoke a virtual [br]function - is it the image? 0:14:35.149,0:14:37.831 and we return as normal. 0:14:37.831,0:14:43.357 Essentially the point I'm trying to make [br]is when we compile this to bytecode 0:14:43.357,0:14:45.255 this execution happens. 0:14:45.255,0:14:46.758 This null check happens. 0:14:46.758,0:14:48.364 This execution happens. 0:14:48.364,0:14:50.000 And we return. 0:14:50.000,0:14:54.608 In the actual Java class files we've not [br]lost anything. 0:14:54.848,0:14:58.166 This is what it looks like when it's [br]been JIT'd. 0:14:58.166,0:15:01.108 Now we've lost lots of things. 0:15:01.108,0:15:05.578 The JIT has done quite a few clever things[br]which I'll talk about. 0:15:05.578,0:15:10.635 First off if we look down here there's [br]a single branch here. 0:15:10.635,0:15:15.077 And this is only if our check cast failed 0:15:17.037,0:15:19.991 We've got comments on the [br]right hand side. 0:15:19.991,0:15:25.516 Our get method has been inlined so [br]we're no longer calling. 0:15:27.006,0:15:31.359 We seem to have lost our null check,[br]that's just gone. 0:15:32.179,0:15:35.598 And again we've got a get field as well. 0:15:35.598,0:15:39.803 That's no longer a method, [br]that's been inlined as well. 0:15:39.803,0:15:42.319 We've also got some other cute things. 0:15:42.319,0:15:45.742 Those more familiar with AArch64 [br]will understand 0:15:45.742,0:15:49.590 that the pointers we're using [br]are 32bit not 64bit. 0:15:49.590,0:15:53.795 What we're doing is getting a pointer [br]and shifting it left 3 0:15:53.795,0:15:56.826 and widening it to a 64bit pointer. 0:15:56.996,0:16:01.655 We've also got 32bit pointers on a [br]64bit system as well. 0:16:02.035,0:16:05.988 So that's saving a reasonable amount [br]of memory and cache. 0:16:05.988,0:16:09.582 To summarise. We don't have any [br]branches or function calls 0:16:09.582,0:16:12.632 and we've got a lot of inlining. 0:16:12.632,0:16:15.911 We did have function calls in the [br]class file so it's the JVM; 0:16:15.911,0:16:18.236 it's the JIT that has done this. 0:16:18.236,0:16:22.330 We've got no null checks either and I'm [br]going to talk through this now. 0:16:24.070,0:16:29.426 The null check elimination is quite a [br]clever feature in Java and other programs. 0:16:30.446,0:16:33.345 The idea behind null check elimination is 0:16:33.345,0:16:37.494 most of the time this object is not [br]going to be null. 0:16:37.724,0:16:42.749 If this object is null the operating [br]system knows this quite quickly. 0:16:43.179,0:16:47.793 So if you try to dereference a null [br]pointer you'll get either a SIGSEGV or 0:16:47.793,0:16:50.915 a SIGBUS depending on a [br]few circumstances. 0:16:50.915,0:16:53.455 That goes straight back to the JVM 0:16:53.455,0:16:57.970 and the JVM knows where the null [br]exception took place. 0:16:58.470,0:17:01.759 Because it knows where it took [br]place it can look this up 0:17:01.759,0:17:05.364 and unwind it as part of an exception. 0:17:05.364,0:17:09.709 Those null checks just go.[br]Completely gone. 0:17:09.709,0:17:15.447 Most of the time this works and you are [br]saving a reasonable amount of execution. 0:17:16.417,0:17:19.836 I'll talk about when it doesn't work [br]in a second. 0:17:19.836,0:17:23.913 That's reasonably clever. We have similar [br]programming techniques in other places 0:17:23.913,0:17:27.680 even the Linux kernel for instance when [br]you copy data to and from user space 0:17:27.680,0:17:30.912 it does pretty much identical [br]the same thing. 0:17:30.912,0:17:36.456 It has an exception unwind table and it [br]knows if it catches a page fault on 0:17:36.456,0:17:40.130 this particular program counter[br]it can deal with it because it knows 0:17:40.130,0:17:43.729 the program counter and it knows[br]conceptually what it was doing. 0:17:43.729,0:17:47.630 In a similar way the JIT know what its [br]doing to a reasonable degree. 0:17:47.630,0:17:51.869 It can handle the null check elimination. 0:17:53.059,0:17:57.325 I mentioned the sneaky one. We've got[br]essentially 32bit pointers 0:17:57.325,0:17:59.478 on a 64bit system. 0:17:59.478,0:18:04.534 Most of the time in Java people typically [br]specify heap size smaller than 32GB. 0:18:05.404,0:18:10.004 Which is perfect if you want to use 32bit [br]pointers and left shift 3. 0:18:10.004,0:18:12.800 Because that gives you 32GB of [br]addressable memory. 0:18:12.800,0:18:18.837 That's a significant memory saving because[br]otherwise a lot of things would double up. 0:18:18.847,0:18:22.861 There's a significant number of pointers [br]in Java. 0:18:22.861,0:18:28.845 The one that should make people [br]jump out of their seat is 0:18:28.845,0:18:32.224 the fact that most methods in Java are [br]actually virtual. 0:18:32.224,0:18:36.917 So what the JVM has actually done is [br]inlined a virtual function. 0:18:36.917,0:18:41.578 A virtual function is essentially a [br]function were you don't know where 0:18:41.578,0:18:43.369 you're going until run time. 0:18:43.369,0:18:47.333 You can have several different classes [br]and they share the same virtual function 0:18:47.333,0:18:51.439 in the base class and dependent upon [br]which specific class you're running 0:18:51.439,0:18:54.204 different virtual functions will [br]get executed. 0:18:54.204,0:18:59.517 In C++ that will be a read from a V table[br]and then you know where to go. 0:19:01.047,0:19:03.316 The JVM's inlined it. 0:19:03.316,0:19:05.105 We've saved a memory load. 0:19:05.105,0:19:07.925 We've saved a branch as well 0:19:07.925,0:19:11.803 The reason the JVM can inline it is [br]because the JVM knows 0:19:11.803,0:19:14.285 every single class that has been loaded. 0:19:14.285,0:19:19.517 So it knows that although this looks [br]polymorphic to the casual programmer 0:19:19.517,0:19:26.010 It actually is monomorphic.[br]The JVM knows this. 0:19:26.010,0:19:30.982 Because it knows this it can be clever.[br]And this is really clever. 0:19:30.982,0:19:34.940 That's a significant cost saving. 0:19:35.240,0:19:40.554 This is all great. I've already mentioned [br]the null check elimination. 0:19:41.364,0:19:46.950 We're taking a signal as most of you know [br]if we do that a lot it's going to be slow. 0:19:46.950,0:19:51.137 Jumping into kernel, into user, [br]bouncing around. 0:19:51.137,0:19:55.525 The JVM also has a notion of [br]'OK I've been a bit too clever now; 0:19:55.525,0:19:57.806 I need to back off a bit' 0:19:57.806,0:20:02.382 Also there's nothing stopping the user [br]loading more classes 0:20:02.382,0:20:06.799 and rendering the monomorphic [br]assumption invalid. 0:20:06.799,0:20:09.936 So the JVM needs to have a notion of [br]backpeddling and go 0:20:09.936,0:20:14.104 'Ok I've gone to far and need to [br]deoptimise' 0:20:14.104,0:20:16.881 The JVM has the ability to deoptimise. 0:20:16.881,0:20:22.595 In other words it essentially knows that [br]for certain code paths everything's OK. 0:20:22.595,0:20:26.870 But for certain new objects it can't get [br]away with these tricks. 0:20:26.870,0:20:32.148 By the time the new objects are executed [br]they are going to be safe. 0:20:32.268,0:20:35.363 There are ramifications for this.[br]This is the important thing to consider 0:20:35.363,0:20:39.818 with something like Java and other [br]languages and other virtual machines. 0:20:39.818,0:20:45.551 If you're trying to profile this it means [br]there is a very significant ramification. 0:20:46.181,0:20:51.472 You can have the same class and [br]method JIT'd multiple ways 0:20:52.052,0:20:54.829 and executed at the same time. 0:20:54.829,0:20:59.952 So if you're trying to find a hot spot [br]the program counter's nodding off. 0:21:00.582,0:21:03.625 Because you can refer to the same thing [br]in several different ways. 0:21:03.625,0:21:07.914 This is quite common as well as [br]deoptimisation does take place. 0:21:08.944,0:21:13.715 That's something to bear in mind with JVM [br]and similar runtime environments. 0:21:15.845,0:21:18.984 You can get a notion of what the JVM's [br]trying to do. 0:21:18.984,0:21:22.262 You can ask it nicely and add a print [br]compilation option 0:21:22.262,0:21:24.667 and it will tell you what it's doing. 0:21:24.667,0:21:26.755 This is reasonably verbose. 0:21:26.755,0:21:29.718 Typically what happens is the JVM gets [br]excited JIT'ing everything 0:21:29.718,0:21:32.413 and optimising everything then [br]it settles down. 0:21:32.413,0:21:35.150 Until you load something new [br]and it gets excited again. 0:21:35.150,0:21:38.278 There's a lot of logs. This is mainly [br]useful for debugging but 0:21:38.278,0:21:42.121 it gives you an appreciation that it's [br]doing a lot of work. 0:21:42.121,0:21:45.285 You can go even further with a log [br]compilation option. 0:21:45.285,0:21:50.492 That produces a lot of XML and that is [br]useful for people debugging the JVM as well. 0:21:51.182,0:21:54.430 It's quite handy to get an idea of [br]what's going on. 0:21:56.810,0:22:03.145 If that is not enough information you [br]also have the ability to go even further. 0:22:04.735,0:22:07.334 This is beyond the limit of my [br]understanding. 0:22:07.334,0:22:10.945 I've gone into this little bit just to [br]show you what can be done. 0:22:11.295,0:22:17.342 You have release builds of OpenJDK [br]and they have debug builds of OpenJDK. 0:22:17.342,0:22:23.737 The release builds will by default turn [br]off a lot of the diagnostic options. 0:22:24.637,0:22:27.660 You can switch them back on again. 0:22:27.660,0:22:33.299 When you do you can also gain insight [br]into the actual, it's colloquially 0:22:33.299,0:22:37.344 referred to as the C2 JIT, [br]the compiler there. 0:22:37.344,0:22:41.706 You can see, for instance, objects in [br]timelines and visualize them 0:22:41.706,0:22:45.493 as they're being optimised at various [br]stages and various things. 0:22:45.493,0:22:51.978 So this is based on a masters thesis [br]by Thomas Würthinger. 0:22:53.578,0:22:58.446 This is something you can play with as [br]well and see how far the optimiser goes. 0:22:59.966,0:23:03.373 And it's also good for people hacking [br]with the JVM. 0:23:04.813,0:23:08.275 I'll move onto some stuff we did. 0:23:09.605,0:23:15.520 Last year we were working on the [br]big data. Relatively new architecture 0:23:17.160,0:23:22.457 ARM64, it's called AArch64 in OpenJDK [br]land but ARM64 in Debian land. 0:23:23.787,0:23:26.769 We were a bit concerned because [br]everything's all shiny and new. 0:23:26.769,0:23:28.726 Has it been optimised correctly? 0:23:28.726,0:23:31.440 Are there any obvious things[br]we need to optimise? 0:23:31.440,0:23:33.820 And we're also interested because[br]everything was so shiny and new 0:23:33.820,0:23:35.201 in the whole system. 0:23:35.201,0:23:38.969 Not just the JVM but the glibc and [br]the kernel as well. 0:23:38.969,0:23:41.890 So how do we get a view of all of this? 0:23:41.890,0:23:48.657 I gave a quick talk before at the Debian [br]mini-conf before last [2014] about perf 0:23:49.537,0:23:52.650 so decided we could try and do some [br]clever things with Linux perf 0:23:52.650,0:23:57.655 and see if we could get some actual useful[br]debugging information out. 0:23:57.655,0:24:02.128 We have the flame graphs that are quite [br]well known. 0:24:02.128,0:24:08.192 We also have some previous work, Johannes [br]had a special perf map agent that 0:24:08.192,0:24:13.350 could basically hook into perf and it [br]would give you a nice way of running 0:24:13.350,0:24:20.001 perf-top for want of a better expression [br]and viewing the top Java function names. 0:24:21.581,0:24:24.940 This is really good work and it's really [br]good for a particular use case 0:24:24.940,0:24:29.405 if you just want to do a quick snap shot [br]once and see in that snap shot 0:24:29.635,0:24:32.104 where the hotspots where. 0:24:32.104,0:24:38.490 For a prolonged work load with all [br]the functions being JIT'd multiple ways 0:24:38.490,0:24:42.165 with the optimisation going on and [br]everything moving around 0:24:42.165,0:24:46.762 it require a little bit more information [br]to be captured. 0:24:46.762,0:24:50.872 I decided to do a little bit of work on a [br]very similar thing to perf-map-agent 0:24:51.412,0:24:55.725 but an agent that would capture it over [br]a prolonged period of time. 0:24:55.725,0:24:59.234 Here's an example Flame graph, these are [br]all over the internet. 0:24:59.234,0:25:04.707 This is the SHA1 computation example that [br]I gave at the beginning. 0:25:04.707,0:25:10.394 As expected the VM intrinsic SHA1 is the [br]top one. 0:25:10.394,0:25:17.494 Not expected by me was this quite [br]significant chunk of CPU execution time. 0:25:17.494,0:25:21.408 And there was a significant amount of [br]time being spent copying memory 0:25:21.408,0:25:28.080 from the mmapped memory [br]region into a heap 0:25:28.080,0:25:31.315 and then that was passed to [br]the crypto engine. 0:25:31.315,0:25:35.487 So we're doing a ton of memory copies for [br]no good reason. 0:25:35.487,0:25:38.662 That essentially highlighted an example. 0:25:38.662,0:25:42.494 That was an assumption I made about Java [br]to begin with which was if you do 0:25:42.494,0:25:45.114 the equivalent of mmap it should just [br]work like mmap right? 0:25:45.114,0:25:48.118 You should just be able to address the [br]memory. That is not the case.[br] 0:25:48.118,0:25:53.836 If you've got a file mapping object and [br]you try to address it it has to be copied 0:25:53.836,0:25:59.200 into safe heap memory first. And that is [br]what was slowing down the programs. 0:25:59.200,0:26:04.576 If that was omitted you could make [br]the SHA1 computation even quicker. 0:26:04.576,0:26:09.022 So that would be the logical target you [br]would want to optimise. 0:26:09.022,0:26:12.014 I wanted to extend Johannes' work [br]with something called a 0:26:12.014,0:26:15.741 Java Virtual Machine Tools Interface [br]profiling agent. 0:26:16.871,0:26:22.649 This is part of the Java Virtual Machine [br]standard as you can make a special library 0:26:22.649,0:26:24.507 and then hook this into the JVM. 0:26:24.507,0:26:28.199 And the JVM can expose quite a few [br]things to the library. 0:26:28.199,0:26:32.413 It exposes a reasonable amount of [br]information as well. 0:26:32.413,0:26:39.167 Perf as well has the ability to look [br]at map files natively. 0:26:39.837,0:26:44.357 If you are profiling JavaScript, or [br]something similar, I think the 0:26:44.357,0:26:48.403 Google V8 JavaScript engine will write [br]out a special map file that says 0:26:48.403,0:26:52.716 these program counter addresses correspond[br]to these function names. 0:26:52.716,0:26:56.986 I decided to use that in a similar way to [br]what Johannes did for the extended 0:26:56.986,0:27:02.481 profiling agent but I also decided to [br]capture some more information as well. 0:27:05.261,0:27:09.970 I decided to capture the disassembly [br]so when we run perf annotate 0:27:09.970,0:27:13.395 we can see the actual JVM bytecode [br]in our annotation. 0:27:13.395,0:27:17.257 We can see how it was JIT'd at the [br]time when it was JIT'd. 0:27:17.257,0:27:19.659 We can see where the hotspots where. 0:27:19.659,0:27:22.817 And that's good. But we can go [br]even better. 0:27:22.817,0:27:29.174 We can run an annotated trace that [br]contains the Java class, 0:27:29.174,0:27:33.960 the Java method and the bytecode all in [br]one place at the same time. 0:27:33.960,0:27:38.950 You can see everything from the JVM [br]at the same place. 0:27:38.950,0:27:43.940 This works reasonably well because the [br]perf interface is extremely extensible. 0:27:43.940,0:27:47.689 And again we can do entire [br]system optimisation. 0:27:47.689,0:27:51.776 The bits in red here are the Linux kernel. 0:27:51.776,0:27:54.991 Then we got into libraries. 0:27:54.991,0:27:57.940 And then we got into Java and more [br]libraries as well. 0:27:57.940,0:28:02.340 So we can see everything from top to [br]bottom in one fell swoop. 0:28:04.120,0:28:07.609 This is just a quick slide showing the [br]mechanisms employed. 0:28:07.609,0:28:11.670 Essentially we have this agent which is [br]a shared object file. 0:28:11.670,0:28:16.390 And this will spit out useful files here [br]in a standard way. 0:28:16.390,0:28:25.625 And the Linux perf basically just records [br]the perf data dump file as normal. 0:28:27.155,0:28:29.585 We have 2 sets of recording going on. 0:28:29.585,0:28:35.229 To report it it's very easy to do [br]normal reporting with the PID map. 0:28:35.229,0:28:40.819 This is just out of the box, works with [br]the Google V8 engine as well. 0:28:40.819,0:28:45.077 If you want to do very clever annotations [br]perf has the ability to have 0:28:45.077,0:28:47.693 Python scripts passed to it. 0:28:47.693,0:28:53.521 So you can craft quite a dodgy Python [br]script and that can interface 0:28:53.521,0:28:55.369 with the perf annotation output. 0:28:55.369,0:29:00.244 That's how I was able to get the extra [br]Java information in the same annotation. 0:29:00.244,0:29:05.047 And this is really easy to do; it's quite [br]easy to knock the script up. 0:29:05.047,0:29:10.216 And again the only thing we do for this [br]profiling is we hook in the profiling 0:29:10.216,0:29:13.440 agent which dumps out various things. 0:29:13.440,0:29:18.330 We preserve the frame pointer because [br]that makes things considerably easier 0:29:18.330,0:29:21.367 on winding. This will effect [br]performance a little bit. 0:29:21.367,0:29:25.913 And again when we're reporting we just [br]hook in a Python script. 0:29:25.913,0:29:29.697 It's really easy to hook everything in [br]and get it working. 0:29:32.692,0:29:37.322 At the moment we have a JVMTI agent. It's [br]actually on http://git.linaro.org now. 0:29:38.252,0:29:42.154 Since I gave this talk Google have [br]extended perf anyway so it will do 0:29:42.154,0:29:45.405 quite a lot of similar things out of the [br]box anyway. 0:29:45.405,0:29:49.748 It's worth having a look at the [br]latest perf. 0:29:49.748,0:29:54.284 These techniques in this slide deck can be[br]used obviously in other JITs quite easily. 0:29:54.284,0:29:59.308 The fact that perf is so easy to extend [br]with scripts can be useful 0:29:59.308,0:30:01.215 for other things. 0:30:01.215,0:30:05.769 And OpenJDK has a significant amount of [br]cleverness associated with it that 0:30:05.769,0:30:10.050 I thought was very surprising and good. [br]So that's what I covered in the talk. 0:30:13.000,0:30:17.852 These are basically references to things [br]like command line arguments 0:30:17.852,0:30:20.491 and the Flame graphs and stuff like that. 0:30:20.491,0:30:26.308 If anyone is interested in playing with [br]OpenJDK on ARM64 I'd suggest going here: 0:30:26.308,0:30:31.012 http://openjdk.linaro.org[br]Where the most recent builds are. 0:30:31.012,0:30:36.002 Obviously fixes are going in upstream and [br]they're going into distributions as well. 0:30:36.002,0:30:40.106 They're included in OpenJDK so it should [br]be good as well. 0:30:41.326,0:30:44.874 I've run through quite a few fundamental [br]things reasonably quickly. 0:30:44.874,0:30:48.274 I'd be happy to accept any questions [br]or comments 0:30:54.161,0:30:57.297 And if you want to talk to me privately [br]about Java afterwards feel free to 0:30:57.297,0:30:59.129 when no-one's looking. 0:31:06.896,0:31:12.698 [Audience] Applause 0:31:13.315,0:31:19.028 [Audience] It's not really a question so [br]much as a comment. 0:31:19.028,0:31:31.881 Last mini-Deb conf we had a talk about [br]using the JVM with other languages. 0:31:31.881,0:31:35.845 And it seems to me that all this would [br]apply even if you hate Java programming 0:31:35.845,0:31:39.361 language and want to write in, I don't [br]know, lisp or something instead 0:31:39.361,0:31:42.424 if you've got a lisp system that can [br]generate JVM bytecode. 0:31:42.424,0:31:48.132 [Presenter] Yeah, totally. And the other [br]big data language we looked at was Scala. 0:31:49.162,0:31:53.420 It uses the JVM back end but a completely [br]different language on the front. 0:32:03.875,0:32:08.104 Cheers guys.