0:00:01.753,0:00:10.762
Onto the second talk of the day.

0:00:10.762,0:00:15.511
Steve Capper is going to tell us[br]about the good bits of Java

0:00:15.511,0:00:18.248
They do exist

0:00:18.248,0:00:23.417
[Audience] Could this have been a [br]lightening talk? [Audience laughter]

0:00:23.417,0:00:26.204
Believe it or not we've got some [br]good stuff here.

0:00:27.293,0:00:30.534
I was as skeptical as you guys [br]when I first looked.

0:00:30.534,0:00:34.911
First many apologies for not attending [br]the mini-conf last year

0:00:34.911,0:00:40.471
I was unfortunately ill on the day [br]I was due to give this talk.

0:00:43.619,0:00:45.742
Let me figure out how to use a computer.

0:01:00.894,0:01:02.160
Sorry about this.

0:01:12.795,0:01:16.034
There we go; it's because [br]I've not woken up.

0:01:16.034,0:01:27.981
Last year I worked at Linaro in the [br]Enterprise group and we performed analysis

0:01:27.981,0:01:32.206
on so called 'Big Data' application sets.

0:01:32.206,0:01:38.441
As many of you know quite a lot of these [br]big data applications are written in Java.

0:01:38.441,0:01:43.039
I'm from ARM and we were very interested[br]in 64bit ARM support.

0:01:43.039,0:01:47.613
So this is mainly AArch64 examples [br]for things like assembler

0:01:47.613,0:01:53.708
but most of the messages are [br]pertinent for any architecture.

0:01:53.708,0:01:58.956
These good bits are shared between [br]most if not all the architectures.

0:01:58.956,0:02:03.008
Whilst trying to optimise a lot of [br]these big data applications

0:02:03.008,0:02:06.409
I stumbled a across quite a few things in [br]the JVM and I thought

0:02:06.409,0:02:11.425
'actually that's really clever; [br]that's really cool'

0:02:11.425,0:02:16.673
So I thought that would make a good [br]basis for an interesting talk.

0:02:16.673,0:02:20.330
This talk is essentially some of the [br]clever things I found in the

0:02:20.330,0:02:25.322
Java Virtual Machine; these [br]optimisations are in OpenJDK.

0:02:25.322,0:02:33.554
Source is available it's all there, [br]readily available and in play now.

0:02:33.554,0:02:38.349
I'm going to finish with some of the [br]optimisation work we did with Java.

0:02:38.349,0:02:42.818
People who know me will know [br]I'm not a Java zealot.

0:02:42.818,0:02:48.809
I don't particularly believe in [br]programming in a language over another one

0:02:48.809,0:02:51.561
So to make it clear from the outset [br]I'm not attempting to convert

0:02:51.561,0:02:54.556
anyone to Java programmers.

0:02:54.556,0:02:57.424
I'm just going to highlight a few salient [br]things in the Java Virtual Machine

0:02:57.424,0:02:59.791
which I found to be quite clever and [br]interesting

0:02:59.791,0:03:05.365
and I'll try and talk through them [br]with my understanding of them.

0:03:05.365,0:03:10.241
Let's jump straight in and let's [br]start with an example.

0:03:10.241,0:03:15.199
This is a minimal example for [br]computing a SHA1 sum of a file.

0:03:15.199,0:03:19.819
I've omitted some of the checking in the [br]beginning of the function see when

0:03:19.819,0:03:22.397
command line parsing and that sort of [br]thing.

0:03:22.397,0:03:25.160
I've highlighted the salient [br]points in red.

0:03:25.160,0:03:31.093
Essentially we instantiate a SHA1 [br]crypto message service digest.

0:03:31.093,0:03:36.108
And we do the equivalent in [br]Java of an mmap.

0:03:36.108,0:03:38.279
Get it all in memory.

0:03:38.279,0:03:42.784
And then we just put this status straight [br]into the crypto engine.

0:03:42.784,0:03:46.952
And eventually at the end of the [br]program we'll spit out the SHA1 hash.

0:03:46.952,0:03:48.682
It's a very simple program.

0:03:48.682,0:03:56.356
It's basically mmap, SHA1, output [br]the hash afterwards.

0:03:56.356,0:04:04.077
In order to concentrate on the CPU [br]aspect rather than worry about IO

0:04:04.077,0:04:08.465
I decided to cheat a little bit by [br]setting this up.

0:04:08.465,0:04:15.176
I decided to use a sparse file. As many of[br]you know a sparse file is a file that not

0:04:15.176,0:04:19.529
all the contents are stored necessarily [br]on disc. The assumption is that the bits

0:04:19.529,0:04:26.379
that aren't stored are zero. For instance[br]on Linux you can create a 20TB sparse file

0:04:26.379,0:04:31.162
on a 10MB file system and use it as [br]normal.

0:04:31.162,0:04:34.611
Just don't write too much to it otherwise [br]you're going to run out of space.

0:04:34.611,0:04:40.915
The idea behind using a sparse file is I'm[br]just focusing on the computational aspects

0:04:40.915,0:04:44.990
of the SHA1 sum. I'm not worried about [br]the file system or anything like that.

0:04:44.990,0:04:49.690
I don't want to worry about the IO. I [br]just want to focus on the actual compute.

0:04:49.690,0:04:53.106
In order to set up a sparse file I used [br]the following runes.

0:04:53.106,0:04:56.877
The important point is that you seek[br]and the other important point

0:04:56.877,0:05:03.206
is you set a count otherwise you'll [br]fill your disc up.

0:05:03.206,0:05:09.568
I decided to run this against firstly [br]let's get the native SHA1 sum command

0:05:09.568,0:05:15.188
that's built into Linux and let's [br]normalise these results and say that's 1.0

0:05:15.188,0:05:21.794
I used an older version of the OpenJDK [br]and ran the Java program

0:05:21.794,0:05:30.339
and that's 1.09 times slower than the [br]reference command. That's quite good.

0:05:30.339,0:05:39.174
Then I used the new OpenJDK, this is now[br]the current JDK as this is a year on.

0:05:39.174,0:05:46.732
And 0.21 taken. It's significantly faster.

0:05:46.732,0:05:50.993
I've stressed that I've done nothing [br]surreptitious in the Java program.

0:05:50.993,0:05:56.171
It is mmap, compute, spit result out.

0:05:56.171,0:06:01.570
But the OpenJDK has essentially got [br]some more context information.

0:06:01.570,0:06:03.625
I'll talk about that as we go through.

0:06:06.202,0:06:11.194
Before when I started Java I had a very [br]simplistic view of Java.

0:06:11.194,0:06:17.416
Traditionally Java is taught as a virtual [br]machine that runs bytecode.

0:06:17.416,0:06:21.481
Now when you compile a Java program it [br]compiles into bytecode.

0:06:21.481,0:06:25.428
The older versions of the Java Virtual [br]Machine would interpret this bytecode

0:06:25.428,0:06:33.056
and then run through. Newer versions [br]would employ a just-in-time engine

0:06:33.056,0:06:39.627
and try and compile this bytecode [br]into native machine code.

0:06:39.627,0:06:43.191
That is not the only thing that goes on[br]when you run a Java program.

0:06:43.191,0:06:47.568
There is some extra optimisations as well.[br]So this alone would not account for

0:06:47.568,0:06:52.315
the newer version of the SHA1 [br]sum being significantly faster

0:06:52.315,0:06:55.568
than the distro supplied one.

0:06:55.568,0:07:00.850
Java knows about context. It has a class [br]library and these class libraries

0:07:00.850,0:07:04.624
have reasonably well defined purposes.

0:07:04.624,0:07:07.746
We have classes that provide [br]crypto services.

0:07:07.746,0:07:11.067
We have some misc unsafe that every [br]single project seems to pull in their

0:07:11.067,0:07:13.493
project when they're not supposed to.

0:07:13.493,0:07:17.673
These have well defined meanings.

0:07:17.673,0:07:21.075
These do not necessarily have to be [br]written in Java.

0:07:21.075,0:07:24.569
They come as Java classes, [br]they come supplied.

0:07:24.569,0:07:28.911
But most JVMs now have a notion [br]of a virtual machine intrinsic.

0:07:28.911,0:07:35.227
And the virtual machine intrinsic says ok [br]please do a SHA1 in the best possible way

0:07:35.227,0:07:39.500
that your implementation allows. This is [br]something done automatically by the JVM.

0:07:39.500,0:07:43.482
You don't ask for it. If the JVM knows[br]what it's running on and it's reasonably

0:07:43.482,0:07:47.824
recent this will just happen [br]for you for free.

0:07:47.824,0:07:50.123
And there's quite a few classes [br]that do this.

0:07:50.123,0:07:53.650
There's quite a few clever things with [br]atomics, there's crypto,

0:07:53.650,0:07:58.378
there's mathematical routines as well. [br]Most of these routines in the

0:07:58.378,0:08:03.136
class library have a well defined notion [br]of a virtual machine intrinsic

0:08:03.136,0:08:06.992
and they do run reasonably optimally.

0:08:06.992,0:08:11.984
They are a subject of continuous [br]optimisation as well.

0:08:11.984,0:08:17.312
We've got some runes that are [br]presented on the slides here.

0:08:17.312,0:08:21.342
These are quite useful if you [br]are interested in

0:08:21.342,0:08:24.696
how these intrinsics are made.

0:08:24.696,0:08:29.248
You can ask the JVM to print out a lot of[br]the just-in-time compiled code.

0:08:29.248,0:08:35.239
You can ask the JVM to print out the [br]native methods as well as these intrinsics

0:08:35.239,0:08:40.731
and in this particular case after sifting [br]through about 5MB of text

0:08:40.731,0:08:45.235
I've come across this particular SHA1 sum[br]implementation.

0:08:45.235,0:08:53.420
This is AArch64. This is employing the [br]cryptographic extensions

0:08:53.420,0:08:57.263
in the architecture.

0:08:57.263,0:08:59.353
So it's essentially using the CPU [br]instructions which would explain why

0:08:59.353,0:09:01.884
it's faster. But again it's done [br]all this automatically.

0:09:01.884,0:09:05.948
This did not require any specific runes [br]or anything to activate.

0:09:05.948,0:09:12.229
We'll see a bit later on how you can [br]more easily find the hot spots

0:09:12.229,0:09:15.128
rather than sifting through a lot [br]of assembler.

0:09:15.128,0:09:18.658
I've mentioned that the cryptographic [br]engine is employed and again

0:09:18.658,0:09:23.374
this routine was generated at run [br]time as well.

0:09:23.374,0:09:28.390
This is one of the important things about [br]certain execution of amps like Java.

0:09:28.390,0:09:31.439
You don't have to know everything at [br]compile time.

0:09:31.439,0:09:35.019
You know a lot more information at [br]run time and you can use that

0:09:35.019,0:09:37.838
in theory to optimise.

0:09:37.838,0:09:40.394
You can switch off these clever routines.

0:09:40.394,0:09:43.459
For instance I've got a deactivate [br]here and we get back to the

0:09:43.459,0:09:46.699
slower performance we expected.

0:09:46.699,0:09:52.968
Again, this particular set of routines is [br]present in OpenJDK,

0:09:52.968,0:09:57.391
I think for all the architectures that [br]support it.

0:09:57.391,0:10:00.758
We get this optimisation for free on X86 [br]and others as well.

0:10:00.758,0:10:03.301
It works quite well.

0:10:03.301,0:10:08.514
That was one surprise I came across [br]as the instrinsics.

0:10:08.514,0:10:13.216
One thing I thought it would be quite [br]good to do would be to go through

0:10:13.216,0:10:18.104
a slightly more complicated example. [br]And use this example to explain

0:10:18.104,0:10:21.865
a lot of other things that happen [br]in the JVM as well.

0:10:21.865,0:10:24.652
I will spend a bit of time going through [br]this example

0:10:24.652,0:10:32.512
and explain roughly the notion of what [br]it's supposed to be doing.

0:10:32.512,0:10:39.605
This is an imaginary method that I've [br]contrived to demonstrate a lot of points

0:10:39.605,0:10:43.146
in the fewest possible lines of code.

0:10:43.146,0:10:43.828
I'll start with what it's meant to do.

0:10:43.828,0:10:50.886
This is meant to be a routine that gets a[br]reference to something and let's you know

0:10:50.886,0:10:58.692
whether or not it's an image and in a [br]hypothetical cache.

0:10:58.692,0:11:02.512
I'll start with the important thing [br]here the weak reference.

0:11:02.512,0:11:09.443
In Java and other garbage collected [br]languages we have the notion of references

0:11:09.443,0:11:13.588
Most of the time when you are running a [br]Java program you have something like a

0:11:13.588,0:11:19.451
variable name and that is in the current [br]execution context that is referred to as a

0:11:19.451,0:11:24.176
strong reference to the object. In other [br]words I can see it. I am using it.

0:11:24.176,0:11:27.659
Please don't get rid of it. [br]Bad things will happen if you do.

0:11:27.659,0:11:30.561
So the garbage collector knows [br]not to get rid of it.

0:11:30.561,0:11:36.146
In Java and other languages you also [br]have the notion of a weak reference.

0:11:36.146,0:11:40.244
This is essentially the programmer saying[br]to the virtual machine

0:11:40.244,0:11:43.936
"Look I kinda care about this but [br]just a little bit."

0:11:43.936,0:11:49.973
"If you want to get rid of it feel free [br]to but please let me know."

0:11:49.973,0:11:54.142
This is why this is for a CacheClass. [br]For instance the JVM in this particular

0:11:54.142,0:12:01.421
case could decide that it's running quite [br]low on memory this particular xMB image

0:12:01.421,0:12:04.416
has not been used for a while it can [br]garbage collect it.

0:12:04.416,0:12:08.770
The important thing is how we go about [br]expressing this in the language.

0:12:08.770,0:12:13.530
We can't just have a reference to the [br]object because that's a strong reference

0:12:13.530,0:12:17.768
and the JVM will know it can't get [br]rid of this because the program

0:12:17.768,0:12:19.521
can see it actively.

0:12:19.521,0:12:25.152
So we have a level of indirection which [br]is known as a weak reference.

0:12:25.152,0:12:28.621
We have this hypothetical CacheClass [br]that I've devised.

0:12:28.621,0:12:32.280
At this point it is a weak reference.

0:12:32.280,0:12:36.204
Then we get it. This is calling the weak [br]reference routine.

0:12:36.204,0:12:41.139
Now it becomes a strong reference so [br]it's not going to be garbage collected.

0:12:41.139,0:12:45.202
When we get to the return path it becomes [br]a weak reference again

0:12:45.202,0:12:48.929
because our strong reference [br]has disappeared.

0:12:48.929,0:12:50.856
The salient points in this example are:

0:12:50.856,0:12:54.293
We're employing a method to get [br]a reference.

0:12:54.293,0:12:57.079
We're checking an item to see if [br]it's null.

0:12:57.079,0:13:01.050
So let's say that the JVM decided to [br]garbage collect this

0:13:01.050,0:13:04.544
before we executed the method.

0:13:04.544,0:13:08.885
The weak reference class is still valid [br]because we've got a strong reference to it

0:13:08.885,0:13:11.926
but the actual object behind this is gone.

0:13:11.926,0:13:14.970
If we're too late and the garbage [br]collector has killed it

0:13:14.970,0:13:17.919
it will be null and we return.

0:13:17.919,0:13:22.331
So it's a level of indirection to see [br]does this still exist

0:13:22.331,0:13:27.960
if so can I please have it and then [br]operate on it as normal

0:13:27.960,0:13:30.794
and then return becomes weak [br]reference again.

0:13:30.794,0:13:37.888
This example program is quite useful when[br]we look at how it's implemented in the JVM

0:13:37.888,0:13:40.384
and we'll go through a few things now.

0:13:40.384,0:13:43.644
First off we'll go through the bytecode.

0:13:43.644,0:13:49.312
The only point of this slide is to [br]show it's roughly

0:13:49.312,0:13:54.152
the same as this.

0:13:54.152,0:13:56.371
We get our variable.

0:13:56.371,0:13:58.981
We use our getter.

0:13:58.981,0:14:03.836
This bit is extra this checkcast. [br]The reason that bit is extra is

0:14:03.836,0:14:15.110
because we're using the equivalent of [br]a template in Java.

0:14:15.110,0:14:19.173
And the way that's implemented in Java is [br]it just basically casts everything to an

0:14:19.173,0:14:23.167
object so that requires extra [br]compiler information.

0:14:23.167,0:14:26.035
And this is the extra check.

0:14:26.035,0:14:30.795
The rest of this we load the reference, [br]we check to see if it is null,

0:14:30.795,0:14:35.149
If it's not null we invoke a virtual [br]function - is it the image?

0:14:35.149,0:14:37.981
and we return as normal.

0:14:37.981,0:14:43.357
Essentially the point I'm trying to make [br]is when we compile this to bytecode

0:14:43.357,0:14:45.365
this execution happens.

0:14:45.365,0:14:46.758
This null check happens.

0:14:46.758,0:14:48.604
This execution happens.

0:14:48.604,0:14:50.450
And we return.

0:14:50.450,0:14:55.118
In the actual Java class files we've not [br]lost anything.

0:14:55.118,0:14:58.356
This is what it looks like when it's [br]been JIT'd.

0:14:58.356,0:15:01.108
Now we've lost lots of things.

0:15:01.108,0:15:05.578
The JIT has done quite a few clever things[br]which I'll talk about.

0:15:05.578,0:15:10.905
First off if we look down here there's [br]a single branch here.

0:15:10.905,0:15:16.167
And this is only if our check cast failed

0:15:16.167,0:15:20.381
We've got comments on the [br]right hand side.

0:15:20.381,0:15:27.126
Our get method has been inlined so [br]we're no longer calling.

0:15:27.126,0:15:32.699
We seem to have lost our null check,[br]that's just gone.

0:15:32.699,0:15:35.938
And again we've got a get field as well.

0:15:35.938,0:15:40.153
That's no longer a method, [br]that's been inlined as well.

0:15:40.153,0:15:42.579
We've also got some other cute things.

0:15:42.579,0:15:46.062
Those more familiar with AArch64 [br]will understand

0:15:46.062,0:15:49.870
that the pointers we're using [br]are 32bit not 64bit.

0:15:49.870,0:15:54.015
What we're doing is getting a pointer [br]and shifting it left 3

0:15:54.015,0:15:56.476
and widening it to a 64bit pointer.

0:15:56.476,0:16:02.245
We've also got 32bit pointers on a [br]64bit system as well.

0:16:02.245,0:16:06.148
So that's saving a reasonable amount [br]of memory and cache.

0:16:06.148,0:16:09.712
To summarise. We don't have any [br]branches or function calls

0:16:09.712,0:16:12.742
and we've got a lot of inlining.

0:16:12.742,0:16:16.061
We did have function calls in the [br]class file so it's the JVM;

0:16:16.061,0:16:18.326
it's the JIT that has done this.

0:16:18.326,0:16:23.330
We've got no null checks either and I'm [br]going to talk through this now.

0:16:23.330,0:16:29.996
The null check elimination is quite a [br]clever feature in Java and other programs.

0:16:29.996,0:16:33.605
The idea behind null check elimination is

0:16:33.605,0:16:37.494
most of the time this object is not [br]going to be null.

0:16:37.494,0:16:43.599
If this object is null the operating [br]system knows this quite quickly.

0:16:43.599,0:16:47.943
So if you try to dereference a null [br]pointer you'll get either a SIGSEGV or

0:16:47.943,0:16:51.275
a SIGBUS depending on a [br]few circumstances.

0:16:51.275,0:16:53.655
That goes straight back to the JVM

0:16:53.655,0:16:58.810
and the JVM knows where the null [br]exception took place.

0:16:58.810,0:17:01.759
Because it knows where it took [br]place it can look this up

0:17:01.759,0:17:05.774
and unwind it as part of an exception.

0:17:05.774,0:17:09.909
Those null checks just go.[br]Completely gone.

0:17:09.909,0:17:15.447
Most of the time this works and you are [br]saving a reasonable amount of execution.

0:17:15.447,0:17:19.836
I'll talk about when it doesn't work [br]in a second.

0:17:19.836,0:17:24.143
That's reasonably clever. We have similar [br]programming techniques in other places

0:17:24.143,0:17:27.800
even the Linux kernel for instance when [br]you copy data to and from user space

0:17:27.800,0:17:30.912
it does pretty much identical [br]the same thing.

0:17:30.912,0:17:36.636
It has an exception unwind table and it [br]knows if it catches a page fault on

0:17:36.636,0:17:40.130
this particular program counter[br]it can deal with it because it knows

0:17:40.130,0:17:43.729
the program counter and it knows[br]conceptually what it was doing.

0:17:43.729,0:17:47.630
In a similar way the JIT know what its [br]doing to a reasonable degree.

0:17:47.630,0:17:53.459
It can handle the null check elimination.

0:17:53.459,0:17:57.325
I mentioned the sneaky one. We've got[br]essentially 32bit pointers

0:17:57.325,0:17:59.658
on a 64bit system.

0:17:59.658,0:18:04.534
Most of the time in Java people typically [br]specify heap size smaller than 32GB.

0:18:04.534,0:18:10.444
Which is perfect if you want to use 32bit [br]pointers and left shift 3.

0:18:10.444,0:18:12.940
Because that gives you 32GB of [br]addressable memory.

0:18:12.940,0:18:19.337
That's a significant memory saving because[br]otherwise a lot of things would double up.

0:18:19.337,0:18:23.041
There's a significant number of pointers [br]in Java.

0:18:23.041,0:18:29.055
The one that should make people [br]jump out of their seat is

0:18:29.055,0:18:32.224
the fact that most methods in Java are [br]actually virtual.

0:18:32.224,0:18:37.147
So what the JVM has actually done is [br]inlined a virtual function.

0:18:37.147,0:18:41.698
A virtual function is essentially a [br]function were you don't know where

0:18:41.698,0:18:43.439
you're going until run time.

0:18:43.439,0:18:47.503
You can have several different classes [br]and they share the same virtual function

0:18:47.503,0:18:51.439
in the base class and dependent upon [br]which specific class you're running

0:18:51.439,0:18:54.364
different virtual functions will [br]get executed.

0:18:54.364,0:19:01.307
In C++ that will be a read from a V table[br]and then you know where to go.

0:19:01.307,0:19:03.606
The JVM's inlined it.

0:19:03.606,0:19:05.255
We've saved a memory load.

0:19:05.255,0:19:07.925
We've saved a branch as well

0:19:07.925,0:19:12.023
The reason the JVM can inline it is [br]because the JVM knows

0:19:12.023,0:19:14.485
every single class that has been loaded.

0:19:14.485,0:19:19.767
So it knows that although this looks [br]polymorphic to the casual programmer

0:19:19.767,0:19:26.420
It actually is monomorphic.[br]The JVM knows this.

0:19:26.420,0:19:30.982
Because it knows this it can be clever.[br]And this is really clever.

0:19:30.982,0:19:35.660
That's a significant cost saving.

0:19:35.660,0:19:41.664
This is all great. I've already mentioned [br]the null check elimination.

0:19:41.664,0:19:47.410
We're taking a signal as most of you know [br]if we do that a lot it's going to be slow.

0:19:47.410,0:19:51.137
Jumping into kernel, into user, [br]bouncing around.

0:19:51.137,0:19:55.735
The JVM also has a notion of [br]'OK I've been a bit too clever now;

0:19:55.735,0:19:57.906
I need to back off a bit'

0:19:57.906,0:20:02.782
Also there's nothing stopping the user [br]loading more classes

0:20:02.782,0:20:06.799
and rendering the monomorphic [br]assumption invalid.

0:20:06.799,0:20:10.166
So the JVM needs to have a notion of [br]backpeddling and go

0:20:10.166,0:20:14.264
'Ok I've gone to far and need to [br]deoptimise'

0:20:14.264,0:20:17.051
The JVM has the ability to deoptimise.

0:20:17.051,0:20:22.865
In other words it essentially knows that [br]for certain code paths everything's OK.

0:20:22.865,0:20:27.140
But for certain new objects it can't get [br]away with these tricks.

0:20:27.140,0:20:32.678
By the time the new objects are executed [br]they are going to be safe.

0:20:32.678,0:20:36.823
There are ramifications for this.[br]This is the important thing to consider

0:20:36.823,0:20:39.818
with something like Java and other [br]languages and other virtual machines.

0:20:39.818,0:20:46.351
If you're trying to profile this it means [br]there is a very significant ramification.

0:20:46.351,0:20:52.252
You can have the same class and [br]method JIT'd multiple ways

0:20:52.252,0:20:54.969
and executed at the same time.

0:20:54.969,0:21:00.832
So if you're trying to find a hot spot [br]the program counter's nodding off.

0:21:00.832,0:21:03.885
Because you can refer to the same thing [br]in several different ways.

0:21:03.885,0:21:07.914
This is quite common as well as [br]deoptimisation does take place.

0:21:07.914,0:21:16.215
That's something to bear in mind with JVM [br]and similar runtime environments.

0:21:16.215,0:21:19.814
You can get a notion of what the JVM's [br]trying to do.

0:21:19.814,0:21:22.612
You can ask it nicely and add a print [br]compilation option

0:21:22.612,0:21:24.667
and it will tell you what it's doing.

0:21:24.667,0:21:26.885
This is reasonably verbose.

0:21:26.885,0:21:30.008
Typically what happens is the JVM gets [br]excited JIT'ing everything

0:21:30.008,0:21:32.643
and optimising everything then [br]it settles down.

0:21:32.643,0:21:35.430
Until you load something new [br]and it gets excited again.

0:21:35.430,0:21:38.588
There's a lot of logs. This is mainly [br]useful for debugging but

0:21:38.588,0:21:42.361
it gives you an appreciation that it's [br]doing a lot of work.

0:21:42.361,0:21:45.635
You can go even further with a log [br]compilation option.

0:21:45.635,0:21:50.732
That produces a lot of XML and that is [br]useful for people debugging the JVM as well.

0:21:50.732,0:21:56.270
It's quite handy to get an idea of [br]what's going on.

0:21:56.270,0:22:04.025
If that is not enough information you [br]also have the ability to go even further.

0:22:04.025,0:22:07.334
This is beyond the limit of my [br]understanding.

0:22:07.334,0:22:10.945
I've gone into this little bit just to [br]show you what can be done.

0:22:10.945,0:22:17.342
You have release builds of OpenJDK [br]and they have debug builds of OpenJDK.

0:22:17.342,0:22:24.877
The release builds will by default turn [br]off a lot of the diagnostic options.

0:22:24.877,0:22:27.950
You can switch them back on again.

0:22:27.950,0:22:33.619
When you do you can also gain insight [br]into the actual, it's colloquially

0:22:33.619,0:22:37.624
referred to as the C2 JIT, [br]the compiler there.

0:22:37.624,0:22:41.966
You can see, for instance, objects in [br]timelines and visualize them

0:22:41.966,0:22:45.693
as they're being optimised at various [br]stages and various things.

0:22:45.693,0:22:52.938
So this is based on a masters thesis [br]by Thomas Würthinger.

0:22:52.938,0:23:00.566
This is something you can play with as [br]well and see how far the optimiser goes.

0:23:00.566,0:23:05.163
And it's also good for people hacking [br]with the JVM.

0:23:05.163,0:23:08.275
I'll move onto some stuff we did.

0:23:08.275,0:23:15.740
Last year we were working on the [br]big data. Relatively new architecture

0:23:15.740,0:23:23.937
ARM64, it's called AArch64 in OpenJDK [br]land but ARM64 in Debian land.

0:23:23.937,0:23:26.769
We were a bit concerned because [br]everything's all shiny and new.

0:23:26.769,0:23:28.926
Has it been optimised correctly?

0:23:28.926,0:23:31.680
Are there any obvious things[br]we need to optimise?

0:23:31.680,0:23:34.060
And we're also interested because[br]everything was so shiny and new

0:23:34.060,0:23:35.291
in the whole system.

0:23:35.291,0:23:39.389
Not just the JVM but the glibc and [br]the kernel as well.

0:23:39.389,0:23:42.060
So how do we get a view of all of this?

0:23:42.060,0:23:49.827
I gave a quick talk before at the Debian [br]mini-conf before last [2014] about perf

0:23:49.827,0:23:53.020
so decided we could try and do some [br]clever things with Linux perf

0:23:53.020,0:23:57.965
and see if we could get some actual useful[br]debugging information out.

0:23:57.965,0:24:02.598
We have the flame graphs that are quite [br]well known.

0:24:02.598,0:24:08.612
We also have some previous work, Johannes [br]had a special perf map agent that

0:24:08.612,0:24:13.500
could basically hook into perf and it [br]would give you a nice way of running

0:24:13.500,0:24:20.431
perf-top for want of a better expression [br]and viewing the top Java function names.

0:24:20.431,0:24:25.130
This is really good work and it's really [br]good for a particular use case

0:24:25.130,0:24:29.405
if you just want to do a quick snap shot [br]once and see in that snap shot

0:24:29.405,0:24:32.284
where the hotspots where.

0:24:32.284,0:24:38.670
For a prolonged work load with all [br]the functions being JIT'd multiple ways

0:24:38.670,0:24:42.165
with the optimisation going on and [br]everything moving around

0:24:42.165,0:24:46.762
it require a little bit more information [br]to be captured.

0:24:46.762,0:24:50.872
I decided to do a little bit of work on a [br]very similar thing to perf-map-agent

0:24:50.872,0:24:56.155
but an agent that would capture it over [br]a prolonged period of time.

0:24:56.155,0:24:59.394
Here's an example Flame graph, these are [br]all over the internet.

0:24:59.394,0:25:05.257
This is the SHA1 computation example that [br]I gave at the beginning.

0:25:05.257,0:25:10.654
As expected the VM intrinsic SHA1 is the [br]top one.

0:25:10.654,0:25:17.494
Not expected by me was this quite [br]significant chunk of CPU execution time.

0:25:17.494,0:25:21.498
And there was a significant amount of [br]time being spent copying memory

0:25:21.498,0:25:28.430
from the mmapped memory [br]region into a heap

0:25:28.430,0:25:31.575
and then that was passed to [br]the crypto engine.

0:25:31.575,0:25:35.687
So we're doing a ton of memory copies for [br]no good reason.

0:25:35.687,0:25:38.972
That essentially highlighted an example.

0:25:38.972,0:25:42.664
That was an assumption I made about Java [br]to begin with which was if you do

0:25:42.664,0:25:45.334
the equivalent of mmap it should just [br]work like mmap right?

0:25:45.334,0:25:48.398
You should just be able to address the [br]memory. That is not the case.[br]

0:25:48.398,0:25:54.146
If you've got a file mapping object and [br]you try to address it it has to be copied

0:25:54.146,0:25:59.510
into safe heap memory first. And that is [br]what was slowing down the programs.

0:25:59.510,0:26:04.886
If that was omitted you could make [br]the SHA1 computation even quicker.

0:26:04.886,0:26:08.982
So that would be the logical target you [br]would want to optimise.

0:26:08.982,0:26:12.014
I wanted to extend Johannes' work [br]with something called a

0:26:12.014,0:26:15.741
Java Virtual Machine Tools Interface [br]profiling agent.

0:26:15.741,0:26:22.649
This is part of the Java Virtual Machine [br]standard as you can make a special library

0:26:22.649,0:26:24.597
and then hook this into the JVM.

0:26:24.597,0:26:28.349
And the JVM can expose quite a few [br]things to the library.

0:26:28.349,0:26:32.413
It exposes a reasonable amount of [br]information as well.

0:26:32.413,0:26:40.237
Perf as well has the ability to look [br]at map files natively.

0:26:40.237,0:26:44.557
If you are profiling JavaScript, or [br]something similar, I think the

0:26:44.557,0:26:48.493
Google V8 JavaScript engine will write [br]out a special map file that says

0:26:48.493,0:26:52.846
these program counter addresses correspond[br]to these function names.

0:26:52.846,0:26:57.026
I decided to use that in a similar way to [br]what Johannes did for the extended

0:26:57.026,0:27:03.621
profiling agent but I also decided to [br]capture some more information as well.

0:27:03.621,0:27:10.110
I decided to capture the disassembly [br]so when we run perf annotate

0:27:10.110,0:27:13.675
we can see the actual JVM bytecode [br]in our annotation.

0:27:13.675,0:27:17.437
We can see how it was JIT'd at the [br]time when it was JIT'd.

0:27:17.437,0:27:19.909
We can see where the hotspots where.

0:27:19.909,0:27:23.067
And that's good. But we can go [br]even better.

0:27:23.067,0:27:29.174
We can run an annotated trace that [br]contains the Java class,

0:27:29.174,0:27:34.190
the Java method and the bytecode all in [br]one place at the same time.

0:27:34.190,0:27:38.950
You can see everything from the JVM [br]at the same place.

0:27:38.950,0:27:43.940
This works reasonably well because the [br]perf interface is extremely extensible.

0:27:43.940,0:27:47.889
And again we can do entire [br]system optimisation.

0:27:47.889,0:27:52.046
The bits in red here are the Linux kernel.

0:27:52.046,0:27:55.111
Then we got into libraries.

0:27:55.111,0:27:58.350
And then we got into Java and more [br]libraries as well.

0:27:58.350,0:28:03.110
So we can see everything from top to [br]bottom in one fell swoop.

0:28:03.110,0:28:07.789
This is just a quick slide showing the [br]mechanisms employed.

0:28:07.789,0:28:12.340
Essentially we have this agent which is [br]a shared object file.

0:28:12.340,0:28:16.740
And this will spit out useful files here [br]in a standard way.

0:28:16.740,0:28:27.445
And the Linux perf basically just records [br]the perf data dump file as normal.

0:28:27.445,0:28:29.755
We have 2 sets of recording going on.

0:28:29.755,0:28:35.559
To report it it's very easy to do [br]normal reporting with the PID map.

0:28:35.559,0:28:40.819
This is just out of the box, works with [br]the Google V8 engine as well.

0:28:40.819,0:28:45.127
If you want to do very clever annotations [br]perf has the ability to have

0:28:45.127,0:28:47.913
Python scripts passed to it.

0:28:47.913,0:28:53.521
So you can craft quite a dodgy Python [br]script and that can interface

0:28:53.521,0:28:55.529
with the perf annotation output.

0:28:55.529,0:29:00.614
That's how I was able to get the extra [br]Java information in the same annotation.

0:29:00.614,0:29:05.247
And this is really easy to do; it's quite [br]easy to knock the script up.

0:29:05.247,0:29:10.216
And again the only thing we do for this [br]profiling is we hook in the profiling

0:29:10.216,0:29:13.440
agent which dumps out various things.

0:29:13.440,0:29:18.470
We preserve the frame pointer because [br]that makes things considerably easier

0:29:18.470,0:29:21.547
on winding. This will effect [br]performance a little bit.

0:29:21.547,0:29:26.133
And again when we're reporting we just [br]hook in a Python script.

0:29:26.133,0:29:29.697
It's really easy to hook everything in [br]and get it working.

0:29:31.462,0:29:38.602
At the moment we have a JVMTI agent. It's [br]actually on http://git.linaro.org now.

0:29:38.602,0:29:42.364
Since I gave this talk Google have [br]extended perf anyway so it will do

0:29:42.364,0:29:45.765
quite a lot of similar things out of the [br]box anyway.

0:29:45.765,0:29:49.748
It's worth having a look at the [br]latest perf.

0:29:49.748,0:29:54.554
These techniques in this slide deck can be[br]used obviously in other JITs quite easily.

0:29:54.554,0:29:59.628
The fact that perf is so easy to extend [br]with scripts can be useful

0:29:59.628,0:30:01.345
for other things.

0:30:01.345,0:30:05.769
And OpenJDK has a significant amount of [br]cleverness associated with it that

0:30:05.769,0:30:11.180
I thought was very surprising and good. [br]So that's what I covered in the talk.

0:30:11.180,0:30:17.832
These are basically references to things [br]like command line arguments

0:30:17.832,0:30:20.851
and the Flame graphs and stuff like that.

0:30:20.851,0:30:26.308
If anyone is interested in playing with [br]OpenJDK on ARM64 I'd suggest going here:

0:30:26.308,0:30:31.332
http://openjdk.linaro.org[br]Where the most recent builds are.

0:30:31.332,0:30:36.002
Obviously fixes are going in upstream and [br]they're going into distributions as well.

0:30:36.002,0:30:41.586
They're included in OpenJDK so it should [br]be good as well.

0:30:41.586,0:30:45.034
I've run through quite a few fundamental [br]things reasonably quickly.

0:30:45.034,0:30:48.274
I'd be happy to accept any questions [br]or comments

0:30:54.671,0:30:57.817
And if you want to talk to me privately [br]about Java afterwards feel free to

0:30:57.817,0:30:59.129
when no-one's looking.

0:31:06.896,0:31:12.698
[Audience] Applause

0:31:13.315,0:31:19.028
[Audience] It's not really a question so [br]much as a comment.

0:31:19.028,0:31:31.881
Last mini-Deb conf we had a talk about [br]using the JVM with other languages.

0:31:31.881,0:31:36.025
And it seems to me that all this would [br]apply even if you hate Java programming

0:31:36.025,0:31:39.671
language and want to write in, I don't [br]know, lisp or something instead

0:31:39.671,0:31:42.794
if you've got a lisp system that can [br]generate JVM bytecode.

0:31:42.794,0:31:48.982
[Presenter] Yeah, totally. And the other [br]big data language we looked at was Scala.

0:31:48.982,0:31:53.870
It uses the JVM back end but a completely [br]different language on the front.

0:32:03.875,0:32:08.104
Cheers guys.