WEBVTT

00:00:01.753 --> 00:00:10.762
Onto the second talk of the day.

00:00:10.762 --> 00:00:15.511
Steve Capper is going to tell us
about the good bits of Java

00:00:15.511 --> 00:00:18.248
They do exist

00:00:18.248 --> 00:00:23.417
[Audience] Could this have been a 
lightening talk? [Audience laughter]

00:00:23.417 --> 00:00:26.204
Believe it or not we've got some 
good stuff here.

00:00:27.293 --> 00:00:30.534
I was as skeptical as you guys 
when I first looked.

00:00:30.534 --> 00:00:34.911
First many apologies for not attending 
the mini-conf last year

00:00:34.911 --> 00:00:40.471
I was unfortunately ill on the day 
I was due to give this talk.

00:00:43.619 --> 00:00:45.742
Let me figure out how to use a computer.

00:01:00.894 --> 00:01:02.160
Sorry about this.

00:01:12.795 --> 00:01:16.034
There we go; it's because 
I've not woken up.

00:01:16.034 --> 00:01:27.981
Last year I worked at Linaro in the 
Enterprise group and we performed analysis

00:01:27.981 --> 00:01:32.206
on so called 'Big Data' application sets.

00:01:32.206 --> 00:01:38.441
As many of you know quite a lot of these 
big data applications are written in Java.

00:01:38.441 --> 00:01:43.039
I'm from ARM and we were very interested
in 64bit ARM support.

00:01:43.039 --> 00:01:47.613
So this is mainly AArch64 examples 
for things like assembler

00:01:47.613 --> 00:01:53.708
but most of the messages are 
pertinent for any architecture.

00:01:53.708 --> 00:01:58.956
These good bits are shared between 
most if not all the architectures.

00:01:58.956 --> 00:02:03.008
Whilst trying to optimise a lot of 
these big data applications

00:02:03.008 --> 00:02:06.409
I stumbled a across quite a few things in 
the JVM and I thought

00:02:06.409 --> 00:02:11.425
'actually that's really clever; 
that's really cool'

00:02:11.425 --> 00:02:16.673
So I thought that would make a good 
basis for an interesting talk.

00:02:16.673 --> 00:02:20.330
This talk is essentially some of the 
clever things I found in the

00:02:20.330 --> 00:02:25.322
Java Virtual Machine; these 
optimisations are in OpenJDK.

00:02:25.322 --> 00:02:33.554
Source is available it's all there, 
readily available and in play now.

00:02:33.554 --> 00:02:38.349
I'm going to finish with some of the 
optimisation work we did with Java.

00:02:38.349 --> 00:02:42.818
People who know me will know 
I'm not a Java zealot.

00:02:42.818 --> 00:02:48.809
I don't particularly believe in 
programming in a language over another one

00:02:48.809 --> 00:02:51.561
So to make it clear from the outset 
I'm not attempting to convert

00:02:51.561 --> 00:02:54.556
anyone to Java programmers.

00:02:54.556 --> 00:02:57.424
I'm just going to highlight a few salient 
things in the Java Virtual Machine

00:02:57.424 --> 00:02:59.791
which I found to be quite clever and 
interesting

00:02:59.791 --> 00:03:05.365
and I'll try and talk through them 
with my understanding of them.

00:03:05.365 --> 00:03:10.241
Let's jump straight in and let's 
start with an example.

00:03:10.241 --> 00:03:15.199
This is a minimal example for 
computing a SHA1 sum of a file.

00:03:15.199 --> 00:03:19.819
I've omitted some of the checking in the 
beginning of the function see when

00:03:19.819 --> 00:03:22.397
command line parsing and that sort of 
thing.

00:03:22.397 --> 00:03:25.160
I've highlighted the salient 
points in red.

00:03:25.160 --> 00:03:31.093
Essentially we instantiate a SHA1 
crypto message service digest.

00:03:31.093 --> 00:03:36.108
And we do the equivalent in 
Java of an mmap.

00:03:36.108 --> 00:03:38.279
Get it all in memory.

00:03:38.279 --> 00:03:42.784
And then we just put this status straight 
into the crypto engine.

00:03:42.784 --> 00:03:46.952
And eventually at the end of the 
program we'll spit out the SHA1 hash.

00:03:46.952 --> 00:03:48.682
It's a very simple program.

00:03:48.682 --> 00:03:56.356
It's basically mmap, SHA1, output 
the hash afterwards.

00:03:56.356 --> 00:04:04.077
In order to concentrate on the CPU 
aspect rather than worry about IO

00:04:04.077 --> 00:04:08.465
I decided to cheat a little bit by 
setting this up.

00:04:08.465 --> 00:04:15.176
I decided to use a sparse file. As many of
you know a sparse file is a file that not

00:04:15.176 --> 00:04:19.529
all the contents are stored necessarily 
on disc. The assumption is that the bits

00:04:19.529 --> 00:04:26.379
that aren't stored are zero. For instance
on Linux you can create a 20TB sparse file

00:04:26.379 --> 00:04:31.162
on a 10MB file system and use it as 
normal.

00:04:31.162 --> 00:04:34.611
Just don't write too much to it otherwise 
you're going to run out of space.

00:04:34.611 --> 00:04:40.915
The idea behind using a sparse file is I'm
just focusing on the computational aspects

00:04:40.915 --> 00:04:44.990
of the SHA1 sum. I'm not worried about 
the file system or anything like that.

00:04:44.990 --> 00:04:49.690
I don't want to worry about the IO. I 
just want to focus on the actual compute.

00:04:49.690 --> 00:04:53.106
In order to set up a sparse file I used 
the following runes.

00:04:53.106 --> 00:04:56.877
The important point is that you seek
and the other important point

00:04:56.877 --> 00:05:03.206
is you set a count otherwise you'll 
fill your disc up.

00:05:03.206 --> 00:05:09.568
I decided to run this against firstly 
let's get the native SHA1 sum command

00:05:09.568 --> 00:05:15.188
that's built into Linux and let's 
normalise these results and say that's 1.0

00:05:15.188 --> 00:05:21.794
I used an older version of the OpenJDK 
and ran the Java program

00:05:21.794 --> 00:05:30.339
and that's 1.09 times slower than the 
reference command. That's quite good.

00:05:30.339 --> 00:05:39.174
Then I used the new OpenJDK, this is now
the current JDK as this is a year on.

00:05:39.174 --> 00:05:46.732
And 0.21 taken. It's significantly faster.

00:05:46.732 --> 00:05:50.993
I've stressed that I've done nothing 
surreptitious in the Java program.

00:05:50.993 --> 00:05:56.171
It is mmap, compute, spit result out.

00:05:56.171 --> 00:06:01.570
But the OpenJDK has essentially got 
some more context information.

00:06:01.570 --> 00:06:03.625
I'll talk about that as we go through.

00:06:06.202 --> 00:06:11.194
Before when I started Java I had a very 
simplistic view of Java.

00:06:11.194 --> 00:06:17.416
Traditionally Java is taught as a virtual 
machine that runs bytecode.

00:06:17.416 --> 00:06:21.481
Now when you compile a Java program it 
compiles into bytecode.

00:06:21.481 --> 00:06:25.428
The older versions of the Java Virtual 
Machine would interpret this bytecode

00:06:25.428 --> 00:06:33.056
and then run through. Newer versions 
would employ a just-in-time engine

00:06:33.056 --> 00:06:39.627
and try and compile this bytecode 
into native machine code.

00:06:39.627 --> 00:06:43.191
That is not the only thing that goes on
when you run a Java program.

00:06:43.191 --> 00:06:47.568
There is some extra optimisations as well.
So this alone would not account for

00:06:47.568 --> 00:06:52.315
the newer version of the SHA1 
sum being significantly faster

00:06:52.315 --> 00:06:55.568
than the distro supplied one.

00:06:55.568 --> 00:07:00.850
Java knows about context. It has a class 
library and these class libraries

00:07:00.850 --> 00:07:04.624
have reasonably well defined purposes.

00:07:04.624 --> 00:07:07.746
We have classes that provide 
crypto services.

00:07:07.746 --> 00:07:11.067
We have some misc unsafe that every 
single project seems to pull in their

00:07:11.067 --> 00:07:13.493
project when they're not supposed to.

00:07:13.493 --> 00:07:17.673
These have well defined meanings.

00:07:17.673 --> 00:07:21.075
These do not necessarily have to be 
written in Java.

00:07:21.075 --> 00:07:24.569
They come as Java classes, 
they come supplied.

00:07:24.569 --> 00:07:28.911
But most JVMs now have a notion 
of a virtual machine intrinsic.

00:07:28.911 --> 00:07:35.227
And the virtual machine intrinsic says ok 
please do a SHA1 in the best possible way

00:07:35.227 --> 00:07:39.500
that your implementation allows. This is 
something done automatically by the JVM.

00:07:39.500 --> 00:07:43.482
You don't ask for it. If the JVM knows
what it's running on and it's reasonably

00:07:43.482 --> 00:07:47.824
recent this will just happen 
for you for free.

00:07:47.824 --> 00:07:50.123
And there's quite a few classes 
that do this.

00:07:50.123 --> 00:07:53.650
There's quite a few clever things with 
atomics, there's crypto,

00:07:53.650 --> 00:07:58.378
there's mathematical routines as well. 
Most of these routines in the

00:07:58.378 --> 00:08:03.136
class library have a well defined notion 
of a virtual machine intrinsic

00:08:03.136 --> 00:08:06.992
and they do run reasonably optimally.

00:08:06.992 --> 00:08:11.984
They are a subject of continuous 
optimisation as well.

00:08:11.984 --> 00:08:17.312
We've got some runes that are 
presented on the slides here.

00:08:17.312 --> 00:08:21.342
These are quite useful if you 
are interested in

00:08:21.342 --> 00:08:24.696
how these intrinsics are made.

00:08:24.696 --> 00:08:29.248
You can ask the JVM to print out a lot of
the just-in-time compiled code.

00:08:29.248 --> 00:08:35.239
You can ask the JVM to print out the 
native methods as well as these intrinsics

00:08:35.239 --> 00:08:40.731
and in this particular case after sifting 
through about 5MB of text

00:08:40.731 --> 00:08:45.235
I've come across this particular SHA1 sum
implementation.

00:08:45.235 --> 00:08:53.420
This is AArch64. This is employing the 
cryptographic extensions

00:08:53.420 --> 00:08:57.263
in the architecture.

00:08:57.263 --> 00:08:59.353
So it's essentially using the CPU 
instructions which would explain why

00:08:59.353 --> 00:09:01.884
it's faster. But again it's done 
all this automatically.

00:09:01.884 --> 00:09:05.948
This did not require any specific runes 
or anything to activate.

00:09:05.948 --> 00:09:12.229
We'll see a bit later on how you can 
more easily find the hot spots

00:09:12.229 --> 00:09:15.128
rather than sifting through a lot 
of assembler.

00:09:15.128 --> 00:09:18.658
I've mentioned that the cryptographic 
engine is employed and again

00:09:18.658 --> 00:09:23.374
this routine was generated at run 
time as well.

00:09:23.374 --> 00:09:28.390
This is one of the important things about 
certain execution of amps like Java.

00:09:28.390 --> 00:09:31.439
You don't have to know everything at 
compile time.

00:09:31.439 --> 00:09:35.019
You know a lot more information at 
run time and you can use that

00:09:35.019 --> 00:09:37.838
in theory to optimise.

00:09:37.838 --> 00:09:40.394
You can switch off these clever routines.

00:09:40.394 --> 00:09:43.459
For instance I've got a deactivate 
here and we get back to the

00:09:43.459 --> 00:09:46.699
slower performance we expected.

00:09:46.699 --> 00:09:52.968
Again, this particular set of routines is 
present in OpenJDK,

00:09:52.968 --> 00:09:57.391
I think for all the architectures that 
support it.

00:09:57.391 --> 00:10:00.758
We get this optimisation for free on X86 
and others as well.

00:10:00.758 --> 00:10:03.301
It works quite well.

00:10:03.301 --> 00:10:08.514
That was one surprise I came across 
as the instrinsics.

00:10:08.514 --> 00:10:13.216
One thing I thought it would be quite 
good to do would be to go through

00:10:13.216 --> 00:10:18.104
a slightly more complicated example. 
And use this example to explain

00:10:18.104 --> 00:10:21.865
a lot of other things that happen 
in the JVM as well.

00:10:21.865 --> 00:10:24.652
I will spend a bit of time going through 
this example

00:10:24.652 --> 00:10:32.512
and explain roughly the notion of what 
it's supposed to be doing.

00:10:32.512 --> 00:10:39.605
This is an imaginary method that I've 
contrived to demonstrate a lot of points

00:10:39.605 --> 00:10:43.146
in the fewest possible lines of code.

00:10:43.146 --> 00:10:43.828
I'll start with what it's meant to do.

00:10:43.828 --> 00:10:50.886
This is meant to be a routine that gets a
reference to something and let's you know

00:10:50.886 --> 00:10:58.692
whether or not it's an image and in a 
hypothetical cache.

00:10:58.692 --> 00:11:02.512
I'll start with the important thing 
here the weak reference.

00:11:02.512 --> 00:11:09.443
In Java and other garbage collected 
languages we have the notion of references

00:11:09.443 --> 00:11:13.588
Most of the time when you are running a 
Java program you have something like a

00:11:13.588 --> 00:11:19.451
variable name and that is in the current 
execution context that is referred to as a

00:11:19.451 --> 00:11:24.176
strong reference to the object. In other 
words I can see it. I am using it.

00:11:24.176 --> 00:11:27.659
Please don't get rid of it. 
Bad things will happen if you do.

00:11:27.659 --> 00:11:30.561
So the garbage collector knows 
not to get rid of it.

00:11:30.561 --> 00:11:36.146
In Java and other languages you also 
have the notion of a weak reference.

00:11:36.146 --> 00:11:40.244
This is essentially the programmer saying
to the virtual machine

00:11:40.244 --> 00:11:43.936
"Look I kinda care about this but 
just a little bit."

00:11:43.936 --> 00:11:49.973
"If you want to get rid of it feel free 
to but please let me know."

00:11:49.973 --> 00:11:54.142
This is why this is for a CacheClass. 
For instance the JVM in this particular

00:11:54.142 --> 00:12:01.421
case could decide that it's running quite 
low on memory this particular xMB image

00:12:01.421 --> 00:12:04.416
has not been used for a while it can 
garbage collect it.

00:12:04.416 --> 00:12:08.770
The important thing is how we go about 
expressing this in the language.

00:12:08.770 --> 00:12:13.530
We can't just have a reference to the 
object because that's a strong reference

00:12:13.530 --> 00:12:17.768
and the JVM will know it can't get 
rid of this because the program

00:12:17.768 --> 00:12:19.521
can see it actively.

00:12:19.521 --> 00:12:25.152
So we have a level of indirection which 
is known as a weak reference.

00:12:25.152 --> 00:12:28.621
We have this hypothetical CacheClass 
that I've devised.

00:12:28.621 --> 00:12:32.280
At this point it is a weak reference.

00:12:32.280 --> 00:12:36.204
Then we get it. This is calling the weak 
reference routine.

00:12:36.204 --> 00:12:41.139
Now it becomes a strong reference so 
it's not going to be garbage collected.

00:12:41.139 --> 00:12:45.202
When we get to the return path it becomes 
a weak reference again

00:12:45.202 --> 00:12:48.929
because our strong reference 
has disappeared.

00:12:48.929 --> 00:12:50.856
The salient points in this example are:

00:12:50.856 --> 00:12:54.293
We're employing a method to get 
a reference.

00:12:54.293 --> 00:12:57.079
We're checking an item to see if 
it's null.

00:12:57.079 --> 00:13:01.050
So let's say that the JVM decided to 
garbage collect this

00:13:01.050 --> 00:13:04.544
before we executed the method.

00:13:04.544 --> 00:13:08.885
The weak reference class is still valid 
because we've got a strong reference to it

00:13:08.885 --> 00:13:11.926
but the actual object behind this is gone.

00:13:11.926 --> 00:13:14.970
If we're too late and the garbage 
collector has killed it

00:13:14.970 --> 00:13:17.919
it will be null and we return.

00:13:17.919 --> 00:13:22.331
So it's a level of indirection to see 
does this still exist

00:13:22.331 --> 00:13:27.960
if so can I please have it and then 
operate on it as normal

00:13:27.960 --> 00:13:30.794
and then return becomes weak 
reference again.

00:13:30.794 --> 00:13:37.888
This example program is quite useful when
we look at how it's implemented in the JVM

00:13:37.888 --> 00:13:40.384
and we'll go through a few things now.

00:13:40.384 --> 00:13:43.644
First off we'll go through the bytecode.

00:13:43.644 --> 00:13:49.312
The only point of this slide is to 
show it's roughly

00:13:49.312 --> 00:13:54.152
the same as this.

00:13:54.152 --> 00:13:56.371
We get our variable.

00:13:56.371 --> 00:13:58.981
We use our getter.

00:13:58.981 --> 00:14:03.836
This bit is extra this checkcast. 
The reason that bit is extra is

00:14:03.836 --> 00:14:15.110
because we're using the equivalent of 
a template in Java.

00:14:15.110 --> 00:14:19.173
And the way that's implemented in Java is 
it just basically casts everything to an

00:14:19.173 --> 00:14:23.167
object so that requires extra 
compiler information.

00:14:23.167 --> 00:14:26.035
And this is the extra check.

00:14:26.035 --> 00:14:30.795
The rest of this we load the reference, 
we check to see if it is null,

00:14:30.795 --> 00:14:35.149
If it's not null we invoke a virtual 
function - is it the image?

00:14:35.149 --> 00:14:37.981
and we return as normal.

00:14:37.981 --> 00:14:43.357
Essentially the point I'm trying to make 
is when we compile this to bytecode

00:14:43.357 --> 00:14:45.365
this execution happens.

00:14:45.365 --> 00:14:46.758
This null check happens.

00:14:46.758 --> 00:14:48.604
This execution happens.

00:14:48.604 --> 00:14:50.450
And we return.

00:14:50.450 --> 00:14:55.118
In the actual Java class files we've not 
lost anything.

00:14:55.118 --> 00:14:58.356
This is what it looks like when it's 
been JIT'd.

00:14:58.356 --> 00:15:01.108
Now we've lost lots of things.

00:15:01.108 --> 00:15:05.578
The JIT has done quite a few clever things
which I'll talk about.

00:15:05.578 --> 00:15:10.905
First off if we look down here there's 
a single branch here.

00:15:10.905 --> 00:15:16.167
And this is only if our check cast failed

00:15:16.167 --> 00:15:20.381
We've got comments on the 
right hand side.

00:15:20.381 --> 00:15:27.126
Our get method has been inlined so 
we're no longer calling.

00:15:27.126 --> 00:15:32.699
We seem to have lost our null check,
that's just gone.

00:15:32.699 --> 00:15:35.938
And again we've got a get field as well.

00:15:35.938 --> 00:15:40.153
That's no longer a method, 
that's been inlined as well.

00:15:40.153 --> 00:15:42.579
We've also got some other cute things.

00:15:42.579 --> 00:15:46.062
Those more familiar with AArch64 
will understand

00:15:46.062 --> 00:15:49.870
that the pointers we're using 
are 32bit not 64bit.

00:15:49.870 --> 00:15:54.015
What we're doing is getting a pointer 
and shifting it left 3

00:15:54.015 --> 00:15:56.476
and widening it to a 64bit pointer.

00:15:56.476 --> 00:16:02.245
We've also got 32bit pointers on a 
64bit system as well.

00:16:02.245 --> 00:16:06.148
So that's saving a reasonable amount 
of memory and cache.

00:16:06.148 --> 00:16:09.712
To summarise. We don't have any 
branches or function calls

00:16:09.712 --> 00:16:12.742
and we've got a lot of inlining.

00:16:12.742 --> 00:16:16.061
We did have function calls in the 
class file so it's the JVM;

00:16:16.061 --> 00:16:18.326
it's the JIT that has done this.

00:16:18.326 --> 00:16:23.330
We've got no null checks either and I'm 
going to talk through this now.

00:16:23.330 --> 00:16:29.996
The null check elimination is quite a 
clever feature in Java and other programs.

00:16:29.996 --> 00:16:33.605
The idea behind null check elimination is

00:16:33.605 --> 00:16:37.494
most of the time this object is not 
going to be null.

00:16:37.494 --> 00:16:43.599
If this object is null the operating 
system knows this quite quickly.

00:16:43.599 --> 00:16:47.943
So if you try to dereference a null 
pointer you'll get either a SIGSEGV or

00:16:47.943 --> 00:16:51.275
a SIGBUS depending on a 
few circumstances.

00:16:51.275 --> 00:16:53.655
That goes straight back to the JVM

00:16:53.655 --> 00:16:58.810
and the JVM knows where the null 
exception took place.

00:16:58.810 --> 00:17:01.759
Because it knows where it took 
place it can look this up

00:17:01.759 --> 00:17:05.774
and unwind it as part of an exception.

00:17:05.774 --> 00:17:09.909
Those null checks just go.
Completely gone.

00:17:09.909 --> 00:17:15.447
Most of the time this works and you are 
saving a reasonable amount of execution.

00:17:15.447 --> 00:17:19.836
I'll talk about when it doesn't work 
in a second.

00:17:19.836 --> 00:17:24.143
That's reasonably clever. We have similar 
programming techniques in other places

00:17:24.143 --> 00:17:27.800
even the Linux kernel for instance when 
you copy data to and from user space

00:17:27.800 --> 00:17:30.912
it does pretty much identical 
the same thing.

00:17:30.912 --> 00:17:36.636
It has an exception unwind table and it 
knows if it catches a page fault on

00:17:36.636 --> 00:17:40.130
this particular program counter
it can deal with it because it knows

00:17:40.130 --> 00:17:43.729
the program counter and it knows
conceptually what it was doing.

00:17:43.729 --> 00:17:47.630
In a similar way the JIT know what its 
doing to a reasonable degree.

00:17:47.630 --> 00:17:53.459
It can handle the null check elimination.

00:17:53.459 --> 00:17:57.325
I mentioned the sneaky one. We've got
essentially 32bit pointers

00:17:57.325 --> 00:17:59.658
on a 64bit system.

00:17:59.658 --> 00:18:04.534
Most of the time in Java people typically 
specify heap size smaller than 32GB.

00:18:04.534 --> 00:18:10.444
Which is perfect if you want to use 32bit 
pointers and left shift 3.

00:18:10.444 --> 00:18:12.940
Because that gives you 32GB of 
addressable memory.

00:18:12.940 --> 00:18:19.337
That's a significant memory saving because
otherwise a lot of things would double up.

00:18:19.337 --> 00:18:23.041
There's a significant number of pointers 
in Java.

00:18:23.041 --> 00:18:29.055
The one that should make people 
jump out of their seat is

00:18:29.055 --> 00:18:32.224
the fact that most methods in Java are 
actually virtual.

00:18:32.224 --> 00:18:37.147
So what the JVM has actually done is 
inlined a virtual function.

00:18:37.147 --> 00:18:41.698
A virtual function is essentially a 
function were you don't know where

00:18:41.698 --> 00:18:43.439
you're going until run time.

00:18:43.439 --> 00:18:47.503
You can have several different classes 
and they share the same virtual function

00:18:47.503 --> 00:18:51.439
in the base class and dependent upon 
which specific class you're running

00:18:51.439 --> 00:18:54.364
different virtual functions will 
get executed.

00:18:54.364 --> 00:19:01.307
In C++ that will be a read from a V table
and then you know where to go.

00:19:01.307 --> 00:19:03.606
The JVM's inlined it.

00:19:03.606 --> 00:19:05.255
We've saved a memory load.

00:19:05.255 --> 00:19:07.925
We've saved a branch as well

00:19:07.925 --> 00:19:12.023
The reason the JVM can inline it is 
because the JVM knows

00:19:12.023 --> 00:19:14.485
every single class that has been loaded.

00:19:14.485 --> 00:19:19.767
So it knows that although this looks 
polymorphic to the casual programmer

00:19:19.767 --> 00:19:26.420
It actually is monomorphic.
The JVM knows this.

00:19:26.420 --> 00:19:30.982
Because it knows this it can be clever.
And this is really clever.

00:19:30.982 --> 00:19:35.660
That's a significant cost saving.

00:19:35.660 --> 00:19:41.664
This is all great. I've already mentioned 
the null check elimination.

00:19:41.664 --> 00:19:47.410
We're taking a signal as most of you know 
if we do that a lot it's going to be slow.

00:19:47.410 --> 00:19:51.137
Jumping into kernel, into user, 
bouncing around.

00:19:51.137 --> 00:19:55.735
The JVM also has a notion of 
'OK I've been a bit too clever now;

00:19:55.735 --> 00:19:57.906
I need to back off a bit'

00:19:57.906 --> 00:20:02.782
Also there's nothing stopping the user 
loading more classes

00:20:02.782 --> 00:20:06.799
and rendering the monomorphic 
assumption invalid.

00:20:06.799 --> 00:20:10.166
So the JVM needs to have a notion of 
backpeddling and go

00:20:10.166 --> 00:20:14.264
'Ok I've gone to far and need to 
deoptimise'

00:20:14.264 --> 00:20:17.051
The JVM has the ability to deoptimise.

00:20:17.051 --> 00:20:22.865
In other words it essentially knows that 
for certain code paths everything's OK.

00:20:22.865 --> 00:20:27.140
But for certain new objects it can't get 
away with these tricks.

00:20:27.140 --> 00:20:32.678
By the time the new objects are executed 
they are going to be safe.

00:20:32.678 --> 00:20:36.823
There are ramifications for this.
This is the important thing to consider

00:20:36.823 --> 00:20:39.818
with something like Java and other 
languages and other virtual machines.

00:20:39.818 --> 00:20:46.351
If you're trying to profile this it means 
there is a very significant ramification.

00:20:46.351 --> 00:20:52.252
You can have the same class and 
method JIT'd multiple ways

00:20:52.252 --> 00:20:54.969
and executed at the same time.

00:20:54.969 --> 00:21:00.832
So if you're trying to find a hot spot 
the program counter's nodding off.

00:21:00.832 --> 00:21:03.885
Because you can refer to the same thing 
in several different ways.

00:21:03.885 --> 00:21:07.914
This is quite common as well as 
deoptimisation does take place.

00:21:07.914 --> 00:21:16.215
That's something to bear in mind with JVM 
and similar runtime environments.

00:21:16.215 --> 00:21:19.814
You can get a notion of what the JVM's 
trying to do.

00:21:19.814 --> 00:21:22.612
You can ask it nicely and add a print 
compilation option

00:21:22.612 --> 00:21:24.667
and it will tell you what it's doing.

00:21:24.667 --> 00:21:26.885
This is reasonably verbose.

00:21:26.885 --> 00:21:30.008
Typically what happens is the JVM gets 
excited JIT'ing everything

00:21:30.008 --> 00:21:32.643
and optimising everything then 
it settles down.

00:21:32.643 --> 00:21:35.430
Until you load something new 
and it gets excited again.

00:21:35.430 --> 00:21:38.588
There's a lot of logs. This is mainly 
useful for debugging but

00:21:38.588 --> 00:21:42.361
it gives you an appreciation that it's 
doing a lot of work.

00:21:42.361 --> 00:21:45.635
You can go even further with a log 
compilation option.

00:21:45.635 --> 00:21:50.732
That produces a lot of XML and that is 
useful for people debugging the JVM as well.

00:21:50.732 --> 00:21:56.270
It's quite handy to get an idea of 
what's going on.

00:21:56.270 --> 00:22:04.025
If that is not enough information you 
also have the ability to go even further.

00:22:04.025 --> 00:22:07.334
This is beyond the limit of my 
understanding.

00:22:07.334 --> 00:22:10.945
I've gone into this little bit just to 
show you what can be done.

00:22:10.945 --> 00:22:17.342
You have release builds of OpenJDK 
and they have debug builds of OpenJDK.

00:22:17.342 --> 00:22:24.877
The release builds will by default turn 
off a lot of the diagnostic options.

00:22:24.877 --> 00:22:27.950
You can switch them back on again.

00:22:27.950 --> 00:22:33.619
When you do you can also gain insight 
into the actual, it's colloquially

00:22:33.619 --> 00:22:37.624
referred to as the C2 JIT, 
the compiler there.

00:22:37.624 --> 00:22:41.966
You can see, for instance, objects in 
timelines and visualize them

00:22:41.966 --> 00:22:45.693
as they're being optimised at various 
stages and various things.

00:22:45.693 --> 00:22:52.938
So this is based on a masters thesis 
by Thomas Würthinger.

00:22:52.938 --> 00:23:00.566
This is something you can play with as 
well and see how far the optimiser goes.

00:23:00.566 --> 00:23:05.163
And it's also good for people hacking 
with the JVM.

00:23:05.163 --> 00:23:08.275
I'll move onto some stuff we did.

00:23:08.275 --> 00:23:15.740
Last year we were working on the 
big data. Relatively new architecture

00:23:15.740 --> 00:23:23.937
ARM64, it's called AArch64 in OpenJDK 
land but ARM64 in Debian land.

00:23:23.937 --> 00:23:26.769
We were a bit concerned because 
everything's all shiny and new.

00:23:26.769 --> 00:23:28.926
Has it been optimised correctly?

00:23:28.926 --> 00:23:31.680
Are there any obvious things
we need to optimise?

00:23:31.680 --> 00:23:34.060
And we're also interested because
everything was so shiny and new

00:23:34.060 --> 00:23:35.291
in the whole system.

00:23:35.291 --> 00:23:39.389
Not just the JVM but the glibc and 
the kernel as well.

00:23:39.389 --> 00:23:42.060
So how do we get a view of all of this?

00:23:42.060 --> 00:23:49.827
I gave a quick talk before at the Debian 
mini-conf before last [2014] about perf

00:23:49.827 --> 00:23:53.020
so decided we could try and do some 
clever things with Linux perf

00:23:53.020 --> 00:23:57.965
and see if we could get some actual useful
debugging information out.

00:23:57.965 --> 00:24:02.598
We have the flame graphs that are quite 
well known.

00:24:02.598 --> 00:24:08.612
We also have some previous work, Johannes 
had a special perf map agent that

00:24:08.612 --> 00:24:13.500
could basically hook into perf and it 
would give you a nice way of running

00:24:13.500 --> 00:24:20.431
perf-top for want of a better expression 
and viewing the top Java function names.

00:24:20.431 --> 00:24:25.130
This is really good work and it's really 
good for a particular use case

00:24:25.130 --> 00:24:29.405
if you just want to do a quick snap shot 
once and see in that snap shot

00:24:29.405 --> 00:24:32.284
where the hotspots where.

00:24:32.284 --> 00:24:38.670
For a prolonged work load with all 
the functions being JIT'd multiple ways

00:24:38.670 --> 00:24:42.165
with the optimisation going on and 
everything moving around

00:24:42.165 --> 00:24:46.762
it require a little bit more information 
to be captured.

00:24:46.762 --> 00:24:50.872
I decided to do a little bit of work on a 
very similar thing to perf-map-agent

00:24:50.872 --> 00:24:56.155
but an agent that would capture it over 
a prolonged period of time.

00:24:56.155 --> 00:24:59.394
Here's an example Flame graph, these are 
all over the internet.

00:24:59.394 --> 00:25:05.257
This is the SHA1 computation example that 
I gave at the beginning.

00:25:05.257 --> 00:25:10.654
As expected the VM intrinsic SHA1 is the 
top one.

00:25:10.654 --> 00:25:17.494
Not expected by me was this quite 
significant chunk of CPU execution time.

00:25:17.494 --> 00:25:21.498
And there was a significant amount of 
time being spent copying memory

00:25:21.498 --> 00:25:28.430
from the mmapped memory 
region into a heap

00:25:28.430 --> 00:25:31.575
and then that was passed to 
the crypto engine.

00:25:31.575 --> 00:25:35.687
So we're doing a ton of memory copies for 
no good reason.

00:25:35.687 --> 00:25:38.972
That essentially highlighted an example.

00:25:38.972 --> 00:25:42.664
That was an assumption I made about Java 
to begin with which was if you do

00:25:42.664 --> 00:25:45.334
the equivalent of mmap it should just 
work like mmap right?

00:25:45.334 --> 00:25:48.398
You should just be able to address the 
memory. That is not the case.


00:25:48.398 --> 00:25:54.146
If you've got a file mapping object and 
you try to address it it has to be copied

00:25:54.146 --> 00:25:59.510
into safe heap memory first. And that is 
what was slowing down the programs.

00:25:59.510 --> 00:26:04.886
If that was omitted you could make 
the SHA1 computation even quicker.

00:26:04.886 --> 00:26:08.982
So that would be the logical target you 
would want to optimise.

00:26:08.982 --> 00:26:12.014
I wanted to extend Johannes' work 
with something called a

00:26:12.014 --> 00:26:15.741
Java Virtual Machine Tools Interface 
profiling agent.

00:26:15.741 --> 00:26:22.649
This is part of the Java Virtual Machine 
standard as you can make a special library

00:26:22.649 --> 00:26:24.597
and then hook this into the JVM.

00:26:24.597 --> 00:26:28.349
And the JVM can expose quite a few 
things to the library.

00:26:28.349 --> 00:26:32.413
It exposes a reasonable amount of 
information as well.

00:26:32.413 --> 00:26:40.237
Perf as well has the ability to look 
at map files natively.

00:26:40.237 --> 00:26:44.557
If you are profiling JavaScript, or 
something similar, I think the

00:26:44.557 --> 00:26:48.493
Google V8 JavaScript engine will write 
out a special map file that says

00:26:48.493 --> 00:26:52.846
these program counter addresses correspond
to these function names.

00:26:52.846 --> 00:26:57.026
I decided to use that in a similar way to 
what Johannes did for the extended

00:26:57.026 --> 00:27:03.621
profiling agent but I also decided to 
capture some more information as well.

00:27:03.621 --> 00:27:10.110
I decided to capture the disassembly 
so when we run perf annotate

00:27:10.110 --> 00:27:13.675
we can see the actual JVM bytecode 
in our annotation.

00:27:13.675 --> 00:27:17.437
We can see how it was JIT'd at the 
time when it was JIT'd.

00:27:17.437 --> 00:27:19.909
We can see where the hotspots where.

00:27:19.909 --> 00:27:23.067
And that's good. But we can go 
even better.

00:27:23.067 --> 00:27:29.174
We can run an annotated trace that 
contains the Java class,

00:27:29.174 --> 00:27:34.190
the Java method and the bytecode all in 
one place at the same time.

00:27:34.190 --> 00:27:38.950
You can see everything from the JVM 
at the same place.

00:27:38.950 --> 00:27:43.940
This works reasonably well because the 
perf interface is extremely extensible.

00:27:43.940 --> 00:27:47.889
And again we can do entire 
system optimisation.

00:27:47.889 --> 00:27:52.046
The bits in red here are the Linux kernel.

00:27:52.046 --> 00:27:55.111
Then we got into libraries.

00:27:55.111 --> 00:27:58.350
And then we got into Java and more 
libraries as well.

00:27:58.350 --> 00:28:03.110
So we can see everything from top to 
bottom in one fell swoop.

00:28:03.110 --> 00:28:07.789
This is just a quick slide showing the 
mechanisms employed.

00:28:07.789 --> 00:28:12.340
Essentially we have this agent which is 
a shared object file.

00:28:12.340 --> 00:28:16.740
And this will spit out useful files here 
in a standard way.

00:28:16.740 --> 00:28:27.445
And the Linux perf basically just records 
the perf data dump file as normal.

00:28:27.445 --> 00:28:29.755
We have 2 sets of recording going on.

00:28:29.755 --> 00:28:35.559
To report it it's very easy to do 
normal reporting with the PID map.

00:28:35.559 --> 00:28:40.819
This is just out of the box, works with 
the Google V8 engine as well.

00:28:40.819 --> 00:28:45.127
If you want to do very clever annotations 
perf has the ability to have

00:28:45.127 --> 00:28:47.913
Python scripts passed to it.

00:28:47.913 --> 00:28:53.521
So you can craft quite a dodgy Python 
script and that can interface

00:28:53.521 --> 00:28:55.529
with the perf annotation output.

00:28:55.529 --> 00:29:00.614
That's how I was able to get the extra 
Java information in the same annotation.

00:29:00.614 --> 00:29:05.247
And this is really easy to do; it's quite 
easy to knock the script up.

00:29:05.247 --> 00:29:10.216
And again the only thing we do for this 
profiling is we hook in the profiling

00:29:10.216 --> 00:29:13.440
agent which dumps out various things.

00:29:13.440 --> 00:29:18.470
We preserve the frame pointer because 
that makes things considerably easier

00:29:18.470 --> 00:29:21.547
on winding. This will effect 
performance a little bit.

00:29:21.547 --> 00:29:26.133
And again when we're reporting we just 
hook in a Python script.

00:29:26.133 --> 00:29:29.697
It's really easy to hook everything in 
and get it working.

00:29:31.462 --> 00:29:38.602
At the moment we have a JVMTI agent. It's 
actually on http://git.linaro.org now.

00:29:38.602 --> 00:29:42.364
Since I gave this talk Google have 
extended perf anyway so it will do

00:29:42.364 --> 00:29:45.765
quite a lot of similar things out of the 
box anyway.

00:29:45.765 --> 00:29:49.748
It's worth having a look at the 
latest perf.

00:29:49.748 --> 00:29:54.554
These techniques in this slide deck can be
used obviously in other JITs quite easily.

00:29:54.554 --> 00:29:59.628
The fact that perf is so easy to extend 
with scripts can be useful

00:29:59.628 --> 00:30:01.345
for other things.

00:30:01.345 --> 00:30:05.769
And OpenJDK has a significant amount of 
cleverness associated with it that

00:30:05.769 --> 00:30:11.180
I thought was very surprising and good. 
So that's what I covered in the talk.

00:30:11.180 --> 00:30:17.832
These are basically references to things 
like command line arguments

00:30:17.832 --> 00:30:20.851
and the Flame graphs and stuff like that.

00:30:20.851 --> 00:30:26.308
If anyone is interested in playing with 
OpenJDK on ARM64 I'd suggest going here:

00:30:26.308 --> 00:30:31.332
http://openjdk.linaro.org
Where the most recent builds are.

00:30:31.332 --> 00:30:36.002
Obviously fixes are going in upstream and 
they're going into distributions as well.

00:30:36.002 --> 00:30:41.586
They're included in OpenJDK so it should 
be good as well.

00:30:41.586 --> 00:30:45.034
I've run through quite a few fundamental 
things reasonably quickly.

00:30:45.034 --> 00:30:48.274
I'd be happy to accept any questions 
or comments

00:30:54.671 --> 00:30:57.817
And if you want to talk to me privately 
about Java afterwards feel free to

00:30:57.817 --> 00:30:59.129
when no-one's looking.

00:31:06.896 --> 00:31:12.698
[Audience] Applause

00:31:13.315 --> 00:31:19.028
[Audience] It's not really a question so 
much as a comment.

00:31:19.028 --> 00:31:31.881
Last mini-Deb conf we had a talk about 
using the JVM with other languages.

00:31:31.881 --> 00:31:36.025
And it seems to me that all this would 
apply even if you hate Java programming

00:31:36.025 --> 00:31:39.671
language and want to write in, I don't 
know, lisp or something instead

00:31:39.671 --> 00:31:42.794
if you've got a lisp system that can 
generate JVM bytecode.

00:31:42.794 --> 00:31:48.982
[Presenter] Yeah, totally. And the other 
big data language we looked at was Scala.

00:31:48.982 --> 00:31:53.870
It uses the JVM back end but a completely 
different language on the front.

00:32:03.875 --> 00:32:08.104
Cheers guys.