Onto the second talk

Steve Capper is going to tell us
about the good bits of Java

They do exist

[Audience] Could this have been a 
lightening talk? [Audience laughter]

Believe it or not we've got some 
good stuff here.

I was as skeptical as you guys 
when I first looked.

First apologies for not attending this
mini-conf last year

I was unfortunately ill on the day 
I was due to give this talk.

Let me figure out how to use a computer.

Sorry about this.

There we go; it's because 
I've not woken up.

Last year I worked at Linaro in the 
Enterprise group and we performed analysis

on 'Big Data' applications sets.

As many of you know quite a lot of these 
big data applications are written in Java.

I'm from ARM and we were very interested
in 64bit ARM support.

So this is mainly AArch64 examples 
for things like assembler

but most of the messages are 
pertinent for any architecture.

These good bits are shared between 
most if not all the architectures.

Whilst trying to optimise a lot of 
these big data applications

I stumbled a across quite a few things in 
the JVM and I thought

'actually that's really clever; 
that's really cool'

So I thought that would make a good 
basis for a talk.

This talk is essentially some of the 
clever things I found in the

Java Virtual Machine; these 
optimisations are in Open JDK.

Source is available it's all there, 
readily available and in play now.

I'm going to finish with some of the 
optimisation work we did with Java.

People who know me will know 
I'm not a Java zealot.

I don't particularly believe in 
programming in a language over another one

So to make it clear from the outset 
I'm not attempting to convert

anyone to Java programmers.

I'm just going to highlight a few salient 
things in the Java Virtual Machine

which I found to be quite clever and 
interesting

and I'll try and talk through them 
with my understanding of them.

Let's jump straight in and let's 
start with an example.

This is a minimal example for 
computing a SHA1 sum of a file.

I've alluded some of the checking in the 
beginning of the function see when

command line parsing and that sort of 
thing.

I've highlighted the salient points in red.

Essentially we instantiate a SHA1 
crypto message service digest.

And we do the equivalent in 
Java of an mmap.

Get it all in memory.

And then we just put this status straight 
into the crypto engine.

And eventually at the end of the 
program we'll spit out the SHA1 hash.

It's a very simple programme

It's basically mmap, SHA1 output 
the hash afterwards.

In order to concentrate on the CPU 
aspect rather than worry about IO

I decided to cheat a little by 
setting this up.

I decided to use a sparse file. As many of
you know a sparse file is a file that not

all the contents are necessarily stored 
on disc. The assumption is that the bits

that aren't stored are zero. For instance
on Linux you can create a 20TB sparse file

on a 10MB file system and use it as 
normal.

Just don't write too much to it otherwise 
you're going to run out of space.

The idea behind using a sparse file is I'm
just focusing on the computational aspects

of the SHA1 sum. I'm not worried about 
the file system or anything like that.

I don't want to worry about the IO. I 
just want to focus on the actual compute.

In order to set up a sparse file I used 
the following runes.

The important point is that you seek
and the other important point

is you set a count otherwise you'll fill your disc up.

I decided to run this against firstly 
let's get the native SHA1 sum command

that's built into Linux and let's normalise these results and say that's 1.0.

I used an older version of the Open 
JDK and ran the Java programme

and that's 1.09 times slower than the 
reference command. That's quite good.

Then I used the new Open JDK, this is now
the current JDK as this is a year on.

And 0.21 taken. It's significantly faster.

I've stressed that I've done nothing 
surreptitious in the Java program.

It is mmap, compute, spit result out.

But the Open JDK has essentially got 
some more context information.

I'll talk about that as we go through.

Before when I started Java I had a very 
simplistic view of Java.

Traditionally Java is taught as a virtual 
machine that runs byte code.

Now when you compile a Java program it 
compiles into byte code.

The older versions of the Java Virtual 
Machine would interpret this byte code

and then run through. Newer versions would
employ a just-in-time engine and try and

compile this byte code into native machine code.

That is not the only thing that goes on
when you run a Java program.

There is some extra optimisations as well.
So this alone would not account for

the newer version of the SHA1 
sum beingsignificantly faster

than the distro supply one.

Java knows about context. It has a class 
library and these class libraries

have reasonably well defined purposes.

We have classes that provide 
crypto services.

We have some misc unsafe that every 
single project seems to pull in their

project when they're not supposed to.

These have well defined meanings.

These do not necessarily have to be 
written in Java.

They come as Java classes, 
they come supplied.

But most JVMs now have a notion 
of a virtual machine intrinsic

And the virtual machine intrinsic says ok 
please do a SHA1 in the best possible way

that your implementation allows. This is 
something done automatically by the JVM.

You don't ask for it. If the JVM knows
what it's running on and it's reasonably

recent this will just happen 
for you for free.

And there's quite a few classes 
that do this.

There's quite a few clever things with 
atomics, there's crypto,

there's mathematical routines as well. 
Most of these routines in the

class library have a well defined notion 
of a virtual machine intrinsic

and they do run reasonably optimally.

They are a subject of continuous 
optimisation as well.

We've got some runes that are 
presented on the slides here.

These are quite useful if you 
are interested in

how these intrinsics are made.

You can ask the JVM to print out a lot of
the just-in-time compiled code.

You can ask the JVM to print out the 
native methods as well as these intrinsics

and in this particular case after sifting 
through about 5MB of text

I've come across this particular SHA1 sum
implementation.

This is AArch64. This is employing the 
cryptographic extensions

in the architecture. So it's essentially 
using the CPU instructions which

would explain why it's faster. But again 
it's done all this automatically.

This did not require any specific runes 
or anything to activate.

We'll see a bit later on how you can 
more easily find the hot spots

rather than sifting through a lot 
of assembler.

I've mentioned that the cryptographic 
engine is employed and again

this routine was generated at run 
time as well.

This is one of the important things about 
certain execution of amps like Java.

You don't have to know everything at 
compile time.

You know a lot more information at 
run time and you can use that

in theory to optimise.

You can switch off these clever routines.

For instance I've got a deactivate 
here and we get back to the

slower performance we expected.

Again, this particular set of routines is 
present in Open JDK,

I think for all the architectures that support it.

We get this optimisation for free on X86 
and others as well.

It works quite well.

That was one surprise I came across 
as the instrinsics.

One thing I thought it would be quite 
good to do would be to go through

a slightly more complicated example. 
And use this example to explain

a lot of other things that happen 
in the JVM as well.

I will spend a bit of time going through 
this example

and explain roughly the notion of what 
it's supposed to be doing.

This is an imaginary method that I've 
contrived to demonstrate lot of points

in the fewest possible lines of code.

I'll start with what it's meant to do.

This is meant to be a routine that gets a
reference to something and let's you know

whether or not it's an image and in a 
hypothetical cache.

I'll start with the important thing 
here the weak reference.

In Java and other garbage collected 
languages we have the notion of references.

Most of the time when you are running a 
Java program you have something like a

variable name and that is in the current 
execution context that is referred to as a

strong reference to the object. In other 
words I can see it. I am using it.

Please don't get rid of it. 
Bad things will happen if you do.

So the garbage collector knows 
not to get rid of it.

In Java and other languages you also 
have the notion of a weak reference.

This is essentially the programmer saying
to the virtual machine

"Look I kinda care about this but 
just a little bit."

"If you want to get rid of it feel free 
to but please let me know."

This is why this is for a cache class. 
For instance the JVM in this particular

case could decide that it's running quite 
low on memory this particular xMB image

has not been used for a while it can 
garbage collect it.

The important thing is how we go about 
expressing this in the language.

We can't just have a reference to the 
object because that's a strong reference

and the JVM will know it can't get 
rid of this because the program

can see it actively.

So we have a level of indirection which is 
known as a weak reference.

We have this hypothetical CacheClass 
that I've devised.

At this point it is a weak reference.

Then we get it. This is calling the weak 
reference routine.

Now it becomes a strong reference so 
it's not going to be garbage collected.

When we get to the return path it becomes 
a weak reference again

because our strong reference 
has disappeared.

The salient points in this example are:

We're employing a method to get 
a reference.

We're checking an item to see if 
it's null.

So let's say that the JVM decided to 
garbage collect this

before we executed the method.

The weak reference class is still valid 
because we've got a strong reference to it

but the actual object behind this is gone.

If we're too late and the garbage 
collector has killed it

it will be null and we return.

So it's a level of indirection to see 
does this still exist

if so can I please have it and then 
operate on it as normal

and then return becomes weak 
reference again.

This example program is quite useful when
we look at how it's implemented in the JVM

and we'll go through a few things now.

First off we'll go through the byte code.

The only point of this slide is to 
show it's roughly

the same as this.

We get our variable.

We use our getter.

This bit is extra this checkcast. 
The reason that bit is extra is

because we're using the equivalent of 
a template in Java.

And the way that's implemented in Java is 
it just basically casts everything to an

object so that requires extra 
compiler information.

And this is the extra check.

The rest of this we load the reference, 
we check to see if it is null,

If it's not null we invoke a virtual 
function - is it the image?

and we return as normal.

Essentially the point I'm trying to make 
is when we compile this to byte code

this execution happens.

This null check happens.

This execution happens.

And we return.

In the actual Java class files we've not 
lost anything.

This is what it looks like when it's 
been JIT'd.

Now we've lost lots of things.

The JIT has done quite a few clever things
which I'll talk about.

First off if we look down here there's 
a single branch here.

And this is only if our check cast failed

If we've got comments on the 
right hand side.

Our get method has been in-lined so 
we're no longer calling.

We seem to have lost our null check,
that's just gone.

And again we've got a get field as well.

That's no longer a method, 
that's been in-lined as well

We've also got some other cute things.

Those more familiar with AArch64 will 
understand that the pointers we're using

are 32bit not 64bit.

What we're doing is getting a pointer 
and shifting it left 3

and widening it to a 64bit pointer.

We've also got 32bit pointers on a 
64bit system as well.

So that's saving a reasonable amount 
of memory and cache.

To summarise. We don't have any 
branches or function calls

and we've got a lot of in-lining.

We did have function calls in the 
class file so it's the JVM

it's the JIT that has done this.

We've got no null checks either and I'm 
going to talk through this now.

The null check elimination is quite a 
clever feature in Java and other programs.

The idea behind null check elimination is

most of the time this object is not 
going to be null.

If this object is null the operating 
system knows this quite quickly.

So if you try to de-reference a null 
pointer you'll get either a SIGSEGV or

a SIGBUST depending on a 
few circumstances.

That goes straight back to the JVM

and the JVM knows where the null 
exception took place.

Because it knows where the exception took 
place it can look this up

and unwind it as part of an exception.

Those null checks just go.
Completely gone.

Most of the time this works and you are 
saving a reasonable amount of execution.

I'll talk about when it doesn't work 
in a second.

That's reasonably clever. We have similar 
programming techniques in other places

even the Linux kernel for instance when 
you copy data to and from user space

it does pretty much identical the same 
thing. It has an exception unwind table

and it knows if it catches a page fault on
this particular program counter

it can deal with it because it knows 
the program counter and it knows

conceptually what it was doing.

In a similar way the JIT know what its 
doing to a reasonable degree.

It can handle the null check elimination.

I mentioned the sneaky one. We've got
essentially 32bit pointers

on a 64bit system.

Most of the time in Java people typically 
specify heap size smaller than 32GB.

Which is perfect if you want to use 32bit 
pointers and left shift 3.

Because that gives you 32GB of 
addressable memory.

That's a significant memory saving because
otherwise a lot of things would double up.

There's a significant number of pointers 
in Java.

The one that should make people 
jump out of their seat is

the fact that most methods in Java are 
actually virtual.

So what the JVM has actually done is 
in-lined a virtual function.

A virtual function is essentially a 
function were you don't know where

you're going until run time.

You can have several different classes 
and they share the same virtual function

in the base class and dependent upon 
which specific class you're running

different virtual functions will 
get executed.

In C++ that will be a read from a V table
and then you know where to go.

The JVM's in-lined it.

We've saved a memory load.

We've saved a branch as well

The reason the JVM can in-line it is 
because the JVM knows

every single class that has been loaded.

So it knows that although this looks 
polymorphic to the casual programmer

It is actually monomorphic.
The JVM knows this.