-
Onto the second talk of the day.
-
Steve Capper is going to tell us
about the good bits of Java
-
They do exist
-
[Audience] Could this have been a
lightening talk? [Audience laughter]
-
Believe it or not we've got some
good stuff here.
-
I was as skeptical as you guys
when I first looked.
-
First many apologies for not attending
the mini-conf last year
-
I was unfortunately ill on the day
I was due to give this talk.
-
Let me figure out how to use a computer.
-
Sorry about this.
-
There we go; it's because
I've not woken up.
-
Last year I worked at Linaro in the
Enterprise group and we performed analysis
-
on so called 'Big Data' application sets.
-
As many of you know quite a lot of these
big data applications are written in Java.
-
I'm from ARM and we were very interested
in 64bit ARM support.
-
So this is mainly AArch64 examples
for things like assembler
-
but most of the messages are
pertinent for any architecture.
-
These good bits are shared between
most if not all the architectures.
-
Whilst trying to optimise a lot of
these big data applications
-
I stumbled a across quite a few things in
the JVM and I thought
-
'actually that's really clever;
that's really cool'
-
So I thought that would make a good
basis for an interesting talk.
-
This talk is essentially some of the
clever things I found in the
-
Java Virtual Machine; these
optimisations are in OpenJDK.
-
Source is available it's all there,
readily available and in play now.
-
I'm going to finish with some of the
optimisation work we did with Java.
-
People who know me will know
I'm not a Java zealot.
-
I don't particularly believe in
programming in a language over another one
-
So to make it clear from the outset
I'm not attempting to convert
-
anyone to Java programmers.
-
I'm just going to highlight a few salient
things in the Java Virtual Machine
-
which I found to be quite clever and
interesting
-
and I'll try and talk through them
with my understanding of them.
-
Let's jump straight in and let's
start with an example.
-
This is a minimal example for
computing a SHA1 sum of a file.
-
I've omitted some of the checking in the
beginning of the function see when
-
command line parsing and that sort of
thing.
-
I've highlighted the salient
points in red.
-
Essentially we instantiate a SHA1
crypto message service digest.
-
And we do the equivalent in
Java of an mmap.
-
Get it all in memory.
-
And then we just put this status straight
into the crypto engine.
-
And eventually at the end of the
program we'll spit out the SHA1 hash.
-
It's a very simple program.
-
It's basically mmap, SHA1, output
the hash afterwards.
-
In order to concentrate on the CPU
aspect rather than worry about IO
-
I decided to cheat a little bit by
setting this up.
-
I decided to use a sparse file. As many of
you know a sparse file is a file that not
-
all the contents are stored necessarily
on disc. The assumption is that the bits
-
that aren't stored are zero. For instance
on Linux you can create a 20TB sparse file
-
on a 10MB file system and use it as
normal.
-
Just don't write too much to it otherwise
you're going to run out of space.
-
The idea behind using a sparse file is I'm
just focusing on the computational aspects
-
of the SHA1 sum. I'm not worried about
the file system or anything like that.
-
I don't want to worry about the IO. I
just want to focus on the actual compute.
-
In order to set up a sparse file I used
the following runes.
-
The important point is that you seek
and the other important point
-
is you set a count otherwise you'll
fill your disc up.
-
I decided to run this against firstly
let's get the native SHA1 sum command
-
that's built into Linux and let's
normalise these results and say that's 1.0
-
I used an older version of the OpenJDK
and ran the Java program
-
and that's 1.09 times slower than the
reference command. That's quite good.
-
Then I used the new OpenJDK, this is now
the current JDK as this is a year on.
-
And 0.21 taken. It's significantly faster.
-
I've stressed that I've done nothing
surreptitious in the Java program.
-
It is mmap, compute, spit result out.
-
But the OpenJDK has essentially got
some more context information.
-
I'll talk about that as we go through.
-
Before when I started Java I had a very
simplistic view of Java.
-
Traditionally Java is taught as a virtual
machine that runs bytecode.
-
Now when you compile a Java program it
compiles into bytecode.
-
The older versions of the Java Virtual
Machine would interpret this bytecode
-
and then run through. Newer versions
would employ a just-in-time engine
-
and try and compile this bytecode
into native machine code.
-
That is not the only thing that goes on
when you run a Java program.
-
There is some extra optimisations as well.
So this alone would not account for
-
the newer version of the SHA1
sum being significantly faster
-
than the distro supplied one.
-
Java knows about context. It has a class
library and these class libraries
-
have reasonably well defined purposes.
-
We have classes that provide
crypto services.
-
We have some misc unsafe that every
single project seems to pull in their
-
project when they're not supposed to.
-
These have well defined meanings.
-
These do not necessarily have to be
written in Java.
-
They come as Java classes,
they come supplied.
-
But most JVMs now have a notion
of a virtual machine intrinsic.
-
And the virtual machine intrinsic says ok
please do a SHA1 in the best possible way
-
that your implementation allows. This is
something done automatically by the JVM.
-
You don't ask for it. If the JVM knows
what it's running on and it's reasonably
-
recent this will just happen
for you for free.
-
And there's quite a few classes
that do this.
-
There's quite a few clever things with
atomics, there's crypto,
-
there's mathematical routines as well.
Most of these routines in the
-
class library have a well defined notion
of a virtual machine intrinsic
-
and they do run reasonably optimally.
-
They are a subject of continuous
optimisation as well.
-
We've got some runes that are
presented on the slides here.
-
These are quite useful if you
are interested in
-
how these intrinsics are made.
-
You can ask the JVM to print out a lot of
the just-in-time compiled code.
-
You can ask the JVM to print out the
native methods as well as these intrinsics
-
and in this particular case after sifting
through about 5MB of text
-
I've come across this particular SHA1 sum
implementation.
-
This is AArch64. This is employing the
cryptographic extensions
-
in the architecture.
-
So it's essentially using the CPU
instructions which would explain why
-
it's faster. But again it's done
all this automatically.
-
This did not require any specific runes
or anything to activate.
-
We'll see a bit later on how you can
more easily find the hot spots
-
rather than sifting through a lot
of assembler.
-
I've mentioned that the cryptographic
engine is employed and again
-
this routine was generated at run
time as well.
-
This is one of the important things about
certain execution of amps like Java.
-
You don't have to know everything at
compile time.
-
You know a lot more information at
run time and you can use that
-
in theory to optimise.
-
You can switch off these clever routines.
-
For instance I've got a deactivate
here and we get back to the
-
slower performance we expected.
-
Again, this particular set of routines is
present in OpenJDK,
-
I think for all the architectures that
support it.
-
We get this optimisation for free on X86
and others as well.
-
It works quite well.
-
That was one surprise I came across
as the instrinsics.
-
One thing I thought it would be quite
good to do would be to go through
-
a slightly more complicated example.
And use this example to explain
-
a lot of other things that happen
in the JVM as well.
-
I will spend a bit of time going through
this example
-
and explain roughly the notion of what
it's supposed to be doing.
-
This is an imaginary method that I've
contrived to demonstrate a lot of points
-
in the fewest possible lines of code.
-
I'll start with what it's meant to do.
-
This is meant to be a routine that gets a
reference to something and let's you know
-
whether or not it's an image and in a
hypothetical cache.
-
I'll start with the important thing
here the weak reference.
-
In Java and other garbage collected
languages we have the notion of references
-
Most of the time when you are running a
Java program you have something like a
-
variable name and that is in the current
execution context that is referred to as a
-
strong reference to the object. In other
words I can see it. I am using it.
-
Please don't get rid of it.
Bad things will happen if you do.
-
So the garbage collector knows
not to get rid of it.
-
In Java and other languages you also
have the notion of a weak reference.
-
This is essentially the programmer saying
to the virtual machine
-
"Look I kinda care about this but
just a little bit."
-
"If you want to get rid of it feel free
to but please let me know."
-
This is why this is for a CacheClass.
For instance the JVM in this particular
-
case could decide that it's running quite
low on memory this particular xMB image
-
has not been used for a while it can
garbage collect it.
-
The important thing is how we go about
expressing this in the language.
-
We can't just have a reference to the
object because that's a strong reference
-
and the JVM will know it can't get
rid of this because the program
-
can see it actively.
-
So we have a level of indirection which
is known as a weak reference.
-
We have this hypothetical CacheClass
that I've devised.
-
At this point it is a weak reference.
-
Then we get it. This is calling the weak
reference routine.
-
Now it becomes a strong reference so
it's not going to be garbage collected.
-
When we get to the return path it becomes
a weak reference again
-
because our strong reference
has disappeared.
-
The salient points in this example are:
-
We're employing a method to get
a reference.
-
We're checking an item to see if
it's null.
-
So let's say that the JVM decided to
garbage collect this
-
before we executed the method.
-
The weak reference class is still valid
because we've got a strong reference to it
-
but the actual object behind this is gone.
-
If we're too late and the garbage
collector has killed it
-
it will be null and we return.
-
So it's a level of indirection to see
does this still exist
-
if so can I please have it and then
operate on it as normal
-
and then return becomes weak
reference again.
-
This example program is quite useful when
we look at how it's implemented in the JVM
-
and we'll go through a few things now.
-
First off we'll go through the bytecode.
-
The only point of this slide is to
show it's roughly
-
the same as this.
-
We get our variable.
-
We use our getter.
-
This bit is extra this checkcast.
The reason that bit is extra is
-
because we're using the equivalent of
a template in Java.
-
And the way that's implemented in Java is
it just basically casts everything to an
-
object so that requires extra
compiler information.
-
And this is the extra check.
-
The rest of this we load the reference,
we check to see if it is null,
-
If it's not null we invoke a virtual
function - is it the image?
-
and we return as normal.
-
Essentially the point I'm trying to make
is when we compile this to bytecode
-
this execution happens.
-
This null check happens.
-
This execution happens.
-
And we return.
-
In the actual Java class files we've not
lost anything.
-
This is what it looks like when it's
been JIT'd.
-
Now we've lost lots of things.
-
The JIT has done quite a few clever things
which I'll talk about.
-
First off if we look down here there's
a single branch here.
-
And this is only if our check cast failed
-
We've got comments on the
right hand side.
-
Our get method has been inlined so
we're no longer calling.
-
We seem to have lost our null check,
that's just gone.
-
And again we've got a get field as well.
-
That's no longer a method,
that's been inlined as well.
-
We've also got some other cute things.
-
Those more familiar with AArch64
will understand
-
that the pointers we're using
are 32bit not 64bit.
-
What we're doing is getting a pointer
and shifting it left 3
-
and widening it to a 64bit pointer.
-
We've also got 32bit pointers on a
64bit system as well.
-
So that's saving a reasonable amount
of memory and cache.
-
To summarise. We don't have any
branches or function calls
-
and we've got a lot of inlining.
-
We did have function calls in the
class file so it's the JVM;
-
it's the JIT that has done this.
-
We've got no null checks either and I'm
going to talk through this now.
-
The null check elimination is quite a
clever feature in Java and other programs.
-
The idea behind null check elimination is
-
most of the time this object is not
going to be null.
-
If this object is null the operating
system knows this quite quickly.
-
So if you try to dereference a null
pointer you'll get either a SIGSEGV or
-
a SIGBUS depending on a
few circumstances.
-
That goes straight back to the JVM
-
and the JVM knows where the null
exception took place.
-
Because it knows where it took
place it can look this up
-
and unwind it as part of an exception.
-
Those null checks just go.
Completely gone.
-
Most of the time this works and you are
saving a reasonable amount of execution.
-
I'll talk about when it doesn't work
in a second.
-
That's reasonably clever. We have similar
programming techniques in other places
-
even the Linux kernel for instance when
you copy data to and from user space
-
it does pretty much identical
the same thing.
-
It has an exception unwind table and it
knows if it catches a page fault on
-
this particular program counter
it can deal with it because it knows
-
the program counter and it knows
conceptually what it was doing.
-
In a similar way the JIT knows what its
doing to a reasonable degree.
-
It can handle the null check elimination.
-
I mentioned the sneaky one. We've got
essentially 32bit pointers
-
on a 64bit system.
-
Most of the time in Java people typically
specify heap size smaller than 32GB.
-
Which is perfect if you want to use 32bit
pointers and left shift 3.
-
Because that gives you 32GB of
addressable memory.
-
That's a significant memory saving because
otherwise a lot of things would double up.
-
There's a significant number of pointers
in Java.
-
The one that should make people
jump out of their seat is
-
the fact that most methods in Java are
actually virtual.
-
So what the JVM has actually done is
inlined a virtual function.
-
A virtual function is essentially a
function were you don't know where
-
you're going until run time.
-
You can have several different classes
and they share the same virtual function
-
in the base class and dependent upon
which specific class you're running
-
different virtual functions will
get executed.
-
In C++ that will be a read from a V table
and then you know where to go.
-
The JVM's inlined it.
-
We've saved a memory load.
-
We've saved a branch as well
-
The reason the JVM can inline it is
because the JVM knows
-
every single class that has been loaded.
-
So it knows that although this looks
polymorphic to the casual programmer
-
It actually is monomorphic.
The JVM knows this.
-
Because it knows this it can be clever.
And this is really clever.
-
That's a significant cost saving.
-
This is all great. I've already mentioned
the null check elimination.
-
We're taking a signal as most of you know
if we do that a lot it's going to be slow.
-
Jumping into kernel, into user,
bouncing around.
-
The JVM also has a notion of
'OK I've been a bit too clever now;
-
I need to back off a bit'
-
Also there's nothing stopping the user
loading more classes
-
and rendering the monomorphic
assumption invalid.
-
So the JVM needs to have a notion of
backpeddling and go
-
'Ok I've gone to far and need to
deoptimise'
-
The JVM has the ability to deoptimise.
-
In other words it essentially knows that
for certain code paths everything's OK.
-
But for certain new objects it can't get
away with these tricks.
-
By the time the new objects are executed
they are going to be safe.
-
There are ramifications for this.
This is the important thing to consider
-
with something like Java and other
languages and other virtual machines.
-
If you're trying to profile this it means
there is a very significant ramification.
-
You can have the same class and
method JIT'd multiple ways
-
and executed at the same time.
-
So if you're trying to find a hot spot
the program counter's nodding off.
-
Because you can refer to the same thing
in several different ways.
-
This is quite common as well as
deoptimisation does take place.
-
That's something to bear in mind with JVM
and similar runtime environments.
-
You can get a notion of what the JVM's
trying to do.
-
You can ask it nicely and add a print
compilation option
-
and it will tell you what it's doing.
-
This is reasonably verbose.
-
Typically what happens is the JVM gets
excited JIT'ing everything
-
and optimising everything then
it settles down.
-
Until you load something new
and it gets excited again.
-
There's a lot of logs. This is mainly
useful for debugging but
-
it gives you an appreciation that it's
doing a lot of work.
-
You can go even further with a log
compilation option.
-
That produces a lot of XML and that is
useful for people debugging the JVM as well.
-
It's quite handy to get an idea of
what's going on.
-
If that is not enough information you
also have the ability to go even further.
-
This is beyond the limit of my
understanding.
-
I've gone into this little bit just to
show you what can be done.
-
You have release builds of OpenJDK
and they have debug builds of OpenJDK.
-
The release builds will by default turn
off a lot of the diagnostic options.
-
You can switch them back on again.
-
When you do you can also gain insight
into the actual, it's colloquially
-
referred to as the C2 JIT,
the compiler there.
-
You can see, for instance, objects in
timelines and visualize them
-
as they're being optimised at various
stages and various things.
-
So this is based on a masters thesis
by Thomas Würthinger.
-
This is something you can play with as
well and see how far the optimiser goes.
-
And it's also good for people hacking
with the JVM.
-
I'll move onto some stuff we did.
-
Last year we were working on the
big data. Relatively new architecture
-
ARM64, it's called AArch64 in OpenJDK
land but ARM64 in Debian land.
-
We were a bit concerned because
everything's all shiny and new.
-
Has it been optimised correctly?
-
Are there any obvious things
we need to optimise?
-
And we're also interested because
everything was so shiny and new
-
in the whole system.
-
Not just the JVM but the glibc and
the kernel as well.
-
So how do we get a view of all of this?
-
I gave a quick talk before at the Debian
mini-conf before last [2014] about perf
-
so decided we could try and do some
clever things with Linux perf
-
and see if we could get some actual useful
debugging information out.
-
We have the flame graphs that are quite
well known.
-
We also have some previous work, Johannes
had a special perf map agent that
-
could basically hook into perf and it
would give you a nice way of running
-
perf-top for want of a better expression
and viewing the top Java function names.
-
This is really good work and it's really
good for a particular use case
-
if you just want to do a quick snap shot
once and see in that snap shot
-
where the hotspots where.
-
For a prolonged work load with all
the functions being JIT'd multiple ways
-
with the optimisation going on and
everything moving around
-
it require a little bit more information
to be captured.
-
I decided to do a little bit of work on a
very similar thing to perf-map-agent
-
but an agent that would capture it over
a prolonged period of time.
-
Here's an example Flame graph, these are
all over the internet.
-
This is the SHA1 computation example that
I gave at the beginning.
-
As expected the VM intrinsic SHA1 is the
top one.
-
Not expected by me was this quite
significant chunk of CPU execution time.
-
And there was a significant amount of
time being spent copying memory
-
from the mmapped memory
region into a heap
-
and then that was passed to
the crypto engine.
-
So we're doing a ton of memory copies for
no good reason.
-
That essentially highlighted an example.
-
That was an assumption I made about Java
to begin with which was if you do
-
the equivalent of mmap it should just
work like mmap right?
-
You should just be able to address the
memory. That is not the case.
-
If you've got a file mapping object and
you try to address it it has to be copied
-
into safe heap memory first. And that is
what was slowing down the programs.
-
If that was omitted you could make
the SHA1 computation even quicker.
-
So that would be the logical target you
would want to optimise.
-
I wanted to extend Johannes' work
with something called a
-
Java Virtual Machine Tools Interface
profiling agent.
-
This is part of the Java Virtual Machine
standard as you can make a special library
-
and then hook this into the JVM.
-
And the JVM can expose quite a few
things to the library.
-
It exposes a reasonable amount of
information as well.
-
Perf as well has the ability to look
at map files natively.
-
If you are profiling JavaScript, or
something similar, I think the
-
Google V8 JavaScript engine will write
out a special map file that says
-
these program counter addresses correspond
to these function names.
-
I decided to use that in a similar way to
what Johannes did for the extended
-
profiling agent but I also decided to
capture some more information as well.
-
I decided to capture the disassembly
so when we run perf annotate
-
we can see the actual JVM bytecode
in our annotation.
-
We can see how it was JIT'd at the
time when it was JIT'd.
-
We can see where the hotspots where.
-
And that's good. But we can go
even better.
-
We can run an annotated trace that
contains the Java class,
-
the Java method and the bytecode all in
one place at the same time.
-
You can see everything from the JVM
at the same place.
-
This works reasonably well because the
perf interface is extremely extensible.
-
And again we can do entire
system optimisation.
-
The bits in red here are the Linux kernel.
-
Then we got into libraries.
-
And then we got into Java and more
libraries as well.
-
So we can see everything from top to
bottom in one fell swoop.
-
This is just a quick slide showing the
mechanisms employed.
-
Essentially we have this agent which is
a shared object file.
-
And this will spit out useful files here
in a standard way.
-
And the Linux perf basically just records
the perf data dump file as normal.
-
We have 2 sets of recording going on.
-
To report it it's very easy to do
normal reporting with the PID map.
-
This is just out of the box, works with
the Google V8 engine as well.
-
If you want to do very clever annotations
perf has the ability to have
-
Python scripts passed to it.
-
So you can craft quite a dodgy Python
script and that can interface
-
with the perf annotation output.
-
That's how I was able to get the extra
Java information in the same annotation.
-
And this is really easy to do; it's quite
easy to knock the script up.
-
And again the only thing we do for this
profiling is we hook in the profiling
-
agent which dumps out various things.
-
We preserve the frame pointer because
that makes things considerably easier
-
on winding. This will effect
performance a little bit.
-
And again when we're reporting we just
hook in a Python script.
-
It's really easy to hook everything in
and get it working.
-
At the moment we have a JVMTI agent. It's
actually on http://git.linaro.org now.
-
Since I gave this talk Google have
extended perf anyway so it will do
-
quite a lot of similar things out of the
box anyway.
-
It's worth having a look at the
latest perf.
-
These techniques in this slide deck can be
used obviously in other JITs quite easily.
-
The fact that perf is so easy to extend
with scripts can be useful
-
for other things.
-
And OpenJDK has a significant amount of
cleverness associated with it that
-
I thought was very surprising and good.
So that's what I covered in the talk.
-
These are basically references to things
like command line arguments
-
and the Flame graphs and stuff like that.
-
If anyone is interested in playing with
OpenJDK on ARM64 I'd suggest going here:
-
http://openjdk.linaro.org
Where the most recent builds are.
-
Obviously fixes are going in upstream and
they're going into distributions as well.
-
They're included in OpenJDK so it should
be good as well.
-
I've run through quite a few fundamental
things reasonably quickly.
-
I'd be happy to accept any questions
or comments
-
And if you want to talk to me privately
about Java afterwards feel free to
-
when no-one's looking.
-
[Audience] Applause
-
[Audience] It's not really a question so
much as a comment.
-
Last mini-Deb conf we had a talk about
-
using the JVM with other languages.
-
And it seems to me that all this would
apply even if you hate Java programming
-
language and want to write in, I don't
know, lisp or something instead
-
if you've got a lisp system that can
generate JVM bytecode.
-
[Presenter] Yeah, totally. And the other
big data language we looked at was Scala.
-
It uses the JVM back end but a completely
different language on the front.
-
Cheers guys.