< Return to Video

Java_the_good_bits.webm

  • 0:02 - 0:03
    Onto the second talk of the day.
  • 0:11 - 0:14
    Steve Capper is going to tell us
    about the good bits of Java
  • 0:15 - 0:17
    They do exist
  • 0:17 - 0:21
    [Audience] Could this have been a
    lightening talk? [Audience laughter]
  • 0:23 - 0:26
    Believe it or not we've got some
    good stuff here.
  • 0:27 - 0:30
    I was as skeptical as you guys
    when I first looked.
  • 0:31 - 0:35
    First many apologies for not attending
    the mini-conf last year
  • 0:35 - 0:40
    I was unfortunately ill on the day
    I was due to give this talk.
  • 0:44 - 0:46
    Let me figure out how to use a computer.
  • 1:01 - 1:03
    Sorry about this.
  • 1:13 - 1:16
    There we go; it's because
    I've not woken up.
  • 1:20 - 1:27
    Last year I worked at Linaro in the
    Enterprise group and we performed analysis
  • 1:28 - 1:32
    on so called 'Big Data' application sets.
  • 1:32 - 1:37
    As many of you know quite a lot of these
    big data applications are written in Java.
  • 1:38 - 1:43
    I'm from ARM and we were very interested
    in 64bit ARM support.
  • 1:43 - 1:47
    So this is mainly AArch64 examples
    for things like assembler
  • 1:48 - 1:53
    but most of the messages are
    pertinent for any architecture.
  • 1:54 - 1:58
    These good bits are shared between
    most if not all the architectures.
  • 1:59 - 2:03
    Whilst trying to optimise a lot of
    these big data applications
  • 2:03 - 2:06
    I stumbled a across quite a few things in
    the JVM and I thought
  • 2:06 - 2:11
    'actually that's really clever;
    that's really cool'
  • 2:13 - 2:17
    So I thought that would make a good
    basis for an interesting talk.
  • 2:17 - 2:20
    This talk is essentially some of the
    clever things I found in the
  • 2:20 - 2:25
    Java Virtual Machine; these
    optimisations are in OpenJDK.
  • 2:26 - 2:32
    Source is available it's all there,
    readily available and in play now.
  • 2:33 - 2:38
    I'm going to finish with some of the
    optimisation work we did with Java.
  • 2:38 - 2:43
    People who know me will know
    I'm not a Java zealot.
  • 2:43 - 2:48
    I don't particularly believe in
    programming in a language over another one
  • 2:48 - 2:51
    So to make it clear from the outset
    I'm not attempting to convert
  • 2:51 - 2:54
    anyone to Java programmers.
  • 2:54 - 2:57
    I'm just going to highlight a few salient
    things in the Java Virtual Machine
  • 2:57 - 3:00
    which I found to be quite clever and
    interesting
  • 3:00 - 3:04
    and I'll try and talk through them
    with my understanding of them.
  • 3:04 - 3:09
    Let's jump straight in and let's
    start with an example.
  • 3:10 - 3:14
    This is a minimal example for
    computing a SHA1 sum of a file.
  • 3:15 - 3:20
    I've omitted some of the checking in the
    beginning of the function see when
  • 3:20 - 3:22
    command line parsing and that sort of
    thing.
  • 3:22 - 3:25
    I've highlighted the salient
    points in red.
  • 3:25 - 3:30
    Essentially we instantiate a SHA1
    crypto message service digest.
  • 3:30 - 3:35
    And we do the equivalent in
    Java of an mmap.
  • 3:36 - 3:38
    Get it all in memory.
  • 3:38 - 3:42
    And then we just put this status straight
    into the crypto engine.
  • 3:42 - 3:47
    And eventually at the end of the
    program we'll spit out the SHA1 hash.
  • 3:47 - 3:49
    It's a very simple program.
  • 3:49 - 3:53
    It's basically mmap, SHA1, output
    the hash afterwards.
  • 3:56 - 4:03
    In order to concentrate on the CPU
    aspect rather than worry about IO
  • 4:04 - 4:07
    I decided to cheat a little bit by
    setting this up.
  • 4:08 - 4:15
    I decided to use a sparse file. As many of
    you know a sparse file is a file that not
  • 4:15 - 4:20
    all the contents are stored necessarily
    on disc. The assumption is that the bits
  • 4:20 - 4:26
    that aren't stored are zero. For instance
    on Linux you can create a 20TB sparse file
  • 4:26 - 4:31
    on a 10MB file system and use it as
    normal.
  • 4:31 - 4:34
    Just don't write too much to it otherwise
    you're going to run out of space.
  • 4:34 - 4:41
    The idea behind using a sparse file is I'm
    just focusing on the computational aspects
  • 4:41 - 4:45
    of the SHA1 sum. I'm not worried about
    the file system or anything like that.
  • 4:45 - 4:49
    I don't want to worry about the IO. I
    just want to focus on the actual compute.
  • 4:49 - 4:53
    In order to set up a sparse file I used
    the following runes.
  • 4:53 - 4:57
    The important point is that you seek
    and the other important point
  • 4:57 - 5:01
    is you set a count otherwise you'll
    fill your disc up.
  • 5:03 - 5:09
    I decided to run this against firstly
    let's get the native SHA1 sum command
  • 5:09 - 5:15
    that's built into Linux and let's
    normalise these results and say that's 1.0
  • 5:17 - 5:21
    I used an older version of the OpenJDK
    and ran the Java program
  • 5:21 - 5:28
    and that's 1.09 times slower than the
    reference command. That's quite good.
  • 5:30 - 5:39
    Then I used the new OpenJDK, this is now
    the current JDK as this is a year on.
  • 5:39 - 5:45
    And 0.21 taken. It's significantly faster.
  • 5:46 - 5:51
    I've stressed that I've done nothing
    surreptitious in the Java program.
  • 5:51 - 5:54
    It is mmap, compute, spit result out.
  • 5:56 - 6:01
    But the OpenJDK has essentially got
    some more context information.
  • 6:01 - 6:04
    I'll talk about that as we go through.
  • 6:06 - 6:11
    Before when I started Java I had a very
    simplistic view of Java.
  • 6:11 - 6:17
    Traditionally Java is taught as a virtual
    machine that runs bytecode.
  • 6:17 - 6:21
    Now when you compile a Java program it
    compiles into bytecode.
  • 6:21 - 6:25
    The older versions of the Java Virtual
    Machine would interpret this bytecode
  • 6:25 - 6:32
    and then run through. Newer versions
    would employ a just-in-time engine
  • 6:32 - 6:38
    and try and compile this bytecode
    into native machine code.
  • 6:39 - 6:43
    That is not the only thing that goes on
    when you run a Java program.
  • 6:43 - 6:47
    There is some extra optimisations as well.
    So this alone would not account for
  • 6:47 - 6:52
    the newer version of the SHA1
    sum being significantly faster
  • 6:52 - 6:56
    than the distro supplied one.
  • 6:56 - 7:01
    Java knows about context. It has a class
    library and these class libraries
  • 7:01 - 7:04
    have reasonably well defined purposes.
  • 7:04 - 7:08
    We have classes that provide
    crypto services.
  • 7:08 - 7:11
    We have some misc unsafe that every
    single project seems to pull in their
  • 7:11 - 7:13
    project when they're not supposed to.
  • 7:13 - 7:17
    These have well defined meanings.
  • 7:17 - 7:21
    These do not necessarily have to be
    written in Java.
  • 7:21 - 7:24
    They come as Java classes,
    they come supplied.
  • 7:24 - 7:29
    But most JVMs now have a notion
    of a virtual machine intrinsic.
  • 7:29 - 7:35
    And the virtual machine intrinsic says ok
    please do a SHA1 in the best possible way
  • 7:35 - 7:39
    that your implementation allows. This is
    something done automatically by the JVM.
  • 7:39 - 7:43
    You don't ask for it. If the JVM knows
    what it's running on and it's reasonably
  • 7:43 - 7:48
    recent this will just happen
    for you for free.
  • 7:48 - 7:50
    And there's quite a few classes
    that do this.
  • 7:50 - 7:54
    There's quite a few clever things with
    atomics, there's crypto,
  • 7:54 - 7:58
    there's mathematical routines as well.
    Most of these routines in the
  • 7:58 - 8:03
    class library have a well defined notion
    of a virtual machine intrinsic
  • 8:03 - 8:07
    and they do run reasonably optimally.
  • 8:07 - 8:11
    They are a subject of continuous
    optimisation as well.
  • 8:12 - 8:16
    We've got some runes that are
    presented on the slides here.
  • 8:17 - 8:21
    These are quite useful if you
    are interested in
  • 8:21 - 8:24
    how these intrinsics are made.
  • 8:24 - 8:29
    You can ask the JVM to print out a lot of
    the just-in-time compiled code.
  • 8:29 - 8:35
    You can ask the JVM to print out the
    native methods as well as these intrinsics
  • 8:35 - 8:40
    and in this particular case after sifting
    through about 5MB of text
  • 8:40 - 8:45
    I've come across this particular SHA1 sum
    implementation.
  • 8:45 - 8:52
    This is AArch64. This is employing the
    cryptographic extensions
  • 8:52 - 8:54
    in the architecture.
  • 8:54 - 8:57
    So it's essentially using the CPU
    instructions which would explain why
  • 8:57 - 9:00
    it's faster. But again it's done
    all this automatically.
  • 9:00 - 9:06
    This did not require any specific runes
    or anything to activate.
  • 9:08 - 9:12
    We'll see a bit later on how you can
    more easily find the hot spots
  • 9:12 - 9:15
    rather than sifting through a lot
    of assembler.
  • 9:15 - 9:19
    I've mentioned that the cryptographic
    engine is employed and again
  • 9:19 - 9:23
    this routine was generated at run
    time as well.
  • 9:23 - 9:28
    This is one of the important things about
    certain execution of amps like Java.
  • 9:28 - 9:31
    You don't have to know everything at
    compile time.
  • 9:31 - 9:35
    You know a lot more information at
    run time and you can use that
  • 9:35 - 9:37
    in theory to optimise.
  • 9:37 - 9:40
    You can switch off these clever routines.
  • 9:40 - 9:43
    For instance I've got a deactivate
    here and we get back to the
  • 9:43 - 9:47
    slower performance we expected.
  • 9:47 - 9:53
    Again, this particular set of routines is
    present in OpenJDK,
  • 9:53 - 9:57
    I think for all the architectures that
    support it.
  • 9:57 - 10:01
    We get this optimisation for free on X86
    and others as well.
  • 10:01 - 10:03
    It works quite well.
  • 10:03 - 10:08
    That was one surprise I came across
    as the instrinsics.
  • 10:08 - 10:13
    One thing I thought it would be quite
    good to do would be to go through
  • 10:13 - 10:18
    a slightly more complicated example.
    And use this example to explain
  • 10:18 - 10:21
    a lot of other things that happen
    in the JVM as well.
  • 10:21 - 10:24
    I will spend a bit of time going through
    this example
  • 10:24 - 10:30
    and explain roughly the notion of what
    it's supposed to be doing.
  • 10:33 - 10:39
    This is an imaginary method that I've
    contrived to demonstrate a lot of points
  • 10:39 - 10:43
    in the fewest possible lines of code.
  • 10:43 - 10:45
    I'll start with what it's meant to do.
  • 10:45 - 10:51
    This is meant to be a routine that gets a
    reference to something and let's you know
  • 10:51 - 10:56
    whether or not it's an image and in a
    hypothetical cache.
  • 10:58 - 11:02
    I'll start with the important thing
    here the weak reference.
  • 11:02 - 11:09
    In Java and other garbage collected
    languages we have the notion of references
  • 11:09 - 11:13
    Most of the time when you are running a
    Java program you have something like a
  • 11:13 - 11:19
    variable name and that is in the current
    execution context that is referred to as a
  • 11:19 - 11:24
    strong reference to the object. In other
    words I can see it. I am using it.
  • 11:24 - 11:27
    Please don't get rid of it.
    Bad things will happen if you do.
  • 11:27 - 11:31
    So the garbage collector knows
    not to get rid of it.
  • 11:31 - 11:36
    In Java and other languages you also
    have the notion of a weak reference.
  • 11:36 - 11:40
    This is essentially the programmer saying
    to the virtual machine
  • 11:40 - 11:44
    "Look I kinda care about this but
    just a little bit."
  • 11:44 - 11:49
    "If you want to get rid of it feel free
    to but please let me know."
  • 11:49 - 11:54
    This is why this is for a CacheClass.
    For instance the JVM in this particular
  • 11:54 - 12:01
    case could decide that it's running quite
    low on memory this particular xMB image
  • 12:01 - 12:04
    has not been used for a while it can
    garbage collect it.
  • 12:04 - 12:09
    The important thing is how we go about
    expressing this in the language.
  • 12:09 - 12:13
    We can't just have a reference to the
    object because that's a strong reference
  • 12:13 - 12:18
    and the JVM will know it can't get
    rid of this because the program
  • 12:18 - 12:19
    can see it actively.
  • 12:19 - 12:24
    So we have a level of indirection which
    is known as a weak reference.
  • 12:25 - 12:29
    We have this hypothetical CacheClass
    that I've devised.
  • 12:29 - 12:32
    At this point it is a weak reference.
  • 12:32 - 12:36
    Then we get it. This is calling the weak
    reference routine.
  • 12:36 - 12:41
    Now it becomes a strong reference so
    it's not going to be garbage collected.
  • 12:41 - 12:45
    When we get to the return path it becomes
    a weak reference again
  • 12:45 - 12:48
    because our strong reference
    has disappeared.
  • 12:48 - 12:51
    The salient points in this example are:
  • 12:51 - 12:54
    We're employing a method to get
    a reference.
  • 12:54 - 12:57
    We're checking an item to see if
    it's null.
  • 12:57 - 13:01
    So let's say that the JVM decided to
    garbage collect this
  • 13:02 - 13:04
    before we executed the method.
  • 13:04 - 13:09
    The weak reference class is still valid
    because we've got a strong reference to it
  • 13:09 - 13:12
    but the actual object behind this is gone.
  • 13:12 - 13:15
    If we're too late and the garbage
    collector has killed it
  • 13:15 - 13:18
    it will be null and we return.
  • 13:18 - 13:22
    So it's a level of indirection to see
    does this still exist
  • 13:22 - 13:28
    if so can I please have it and then
    operate on it as normal
  • 13:28 - 13:31
    and then return becomes weak
    reference again.
  • 13:31 - 13:37
    This example program is quite useful when
    we look at how it's implemented in the JVM
  • 13:37 - 13:40
    and we'll go through a few things now.
  • 13:40 - 13:44
    First off we'll go through the bytecode.
  • 13:44 - 13:49
    The only point of this slide is to
    show it's roughly
  • 13:49 - 13:54
    the same as this.
  • 13:54 - 13:56
    We get our variable.
  • 13:56 - 13:59
    We use our getter.
  • 13:59 - 14:04
    This bit is extra this checkcast.
    The reason that bit is extra is
  • 14:04 - 14:15
    because we're using the equivalent of
    a template in Java.
  • 14:15 - 14:19
    And the way that's implemented in Java is
    it just basically casts everything to an
  • 14:19 - 14:23
    object so that requires extra
    compiler information.
  • 14:23 - 14:25
    And this is the extra check.
  • 14:25 - 14:31
    The rest of this we load the reference,
    we check to see if it is null,
  • 14:31 - 14:35
    If it's not null we invoke a virtual
    function - is it the image?
  • 14:35 - 14:38
    and we return as normal.
  • 14:38 - 14:43
    Essentially the point I'm trying to make
    is when we compile this to bytecode
  • 14:43 - 14:45
    this execution happens.
  • 14:45 - 14:47
    This null check happens.
  • 14:47 - 14:48
    This execution happens.
  • 14:48 - 14:50
    And we return.
  • 14:50 - 14:55
    In the actual Java class files we've not
    lost anything.
  • 14:55 - 14:58
    This is what it looks like when it's
    been JIT'd.
  • 14:58 - 15:01
    Now we've lost lots of things.
  • 15:01 - 15:06
    The JIT has done quite a few clever things
    which I'll talk about.
  • 15:06 - 15:11
    First off if we look down here there's
    a single branch here.
  • 15:11 - 15:15
    And this is only if our check cast failed
  • 15:17 - 15:20
    We've got comments on the
    right hand side.
  • 15:20 - 15:26
    Our get method has been inlined so
    we're no longer calling.
  • 15:27 - 15:31
    We seem to have lost our null check,
    that's just gone.
  • 15:32 - 15:36
    And again we've got a get field as well.
  • 15:36 - 15:40
    That's no longer a method,
    that's been inlined as well.
  • 15:40 - 15:42
    We've also got some other cute things.
  • 15:42 - 15:46
    Those more familiar with AArch64
    will understand
  • 15:46 - 15:50
    that the pointers we're using
    are 32bit not 64bit.
  • 15:50 - 15:54
    What we're doing is getting a pointer
    and shifting it left 3
  • 15:54 - 15:57
    and widening it to a 64bit pointer.
  • 15:57 - 16:02
    We've also got 32bit pointers on a
    64bit system as well.
  • 16:02 - 16:06
    So that's saving a reasonable amount
    of memory and cache.
  • 16:06 - 16:10
    To summarise. We don't have any
    branches or function calls
  • 16:10 - 16:13
    and we've got a lot of inlining.
  • 16:13 - 16:16
    We did have function calls in the
    class file so it's the JVM;
  • 16:16 - 16:18
    it's the JIT that has done this.
  • 16:18 - 16:22
    We've got no null checks either and I'm
    going to talk through this now.
  • 16:24 - 16:29
    The null check elimination is quite a
    clever feature in Java and other programs.
  • 16:30 - 16:33
    The idea behind null check elimination is
  • 16:33 - 16:37
    most of the time this object is not
    going to be null.
  • 16:38 - 16:43
    If this object is null the operating
    system knows this quite quickly.
  • 16:43 - 16:48
    So if you try to dereference a null
    pointer you'll get either a SIGSEGV or
  • 16:48 - 16:51
    a SIGBUS depending on a
    few circumstances.
  • 16:51 - 16:53
    That goes straight back to the JVM
  • 16:53 - 16:58
    and the JVM knows where the null
    exception took place.
  • 16:58 - 17:02
    Because it knows where it took
    place it can look this up
  • 17:02 - 17:05
    and unwind it as part of an exception.
  • 17:05 - 17:10
    Those null checks just go.
    Completely gone.
  • 17:10 - 17:15
    Most of the time this works and you are
    saving a reasonable amount of execution.
  • 17:16 - 17:20
    I'll talk about when it doesn't work
    in a second.
  • 17:20 - 17:24
    That's reasonably clever. We have similar
    programming techniques in other places
  • 17:24 - 17:28
    even the Linux kernel for instance when
    you copy data to and from user space
  • 17:28 - 17:31
    it does pretty much identical
    the same thing.
  • 17:31 - 17:36
    It has an exception unwind table and it
    knows if it catches a page fault on
  • 17:36 - 17:40
    this particular program counter
    it can deal with it because it knows
  • 17:40 - 17:44
    the program counter and it knows
    conceptually what it was doing.
  • 17:44 - 17:48
    In a similar way the JIT knows what its
    doing to a reasonable degree.
  • 17:48 - 17:52
    It can handle the null check elimination.
  • 17:53 - 17:57
    I mentioned the sneaky one. We've got
    essentially 32bit pointers
  • 17:57 - 17:59
    on a 64bit system.
  • 17:59 - 18:05
    Most of the time in Java people typically
    specify heap size smaller than 32GB.
  • 18:05 - 18:10
    Which is perfect if you want to use 32bit
    pointers and left shift 3.
  • 18:10 - 18:13
    Because that gives you 32GB of
    addressable memory.
  • 18:13 - 18:19
    That's a significant memory saving because
    otherwise a lot of things would double up.
  • 18:19 - 18:23
    There's a significant number of pointers
    in Java.
  • 18:23 - 18:29
    The one that should make people
    jump out of their seat is
  • 18:29 - 18:32
    the fact that most methods in Java are
    actually virtual.
  • 18:32 - 18:37
    So what the JVM has actually done is
    inlined a virtual function.
  • 18:37 - 18:42
    A virtual function is essentially a
    function were you don't know where
  • 18:42 - 18:43
    you're going until run time.
  • 18:43 - 18:47
    You can have several different classes
    and they share the same virtual function
  • 18:47 - 18:51
    in the base class and dependent upon
    which specific class you're running
  • 18:51 - 18:54
    different virtual functions will
    get executed.
  • 18:54 - 19:00
    In C++ that will be a read from a V table
    and then you know where to go.
  • 19:01 - 19:03
    The JVM's inlined it.
  • 19:03 - 19:05
    We've saved a memory load.
  • 19:05 - 19:08
    We've saved a branch as well
  • 19:08 - 19:12
    The reason the JVM can inline it is
    because the JVM knows
  • 19:12 - 19:14
    every single class that has been loaded.
  • 19:14 - 19:20
    So it knows that although this looks
    polymorphic to the casual programmer
  • 19:20 - 19:26
    It actually is monomorphic.
    The JVM knows this.
  • 19:26 - 19:31
    Because it knows this it can be clever.
    And this is really clever.
  • 19:31 - 19:35
    That's a significant cost saving.
  • 19:35 - 19:41
    This is all great. I've already mentioned
    the null check elimination.
  • 19:41 - 19:47
    We're taking a signal as most of you know
    if we do that a lot it's going to be slow.
  • 19:47 - 19:51
    Jumping into kernel, into user,
    bouncing around.
  • 19:51 - 19:56
    The JVM also has a notion of
    'OK I've been a bit too clever now;
  • 19:56 - 19:58
    I need to back off a bit'
  • 19:58 - 20:02
    Also there's nothing stopping the user
    loading more classes
  • 20:02 - 20:07
    and rendering the monomorphic
    assumption invalid.
  • 20:07 - 20:10
    So the JVM needs to have a notion of
    backpeddling and go
  • 20:10 - 20:14
    'Ok I've gone to far and need to
    deoptimise'
  • 20:14 - 20:17
    The JVM has the ability to deoptimise.
  • 20:17 - 20:23
    In other words it essentially knows that
    for certain code paths everything's OK.
  • 20:23 - 20:27
    But for certain new objects it can't get
    away with these tricks.
  • 20:27 - 20:32
    By the time the new objects are executed
    they are going to be safe.
  • 20:32 - 20:35
    There are ramifications for this.
    This is the important thing to consider
  • 20:35 - 20:40
    with something like Java and other
    languages and other virtual machines.
  • 20:40 - 20:46
    If you're trying to profile this it means
    there is a very significant ramification.
  • 20:46 - 20:51
    You can have the same class and
    method JIT'd multiple ways
  • 20:52 - 20:55
    and executed at the same time.
  • 20:55 - 21:00
    So if you're trying to find a hot spot
    the program counter's nodding off.
  • 21:01 - 21:04
    Because you can refer to the same thing
    in several different ways.
  • 21:04 - 21:08
    This is quite common as well as
    deoptimisation does take place.
  • 21:09 - 21:14
    That's something to bear in mind with JVM
    and similar runtime environments.
  • 21:16 - 21:19
    You can get a notion of what the JVM's
    trying to do.
  • 21:19 - 21:22
    You can ask it nicely and add a print
    compilation option
  • 21:22 - 21:25
    and it will tell you what it's doing.
  • 21:25 - 21:27
    This is reasonably verbose.
  • 21:27 - 21:30
    Typically what happens is the JVM gets
    excited JIT'ing everything
  • 21:30 - 21:32
    and optimising everything then
    it settles down.
  • 21:32 - 21:35
    Until you load something new
    and it gets excited again.
  • 21:35 - 21:38
    There's a lot of logs. This is mainly
    useful for debugging but
  • 21:38 - 21:42
    it gives you an appreciation that it's
    doing a lot of work.
  • 21:42 - 21:45
    You can go even further with a log
    compilation option.
  • 21:45 - 21:50
    That produces a lot of XML and that is
    useful for people debugging the JVM as well.
  • 21:51 - 21:54
    It's quite handy to get an idea of
    what's going on.
  • 21:57 - 22:03
    If that is not enough information you
    also have the ability to go even further.
  • 22:05 - 22:07
    This is beyond the limit of my
    understanding.
  • 22:07 - 22:11
    I've gone into this little bit just to
    show you what can be done.
  • 22:11 - 22:17
    You have release builds of OpenJDK
    and they have debug builds of OpenJDK.
  • 22:17 - 22:24
    The release builds will by default turn
    off a lot of the diagnostic options.
  • 22:25 - 22:28
    You can switch them back on again.
  • 22:28 - 22:33
    When you do you can also gain insight
    into the actual, it's colloquially
  • 22:33 - 22:37
    referred to as the C2 JIT,
    the compiler there.
  • 22:37 - 22:42
    You can see, for instance, objects in
    timelines and visualize them
  • 22:42 - 22:45
    as they're being optimised at various
    stages and various things.
  • 22:45 - 22:52
    So this is based on a masters thesis
    by Thomas Würthinger.
  • 22:54 - 22:58
    This is something you can play with as
    well and see how far the optimiser goes.
  • 23:00 - 23:03
    And it's also good for people hacking
    with the JVM.
  • 23:05 - 23:08
    I'll move onto some stuff we did.
  • 23:10 - 23:16
    Last year we were working on the
    big data. Relatively new architecture
  • 23:17 - 23:22
    ARM64, it's called AArch64 in OpenJDK
    land but ARM64 in Debian land.
  • 23:24 - 23:27
    We were a bit concerned because
    everything's all shiny and new.
  • 23:27 - 23:29
    Has it been optimised correctly?
  • 23:29 - 23:31
    Are there any obvious things
    we need to optimise?
  • 23:31 - 23:34
    And we're also interested because
    everything was so shiny and new
  • 23:34 - 23:35
    in the whole system.
  • 23:35 - 23:39
    Not just the JVM but the glibc and
    the kernel as well.
  • 23:39 - 23:42
    So how do we get a view of all of this?
  • 23:42 - 23:49
    I gave a quick talk before at the Debian
    mini-conf before last [2014] about perf
  • 23:50 - 23:53
    so decided we could try and do some
    clever things with Linux perf
  • 23:53 - 23:58
    and see if we could get some actual useful
    debugging information out.
  • 23:58 - 24:02
    We have the flame graphs that are quite
    well known.
  • 24:02 - 24:08
    We also have some previous work, Johannes
    had a special perf map agent that
  • 24:08 - 24:13
    could basically hook into perf and it
    would give you a nice way of running
  • 24:13 - 24:20
    perf-top for want of a better expression
    and viewing the top Java function names.
  • 24:22 - 24:25
    This is really good work and it's really
    good for a particular use case
  • 24:25 - 24:29
    if you just want to do a quick snap shot
    once and see in that snap shot
  • 24:30 - 24:32
    where the hotspots where.
  • 24:32 - 24:38
    For a prolonged work load with all
    the functions being JIT'd multiple ways
  • 24:38 - 24:42
    with the optimisation going on and
    everything moving around
  • 24:42 - 24:47
    it require a little bit more information
    to be captured.
  • 24:47 - 24:51
    I decided to do a little bit of work on a
    very similar thing to perf-map-agent
  • 24:51 - 24:56
    but an agent that would capture it over
    a prolonged period of time.
  • 24:56 - 24:59
    Here's an example Flame graph, these are
    all over the internet.
  • 24:59 - 25:05
    This is the SHA1 computation example that
    I gave at the beginning.
  • 25:05 - 25:10
    As expected the VM intrinsic SHA1 is the
    top one.
  • 25:10 - 25:17
    Not expected by me was this quite
    significant chunk of CPU execution time.
  • 25:17 - 25:21
    And there was a significant amount of
    time being spent copying memory
  • 25:21 - 25:28
    from the mmapped memory
    region into a heap
  • 25:28 - 25:31
    and then that was passed to
    the crypto engine.
  • 25:31 - 25:35
    So we're doing a ton of memory copies for
    no good reason.
  • 25:35 - 25:39
    That essentially highlighted an example.
  • 25:39 - 25:42
    That was an assumption I made about Java
    to begin with which was if you do
  • 25:42 - 25:45
    the equivalent of mmap it should just
    work like mmap right?
  • 25:45 - 25:48
    You should just be able to address the
    memory. That is not the case.
  • 25:48 - 25:54
    If you've got a file mapping object and
    you try to address it it has to be copied
  • 25:54 - 25:59
    into safe heap memory first. And that is
    what was slowing down the programs.
  • 25:59 - 26:05
    If that was omitted you could make
    the SHA1 computation even quicker.
  • 26:05 - 26:09
    So that would be the logical target you
    would want to optimise.
  • 26:09 - 26:12
    I wanted to extend Johannes' work
    with something called a
  • 26:12 - 26:16
    Java Virtual Machine Tools Interface
    profiling agent.
  • 26:17 - 26:23
    This is part of the Java Virtual Machine
    standard as you can make a special library
  • 26:23 - 26:25
    and then hook this into the JVM.
  • 26:25 - 26:28
    And the JVM can expose quite a few
    things to the library.
  • 26:28 - 26:32
    It exposes a reasonable amount of
    information as well.
  • 26:32 - 26:39
    Perf as well has the ability to look
    at map files natively.
  • 26:40 - 26:44
    If you are profiling JavaScript, or
    something similar, I think the
  • 26:44 - 26:48
    Google V8 JavaScript engine will write
    out a special map file that says
  • 26:48 - 26:53
    these program counter addresses correspond
    to these function names.
  • 26:53 - 26:57
    I decided to use that in a similar way to
    what Johannes did for the extended
  • 26:57 - 27:02
    profiling agent but I also decided to
    capture some more information as well.
  • 27:05 - 27:10
    I decided to capture the disassembly
    so when we run perf annotate
  • 27:10 - 27:13
    we can see the actual JVM bytecode
    in our annotation.
  • 27:13 - 27:17
    We can see how it was JIT'd at the
    time when it was JIT'd.
  • 27:17 - 27:20
    We can see where the hotspots where.
  • 27:20 - 27:23
    And that's good. But we can go
    even better.
  • 27:23 - 27:29
    We can run an annotated trace that
    contains the Java class,
  • 27:29 - 27:34
    the Java method and the bytecode all in
    one place at the same time.
  • 27:34 - 27:39
    You can see everything from the JVM
    at the same place.
  • 27:39 - 27:44
    This works reasonably well because the
    perf interface is extremely extensible.
  • 27:44 - 27:48
    And again we can do entire
    system optimisation.
  • 27:48 - 27:52
    The bits in red here are the Linux kernel.
  • 27:52 - 27:55
    Then we got into libraries.
  • 27:55 - 27:58
    And then we got into Java and more
    libraries as well.
  • 27:58 - 28:02
    So we can see everything from top to
    bottom in one fell swoop.
  • 28:04 - 28:08
    This is just a quick slide showing the
    mechanisms employed.
  • 28:08 - 28:12
    Essentially we have this agent which is
    a shared object file.
  • 28:12 - 28:16
    And this will spit out useful files here
    in a standard way.
  • 28:16 - 28:26
    And the Linux perf basically just records
    the perf data dump file as normal.
  • 28:27 - 28:30
    We have 2 sets of recording going on.
  • 28:30 - 28:35
    To report it it's very easy to do
    normal reporting with the PID map.
  • 28:35 - 28:41
    This is just out of the box, works with
    the Google V8 engine as well.
  • 28:41 - 28:45
    If you want to do very clever annotations
    perf has the ability to have
  • 28:45 - 28:48
    Python scripts passed to it.
  • 28:48 - 28:54
    So you can craft quite a dodgy Python
    script and that can interface
  • 28:54 - 28:55
    with the perf annotation output.
  • 28:55 - 29:00
    That's how I was able to get the extra
    Java information in the same annotation.
  • 29:00 - 29:05
    And this is really easy to do; it's quite
    easy to knock the script up.
  • 29:05 - 29:10
    And again the only thing we do for this
    profiling is we hook in the profiling
  • 29:10 - 29:13
    agent which dumps out various things.
  • 29:13 - 29:18
    We preserve the frame pointer because
    that makes things considerably easier
  • 29:18 - 29:21
    on winding. This will effect
    performance a little bit.
  • 29:21 - 29:26
    And again when we're reporting we just
    hook in a Python script.
  • 29:26 - 29:30
    It's really easy to hook everything in
    and get it working.
  • 29:33 - 29:37
    At the moment we have a JVMTI agent. It's
    actually on http://git.linaro.org now.
  • 29:38 - 29:42
    Since I gave this talk Google have
    extended perf anyway so it will do
  • 29:42 - 29:45
    quite a lot of similar things out of the
    box anyway.
  • 29:45 - 29:50
    It's worth having a look at the
    latest perf.
  • 29:50 - 29:54
    These techniques in this slide deck can be
    used obviously in other JITs quite easily.
  • 29:54 - 29:59
    The fact that perf is so easy to extend
    with scripts can be useful
  • 29:59 - 30:01
    for other things.
  • 30:01 - 30:06
    And OpenJDK has a significant amount of
    cleverness associated with it that
  • 30:06 - 30:10
    I thought was very surprising and good.
    So that's what I covered in the talk.
  • 30:13 - 30:18
    These are basically references to things
    like command line arguments
  • 30:18 - 30:20
    and the Flame graphs and stuff like that.
  • 30:20 - 30:26
    If anyone is interested in playing with
    OpenJDK on ARM64 I'd suggest going here:
  • 30:26 - 30:31
    http://openjdk.linaro.org
    Where the most recent builds are.
  • 30:31 - 30:36
    Obviously fixes are going in upstream and
    they're going into distributions as well.
  • 30:36 - 30:40
    They're included in OpenJDK so it should
    be good as well.
  • 30:41 - 30:45
    I've run through quite a few fundamental
    things reasonably quickly.
  • 30:45 - 30:48
    I'd be happy to accept any questions
    or comments
  • 30:54 - 30:57
    And if you want to talk to me privately
    about Java afterwards feel free to
  • 30:57 - 30:59
    when no-one's looking.
  • 31:07 - 31:13
    [Audience] Applause
  • 31:13 - 31:19
    [Audience] It's not really a question so
    much as a comment.
  • 31:19 - 31:26
    Last mini-Deb conf we had a talk about
  • 31:27 - 31:32
    using the JVM with other languages.
  • 31:32 - 31:36
    And it seems to me that all this would
    apply even if you hate Java programming
  • 31:36 - 31:39
    language and want to write in, I don't
    know, lisp or something instead
  • 31:39 - 31:42
    if you've got a lisp system that can
    generate JVM bytecode.
  • 31:42 - 31:48
    [Presenter] Yeah, totally. And the other
    big data language we looked at was Scala.
  • 31:49 - 31:53
    It uses the JVM back end but a completely
    different language on the front.
  • 32:04 - 32:08
    Cheers guys.
Title:
Java_the_good_bits.webm
Video Language:
English
Team:
Debconf
Project:
2016_miniconf-cambridge16
Duration:
32:13

English subtitles

Revisions Compare revisions