English sottotitoli

← Java_the_good_bits.webm

Ottieni il codice di inserimento
1 Language

Mostrare Revisione 10 creata 11/28/2016 da Jeffity.

  1. Onto the second talk of the day.
  2. Steve Capper is going to tell us
    about the good bits of Java
  3. They do exist
  4. [Audience] Could this have been a
    lightening talk? [Audience laughter]
  5. Believe it or not we've got some
    good stuff here.
  6. I was as skeptical as you guys
    when I first looked.
  7. First many apologies for not attending
    the mini-conf last year
  8. I was unfortunately ill on the day
    I was due to give this talk.
  9. Let me figure out how to use a computer.
  10. Sorry about this.
  11. There we go; it's because
    I've not woken up.
  12. Last year I worked at Linaro in the
    Enterprise group and we performed analysis
  13. on so called 'Big Data' application sets.
  14. As many of you know quite a lot of these
    big data applications are written in Java.
  15. I'm from ARM and we were very interested
    in 64bit ARM support.
  16. So this is mainly AArch64 examples
    for things like assembler
  17. but most of the messages are
    pertinent for any architecture.
  18. These good bits are shared between
    most if not all the architectures.
  19. Whilst trying to optimise a lot of
    these big data applications
  20. I stumbled a across quite a few things in
    the JVM and I thought
  21. 'actually that's really clever;
    that's really cool'
  22. So I thought that would make a good
    basis for an interesting talk.
  23. This talk is essentially some of the
    clever things I found in the
  24. Java Virtual Machine; these
    optimisations are in OpenJDK.
  25. Source is available it's all there,
    readily available and in play now.
  26. I'm going to finish with some of the
    optimisation work we did with Java.
  27. People who know me will know
    I'm not a Java zealot.
  28. I don't particularly believe in
    programming in a language over another one
  29. So to make it clear from the outset
    I'm not attempting to convert
  30. anyone to Java programmers.
  31. I'm just going to highlight a few salient
    things in the Java Virtual Machine
  32. which I found to be quite clever and
    interesting
  33. and I'll try and talk through them
    with my understanding of them.
  34. Let's jump straight in and let's
    start with an example.
  35. This is a minimal example for
    computing a SHA1 sum of a file.
  36. I've omitted some of the checking in the
    beginning of the function see when
  37. command line parsing and that sort of
    thing.
  38. I've highlighted the salient
    points in red.
  39. Essentially we instantiate a SHA1
    crypto message service digest.
  40. And we do the equivalent in
    Java of an mmap.
  41. Get it all in memory.
  42. And then we just put this status straight
    into the crypto engine.
  43. And eventually at the end of the
    program we'll spit out the SHA1 hash.
  44. It's a very simple program.
  45. It's basically mmap, SHA1, output
    the hash afterwards.
  46. In order to concentrate on the CPU
    aspect rather than worry about IO
  47. I decided to cheat a little bit by
    setting this up.
  48. I decided to use a sparse file. As many of
    you know a sparse file is a file that not
  49. all the contents are stored necessarily
    on disc. The assumption is that the bits
  50. that aren't stored are zero. For instance
    on Linux you can create a 20TB sparse file
  51. on a 10MB file system and use it as
    normal.
  52. Just don't write too much to it otherwise
    you're going to run out of space.
  53. The idea behind using a sparse file is I'm
    just focusing on the computational aspects
  54. of the SHA1 sum. I'm not worried about
    the file system or anything like that.
  55. I don't want to worry about the IO. I
    just want to focus on the actual compute.
  56. In order to set up a sparse file I used
    the following runes.
  57. The important point is that you seek
    and the other important point
  58. is you set a count otherwise you'll
    fill your disc up.
  59. I decided to run this against firstly
    let's get the native SHA1 sum command
  60. that's built into Linux and let's
    normalise these results and say that's 1.0
  61. I used an older version of the OpenJDK
    and ran the Java program
  62. and that's 1.09 times slower than the
    reference command. That's quite good.
  63. Then I used the new OpenJDK, this is now
    the current JDK as this is a year on.
  64. And 0.21 taken. It's significantly faster.
  65. I've stressed that I've done nothing
    surreptitious in the Java program.
  66. It is mmap, compute, spit result out.
  67. But the OpenJDK has essentially got
    some more context information.
  68. I'll talk about that as we go through.
  69. Before when I started Java I had a very
    simplistic view of Java.
  70. Traditionally Java is taught as a virtual
    machine that runs bytecode.
  71. Now when you compile a Java program it
    compiles into bytecode.
  72. The older versions of the Java Virtual
    Machine would interpret this bytecode
  73. and then run through. Newer versions
    would employ a just-in-time engine
  74. and try and compile this bytecode
    into native machine code.
  75. That is not the only thing that goes on
    when you run a Java program.
  76. There is some extra optimisations as well.
    So this alone would not account for
  77. the newer version of the SHA1
    sum being significantly faster
  78. than the distro supplied one.
  79. Java knows about context. It has a class
    library and these class libraries
  80. have reasonably well defined purposes.
  81. We have classes that provide
    crypto services.
  82. We have some misc unsafe that every
    single project seems to pull in their
  83. project when they're not supposed to.
  84. These have well defined meanings.
  85. These do not necessarily have to be
    written in Java.
  86. They come as Java classes,
    they come supplied.
  87. But most JVMs now have a notion
    of a virtual machine intrinsic.
  88. And the virtual machine intrinsic says ok
    please do a SHA1 in the best possible way
  89. that your implementation allows. This is
    something done automatically by the JVM.
  90. You don't ask for it. If the JVM knows
    what it's running on and it's reasonably
  91. recent this will just happen
    for you for free.
  92. And there's quite a few classes
    that do this.
  93. There's quite a few clever things with
    atomics, there's crypto,
  94. there's mathematical routines as well.
    Most of these routines in the
  95. class library have a well defined notion
    of a virtual machine intrinsic
  96. and they do run reasonably optimally.
  97. They are a subject of continuous
    optimisation as well.
  98. We've got some runes that are
    presented on the slides here.
  99. These are quite useful if you
    are interested in
  100. how these intrinsics are made.
  101. You can ask the JVM to print out a lot of
    the just-in-time compiled code.
  102. You can ask the JVM to print out the
    native methods as well as these intrinsics
  103. and in this particular case after sifting
    through about 5MB of text
  104. I've come across this particular SHA1 sum
    implementation.
  105. This is AArch64. This is employing the
    cryptographic extensions
  106. in the architecture.
  107. So it's essentially using the CPU
    instructions which would explain why
  108. it's faster. But again it's done
    all this automatically.
  109. This did not require any specific runes
    or anything to activate.
  110. We'll see a bit later on how you can
    more easily find the hot spots
  111. rather than sifting through a lot
    of assembler.
  112. I've mentioned that the cryptographic
    engine is employed and again
  113. this routine was generated at run
    time as well.
  114. This is one of the important things about
    certain execution of amps like Java.
  115. You don't have to know everything at
    compile time.
  116. You know a lot more information at
    run time and you can use that
  117. in theory to optimise.
  118. You can switch off these clever routines.
  119. For instance I've got a deactivate
    here and we get back to the
  120. slower performance we expected.
  121. Again, this particular set of routines is
    present in OpenJDK,
  122. I think for all the architectures that
    support it.
  123. We get this optimisation for free on X86
    and others as well.
  124. It works quite well.
  125. That was one surprise I came across
    as the instrinsics.
  126. One thing I thought it would be quite
    good to do would be to go through
  127. a slightly more complicated example.
    And use this example to explain
  128. a lot of other things that happen
    in the JVM as well.
  129. I will spend a bit of time going through
    this example
  130. and explain roughly the notion of what
    it's supposed to be doing.
  131. This is an imaginary method that I've
    contrived to demonstrate a lot of points
  132. in the fewest possible lines of code.
  133. I'll start with what it's meant to do.
  134. This is meant to be a routine that gets a
    reference to something and let's you know
  135. whether or not it's an image and in a
    hypothetical cache.
  136. I'll start with the important thing
    here the weak reference.
  137. In Java and other garbage collected
    languages we have the notion of references
  138. Most of the time when you are running a
    Java program you have something like a
  139. variable name and that is in the current
    execution context that is referred to as a
  140. strong reference to the object. In other
    words I can see it. I am using it.
  141. Please don't get rid of it.
    Bad things will happen if you do.
  142. So the garbage collector knows
    not to get rid of it.
  143. In Java and other languages you also
    have the notion of a weak reference.
  144. This is essentially the programmer saying
    to the virtual machine
  145. "Look I kinda care about this but
    just a little bit."
  146. "If you want to get rid of it feel free
    to but please let me know."
  147. This is why this is for a CacheClass.
    For instance the JVM in this particular
  148. case could decide that it's running quite
    low on memory this particular xMB image
  149. has not been used for a while it can
    garbage collect it.
  150. The important thing is how we go about
    expressing this in the language.
  151. We can't just have a reference to the
    object because that's a strong reference
  152. and the JVM will know it can't get
    rid of this because the program
  153. can see it actively.
  154. So we have a level of indirection which
    is known as a weak reference.
  155. We have this hypothetical CacheClass
    that I've devised.
  156. At this point it is a weak reference.
  157. Then we get it. This is calling the weak
    reference routine.
  158. Now it becomes a strong reference so
    it's not going to be garbage collected.
  159. When we get to the return path it becomes
    a weak reference again
  160. because our strong reference
    has disappeared.
  161. The salient points in this example are:
  162. We're employing a method to get
    a reference.
  163. We're checking an item to see if
    it's null.
  164. So let's say that the JVM decided to
    garbage collect this
  165. before we executed the method.
  166. The weak reference class is still valid
    because we've got a strong reference to it
  167. but the actual object behind this is gone.
  168. If we're too late and the garbage
    collector has killed it
  169. it will be null and we return.
  170. So it's a level of indirection to see
    does this still exist
  171. if so can I please have it and then
    operate on it as normal
  172. and then return becomes weak
    reference again.
  173. This example program is quite useful when
    we look at how it's implemented in the JVM
  174. and we'll go through a few things now.
  175. First off we'll go through the bytecode.
  176. The only point of this slide is to
    show it's roughly
  177. the same as this.
  178. We get our variable.
  179. We use our getter.
  180. This bit is extra this checkcast.
    The reason that bit is extra is
  181. because we're using the equivalent of
    a template in Java.
  182. And the way that's implemented in Java is
    it just basically casts everything to an
  183. object so that requires extra
    compiler information.
  184. And this is the extra check.
  185. The rest of this we load the reference,
    we check to see if it is null,
  186. If it's not null we invoke a virtual
    function - is it the image?
  187. and we return as normal.
  188. Essentially the point I'm trying to make
    is when we compile this to bytecode
  189. this execution happens.
  190. This null check happens.
  191. This execution happens.
  192. And we return.
  193. In the actual Java class files we've not
    lost anything.
  194. This is what it looks like when it's
    been JIT'd.
  195. Now we've lost lots of things.
  196. The JIT has done quite a few clever things
    which I'll talk about.
  197. First off if we look down here there's
    a single branch here.
  198. And this is only if our check cast failed
  199. We've got comments on the
    right hand side.
  200. Our get method has been inlined so
    we're no longer calling.
  201. We seem to have lost our null check,
    that's just gone.
  202. And again we've got a get field as well.
  203. That's no longer a method,
    that's been inlined as well.
  204. We've also got some other cute things.
  205. Those more familiar with AArch64
    will understand
  206. that the pointers we're using
    are 32bit not 64bit.
  207. What we're doing is getting a pointer
    and shifting it left 3
  208. and widening it to a 64bit pointer.
  209. We've also got 32bit pointers on a
    64bit system as well.
  210. So that's saving a reasonable amount
    of memory and cache.
  211. To summarise. We don't have any
    branches or function calls
  212. and we've got a lot of inlining.
  213. We did have function calls in the
    class file so it's the JVM;
  214. it's the JIT that has done this.
  215. We've got no null checks either and I'm
    going to talk through this now.
  216. The null check elimination is quite a
    clever feature in Java and other programs.
  217. The idea behind null check elimination is
  218. most of the time this object is not
    going to be null.
  219. If this object is null the operating
    system knows this quite quickly.
  220. So if you try to dereference a null
    pointer you'll get either a SIGSEGV or
  221. a SIGBUS depending on a
    few circumstances.
  222. That goes straight back to the JVM
  223. and the JVM knows where the null
    exception took place.
  224. Because it knows where it took
    place it can look this up
  225. and unwind it as part of an exception.
  226. Those null checks just go.
    Completely gone.
  227. Most of the time this works and you are
    saving a reasonable amount of execution.
  228. I'll talk about when it doesn't work
    in a second.
  229. That's reasonably clever. We have similar
    programming techniques in other places
  230. even the Linux kernel for instance when
    you copy data to and from user space
  231. it does pretty much identical
    the same thing.
  232. It has an exception unwind table and it
    knows if it catches a page fault on
  233. this particular program counter
    it can deal with it because it knows
  234. the program counter and it knows
    conceptually what it was doing.
  235. In a similar way the JIT knows what its
    doing to a reasonable degree.
  236. It can handle the null check elimination.
  237. I mentioned the sneaky one. We've got
    essentially 32bit pointers
  238. on a 64bit system.
  239. Most of the time in Java people typically
    specify heap size smaller than 32GB.
  240. Which is perfect if you want to use 32bit
    pointers and left shift 3.
  241. Because that gives you 32GB of
    addressable memory.
  242. That's a significant memory saving because
    otherwise a lot of things would double up.
  243. There's a significant number of pointers
    in Java.
  244. The one that should make people
    jump out of their seat is
  245. the fact that most methods in Java are
    actually virtual.
  246. So what the JVM has actually done is
    inlined a virtual function.
  247. A virtual function is essentially a
    function were you don't know where
  248. you're going until run time.
  249. You can have several different classes
    and they share the same virtual function
  250. in the base class and dependent upon
    which specific class you're running
  251. different virtual functions will
    get executed.
  252. In C++ that will be a read from a V table
    and then you know where to go.
  253. The JVM's inlined it.
  254. We've saved a memory load.
  255. We've saved a branch as well
  256. The reason the JVM can inline it is
    because the JVM knows
  257. every single class that has been loaded.
  258. So it knows that although this looks
    polymorphic to the casual programmer
  259. It actually is monomorphic.
    The JVM knows this.
  260. Because it knows this it can be clever.
    And this is really clever.
  261. That's a significant cost saving.
  262. This is all great. I've already mentioned
    the null check elimination.
  263. We're taking a signal as most of you know
    if we do that a lot it's going to be slow.
  264. Jumping into kernel, into user,
    bouncing around.
  265. The JVM also has a notion of
    'OK I've been a bit too clever now;
  266. I need to back off a bit'
  267. Also there's nothing stopping the user
    loading more classes
  268. and rendering the monomorphic
    assumption invalid.
  269. So the JVM needs to have a notion of
    backpeddling and go
  270. 'Ok I've gone to far and need to
    deoptimise'
  271. The JVM has the ability to deoptimise.
  272. In other words it essentially knows that
    for certain code paths everything's OK.
  273. But for certain new objects it can't get
    away with these tricks.
  274. By the time the new objects are executed
    they are going to be safe.
  275. There are ramifications for this.
    This is the important thing to consider
  276. with something like Java and other
    languages and other virtual machines.
  277. If you're trying to profile this it means
    there is a very significant ramification.
  278. You can have the same class and
    method JIT'd multiple ways
  279. and executed at the same time.
  280. So if you're trying to find a hot spot
    the program counter's nodding off.
  281. Because you can refer to the same thing
    in several different ways.
  282. This is quite common as well as
    deoptimisation does take place.
  283. That's something to bear in mind with JVM
    and similar runtime environments.
  284. You can get a notion of what the JVM's
    trying to do.
  285. You can ask it nicely and add a print
    compilation option
  286. and it will tell you what it's doing.
  287. This is reasonably verbose.
  288. Typically what happens is the JVM gets
    excited JIT'ing everything
  289. and optimising everything then
    it settles down.
  290. Until you load something new
    and it gets excited again.
  291. There's a lot of logs. This is mainly
    useful for debugging but
  292. it gives you an appreciation that it's
    doing a lot of work.
  293. You can go even further with a log
    compilation option.
  294. That produces a lot of XML and that is
    useful for people debugging the JVM as well.
  295. It's quite handy to get an idea of
    what's going on.
  296. If that is not enough information you
    also have the ability to go even further.
  297. This is beyond the limit of my
    understanding.
  298. I've gone into this little bit just to
    show you what can be done.
  299. You have release builds of OpenJDK
    and they have debug builds of OpenJDK.
  300. The release builds will by default turn
    off a lot of the diagnostic options.
  301. You can switch them back on again.
  302. When you do you can also gain insight
    into the actual, it's colloquially
  303. referred to as the C2 JIT,
    the compiler there.
  304. You can see, for instance, objects in
    timelines and visualize them
  305. as they're being optimised at various
    stages and various things.
  306. So this is based on a masters thesis
    by Thomas Würthinger.
  307. This is something you can play with as
    well and see how far the optimiser goes.
  308. And it's also good for people hacking
    with the JVM.
  309. I'll move onto some stuff we did.
  310. Last year we were working on the
    big data. Relatively new architecture
  311. ARM64, it's called AArch64 in OpenJDK
    land but ARM64 in Debian land.
  312. We were a bit concerned because
    everything's all shiny and new.
  313. Has it been optimised correctly?
  314. Are there any obvious things
    we need to optimise?
  315. And we're also interested because
    everything was so shiny and new
  316. in the whole system.
  317. Not just the JVM but the glibc and
    the kernel as well.
  318. So how do we get a view of all of this?
  319. I gave a quick talk before at the Debian
    mini-conf before last [2014] about perf
  320. so decided we could try and do some
    clever things with Linux perf
  321. and see if we could get some actual useful
    debugging information out.
  322. We have the flame graphs that are quite
    well known.
  323. We also have some previous work, Johannes
    had a special perf map agent that
  324. could basically hook into perf and it
    would give you a nice way of running
  325. perf-top for want of a better expression
    and viewing the top Java function names.
  326. This is really good work and it's really
    good for a particular use case
  327. if you just want to do a quick snap shot
    once and see in that snap shot
  328. where the hotspots where.
  329. For a prolonged work load with all
    the functions being JIT'd multiple ways
  330. with the optimisation going on and
    everything moving around
  331. it require a little bit more information
    to be captured.
  332. I decided to do a little bit of work on a
    very similar thing to perf-map-agent
  333. but an agent that would capture it over
    a prolonged period of time.
  334. Here's an example Flame graph, these are
    all over the internet.
  335. This is the SHA1 computation example that
    I gave at the beginning.
  336. As expected the VM intrinsic SHA1 is the
    top one.
  337. Not expected by me was this quite
    significant chunk of CPU execution time.
  338. And there was a significant amount of
    time being spent copying memory
  339. from the mmapped memory
    region into a heap
  340. and then that was passed to
    the crypto engine.
  341. So we're doing a ton of memory copies for
    no good reason.
  342. That essentially highlighted an example.
  343. That was an assumption I made about Java
    to begin with which was if you do
  344. the equivalent of mmap it should just
    work like mmap right?
  345. You should just be able to address the
    memory. That is not the case.
  346. If you've got a file mapping object and
    you try to address it it has to be copied
  347. into safe heap memory first. And that is
    what was slowing down the programs.
  348. If that was omitted you could make
    the SHA1 computation even quicker.
  349. So that would be the logical target you
    would want to optimise.
  350. I wanted to extend Johannes' work
    with something called a
  351. Java Virtual Machine Tools Interface
    profiling agent.
  352. This is part of the Java Virtual Machine
    standard as you can make a special library
  353. and then hook this into the JVM.
  354. And the JVM can expose quite a few
    things to the library.
  355. It exposes a reasonable amount of
    information as well.
  356. Perf as well has the ability to look
    at map files natively.
  357. If you are profiling JavaScript, or
    something similar, I think the
  358. Google V8 JavaScript engine will write
    out a special map file that says
  359. these program counter addresses correspond
    to these function names.
  360. I decided to use that in a similar way to
    what Johannes did for the extended
  361. profiling agent but I also decided to
    capture some more information as well.
  362. I decided to capture the disassembly
    so when we run perf annotate
  363. we can see the actual JVM bytecode
    in our annotation.
  364. We can see how it was JIT'd at the
    time when it was JIT'd.
  365. We can see where the hotspots where.
  366. And that's good. But we can go
    even better.
  367. We can run an annotated trace that
    contains the Java class,
  368. the Java method and the bytecode all in
    one place at the same time.
  369. You can see everything from the JVM
    at the same place.
  370. This works reasonably well because the
    perf interface is extremely extensible.
  371. And again we can do entire
    system optimisation.
  372. The bits in red here are the Linux kernel.
  373. Then we got into libraries.
  374. And then we got into Java and more
    libraries as well.
  375. So we can see everything from top to
    bottom in one fell swoop.
  376. This is just a quick slide showing the
    mechanisms employed.
  377. Essentially we have this agent which is
    a shared object file.
  378. And this will spit out useful files here
    in a standard way.
  379. And the Linux perf basically just records
    the perf data dump file as normal.
  380. We have 2 sets of recording going on.
  381. To report it it's very easy to do
    normal reporting with the PID map.
  382. This is just out of the box, works with
    the Google V8 engine as well.
  383. If you want to do very clever annotations
    perf has the ability to have
  384. Python scripts passed to it.
  385. So you can craft quite a dodgy Python
    script and that can interface
  386. with the perf annotation output.
  387. That's how I was able to get the extra
    Java information in the same annotation.
  388. And this is really easy to do; it's quite
    easy to knock the script up.
  389. And again the only thing we do for this
    profiling is we hook in the profiling
  390. agent which dumps out various things.
  391. We preserve the frame pointer because
    that makes things considerably easier
  392. on winding. This will effect
    performance a little bit.
  393. And again when we're reporting we just
    hook in a Python script.
  394. It's really easy to hook everything in
    and get it working.
  395. At the moment we have a JVMTI agent. It's
    actually on http://git.linaro.org now.
  396. Since I gave this talk Google have
    extended perf anyway so it will do
  397. quite a lot of similar things out of the
    box anyway.
  398. It's worth having a look at the
    latest perf.
  399. These techniques in this slide deck can be
    used obviously in other JITs quite easily.
  400. The fact that perf is so easy to extend
    with scripts can be useful
  401. for other things.
  402. And OpenJDK has a significant amount of
    cleverness associated with it that
  403. I thought was very surprising and good.
    So that's what I covered in the talk.
  404. These are basically references to things
    like command line arguments
  405. and the Flame graphs and stuff like that.
  406. If anyone is interested in playing with
    OpenJDK on ARM64 I'd suggest going here:
  407. http://openjdk.linaro.org
    Where the most recent builds are.
  408. Obviously fixes are going in upstream and
    they're going into distributions as well.
  409. They're included in OpenJDK so it should
    be good as well.
  410. I've run through quite a few fundamental
    things reasonably quickly.
  411. I'd be happy to accept any questions
    or comments
  412. And if you want to talk to me privately
    about Java afterwards feel free to
  413. when no-one's looking.
  414. [Audience] Applause
  415. [Audience] It's not really a question so
    much as a comment.
  416. Last mini-Deb conf we had a talk about
  417. using the JVM with other languages.
  418. And it seems to me that all this would
    apply even if you hate Java programming
  419. language and want to write in, I don't
    know, lisp or something instead
  420. if you've got a lisp system that can
    generate JVM bytecode.
  421. [Presenter] Yeah, totally. And the other
    big data language we looked at was Scala.
  422. It uses the JVM back end but a completely
    different language on the front.
  423. Cheers guys.