English subtitles

← Why Use CUB - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. Now let me focus on a specific aspect of writing high performance kernel code.
  2. At a high level, GPU programming looks like this.
  3. There's a bunch of data sitting in global memory,
  4. and you have an algorithm that you want to run on that data.
  5. That algorithm will get executed by threads running on SMs,
  6. and for various reasons, you might want to stage the data through shared memory.
  7. And you've seen for yourself that loading or storing global memory into shared memory
  8. or into local variables of the threads
  9. can be complicated if you are striving for really high performance.
  10. Think about problem set 2
  11. where you loaded an image tile into shared memory
  12. and also had to load a halo of extra cells around the image tile
  13. in order to account for the width of the blur,
  14. and this can be tricky because the number of threads that want to perform a computation
  15. on the pixels in that tile
  16. is naturally a different number than the number of threads
  17. you would launch if your goal was simply to load the pixels in the tile
  18. as well as the pixels outside of the tile.
  19. Or think about the tile transpose example in Unit 5
  20. where we staged through shared memory in order to pay careful attention to coalescent global memory.
  21. Think about our discussion of Little's Law
  22. and the trade offs that we went over between latency,
  23. bandwidth, occupancy, the number of threads,
  24. and the transaction size per thread.
  25. Remember that we saw a graph that looked sort of like this
  26. where the bandwidth that we achieved as we increased the number of threads
  27. was higher if we were able to access 4 floating point values in a single load
  28. versus 2 floating point values in a single load
  29. versus a single floating point value in a load.
  30. Finally, there are ninja level optimizations we haven't even talked about in this class,
  31. like using the Kepler LDG intrinsic.
  32. In short, CUDA's explicit use of user managed shared memory
  33. enables predictable high performance.
  34. In contrast, the hardware manage caches that you find on CPUs
  35. can result in unpredictable performance,
  36. especially when you have many multiple threads sharing the same cache, and they're interfering with each other.
  37. So that's an advantage of explicit shared memory,
  38. but that advantage comes at the cost of an additional burden on the programmer
  39. who has to explicitly manage the movement of data in and out from global memory.
  40. And this is a big part of where CUB comes in.
  41. CUB puts an abstraction around the algorithm and its memory access pattern
  42. and deals opaquely with the movement of data from global memory,
  43. possibly through shared memory, into the actual local variables of the threads.
  44. I want to mention another programming power tool called Cuda DMA
  45. that focuses specifically around the movement of data from global into shared memory.