English 字幕

← Measuring Memory Usage of Our Transpose Code - Intro to Parallel Programming


Showing Revision 4 created 05/24/2016 by Udacity Robot.

  1. So, if we go through this analysis for all of these kernels, we'll see that our
  2. parallel per-element version of the code achieves 12.5 gigabytes per second,
  3. our parallel per-row version of the code gets about 1.8 gigabytes per second,
  4. and our serial version of the code gets an abysmal 0.018 gigabytes per second.
  5. This is roughly the speed of a carrier pigeon. And a better way to think about this, perhaps,
  6. is not in absolute numbers but as a percentage of what the particular GPU we're using can achieve.
  7. So, if we were to work out the percentages we're achieving,
  8. it's something like 31% of theoretical peak bandwidth with our highest performing kernel,
  9. 4.5% peak bandwidth with our per-row kernel, and less than a 10th of a percent with our serial kernel.
  10. So, back to the question. Why is this number so low?
  11. Well, we can take a pretty shrewd guess that whenever you see really low DRAM utilization,
  12. really low percentage bandwidth, your first guess is always coalescing.
  13. A way to think about coalescing is that the GPU is always accessing global memory,
  14. accessing the DRAM in pretty large chunks, 32 or 128 bytes at a time.
  15. And this means that we are going to need the fewest
  16. total memory transactions when the threads in a warp access contiguous adjacent memory locations.
  17. So, this is an example of good coalescing.
  18. Every thread is either reading or writing an adjacent memory location.
  19. And clearly, if the threads in a warp are reading and writing completely random locations and memory,
  20. then you're going to get poor coalescing, right?
  21. So, if these accesses are spread out all over the memory,
  22. then the total number of chunks of memory that we have to read could be as large as the number of threads in the warp.
  23. So, a random access pattern clearly leads to bad coalescing.
  24. So, a much more common access pattern is what's called strided, and this is where threads access
  25. a location memory that's a function of their thread ID times some stride.
  26. So, for example, thread 0 might access location 0, thread 1 location 2, thread 2 location 4, thread 3 location 6, and so on.
  27. In that case, that would be a stride of 2 because there's two elements between thread accesses,
  28. and strided accesses range from, okay, like in this case where with the stride of 2 elements,
  29. I'm really only doubling the number of memory transactions.
  30. So, I'm sort of halving the quality of my coalescing all the way to really, really bad, right?
  31. So, you can imagine that if the stride between elements is large enough,
  32. then every thread in the warp is accessing a completely different 32- or 128-byte chunk of memory,
  33. and then you're guaranteed to get bad behavior.
  34. Guaranteed to be maximizing the number of memory transactions that you have to do.
  35. So, let's look at the code for our kernels. Here's where we're reading from the input matrix,
  36. and this actually works out pretty well.
  37. Every thread is reading a value in memory which is equal to some large offset; J times N plus I.
  38. And if you look at I, I is really the thread index plus some offset.
  39. So, adjacent threads, you know, threads with adjacent thread into season X
  40. are reading adjacent values of the input matrix. That's exactly what we want,
  41. so this is good coalescing. On the other hand, when we write the output matrix,
  42. adjacent threads, threads with adjacent values of I, are riding to places separated in memory by N, right?
  43. And N was like 1024. So, adjacent threads are running memory locations that are 1024 elements away from each other.
  44. This is clearly bad, this is bad coalescing. Bad, bad, bad, bad.
  45. This is, in fact, the root of our problem.