English subtitles

← Reduction Using Global and Shared Memory - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. So now we've done all the preliminaries, so let's turn to some actual code.
  2. We're going to implement this twice with a similar strategy each time.
  3. In both we're going to implement a sum of a million--actually 2^20 elements,
  4. and we're going to do this in 2 stages.
  5. In the 1st stage we're going to launch 1024 blocks,
  6. each one of which will use 1024 threads to reduce 1024 elements.
  7. Each of those will produce 1 single item.
  8. So we're going to have 1024 items left when we're done.
  9. And we're going to launch 1 block to reduce the final 1024 elements into 1 single element.
  10. So I'll post all the code, of course. But the CPU side code is straightforward.
  11. Instead we're just going to take a look at the kernel. Let's see how that works.
  12. So each block is going to be responsible for a 1024 element chunk of floats,
  13. and we're going to run this loop within the kernel.
  14. On each iteration of this loop we're going to divide the active region in half.
  15. So on the 1st iteration, where we start with 1024 elements,
  16. we're going to have two 512-element regions.
  17. Then each of 512 threads will add its element in the 2nd half to its element in the 1st half,
  18. writing back to the 1st half.
  19. Now we're going to synchronize all threads, this syncthreads call right here,
  20. to make sure every one is done.
  21. We've got 512 elements remaining,
  22. and so we're going to loop again on this resulting region of 512 elements.
  23. Now we'll divide it into two 256-element chunks using 256 threads
  24. to sum these 256 items to these 256 items.
  25. And we're going to continue this loop, cutting it in half every time,
  26. until we have 1 element remaining at the very end of 10 iterations.
  27. And then we'll write that back out to global memory. So this works.
  28. We can run it on a computer in our lab.
  29. So we're doing that now, and we notice that it finishes in 0.65 milliseconds.
  30. Less than a millisecond. That's pretty great, but it's not as efficient as we might like.
  31. Specifically, if we take a look at the code again,
  32. we're going to global memory more often than we'd like.
  33. On each iteration of the loop, we read n items from global memory and we write back n/2 items.
  34. Then we read those n/2 items back from global memory and so on.
  35. In an ideal world, we'd do an original read where we read all of the 1024 items into the thread block,
  36. do all the reduction internally, and then write back the final value.
  37. And this should be faster because we would incur less memory traffic overall.
  38. The CUDA feature we use to do this is called shared memory
  39. and will store all intermediate values in shared memory where all threads can access them.
  40. Shared memory is considerably faster than global memory.
  41. So let's take a look at the kernel. It's going to look very similar.
  42. And in this kernel we're going to have the exact same loop structure.
  43. What's going to be different, though, is this little part right here.
  44. We have to 1st copy all the values from global memory into shared memory
  45. and that's done with this little block.
  46. And then all the further accesses here are from shared memory--
  47. this s data--as opposed to from global memory, which we did last time.
  48. And when we're done, we have to write this final value back to global memory again.
  49. The only other interesting part of this code is how we declare the amount of shared memory we need.
  50. And we do that here.
  51. We're declaring that we're going to have an externally defined amount of shared data.
  52. Now, we haven't actually said how much we do,
  53. so to do that, we're going to have to go down to where we actually call the kernel.
  54. So when we're calling the reduce kernel using the shared memory,
  55. we call it with now 3 arguments inside the triple chevrons, the normal blocks and the threads,
  56. but then we say how many bytes we need allocated in shared memory.
  57. In this case, every thread is going to ask for 1 float stored in shared memory.
  58. So the advantage of the shared memory version is that it saves global memory bandwidth.
  59. It's a good exercise to figure out how much memory bandwidth you'll save.
  60. So I'll ask that as a quiz.
  61. The global memory version uses how many times as much memory bandwidth
  62. as the shared memory version?
  63. Round to the nearest integer.