English subtitles

← Reduction Using Global and Shared Memory - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. So the answer is 3.
  2. And so here I'm showing the memory traffic required for the global version.
  3. Here I'm showing the memory traffic required from the shared version.
  4. I'm showing all the reads you need to do in red and all the writes you need to do in blue.
  5. And if you sum this series, you'll find for reducing n values in shared memory
  6. you'll do n reads and 1 write but for global memory you do 2 n reads and n writes.
  7. So you would expect that it would run faster, which it will, and let's show that now.
  8. So now I'm going to run the shared memory version.
  9. And we see that it does in fact run faster.
  10. Now we're down to 464 microseconds as opposed to 651.
  11. But it doesn't run 3 times faster. How come?
  12. Well, the detailed reason is that we're not saturating the memory system.
  13. And there's numerous advanced techniques you would need to do
  14. to totally max out the performance of reduce.
  15. We're not doing that today, but if you're really interested in micro-optimization of this kind,
  16. this application is a really great place to start.
  17. You'll want to look in particular at processing multiple items per thread instead of just 1.
  18. You'll also want to perform the 1st step of the reduction
  19. right when you read the items from global memory into shared memory.
  20. And you'll want to take advantage of the fact that warps are synchronous
  21. when you're doing the last steps of the reduction.
  22. But these are all advanced techniques.
  23. We can talk about them on the forums if you all are interested.