English subtitles

← CUB Programming Exercise - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/24/2016 by Udacity Robot.

  1. So when I run all of these, this one is slightly the fastest of all of these on the diagonal--
  2. 64 threads per block operating on 16 items each.
  3. And so the thing to notice is that throughput is generally increasing
  4. as the granularity per thread increases, right?
  5. So having threads do more and more serial work is generally a good thing.
  6. On the other hand, if you've got a fixed tile size--
  7. in other words, the tile here is the total number of items that we're scanning over--
  8. for a fixed tile size there's diminishing returns.
  9. The improvements get smaller and smaller as you approach this sweet spot
  10. because you're trading off increased granularity per thread for decreased parallelilsm.
  11. And then it starts to go up again, and in my measurements on Fermi,
  12. which is the same GPU that you'll be using on the Udacity IDE,
  13. I get 32x32 being slightly slower than 64 threads with 16 items each.
  14. And at some point, you get to the point where you can no longer fit a problem size
  15. in a single thread's registers, and then performance falls of a cliff again.
  16. And for me on Fermi, 64 items per thread running at only 16 threads
  17. really starts to get pretty bad performance again.
  18. I encourage you to play around, experiment a little bit more
  19. to see what the rest of this matrix looks like.
  20. And if you're interested, go check out the CUB homepage and see what else you can do.
  21. There's different varieties of block scan, for example.
  22. There's work-efficient and step-efficient varieties of block scan,
  23. and so experiment a little bit and see how fast you can get this scan throughput,
  24. how few clocks you can spend per element scanned.
  25. And what I've found is that with the right balance,
  26. you can have a computational overhead of less than 1 clock cycle per item scanned.
  27. That is really pretty remarkable.
  28. Think about running this code on a CPU.
  29. You couldn't really achieve less than 1 clock per item scanned.