English (United States) subtitles

← Writing Efficient Programs - Intro to Parallel Programming

Get Embed Code
4 Languages

Showing Revision 2 created 05/24/2016 by Udacity Robot.

  1. In the next homework, we're going to implement some pretty cool image blurring techniques.
  2. You now know enough that you can go and implement that program
  3. on a massively parallel GPU, and you'd get a correct answer, and it would be pretty fast.
  4. But we can do better.
  5. Now we have all the ingredients to start talking about writing efficient parallel programs in CUDA.
  6. For now I'm only going to talk about high level strategies.
  7. We're going to have a whole unit later on about detailed optimization approaches
  8. to help you really squeeze the maximum performance out of the GPU.
  9. So think of this as a preview that covers some of the really important things high level things that you have
  10. to keep in mind when you are writing a GPU program.
  11. So the first thing to keep in mind is that GPUs have really incredible computational horsepower.
  12. A high-end GPU can do over 3 trillion math operations per second.
  13. You'll sometimes see this written down as T FLOPS, okay?
  14. A FLOPS stands for a Floating Point Operation Per Second.
  15. T FLOPS is Tera-FLOPS, and minor GPU's can do over 3 trillion of these every second at the high end.
  16. But all that power is wasted if the arithmetic units that are doing the math
  17. need to spend their time waiting while the system fetches the operands from memory.
  18. So the first strategy we need to keep in mind is to maximize arithmetic intensity.
  19. Arithmetic intensity is basically the amount of math we do per amount of memory that we access.
  20. So we can increase arithmetic intensity by making the numerator bigger
  21. or by making the denominator smaller.
  22. So this corresponds to maximizing the work we do per thread or minimizing the memory we do per thread.
  23. And let's be more exact about what we mean here.
  24. Really what we're talking about is maximizing the number
  25. of useful compute operations per thread.
  26. Really what we care about here is minimizing the time spent on memory per thread.
  27. And I phrased this carefully because it is not the total number of memory operations that we care about
  28. and it's not the total amount of memory that comes
  29. and goes in the course of the thread executing its program.
  30. It's how long it takes us to do that,
  31. so there are a lot of ways to spend less time on memory accesses.
  32. And that's what we're going to talk about now.