Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← Summary - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/24/2016 by Udacity Robot.

  1. All right, it's time to wrap up. Here's what I hope you've taken away from this unit.
  2. Remember APOD--Analyze, Parallelize, Optimize, and Deploy.
  3. The important points here are do profile-guided optimization at every step and deploy early and often
  4. rather than optimizing forever in a vacuum. I can't emphasize this enough.
  5. Optimization takes effort and often complicates code,
  6. so optimize only where and when you need it and go around this cycle multiple times.
  7. Now, most codes are limited by memory bandwidth.
  8. So compare your performance to the theoretical peak bandwidth
  9. and if it's lacking, see what you can do about it.
  10. Things that will help improve in order from most to least important
  11. are assure sufficient occupancy, make sure you have enough threads to keep the machine busy.
  12. This doesn't mean having as many threads as you can possibly fit on the machine,
  13. but you do need enough that the machine is basically busy.
  14. Coalesce global memory accesses.
  15. Really strive to see if you can find a way to cast your algorithm so that you achieve perfect coalescing.
  16. And if you can't, consider whether you can do a transpose operation
  17. or something that will get poor coalescing once
  18. but then put the data into a memory form where all your subsequent accesses will get good coalescing.
  19. Remember Little's law.
  20. In order to achieve the maximum bandwidth,
  21. you may need to reduce the latency between your memory accesses.
  22. And so for example, we saw that in 1 case we spent too much time waiting at barriers.
  23. By reducing the number of threads in a block,
  24. we are able to reduce the average time spent waiting at a barrier
  25. and help saturate that global memory bandwidth.
  26. We've talked about minimizing the branch divergence experience by threads.
  27. Remember that this really applies to threads that diverge within a warp.
  28. If the warps themselves diverge--in other words, if all the threads within a warp take the same branch,
  29. go in the same code path--then that comes for free.
  30. There's no additional penalty for threads in different warps diverging.
  31. It's only when threads within a warp diverge that you have to execute both sides of the branch.
  32. As a rule, you should generally try to avoid branchy code--
  33. code with of lots of if statements, switch statements, and so on.
  34. And you should generally be thinking about avoiding thread workload imbalance.
  35. In other words, if you have loops in your kernels
  36. that might execute a very different number of times between threads,
  37. then that 1 thread that's taking much longer than average thread
  38. can end holding the rest of the threads hostage.
  39. All that said, don't let a little bit of thread divergence freak you out.
  40. Remember we analyzed a real-world example of dealing with boundary conditions
  41. at the edge of an image and figured out that in fact the if statements
  42. to guard the edge of the images weren't really costing us very much.
  43. Only a few warps ended up being divergent.
  44. If you're limited by the actual computational performance of your kernel
  45. rather than the time it takes to get the data to and from your kernel,
  46. then consider using fast math operations.
  47. This includes things like the intrinsics for sine and cosine and so forth.
  48. They go quite a bit faster than their math.h counterparts,
  49. at the cost of a few bits of precision.
  50. And remember that when you use double precision it should be on purpose.
  51. So just typing the literal 3.14, well, that's a 64-bit double precision number,
  52. and the compiler will treat it as such,
  53. whereas typing 3.14f tells the compiler, hey, this is a single precision operation,
  54. you don't have to promote everything I multiply this by or add this to
  55. to be a double precision number.
  56. Finally, if you're limited by host device memory transfer time,
  57. consider using streams and asynchronous memcpys to overlap computation and memory transfers.
  58. And that's it. Now go forth and optimize your codes.