YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← A Related Problem Part 1 - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 4 created 05/24/2016 by Udacity Robot.

  1. Okay. Let's talk about a related problem.
  2. What happens when you have lots of threads both reading and writing the same memory locations?
  3. If lots of threads are all trying to read and write the same memory locations,
  4. then you're going to start to get conflicts.
  5. Let's look at an example.
  6. Suppose you had 10,000 threads that were all trying to increment 10 array elements.
  7. What's going to happen?
  8. Let's look at some code.
  9. This goes a little more complicated than we've seen in these examples before,
  10. so let me walk through it.
  11. What I'm going to do is I'm going to try writing with many threads
  12. to a small number of array elements.
  13. In this case, I'm going to write with 1,000,000 threads into 10 array elements,
  14. and I'm going to do so in blocks of a thousand.
  15. That's the #defines.
  16. It got a little helper function that just prints out an array.
  17. We'll use that for debugging.
  18. And then here's the kernel we're going to use.
  19. It's a kernel because it's labeled global.
  20. It is called increment naive, and it takes a pointer to global memory an array of integers.
  21. Each thread simply figures out what thread it is by looking at the block index,
  22. which block it is times how many threads there are in a block plus which thread it is inside that block.
  23. Now, what we're going to do is we're going to have every thread increment
  24. one of the elements in the array.
  25. To do that, we're just going to mod the thread index by the array size.
  26. Every thread is going, every consecutive threads are going to increment consecutive elements in the array,
  27. wrapping around at the array size.
  28. The fact that we have a million threads writing into only ten elements means
  29. that after each thread has added one to its corresponding element in the array,
  30. we're going to end up with 10 elements each containing the number 100,000.
  31. And then the code itself is simple.
  32. We have a timer class.
  33. Again, I've sort of hidden that away so you don't have to deal with it right now.
  34. We're going to declare some host memory.
  35. We're going to declare some GPU memory.
  36. And we're going to zero out that memory.
  37. You haven't seen cudaMemset before, but it's exactly what you'd think.
  38. We're going to set all of the bytes of this device array to zero.
  39. Now, we're going to launch the kernel.
  40. And I've put a timer around this,
  41. because one of the things I want to show you is that atomic operations can slow things down.
  42. So here's the kernel that we called increment_naive.
  43. We're going to launch it with a number of blocks equal
  44. to the total number of threads divided by the block width
  45. and the number of threads per block equal to the block width.
  46. Remember, these numbers initially are 1,000,000 and 1,000, okay?
  47. We're going to end up launching a thousand blocks,
  48. and we're going to pass in the device array,
  49. and then each thread is going to do its incrementing.
  50. And when it's all done we're going to stop the timer and copy back the array using cudaMemcpy.
  51. Now, we'll take that array that we just incremented
  52. and copy it back to the host and then I hid away a little print array helper function.
  53. It just prints out the contents of the array.
  54. Then I'm going to print out the amount of time taken milliseconds
  55. by this kernel that I measured with a timer.
  56. Okay. That's the whole Cuda program. Let's compile and run it.