English 字幕

← Tiling - Intro to Parallel Programming


Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. So here is my code for this. I'll start by setting k equal to 32 and here is the actual code.
  2. I begin by figuring out the locations of the tile corners.
  3. This is going to tell me where I need to start writing in the output and start reading from the input--
  4. so just a little book keeping and giving things variable names that mean something to me.
  5. But as you can see the, the place where I start
  6. reading the i value is a function of which block we're in times the width of the tile
  7. because each tile is responsible for 1 block and that's the case.
  8. And the j value is the same but in y, in y instead of x, and the output simply inverts y and x.
  9. Okay, so now that I know where I need to read to write my tile,
  10. I'm going to want to know which element of the tile to read and write from,
  11. and just a shorthand to make the code a little more readable.
  12. I'm going to set x to thread index .x, and y to threat index y.
  13. So now, the code itself is really simple.
  14. I declare my floating point array in shared memory k by k array of tiles,
  15. and I read from global memory, and write the result into shared memory.
  16. So here's my read from global memory and its function of where the tile starts
  17. plus which thread I'm responsible, or which element this thread is responsible for in the tile.
  18. To avoid an extra sync threads, I'm going to go
  19. ahead and write this into shared memory in transposed fashion.
  20. So it's not tile x y, it's tile y x. Okay, that saves one of these sync thread's barriers.
  21. Now I've got the transpose tile sitting in shared memory.
  22. It's already been transposed, and I want to write it out to global memory.
  23. And I want to write it out in coalesced fashion, so adjacent threads write adjacent locations and memory.
  24. In other words, adjacent threads are varying by x the way I've set this out.
  25. So here's my write to global memory after my read to from a shared memory.
  26. You could have done this in 2 sync threads.
  27. You could have read this in the shared memory, performed the transpose,
  28. and written it out to global memory.
  29. And you have needed a sync threads after reading it in to shared memory
  30. and again after performing the transpose.
  31. So, if you did that I encourage you to go back
  32. and convert it to this single version, and see how much faster it goes.
  33. Let's go ahead and run this on my laptop. Okay, there's 2 interesting things to note here.
  34. One is that the amount of time taken by the parallel per element code--
  35. the kernel that we had before--actually went up.
  36. It's almost twice as slow now as it was before.
  37. And if you think about it this code didn't change at all, except that we changed the value of k.
  38. We changed the size of the thread block that's being used in this code.
  39. We're going to come back to that. That's going to give us a hint as to a further optimization.
  40. In the meantime, transpose parallel per element tiled, our new version,
  41. is a little bit faster--not a lot faster and that's kind of disturbing.
  42. We should have gone to perfectly coalesced loads and stores which should have made a difference.
  43. Let's go ahead and fire up NVVP again and see what happened.