English subtitles

← Profiling the Tiling Code - Intro to Parallel Programming

Get Embed Code
2 Languages

Showing Revision 5 created 05/24/2016 by Udacity Robot.

  1. Okay, so here we've run NVVP on the function again.
  2. I'll zoom in on these second functions. Here's our transpose per element tile.
  3. This is the kernel we just wrote.
  4. And as you can see, it's running with 32 by 32 thread blocks, each of which has 32 by 32 threads.
  5. And sure enough we're achieving a 100% global load efficiency and 100% global store efficiency.
  6. And yet, our DRAM utilization has actually gone down slightly.
  7. So whats going on? Why is our achieved bandwidth still so low?
  8. The answer is going to come down to this statistic here--Shared Memory Replay Overhead.
  9. But before we get into the details of Shared Memory Replay Overhead,
  10. what that means and what to do about it, I want to back up for a little bit
  11. and talk about a general principle of how do we make the GPU fast.