Okay, so here we've run NVVP on the function again.
I'll zoom in on these second functions. Here's our transpose per element tile.
This is the kernel we just wrote.
And as you can see, it's running with 32 by 32 thread blocks, each of which has 32 by 32 threads.
And sure enough we're achieving a 100% global load efficiency and 100% global store efficiency.
And yet, our DRAM utilization has actually gone down slightly.
So whats going on? Why is our achieved bandwidth still so low?
The answer is going to come down to this statistic here--Shared Memory Replay Overhead.
But before we get into the details of Shared Memory Replay Overhead,
what that means and what to do about it, I want to back up for a little bit
and talk about a general principle of how do we make the GPU fast.
好的,所以在这里我们对这个函数再次运行了 NVVP。
我会放大这些第二次的函数。
这里是我们每元素图块的转置矩阵。
这是我们刚写的内核。
正如你所看到的,它通过32x32的线程块运行,每个线程块都有 32x32个线程。
我们能足够确定将实现100%的全局负载效率和 100%全局存储效率。
不过,我们的DRAM利用实际上轻微地下降了。
那么发生什么了?为什么我们达到的带宽仍然是那么低?
答案会在下面的这一统计数字里 —— 共享内存重播开销。
但在我们讨论共享内存重播开销的详细信息,
即那意味着什么,对此需要做什么,我想稍微回顾一下谈过的内容,
谈谈我们如何让GPU变得快速的一般原则。