So, if we go through this analysis for all of these kernels, we'll see that our
parallel per-element version of the code achieves 12.5 gigabytes per second,
our parallel per-row version of the code gets about 1.8 gigabytes per second,
and our serial version of the code gets an abysmal 0.018 gigabytes per second.
This is roughly the speed of a carrier pigeon. And a better way to think about this, perhaps,
is not in absolute numbers but as a percentage of what the particular GPU we're using can achieve.
So, if we were to work out the percentages we're achieving,
it's something like 31% of theoretical peak bandwidth with our highest performing kernel,
4.5% peak bandwidth with our per-row kernel, and less than a 10th of a percent with our serial kernel.
So, back to the question. Why is this number so low?
Well, we can take a pretty shrewd guess that whenever you see really low DRAM utilization,
really low percentage bandwidth, your first guess is always coalescing.
A way to think about coalescing is that the GPU is always accessing global memory,
accessing the DRAM in pretty large chunks, 32 or 128 bytes at a time.
And this means that we are going to need the fewest
total memory transactions when the threads in a warp access contiguous adjacent memory locations.
So, this is an example of good coalescing.
Every thread is either reading or writing an adjacent memory location.
And clearly, if the threads in a warp are reading and writing completely random locations and memory,
then you're going to get poor coalescing, right?
So, if these accesses are spread out all over the memory,
then the total number of chunks of memory that we have to read could be as large as the number of threads in the warp.
So, a random access pattern clearly leads to bad coalescing.
So, a much more common access pattern is what's called strided, and this is where threads access
a location memory that's a function of their thread ID times some stride.
So, for example, thread 0 might access location 0, thread 1 location 2, thread 2 location 4, thread 3 location 6, and so on.
In that case, that would be a stride of 2 because there's two elements between thread accesses,
and strided accesses range from, okay, like in this case where with the stride of 2 elements,
I'm really only doubling the number of memory transactions.
So, I'm sort of halving the quality of my coalescing all the way to really, really bad, right?
So, you can imagine that if the stride between elements is large enough,
then every thread in the warp is accessing a completely different 32- or 128-byte chunk of memory,
and then you're guaranteed to get bad behavior.
Guaranteed to be maximizing the number of memory transactions that you have to do.
So, let's look at the code for our kernels. Here's where we're reading from the input matrix,
and this actually works out pretty well.
Every thread is reading a value in memory which is equal to some large offset; J times N plus I.
And if you look at I, I is really the thread index plus some offset.
So, adjacent threads, you know, threads with adjacent thread into season X
are reading adjacent values of the input matrix. That's exactly what we want,
so this is good coalescing. On the other hand, when we write the output matrix,
adjacent threads, threads with adjacent values of I, are riding to places separated in memory by N, right?
And N was like 1024. So, adjacent threads are running memory locations that are 1024 elements away from each other.
This is clearly bad, this is bad coalescing. Bad, bad, bad, bad.
This is, in fact, the root of our problem.
所以,如果我们回顾一下针对所有这些内核的分析,
我们就会看到
我们的每元素并行代码版本将达到每秒12.5 千兆字节。
我们的每行并行代码版本将获得约每秒1.8 千兆字节,
而我们的串行代码版本获得令人失望的每秒0.018 千兆字节。
这大约是信鸽的速度。或许,
一个思考这个问题更好的方式,
不在于绝对数字,而是我们使用的特定 GPU 能够实现的百分比。
所以,如果我们要算出我们达到的百分比,
使用我们执行最好的内核,它大致是31 %的理论峰值带宽,
使用我们的每行内核时达到4.5%的理论峰值带宽,
而我们的串行内核达到小于0.1%的理论峰值带宽。
所以,回到问题。为什么这个数字这么低?
嗯,我们可以做一个聪明的猜测,
每当你看到非常低的DRAM 利用率,
非常低的百分比带宽,你第一个猜测始终是合并。
理解合并的一种方法是 GPU 始终访问全局内存,
每次以较大的数据块,
32 个或 128 个字节来访问DRAM。
这意味着我们将需要最少
的总内存事务,如果warp中的线程访问连续相邻的内存位置。
所以,这是一个很好的合并示例。
每个线程在读取或写入一个相邻的内存位置。
而且很清楚,如果一个warp中的线程
在读取或写入完全随机地址和内存,
然后你会得到较差的合并,对吗?
所以,如果这些访问散布在内存,
然后我们要读取的内存块总数
可能和warp中的线程数一样多。
所以,一个随机访问模式显然会导致一个较差的合并。
所以,更为常见的访问模式被称为跨步,
在这种模式下线程访问
的内存位置是它们的线程标识与一些步长乘积的函数。
所以,例如,线程 0 可能会访问位置 0,线程 1 访问位置 2,线程 2 位置 4、 线程 3 位置 6,等等。
在这种情况下,这将是 2步幅,
因为在线程访问之间有两个元素,
跨步访问的范围从,好吧,
就像在这个例子的 2 个元素的步幅,
我真的只增加了一倍的内存事务数。
所以我有点儿把我的合并质量
一直减半到非常非常糟糕的地步,对吗?
所以,你可以想象,如果元素之间的跨距足够大,
然后在warp中的每个线程访问完全
不同的 32 或 128 字节的内存块,
那么接下来你保证得到很糟的行为。
你必须要做的是保证将内存事务的数目最大化。
所以,让我们看看我们内核的代码。
这里是我们从输入矩阵读取得地方,
和这实际上表现很不错。
每个线程读取内存中的一个值,
这个值等于某个巨大的偏移;J乘以N加I。
如果你看看I,I其实是线程索引再加上一些偏移量。
所以,相邻的线程,你知道,那些相邻线程进入season X的线程
正在读入输入矩阵的相邻值。这正是我们想要的,
所以这是好的合并。另一方面,当我们写入输出矩阵,
相邻线程,具有相邻值I的线程,
写入内存中被N分开的地方,对吗?
N 就像 1024。所以,相邻线程正运行在
相距1024个元素的内存位置。
这显然不好,这是很糟的合并。很糟,很糟、很糟、很糟。
事实上,这是我们问题的根源。