So now we've done all the preliminaries, so let's turn to some actual code.
We're going to implement this twice with a similar strategy each time.
In both we're going to implement a sum of a million--actually 2^20 elements,
and we're going to do this in 2 stages.
In the 1st stage we're going to launch 1024 blocks,
each one of which will use 1024 threads to reduce 1024 elements.
Each of those will produce 1 single item.
So we're going to have 1024 items left when we're done.
And we're going to launch 1 block to reduce the final 1024 elements into 1 single element.
So I'll post all the code, of course. But the CPU side code is straightforward.
Instead we're just going to take a look at the kernel. Let's see how that works.
So each block is going to be responsible for a 1024 element chunk of floats,
and we're going to run this loop within the kernel.
On each iteration of this loop we're going to divide the active region in half.
So on the 1st iteration, where we start with 1024 elements,
we're going to have two 512-element regions.
Then each of 512 threads will add its element in the 2nd half to its element in the 1st half,
writing back to the 1st half.
Now we're going to synchronize all threads, this syncthreads call right here,
to make sure every one is done.
We've got 512 elements remaining,
and so we're going to loop again on this resulting region of 512 elements.
Now we'll divide it into two 256-element chunks using 256 threads
to sum these 256 items to these 256 items.
And we're going to continue this loop, cutting it in half every time,
until we have 1 element remaining at the very end of 10 iterations.
And then we'll write that back out to global memory. So this works.
We can run it on a computer in our lab.
So we're doing that now, and we notice that it finishes in 0.65 milliseconds.
Less than a millisecond. That's pretty great, but it's not as efficient as we might like.
Specifically, if we take a look at the code again,
we're going to global memory more often than we'd like.
On each iteration of the loop, we read n items from global memory and we write back n/2 items.
Then we read those n/2 items back from global memory and so on.
In an ideal world, we'd do an original read where we read all of the 1024 items into the thread block,
do all the reduction internally, and then write back the final value.
And this should be faster because we would incur less memory traffic overall.
The CUDA feature we use to do this is called shared memory
and will store all intermediate values in shared memory where all threads can access them.
Shared memory is considerably faster than global memory.
So let's take a look at the kernel. It's going to look very similar.
And in this kernel we're going to have the exact same loop structure.
What's going to be different, though, is this little part right here.
We have to 1st copy all the values from global memory into shared memory
and that's done with this little block.
And then all the further accesses here are from shared memory--
this s data--as opposed to from global memory, which we did last time.
And when we're done, we have to write this final value back to global memory again.
The only other interesting part of this code is how we declare the amount of shared memory we need.
And we do that here.
We're declaring that we're going to have an externally defined amount of shared data.
Now, we haven't actually said how much we do,
so to do that, we're going to have to go down to where we actually call the kernel.
So when we're calling the reduce kernel using the shared memory,
we call it with now 3 arguments inside the triple chevrons, the normal blocks and the threads,
but then we say how many bytes we need allocated in shared memory.
In this case, every thread is going to ask for 1 float stored in shared memory.
So the advantage of the shared memory version is that it saves global memory bandwidth.
It's a good exercise to figure out how much memory bandwidth you'll save.
So I'll ask that as a quiz.
The global memory version uses how many times as much memory bandwidth
as the shared memory version?
Round to the nearest integer.
我们已完成了所有准备工作,那么让我们转向一些真正的代码。
我们将执行这个代码两次,每次的策略相似。
两次中我们将执行一百万个元素的求和,实际上是2^20个元素,
我们将分两个阶段进行。
第一阶段,我们启动1024个块,
每一个块用1024个线程归约1024个元素。
每一个块将生成1个单项。
所以当我们完成时,会剩下1024项。
我们将启动一个块把最后的1024个元素归约到一个单一元素。
当然我会贴上所有代码。不过,CPU部分的代码很简单。
因此,我们只看一下内核。让我们看看那是如何工作的。
每个块负责一个有1024个元素的浮点块,
我们将在内核中运行这个循环。
每次循环迭代,我们会把活动区分成两半。
第1次迭代中,我们开始时有1024个元素,
我们将有两个512元素区。
然后512个线程中的每一个会把第2个半区的元素加到第1个半区,
写回到第一个半区。
现在我们将同步所有线程,同步线程调用就在这,
用来确保所有线程都已完成。
我们还剩有512个元素,
我们将对这个512元素的结果区再次循环。
现在我们把它分成两个256元素块,用256个线程
把这256项加到这256个项。
我们将继续这个循环,每次分成两半,
直到10次迭代的最后我们只剩1个元素。
然后我们把这个元素写回全局内存。这是可行的的。
我们可以在我们实验室的计算机上运行它。
我们现在就在运行,我们注意到它在0.65毫秒内就完成了。
少于1毫秒。这相当好,但它还没有达到我们想要的效率。
特别地,如果我们再看一下代码,
我们进入全局内存的频率比我们希望的更频繁。
每次循环迭代,我们从全局内存读取 n 项,写回 n/2 项。
然后,我们再从全局内存读回那 n/2 项,等等。
在理想情况下,我们进行一次原始读取,
即我们把全部1024项读入线程块,
在内部进行所有归约,然后把最终值写回。
这应该更快,因为总体上我们引发较少的存储流量。
我们用来实现这个的CUDA特征叫做共享内存,
所有中间值都存储到所有线程都可以访问的共享内存。
共享内存的速度远远超过全局内存。
让我们看看内核,它将看起来很相似。
在这个内核中,我们将用完全相同的循环结构。
不同的是这里的一小部分。
我们首先要把所有值从全局内存值复制到共享内存,
用这一小块完成。
然后,这里所有后续的访问都是从共享内存——
这个 s 数据——与我们上次做的从全局内存访问形成对照。
当我们完成后,我们得再次把这个最终值写回全局内存。
这个代码另一个有趣的部分是
我们如何声明我们需要的共享内存数量。
我们在这进行。
我们声明我们将有一个外部定义的共享数据量。
现在,我们还没有确切地说我们需要多少。
所以要做到这一点,我们得往下到我们真正调用内核的地方。
当我们用共享内存调用归约内核,
我们用<<< >>>内的3个参数调用它,正常块和线程,
以及我们说我们需要在共享内存分配多少字节。
这种情况下,每一线程将要求存储在共享内存中的1个浮点。
共享内存代码版本的优点是它节省了全局内存带宽。
弄清楚你将节省多少内存带宽是个很好的练习。
我把那作为一个测验。
全局内存代码版本使用的内存带宽是
共享内存代码版本的多少倍?
四舍五入到整数。