So the answer is 3.
And so here I'm showing the memory traffic required for the global version.
Here I'm showing the memory traffic required from the shared version.
I'm showing all the reads you need to do in red and all the writes you need to do in blue.
And if you sum this series, you'll find for reducing n values in shared memory
you'll do n reads and 1 write but for global memory you do 2 n reads and n writes.
So you would expect that it would run faster, which it will, and let's show that now.
So now I'm going to run the shared memory version.
And we see that it does in fact run faster.
Now we're down to 464 microseconds as opposed to 651.
But it doesn't run 3 times faster. How come?
Well, the detailed reason is that we're not saturating the memory system.
And there's numerous advanced techniques you would need to do
to totally max out the performance of reduce.
We're not doing that today, but if you're really interested in micro-optimization of this kind,
this application is a really great place to start.
You'll want to look in particular at processing multiple items per thread instead of just 1.
You'll also want to perform the 1st step of the reduction
right when you read the items from global memory into shared memory.
And you'll want to take advantage of the fact that warps are synchronous
when you're doing the last steps of the reduction.
But these are all advanced techniques.
We can talk about them on the forums if you all are interested.
那么答案是3。
所以在此,我将演示一下全局版本所需的内存流量。
在此,我我将演示一下共享版本所需的内存流量。
我将用红色演示所有你要做的读取,用蓝色表示所有你要做的写入。
如果你把这一系列加起来,你就会发现减少了n 个值,在共享内存中
你将读取n次,写入1次, 但是在全局内存,
你要读取2n 次,写入n次。
所以你希望,它能够运行得更快,事实也会如此,
那让我们现在一起来演示一下。
所以,现在我正在运行共享内存的版本。
而且,我们看到它事实上运行得更快了。
现在我们缩短到464微秒,而不是651微秒。
但是它并没有快3倍。为什么这样呢?
哦,详细的原因是我们没有让内存系统饱和。
而且,有很多更好的技术你可以使用
来完全最大化归约的性能。
我们今天不演示这个了,但是如果你真的对此类微优化感兴趣,
这个程序是最好的起点。
你想具体看看每个线程处理多个项目而不是1个的情况。
你也可能希望运行归约的第一步
正好是你把这些项目从全局内存读入共享内存的时候。
你也想充分利用这一情况,当你执行归约的最后几步时。
warp线程组是同步的。
但是这些都是高级技巧。
如果你感兴趣我们可以在论坛上探讨它们。