So taking these 1 at a time--a million threads incrementing a million elements
does give the correct answer because, in this case, there's a unique element for
every thread, so there's no conflict.
So even though we didn't make these atomic increments we're still safe.
A million threads atomically incrementing a million elements, is, of course, also safe.
So you'll also get the correct answer.
A million threads incrementing a hundred elements is the same example we saw before.
And as we saw, that will give the wrong answer, unless we use atomics.
So the next one is not correct. The fourth one is correct.
And finally, ten million threads atomically incrementing a hundred elements will still be the correct answer,
So all but one of these give you the correct answer.
Okay, so the more interesting question is how long do each of these options take?
The fastest, perhaps counter-intuitively, is going to be option 3--
a million threads writing into a hundred elements.
And the next fastest would be option 1--
a million threads writing into a million elements.
On my laptop, these 2 operations take around 3.2 milliseconds, and 3.4 milliseconds, respectively.
Of course, it's not a very useful option since it doesn't give the correct answer.
But it's still interesting to look at what's going on.
So the reason why this is slightly faster is that you have your million threads all
trying to write to the same 100 elements.
Well, those 100 elements occupy a very small fraction of memory,
and they're all going to fit into a cash line or 2 on this system,
whereas a million elements is large enough that it's not going to fit in cache.
You're actually going to have to touch more of the DRAM, where global memory is stored.
For the same reason, a million threads writing into 100 elements atomically
is going to be slightly faster than a million threads writing into a million elements.
So the next fastest is option 4. The next fastest after that is option 2.
And in my system, again, these took 3.6 and 3.8 milliseconds,
which means the slowest of all options is the one where
10 million threads write into a hundred elements.
This is actually 36 milliseconds, so it takes approximately 10 times as long to complete.
Not surprisingly, there's about 10 times as much going on.
So you might play around with this code for a little bit.
For example, see what happens to the time as you go to even more threads writing into even fewer elements.
The big lesson here is that atomic memory operations come with a cost.
那么一次性讲解这些—— 一百万个线程递增一百万个元素
确实给出了正确的答案, 因为在这样的情况下,
只有唯一的元素对应每个线程,所以并不存在冲突。
所以,即使我们并没有构造这些原子递增,我们还是安全的。
一百万个线程原子递增一百万个元素,当然也是安全的。
所以你也会得到正确的答案。
一百万个线程递增一百个元素和我们之前看到过的示例相同。
正如我们所见,那将给出错误答案,除非我们使用原子计算。
所以下一条是错的,第四条是正确的。
最后,一千万个线程原子递增一百个元素仍然是正确的答案。
因此除了一条以外,这些给你提供了正确答案。
好的,更有趣的问题是各个选项花了多少时间来完成?
最快的,或许与大家的直觉相反, 最有可能是选项3 ——
一百万个线程写入一百个元素中。
第二快的是选项1 ——
一百万个线程写入一百万个元素。
在我们的笔记本电脑上,
此2个运算大概分别花了3.2毫秒以及3.4毫秒。
当然,这不是一个非常有用的选项因为它没有给出正确答案。
但看看发生了什么仍然是有趣的。
更快一些的原因是你让你的100万个线程都
试图写入到相同的100个元素中。
这100个元素占用了极少部分的内存,
因此它们将能刚好放入这个系统的一条或两条高速缓存行,
然而一百万个元素过于巨大以至于无法适应缓存。
实际上你不得不更多地访问动态随机存储器,
也就是全局内存存储的地方。
同样的原因,一百万个线程原子写入100个元素
将比一百万个线程写入一百万个元素快一点点。
所以接下来最快的是选项4。下一个最快的则为选项2。
同样在我的系统,它们将花费3.6毫秒和3.8毫秒,
这意味着所有选项中最慢的是
一千万条线程写入一百个元素。
这实际是36毫秒,完成它大概花了之前选项的10倍时间。
并不奇怪,它有10倍的工作量。
所以你可以稍微研究一下这个代码。
比如,当你用更多的线程写入更少的元素时,
看看时间会发生什么变化。
此处的重要教训就是原子内存运算是有代价的。