Okay. Let's talk about a related problem.
What happens when you have lots of threads both reading and writing the same memory locations?
If lots of threads are all trying to read and write the same memory locations,
then you're going to start to get conflicts.
Let's look at an example.
Suppose you had 10,000 threads that were all trying to increment 10 array elements.
What's going to happen?
Let's look at some code.
This goes a little more complicated than we've seen in these examples before,
so let me walk through it.
What I'm going to do is I'm going to try writing with many threads
to a small number of array elements.
In this case, I'm going to write with 1,000,000 threads into 10 array elements,
and I'm going to do so in blocks of a thousand.
That's the #defines.
It got a little helper function that just prints out an array.
We'll use that for debugging.
And then here's the kernel we're going to use.
It's a kernel because it's labeled global.
It is called increment naive, and it takes a pointer to global memory an array of integers.
Each thread simply figures out what thread it is by looking at the block index,
which block it is times how many threads there are in a block plus which thread it is inside that block.
Now, what we're going to do is we're going to have every thread increment
one of the elements in the array.
To do that, we're just going to mod the thread index by the array size.
Every thread is going, every consecutive threads are going to increment consecutive elements in the array,
wrapping around at the array size.
The fact that we have a million threads writing into only ten elements means
that after each thread has added one to its corresponding element in the array,
we're going to end up with 10 elements each containing the number 100,000.
And then the code itself is simple.
We have a timer class.
Again, I've sort of hidden that away so you don't have to deal with it right now.
We're going to declare some host memory.
We're going to declare some GPU memory.
And we're going to zero out that memory.
You haven't seen cudaMemset before, but it's exactly what you'd think.
We're going to set all of the bytes of this device array to zero.
Now, we're going to launch the kernel.
And I've put a timer around this,
because one of the things I want to show you is that atomic operations can slow things down.
So here's the kernel that we called increment_naive.
We're going to launch it with a number of blocks equal
to the total number of threads divided by the block width
and the number of threads per block equal to the block width.
Remember, these numbers initially are 1,000,000 and 1,000, okay?
We're going to end up launching a thousand blocks,
and we're going to pass in the device array,
and then each thread is going to do its incrementing.
And when it's all done we're going to stop the timer and copy back the array using cudaMemcpy.
Now, we'll take that array that we just incremented
and copy it back to the host and then I hid away a little print array helper function.
It just prints out the contents of the array.
Then I'm going to print out the amount of time taken milliseconds
by this kernel that I measured with a timer.
Okay. That's the whole Cuda program. Let's compile and run it.
好的,我们来讨论一个相关问题。
当你有很多线程读写相同的内存位置,会发生什么?
如果很多线程都试图读写相同的内存位置,
那么你会得到冲突。
我们来看个例子。
假设你有10000个线程都试图增加10个数组元素。
会发生什么?
我们来看些代码。
这比我们之前在这些例子中看到的更复杂一点,
让我来简单讲解一遍。
我要做的是试着用很多线程
写入少量数组元素。
在这个例子中,我将用一百万个线程写入十个数组元素,
我将用一千的线程块进行该操作。
这是#defines。
它有个小的帮助函数,输出一个数组。
我们用这个来调试。
接着这是我们要用的内核。
它是内核,因为它被标识为全局。
它叫做increment_naive,用一个指针到全局内存和一个整数数组。
每个线程只是通过看块索引,弄清它是什么线程,
这是哪个块乘以一个块有多少线程,加上它是这个块中的哪个线程。
现在我们要做的是让每个线程增加
数组中的一个线程。
要做这个,我们只要把线程索引除以数组大小取余数。
每个线程将— 每个连续线程将增加数组中的连续元素,
围绕着数组大小。
我们有一百万个线程只写入10个元素的事实意味着
在每个线程都已对其相应数组元素加一之后,
我们最终得到的10个元素,每个都含有数字100000。
代码本身很简单。
我们有个timer类。
我再次隐藏了,你现在不需要涉及它。
我们要声明一些主机内存。
我们要声明一些GPU内存。
我们要将内存归零。
你之前没见过cudaMemset,但它正是你想的。
我们将把这个设备数组的所有字节设为零。
现在我们要启动内核。
我已在这放了个计时器,
因为我想展示给你看的一点是原子操作会让程序减速。
这是被我们叫做increment_naive的内核。
我们将用很多线程块启动它,
线程块数目等于线程总数除以线程块宽度,
每个块的线程数等于线程块宽度。
还记得,这些数最初是1000000和1000,好吗?
我们最终启动一千个线程块,
我们将传递设备数组,
然后每个线程将进行增量。
当所有都完成了,我们将停止计时器,用cudaMemcpy复制回数组。
现在,我们取我们刚增加的数组,
把它复制回主机,然后我隐藏了一个小的print_array帮助函数。
它打印输出数组的内容。
然后我将打印输出这个内核所花时间的毫秒数,
我用一个计时器测得。
好的。这是整个CUDA程序。我们来编译并运行它