-
Title:
Reduction Using Global and Shared Memory - Intro to Parallel Programming
-
Description:
-
So now we've done all the preliminaries, so let's turn to some actual code.
-
We're going to implement this twice with a similar strategy each time.
-
In both we're going to implement a sum of a million--actually 2^20 elements,
-
and we're going to do this in 2 stages.
-
In the 1st stage we're going to launch 1024 blocks,
-
each one of which will use 1024 threads to reduce 1024 elements.
-
Each of those will produce 1 single item.
-
So we're going to have 1024 items left when we're done.
-
And we're going to launch 1 block to reduce the final 1024 elements into 1 single element.
-
So I'll post all the code, of course. But the CPU side code is straightforward.
-
Instead we're just going to take a look at the kernel. Let's see how that works.
-
So each block is going to be responsible for a 1024 element chunk of floats,
-
and we're going to run this loop within the kernel.
-
On each iteration of this loop we're going to divide the active region in half.
-
So on the 1st iteration, where we start with 1024 elements,
-
we're going to have two 512-element regions.
-
Then each of 512 threads will add its element in the 2nd half to its element in the 1st half,
-
writing back to the 1st half.
-
Now we're going to synchronize all threads, this syncthreads call right here,
-
to make sure every one is done.
-
We've got 512 elements remaining,
-
and so we're going to loop again on this resulting region of 512 elements.
-
Now we'll divide it into two 256-element chunks using 256 threads
-
to sum these 256 items to these 256 items.
-
And we're going to continue this loop, cutting it in half every time,
-
until we have 1 element remaining at the very end of 10 iterations.
-
And then we'll write that back out to global memory. So this works.
-
We can run it on a computer in our lab.
-
So we're doing that now, and we notice that it finishes in 0.65 milliseconds.
-
Less than a millisecond. That's pretty great, but it's not as efficient as we might like.
-
Specifically, if we take a look at the code again,
-
we're going to global memory more often than we'd like.
-
On each iteration of the loop, we read n items from global memory and we write back n/2 items.
-
Then we read those n/2 items back from global memory and so on.
-
In an ideal world, we'd do an original read where we read all of the 1024 items into the thread block,
-
do all the reduction internally, and then write back the final value.
-
And this should be faster because we would incur less memory traffic overall.
-
The CUDA feature we use to do this is called shared memory
-
and will store all intermediate values in shared memory where all threads can access them.
-
Shared memory is considerably faster than global memory.
-
So let's take a look at the kernel. It's going to look very similar.
-
And in this kernel we're going to have the exact same loop structure.
-
What's going to be different, though, is this little part right here.
-
We have to 1st copy all the values from global memory into shared memory
-
and that's done with this little block.
-
And then all the further accesses here are from shared memory--
-
this s data--as opposed to from global memory, which we did last time.
-
And when we're done, we have to write this final value back to global memory again.
-
The only other interesting part of this code is how we declare the amount of shared memory we need.
-
And we do that here.
-
We're declaring that we're going to have an externally defined amount of shared data.
-
Now, we haven't actually said how much we do,
-
so to do that, we're going to have to go down to where we actually call the kernel.
-
So when we're calling the reduce kernel using the shared memory,
-
we call it with now 3 arguments inside the triple chevrons, the normal blocks and the threads,
-
but then we say how many bytes we need allocated in shared memory.
-
In this case, every thread is going to ask for 1 float stored in shared memory.
-
So the advantage of the shared memory version is that it saves global memory bandwidth.
-
It's a good exercise to figure out how much memory bandwidth you'll save.
-
So I'll ask that as a quiz.
-
The global memory version uses how many times as much memory bandwidth
-
as the shared memory version?
-
Round to the nearest integer.