-
Title:
Why Use CUB - Intro to Parallel Programming
-
Description:
-
Now let me focus on a specific aspect of writing high performance kernel code.
-
At a high level, GPU programming looks like this.
-
There's a bunch of data sitting in global memory,
-
and you have an algorithm that you want to run on that data.
-
That algorithm will get executed by threads running on SMs,
-
and for various reasons, you might want to stage the data through shared memory.
-
And you've seen for yourself that loading or storing global memory into shared memory
-
or into local variables of the threads
-
can be complicated if you are striving for really high performance.
-
Think about problem set 2
-
where you loaded an image tile into shared memory
-
and also had to load a halo of extra cells around the image tile
-
in order to account for the width of the blur,
-
and this can be tricky because the number of threads that want to perform a computation
-
on the pixels in that tile
-
is naturally a different number than the number of threads
-
you would launch if your goal was simply to load the pixels in the tile
-
as well as the pixels outside of the tile.
-
Or think about the tile transpose example in Unit 5
-
where we staged through shared memory in order to pay careful attention to coalescent global memory.
-
Think about our discussion of Little's Law
-
and the trade offs that we went over between latency,
-
bandwidth, occupancy, the number of threads,
-
and the transaction size per thread.
-
Remember that we saw a graph that looked sort of like this
-
where the bandwidth that we achieved as we increased the number of threads
-
was higher if we were able to access 4 floating point values in a single load
-
versus 2 floating point values in a single load
-
versus a single floating point value in a load.
-
Finally, there are ninja level optimizations we haven't even talked about in this class,
-
like using the Kepler LDG intrinsic.
-
In short, CUDA's explicit use of user managed shared memory
-
enables predictable high performance.
-
In contrast, the hardware manage caches that you find on CPUs
-
can result in unpredictable performance,
-
especially when you have many multiple threads sharing the same cache, and they're interfering with each other.
-
So that's an advantage of explicit shared memory,
-
but that advantage comes at the cost of an additional burden on the programmer
-
who has to explicitly manage the movement of data in and out from global memory.
-
And this is a big part of where CUB comes in.
-
CUB puts an abstraction around the algorithm and its memory access pattern
-
and deals opaquely with the movement of data from global memory,
-
possibly through shared memory, into the actual local variables of the threads.
-
I want to mention another programming power tool called Cuda DMA
-
that focuses specifically around the movement of data from global into shared memory.