-
Title:
CUB Programming Exercise - Intro to Parallel Programming
-
Description:
-
So when I run all of these, this one is slightly the fastest of all of these on the diagonal--
-
64 threads per block operating on 16 items each.
-
And so the thing to notice is that throughput is generally increasing
-
as the granularity per thread increases, right?
-
So having threads do more and more serial work is generally a good thing.
-
On the other hand, if you've got a fixed tile size--
-
in other words, the tile here is the total number of items that we're scanning over--
-
for a fixed tile size there's diminishing returns.
-
The improvements get smaller and smaller as you approach this sweet spot
-
because you're trading off increased granularity per thread for decreased parallelilsm.
-
And then it starts to go up again, and in my measurements on Fermi,
-
which is the same GPU that you'll be using on the Udacity IDE,
-
I get 32x32 being slightly slower than 64 threads with 16 items each.
-
And at some point, you get to the point where you can no longer fit a problem size
-
in a single thread's registers, and then performance falls of a cliff again.
-
And for me on Fermi, 64 items per thread running at only 16 threads
-
really starts to get pretty bad performance again.
-
I encourage you to play around, experiment a little bit more
-
to see what the rest of this matrix looks like.
-
And if you're interested, go check out the CUB homepage and see what else you can do.
-
There's different varieties of block scan, for example.
-
There's work-efficient and step-efficient varieties of block scan,
-
and so experiment a little bit and see how fast you can get this scan throughput,
-
how few clocks you can spend per element scanned.
-
And what I've found is that with the right balance,
-
you can have a computational overhead of less than 1 clock cycle per item scanned.
-
That is really pretty remarkable.
-
Think about running this code on a CPU.
-
You couldn't really achieve less than 1 clock per item scanned.