So we've seen some really interesting algorithms so far,
but the GPU performance leader is none of the ones that we've discussed to date.
Instead, typically on the GPU for the highest performing sorts
we use a different sorting algorithm called radix sort.
Now, all of the previous sorting algorithms were comparison sorts,
meaning the only operation we did on an item was compare it to another one.
Radix sort relies on a number representation that uses positional notation.
In this case, bits are more significant as we move further left in the word,
and it's most easily explained using integers.
So the algorithm for radix sort is as follows.
Start with the least significant bit of the integer, split the input into 2 sets,
those that have a 0 with this particular bit location and those that have a 1.
Otherwise, maintain the order.
Then proceed to the next least significant bit and repeat until we run out of bits.
So as usual, we're going to do an example that will make more sense,
and we're going to use unsigned integers.
So what we're going to sort is this column of numbers to the left,
and here is the binary representation of those numbers.
And so we're going to start here with the least significant bit.
So the way that we're going to do this is take all the elements that have a 0 as the least significant bit,
and we're going to otherwise maintain their order, but we're going to put them up top.
Then we're going to take all the rest of the elements,
those that have a 1 as the least significant bit,
and we're going to again keep them in order and append them to the list that we've just created.
So what this creates is a list where all the least significant bits are 0
and then a list where all the least significant bits is 1.
Now we move to the next least significant bit,
so the bit in the middle, and we're going to do the same thing.
We're going to take all the 0s and put them up top,
and then we're going to take all the 1s and put them below.
And here the dotted lines are just showing the data movement that we're looking at.
The green lines are the ones where the middle bits are 0,
and the blue line is the one where the middle bits are 1.
Now we move on to the next most significant bit--
in this case, the very most significant bit--and we do the same operation again.
Zeroes in the most significant bit move up top, 1s move to the bottom,
otherwise, we maintain the order.
And now we have a sorted sequence. Pretty cool, huh?
Now, there's 2 big reasons this code runs great on GPUs.
The first is its work complexity. The best comparison base sorts are O(n log n).
This algorithm, on the other hand, is O(kn),
meaning the runtime is linear in 2 different things.
First, it's linear in the number of bits in the representation.
So this particular integer has 3 bits in its representation,
and it took 3 stages for us to be able to sort the input.
Second, it's linear in the number of items to sort.
So we have 8 items in the representation here,
and so the amount of work is proportional to 8.
Generally k is constant, say a 32-bit word or a 64-bit word for any reasonable applications.
And so in general, the work complexity of this
is mostly proportional to the number of items that we need to sort.
And so that's a superior work complexity to any of the sorts that we've talked about to date,
and so that's 1 reason why this looks so good.
The second is that the underlying operations that we need to do this split of the input at each step
are ones that are actually very efficient.
And in fact, they're efficient operations that you already know.
Let's take a closer look at what we're doing.
We're only going to look at the first stage of the radix sort algorithm,
where we're only considering the value of the least significant bit,
and we're only going to look at the output for which the least significant bit is 0.
Now what are we actually doing here?
We've already learned an algorithm that does this operation today.
So what is the name of the algorithm that takes this input and creates that as the output?