1
00:00:00,000 --> 00:00:02,613
So we've seen some really interesting algorithms so far,
2
00:00:02,613 --> 00:00:06,685
but the GPU performance leader is none of the ones that we've discussed to date.
3
00:00:06,685 --> 00:00:11,023
Instead, typically on the GPU for the highest performing sorts
4
00:00:11,023 --> 00:00:13,891
we use a different sorting algorithm called radix sort.
5
00:00:13,891 --> 00:00:18,133
Now, all of the previous sorting algorithms were comparison sorts,
6
00:00:18,133 --> 00:00:21,415
meaning the only operation we did on an item was compare it to another one.
7
00:00:21,415 --> 00:00:25,451
Radix sort relies on a number representation that uses positional notation.
8
00:00:25,451 --> 00:00:29,779
In this case, bits are more significant as we move further left in the word,
9
00:00:29,779 --> 00:00:32,689
and it's most easily explained using integers.
10
00:00:32,689 --> 00:00:35,461
So the algorithm for radix sort is as follows.
11
00:00:35,461 --> 00:00:40,685
Start with the least significant bit of the integer, split the input into 2 sets,
12
00:00:40,685 --> 00:00:45,472
those that have a 0 with this particular bit location and those that have a 1.
13
00:00:45,472 --> 00:00:47,457
Otherwise, maintain the order.
14
00:00:47,457 --> 00:00:51,988
Then proceed to the next least significant bit and repeat until we run out of bits.
15
00:00:51,988 --> 00:00:55,514
So as usual, we're going to do an example that will make more sense,
16
00:00:55,514 --> 00:00:57,904
and we're going to use unsigned integers.
17
00:00:57,904 --> 00:01:01,059
So what we're going to sort is this column of numbers to the left,
18
00:01:01,059 --> 00:01:04,438
and here is the binary representation of those numbers.
19
00:01:04,438 --> 00:01:08,746
And so we're going to start here with the least significant bit.
20
00:01:08,746 --> 00:01:14,740
So the way that we're going to do this is take all the elements that have a 0 as the least significant bit,
21
00:01:14,740 --> 00:01:18,819
and we're going to otherwise maintain their order, but we're going to put them up top.
22
00:01:18,819 --> 00:01:21,220
Then we're going to take all the rest of the elements,
23
00:01:21,220 --> 00:01:23,928
those that have a 1 as the least significant bit,
24
00:01:23,928 --> 00:01:27,728
and we're going to again keep them in order and append them to the list that we've just created.
25
00:01:27,728 --> 00:01:31,172
So what this creates is a list where all the least significant bits are 0
26
00:01:31,172 --> 00:01:34,571
and then a list where all the least significant bits is 1.
27
00:01:34,571 --> 00:01:37,491
Now we move to the next least significant bit,
28
00:01:37,491 --> 00:01:40,150
so the bit in the middle, and we're going to do the same thing.
29
00:01:40,150 --> 00:01:42,447
We're going to take all the 0s and put them up top,
30
00:01:42,447 --> 00:01:44,905
and then we're going to take all the 1s and put them below.
31
00:01:44,905 --> 00:01:48,035
And here the dotted lines are just showing the data movement that we're looking at.
32
00:01:48,035 --> 00:01:51,019
The green lines are the ones where the middle bits are 0,
33
00:01:51,019 --> 00:01:53,787
and the blue line is the one where the middle bits are 1.
34
00:01:53,787 --> 00:01:56,394
Now we move on to the next most significant bit--
35
00:01:56,394 --> 00:02:00,269
in this case, the very most significant bit--and we do the same operation again.
36
00:02:00,269 --> 00:02:03,968
Zeroes in the most significant bit move up top, 1s move to the bottom,
37
00:02:03,968 --> 00:02:06,060
otherwise, we maintain the order.
38
00:02:06,060 --> 00:02:08,907
And now we have a sorted sequence. Pretty cool, huh?
39
00:02:08,907 --> 00:02:12,117
Now, there's 2 big reasons this code runs great on GPUs.
40
00:02:12,117 --> 00:02:17,477
The first is its work complexity. The best comparison base sorts are O(n log n).
41
00:02:17,477 --> 00:02:20,577
This algorithm, on the other hand, is O(kn),
42
00:02:20,577 --> 00:02:23,566
meaning the runtime is linear in 2 different things.
43
00:02:23,566 --> 00:02:27,027
First, it's linear in the number of bits in the representation.
44
00:02:27,027 --> 00:02:31,940
So this particular integer has 3 bits in its representation,
45
00:02:31,940 --> 00:02:35,643
and it took 3 stages for us to be able to sort the input.
46
00:02:35,643 --> 00:02:38,604
Second, it's linear in the number of items to sort.
47
00:02:38,604 --> 00:02:41,796
So we have 8 items in the representation here,
48
00:02:41,796 --> 00:02:45,674
and so the amount of work is proportional to 8.
49
00:02:45,674 --> 00:02:51,718
Generally k is constant, say a 32-bit word or a 64-bit word for any reasonable applications.
50
00:02:51,718 --> 00:02:54,433
And so in general, the work complexity of this
51
00:02:54,433 --> 00:02:58,160
is mostly proportional to the number of items that we need to sort.
52
00:02:58,160 --> 00:03:02,986
And so that's a superior work complexity to any of the sorts that we've talked about to date,
53
00:03:02,986 --> 00:03:05,762
and so that's 1 reason why this looks so good.
54
00:03:05,762 --> 00:03:11,540
The second is that the underlying operations that we need to do this split of the input at each step
55
00:03:11,540 --> 00:03:14,306
are ones that are actually very efficient.
56
00:03:14,306 --> 00:03:17,677
And in fact, they're efficient operations that you already know.
57
00:03:17,677 --> 00:03:19,638
Let's take a closer look at what we're doing.
58
00:03:19,638 --> 00:03:23,280
We're only going to look at the first stage of the radix sort algorithm,
59
00:03:23,280 --> 00:03:26,817
where we're only considering the value of the least significant bit,
60
00:03:26,817 --> 00:03:31,516
and we're only going to look at the output for which the least significant bit is 0.
61
00:03:31,516 --> 00:03:34,327
Now what are we actually doing here?
62
00:03:34,327 --> 00:03:37,427
We've already learned an algorithm that does this operation today.
63
00:03:37,427 --> 00:03:42,477
So what is the name of the algorithm that takes this input and creates that as the output?