
タイトル：
Quick Sort  Intro to Parallel Programming

概説：

The final sort algorithm we'll discuss is one of the most efficient ones for serial processors;

this is quick sort. It is a more complex algorithm on GPUs

because of the control complexity of the algorithm, so let's recap what the quick sort algorithm does.

First, it chooses a pivot element, one particular element from within its input,

then it compares all of the elements in its input to the pivot,

and it uses that comparison to divide the input into 3 subarrays.

Those that are less than the pivot, those that are equal to the pivot,

and those that are greater than the pivot, and then it calls quick sort on each of these subarrays

and continues until the entire input is sorted.

As an example, let's look at a particular array and choose that the pivot is equal to 3.

So what we're going to do here is compare each one of these elements to the pivot,

and we're going to decide if they're equal to, greater than, or less than the pivot.

Then we'll divide it into 3 arrays, those that are less than the pivot,

those that are equal to the pivot, and those that are greater than the pivot,

and we'll call quick sort on each of these arrays and do the same thing.

So when we have 2 and 1, let's say we choose 2 as the pivot,

then we'll divide this into smaller than the pivot and equal to the pivot.

This doesn't need to be recursed because it only has a single element.

And let's say we choose 4 as the pivot here, this is greater than the pivot, this is equal to the pivot,

and now we have a completely sorted array.

So this is a very challenging algorithm to implement on the GPU.

The other algorithms that we've looked at are pretty simple to describe, and they don't recurse.

This one is more complex, and until very recently GPUs didn't support recursion at all.

Indeed, the GPUs we use in this class don't support recursion currently.

So how can we take this seemingly recursiveonly algorithm and map it to the GPU using the primitives that we've learned?

So I'm bringing up this example for two reasons.

The first is that you have already learned all the pieces that you need to implement quick sort on the GPU,

and the second is to motivate the benefits of new GPU capabilities that do natively support recursion.

So we can implement quick sort without recursion by using the idea of segments.

Recall that segmented operations, like scans, only operate within a single segment;

operations on one segment don't affect other segments.

That sounds a little bit like recursion, and in fact it maps in a similar way.

For quick sort, when we begin to process the initial array,

we're going to use distributes, maps, and compacts to eventually divide it into 3 segments.

We can use segmented scans to do all the necessary operations that we need to make this work,

including distributing a pivot across a segment for comparisons and splitting a segment,

which is similar to the way that we split on a particular bit in radix sort.

Quick sort is interesting because it shows you how useful segments can be,

that they can let you mirror the same approach you use in recursion, without actually using recursion.

But, and I gotta be perfectly honest here,

it's really a challenge to implement and equally challenging to make it fast.

So it's very exciting that the newest in video GPUs support a more flexible programming model where kernels can actually call,

can launch other kernels, which makes quick sort's recursive calls much more straightforward.

We're not using this new capability in the class.

The Amazon instances where our code is running don't have these new GPUs that support this capability at this time,

but it's really exciting, and so when we get to unit 7,

we'll see exactly what it looks like and how it applies to Quick sort.