
Title:
Merrills LinearComplexity BFS on GPUs Part1  Intro to Parallel Programming

Description:

So in the discussion on graphs, I made the following statement; let me quote myself.

"If we had a graph that was just a linear chain, the last node would be node N minus 1.

"The linear chain is very hard to paralyze.

If we're doing a BFS here, it's going to take us order of N steps to get to the end of the graph."

Let me state this problem another way.

I have N nodes in a linear chain,

and each node knows the ID of the next node in the chain.

This is just a simple link list.

What is the algorithm for each node, every node finding the end of the list?

And so of course we can solve this in N steps.

Let's say that each node has a next pointer, and I've shown those in blue,

that points to the next node in the chain, and the last node has next equals null.

Now we don't want to change the next pointers at all, or else we'll lose the structure of our lists.

So we're going to assume they're read only.

So we're also going to store a second pointer per node that we can change.

For historical purposes, we'll call this pointer chum,

and we're going to designate it in red.

And at the end of the algorithm, we want each node's chum pointer to point to the last node in the chain.

So the straightforward algorithm is on each iteration on each node,

set chum to chum to next until we reach a node where next is null.

So we'll start off by making all chum pointers point to their own nodes.

That's how we are going to initialize them, and again, on each iteration, we will set chum to the next pointer.

So on the first iteration it's going to look like this, so now we're going to do another iteration,

so for any particular node, we look for chum and then next.

So on the second iteration, it's going to look like this, and so on, and so on.

So on each iteration, the length of the chum pointer is going to be 1 more than it was on the iteration before.

So the question isthe important question is, can we do better?

And it turns out the answer is yes,

and this algorithm described by Danny Hillis and Guy Steele in 1986,

not discovered by them but described very nicely,

is so cool that it's one of the reasons that I decided to do parallel computing in the first place.

So let's analyze the complexity of the algorithm that we just described.

Clearly a serial processor can do this computation in N steps in order of N work.

How about the parallel processor?

We know that it takes N steps; how much work?

Your choices are order of N, N log N, N square root of N, or Nsquared.