WEBVTT

00:00:00.000 --> 00:00:11.000


00:00:11.000 --> 00:00:14.000
This presentation is delivered by the Stanford Center for Professional

00:00:14.000 --> 00:00:25.000
Development.

00:00:25.000 --> 00:00:27.000
Okay, we got volume and everything. All

00:00:27.000 --> 00:00:28.000
right, here we go. Lots of you think your

00:00:28.000 --> 00:00:32.000
learning things that aren't gonna actually someday help you in your future

00:00:32.000 --> 00:00:34.000
career. Let's say you plan to be president. You

00:00:34.000 --> 00:00:38.000
think, ''Should I take 106b; should I not take 106b?'' Watch this interview and

00:00:38.000 --> 00:00:39.000
you will see. [Video]

00:00:39.000 --> 00:00:41.000
We have questions, and we ask

00:00:41.000 --> 00:00:45.000
our candidates questions, and this one is from Larry Schwimmer. What

00:00:45.000 --> 00:00:48.000
- you

00:00:48.000 --> 00:00:52.000
guys think I'm kidding; it's right here.

00:00:52.000 --> 00:00:59.000
What is the most efficient way to sort a million 32-bit integers?

00:00:59.000 --> 00:01:01.000
Well - I'm

00:01:01.000 --> 00:01:07.000
sorry. Maybe we should - that's not a - I think the bubble sort

00:01:07.000 --> 00:01:09.000
would be the wrong way to go. Come on; who told him this?

00:01:09.000 --> 00:01:14.000
There you go. So, apparently -

00:01:14.000 --> 00:01:15.000
the question was

00:01:15.000 --> 00:01:16.000
actually

00:01:16.000 --> 00:01:19.000
by a former student of mine, in

00:01:19.000 --> 00:01:22.000
fact, Larry Schwimmer. So apparently, he's been prepped well. As part of going on the campaign trail, not only do you

00:01:22.000 --> 00:01:25.000
have to know all your policy statements, you also have to have your

00:01:25.000 --> 00:01:29.000
N2 and login sorts kept straight in your mind, so - just so

00:01:29.000 --> 00:01:31.000
you know. So let me give you

00:01:31.000 --> 00:01:35.000
some examples of things that are graphs, just to get you thinking about

00:01:35.000 --> 00:01:36.000


00:01:36.000 --> 00:01:39.000
the data structure that we're gonna be working with and playing around with

00:01:39.000 --> 00:01:40.000
today. Sometimes you

00:01:40.000 --> 00:01:44.000
think of graphs of being those, like, bar graphs, right, your graphing

00:01:44.000 --> 00:01:46.000
histograms. This is actually a different kind of graph. The computer scientist's form of

00:01:46.000 --> 00:01:50.000
graph is just a generalized, recursive data structure

00:01:50.000 --> 00:01:51.000
that, kind of,

00:01:51.000 --> 00:01:54.000
starts with something that looks a little bit like a tree and extends it to

00:01:54.000 --> 00:01:55.000
this notion that there are

00:01:55.000 --> 00:01:57.000
any number of nodes, in

00:01:57.000 --> 00:02:00.000
this case, represented as the cities in this airline route map,

00:02:00.000 --> 00:02:04.000
and then there are connections between them, sometimes called edges or arcs that wire up

00:02:04.000 --> 00:02:08.000
the connections that in the case, for example, of this graph are showing you

00:02:08.000 --> 00:02:10.000
which cities have connecting flights. You have a direct

00:02:10.000 --> 00:02:15.000
flight that takes you from Las Vegas to Phoenix, right, one from Phoenix to Oklahoma

00:02:15.000 --> 00:02:15.000
City.

00:02:15.000 --> 00:02:18.000
There is not a direct flight from Phoenix to Kansas City, but there's a path

00:02:18.000 --> 00:02:21.000
by which you could get there by taking a sequence

00:02:21.000 --> 00:02:25.000
of connecting flights, right, going from Phoenix, to Oklahoma City, to Kansas City is

00:02:25.000 --> 00:02:26.000
one way to get there.

00:02:26.000 --> 00:02:30.000
And so the entire, kind of, map of airline routes forms a graph of the city,

00:02:30.000 --> 00:02:34.000
and there are a lot of interesting questions that having represented that data you might want to

00:02:34.000 --> 00:02:35.000
answer about finding

00:02:35.000 --> 00:02:36.000
flights that are

00:02:36.000 --> 00:02:40.000
cheap, or meet your time specification, or get you where you want to go,

00:02:40.000 --> 00:02:47.000
and that this data would be, kind of, primary in solving that problem.

00:02:47.000 --> 00:02:48.000
Another one,

00:02:48.000 --> 00:02:49.000
right, another

00:02:49.000 --> 00:02:52.000
idea of something that's represented as a graph is something that would support

00:02:52.000 --> 00:02:54.000
the notion of word ladders, those little puzzles where you're

00:02:54.000 --> 00:02:56.000
given the word ''chord,''

00:02:56.000 --> 00:03:00.000
and you want to change it into the word ''worm,'' and you have these rules that say, well, you

00:03:00.000 --> 00:03:02.000
can change one letter at a time.

00:03:02.000 --> 00:03:04.000
And so the

00:03:04.000 --> 00:03:07.000
connections in the case of the word letter have to do with words that are

00:03:07.000 --> 00:03:12.000
one character different from their neighbors, have direct connections or arcs

00:03:12.000 --> 00:03:16.000
between them. Each of the nodes themselves represents a word, and then

00:03:16.000 --> 00:03:19.000
paths between that represent word ladders. That if you can go from here

00:03:19.000 --> 00:03:22.000
through a sequence of steps directly connecting your way, each of those stepping

00:03:22.000 --> 00:03:26.000
stones is a word in the chain that makes a word ladder. So

00:03:26.000 --> 00:03:29.000
things you might want to solve there is what's the shortest word ladder? How many different

00:03:29.000 --> 00:03:31.000
word ladders - how many different ways can I go from this word to that one

00:03:31.000 --> 00:03:36.000
could be answered by searching around in a space like this.

00:03:36.000 --> 00:03:39.000
Any kind of prerequisite structure,

00:03:39.000 --> 00:03:42.000
such as the one for the major that might say you need to take this class before you take

00:03:42.000 --> 00:03:45.000
that class, and this class before that class, right,

00:03:45.000 --> 00:03:46.000
forms a

00:03:46.000 --> 00:03:49.000
structure that also fits into the main graph.

00:03:49.000 --> 00:03:52.000
In this case, there's a connection to those arcs.

00:03:52.000 --> 00:03:54.000
That the prerequisite is such that

00:03:54.000 --> 00:03:58.000
you take 106a, and it leads into 106b, and that connection doesn't really go in

00:03:58.000 --> 00:04:01.000
the reverse, right? You don't start in 106b and move backwards.

00:04:01.000 --> 00:04:05.000
Whereas in the case, for example, of the word ladder, all of these arcs are what we called undirected,

00:04:05.000 --> 00:04:09.000
right? You can traverse from wood to word and back again, right?

00:04:09.000 --> 00:04:10.000
They're equally

00:04:10.000 --> 00:04:12.000


00:04:12.000 --> 00:04:15.000
visitable or neighbors, in this case. So there may be a situation where you have directed

00:04:15.000 --> 00:04:20.000
arcs where they really imply a one-way directionality through the paths.

00:04:20.000 --> 00:04:22.000
Another one, lots and lots of

00:04:22.000 --> 00:04:23.000
arrows that

00:04:23.000 --> 00:04:25.000
tell you about the lifecycle of

00:04:25.000 --> 00:04:26.000


00:04:26.000 --> 00:04:30.000
Stanford relationships all captured on one

00:04:30.000 --> 00:04:31.000
slide

00:04:31.000 --> 00:04:34.000
in, sort of, the flowchart. So flowcharts have a lot of things about directionality. Like, you're

00:04:34.000 --> 00:04:37.000
in this state, and you can move to this state, so that's your neighboring state, and

00:04:37.000 --> 00:04:39.000
the arcs in there connect those things.

00:04:39.000 --> 00:04:43.000
And so, as you will see, like, you get to start over here and, like, be single and love

00:04:43.000 --> 00:04:47.000
it, and then you can, kind of, flirt with various degrees of seriousness,

00:04:47.000 --> 00:04:50.000
and then eventually there evolves a trombone and serenade, which, you know,

00:04:50.000 --> 00:04:53.000
you can't go back to flirting aimlessly after the trombone and serenade; there's no

00:04:53.000 --> 00:04:55.000
arc that direction.

00:04:55.000 --> 00:04:57.000
When somebody gets the trombone out,

00:04:57.000 --> 00:04:59.000
you know, you take them seriously,

00:04:59.000 --> 00:05:02.000
or you completely blow them off. There's the [inaudible] there.

00:05:02.000 --> 00:05:04.000
And then there's dating, and then there's

00:05:04.000 --> 00:05:08.000
two-phase commit, which is a very important part of any computer scientist's education,

00:05:08.000 --> 00:05:10.000
and then, potentially, other

00:05:10.000 --> 00:05:13.000
tracks of children and what not. But it does

00:05:13.000 --> 00:05:15.000
show you, kind of, places you can go, right?

00:05:15.000 --> 00:05:18.000
It turns out you actually don't get to go from flirt aimlessly to have baby, so

00:05:18.000 --> 00:05:20.000
write that down -

00:05:20.000 --> 00:05:23.000
not allowed in this graph. And

00:05:23.000 --> 00:05:28.000
then it turns out some things that you've already seen are actually graphs in disguise.

00:05:28.000 --> 00:05:32.000
I didn't tell you when I gave you the maze problem and its solution, the solving and

00:05:32.000 --> 00:05:33.000
building of a maze,

00:05:33.000 --> 00:05:37.000
that, in fact, what you're working on though is something that is a graph.

00:05:37.000 --> 00:05:39.000
That a perfect maze, right,

00:05:39.000 --> 00:05:40.000
so we started with this

00:05:40.000 --> 00:05:44.000
fully disconnected maze. If you remember how the Aldis/Broder algorithm worked, it's like

00:05:44.000 --> 00:05:47.000
you have all these nodes that have no connections between them,

00:05:47.000 --> 00:05:51.000
and then you went bopping around breaking down walls, which is effectively

00:05:51.000 --> 00:05:53.000
connecting up the nodes to each other,

00:05:53.000 --> 00:05:57.000
and we kept doing that until actually all of the nodes were connected. So, in effect, we

00:05:57.000 --> 00:06:00.000
were building a graph out of this big grid of cells

00:06:00.000 --> 00:06:03.000
by knocking down those walls to build the connections, and when we were done, then

00:06:03.000 --> 00:06:07.000
we made it so that it was possible to trace paths starting from that lower corner

00:06:07.000 --> 00:06:10.000
and working our way through the neighbors all the way over to there.

00:06:10.000 --> 00:06:13.000
So in the case of this, if you imagine that each of these cells was broken out in this little

00:06:13.000 --> 00:06:16.000
thing, there'd be these neighboring connections where this guy has this

00:06:16.000 --> 00:06:19.000
neighbor but no other ones because actually it has no

00:06:19.000 --> 00:06:22.000
other directions you can move from there, but some of them, for example like this

00:06:22.000 --> 00:06:23.000
one,

00:06:23.000 --> 00:06:26.000
has a full set of four neighbors you can reach

00:06:26.000 --> 00:06:28.000
depending on which walls are intact or not.

00:06:28.000 --> 00:06:29.000
And so the

00:06:29.000 --> 00:06:31.000
creation of that was creating a graph out

00:06:31.000 --> 00:06:34.000
of a set of disconnected nodes, and then solving it is searching that

00:06:34.000 --> 00:06:38.000
graph for a path. Here's one that is a little horrifying that came from the newspaper, which is social networking, kind of, a very big [inaudible] of, like, on the net who do you know, and who do they know, and how do those connections, like, linked in maybe help you get a job. This is one that was actually based on the liaisons, let's say, of a group of high school students, right, and it turns out there was a lot of more interconnected in that graph than I would expect, or appreciate, or approve of but interesting to, kind of, look at. So

00:06:38.000 --> 00:06:39.000
let's talk, like, concretely.

00:06:39.000 --> 00:06:43.000
How are we gonna make stuff work on graphs? Graphs turn out to actually - by showing you that, I'm trying to

00:06:43.000 --> 00:06:46.000
get you this idea that a lot of things are really just graphs in disguise. That a lot

00:06:46.000 --> 00:06:50.000
of the problems you may want to solve, if you can represent it as a graph, then things

00:06:50.000 --> 00:06:53.000
you learn about how to manipulate graphs will help you to solve that kind of

00:06:53.000 --> 00:06:54.000
problem.

00:06:54.000 --> 00:06:59.000
So the basic idea of a graph, it is a recursive data structure based on the node where

00:06:59.000 --> 00:07:02.000
nodes have connections to other nodes.

00:07:02.000 --> 00:07:04.000
Okay. So that's, kind of, like a tree,

00:07:04.000 --> 00:07:07.000
but it actually is more freeform than a tree, right?

00:07:07.000 --> 00:07:11.000
In a tree, right, there's that single root node that has, kind of, a special place, and

00:07:11.000 --> 00:07:15.000
there's this restriction about the connections that there's a single path

00:07:15.000 --> 00:07:17.000
from every other node in the tree

00:07:17.000 --> 00:07:20.000
starting from the root that leads to it. So you never can get there by

00:07:20.000 --> 00:07:24.000
multiple ways, and there's no cycles, and it's all connected, and things like

00:07:24.000 --> 00:07:25.000
that.

00:07:25.000 --> 00:07:26.000
So, in terms of a graph,

00:07:26.000 --> 00:07:29.000
it just doesn't place those restrictions. It says there's a node.

00:07:29.000 --> 00:07:32.000
It has any number of pointers to other nodes. It potentially could even

00:07:32.000 --> 00:07:36.000
have zero pointers to other's nodes, so it could be on a little island of its own out

00:07:36.000 --> 00:07:38.000
here that has no incoming or outgoing flights, right? You

00:07:38.000 --> 00:07:40.000
have to swim if you want to get there.

00:07:40.000 --> 00:07:42.000


00:07:42.000 --> 00:07:45.000
And then there's also not a rule that says that you can't have

00:07:45.000 --> 00:07:47.000
multiple ways to get to the same

00:07:47.000 --> 00:07:50.000
node. So if you're at B, and you're interested in getting to C,

00:07:50.000 --> 00:07:52.000
you could take the direction connection here,

00:07:52.000 --> 00:07:54.000
but you could also take a longer path

00:07:54.000 --> 00:07:58.000
that bops through A and hits to C, and so there's more than one way to get there,

00:07:58.000 --> 00:07:59.000


00:07:59.000 --> 00:08:02.000
which would not be true in a tree structure where it has that constraint,

00:08:02.000 --> 00:08:05.000
but that adds some interesting challenges to solving problems in the graphs

00:08:05.000 --> 00:08:08.000
because there actually are so many different ways you can

00:08:08.000 --> 00:08:11.000
get to the same place. We have to be careful avoiding

00:08:11.000 --> 00:08:13.000
getting into circularities where we're

00:08:13.000 --> 00:08:16.000
doing that, and also not redundantly doing work we've already done before because

00:08:16.000 --> 00:08:18.000
we've visited it previously.

00:08:18.000 --> 00:08:21.000
There can be cycles where

00:08:21.000 --> 00:08:22.000
you can

00:08:22.000 --> 00:08:24.000
completely come around. I this one, actually,

00:08:24.000 --> 00:08:26.000
doesn't have a cycle in it.

00:08:26.000 --> 00:08:27.000
Nope, it doesn't,

00:08:27.000 --> 00:08:31.000
but if I add it - I could add a link, for example, from

00:08:31.000 --> 00:08:34.000
C back to B, and then there would be this place you could just, kind of, keep

00:08:34.000 --> 00:08:36.000
rolling around in the B, A, C

00:08:36.000 --> 00:08:38.000
area.

00:08:38.000 --> 00:08:41.000
So we call each of those guys, the circles here, the nodes, right?

00:08:41.000 --> 00:08:44.000
The connection between them arcs. We

00:08:44.000 --> 00:08:47.000
will talk about the arcs as being directed and undirected in different situations

00:08:47.000 --> 00:08:51.000
where we imply that there's a directionality; you can travel the arc only one-way,

00:08:51.000 --> 00:08:54.000
or in cases where it's undirected, it means it goes both ways.

00:08:54.000 --> 00:08:56.000
We'll talk about two nodes being connected,

00:08:56.000 --> 00:08:58.000
meaning there is a direct connection, an arc, between them,

00:08:58.000 --> 00:09:02.000
and if they are not directly connected, there may possibly be a path

00:09:02.000 --> 00:09:05.000
between them. A sequence of arcs forms a path that you can follow to get

00:09:05.000 --> 00:09:07.000
from one node to another,

00:09:07.000 --> 00:09:11.000
and then cycles just meaning a path that revisits one of the nodes that

00:09:11.000 --> 00:09:14.000
was previously traveled on that path. So I've got

00:09:14.000 --> 00:09:16.000
a little terminology.

00:09:16.000 --> 00:09:18.000
Let's talk a little bit about how we're gonna represent it.

00:09:18.000 --> 00:09:22.000
So before we start making stuff happen on them, it's, like, well, how does this -

00:09:22.000 --> 00:09:26.000
in C++, right, what's the best way, or what are the reasonable ways you

00:09:26.000 --> 00:09:27.000
might represent

00:09:27.000 --> 00:09:30.000
this kind of connected structure?

00:09:30.000 --> 00:09:31.000
So if I have

00:09:31.000 --> 00:09:36.000
this four-node graph, A, B, C, D over where with the connections that are shown with

00:09:36.000 --> 00:09:38.000
direction, in this case, I have directed arcs.

00:09:38.000 --> 00:09:41.000
That one way to think of it is that what a

00:09:41.000 --> 00:09:45.000
graph really is is a set of nodes and a set of arcs, and by set, I mean, sort of, just the

00:09:45.000 --> 00:09:48.000
generalization. There might be a vector of arcs; there might be a vector of

00:09:48.000 --> 00:09:51.000
nodes, whatever, but some collection of arcs, some collection of nodes,

00:09:51.000 --> 00:09:53.000
and that the

00:09:53.000 --> 00:09:56.000
nodes might have information about what location they represent, or

00:09:56.000 --> 00:09:59.000
what word, or what entity, you know, in a social network that they are,

00:09:59.000 --> 00:10:03.000
and then the arcs would show who is directly connected to whom.

00:10:03.000 --> 00:10:06.000
And so one way to manage that, right, is just to have two independent collections - a collection

00:10:06.000 --> 00:10:08.000
of nodes and a

00:10:08.000 --> 00:10:11.000
collection of arcs, and that when I need to know things about connections, I'll have to, kind of, use both

00:10:11.000 --> 00:10:12.000
in tandem.

00:10:12.000 --> 00:10:14.000
So if I want to know if A is connected to B,

00:10:14.000 --> 00:10:18.000
I might actually have to just walk through the entire set of arcs trying to find

00:10:18.000 --> 00:10:22.000
one that connects A and B as its endpoints. If I want to see if

00:10:22.000 --> 00:10:25.000
C is connected to B, same thing, just walk down the whole set.

00:10:25.000 --> 00:10:27.000


00:10:27.000 --> 00:10:29.000


00:10:29.000 --> 00:10:31.000
That's probably the simplest thing to do, right, is to just, kind of,

00:10:31.000 --> 00:10:35.000
have them just be maintained as the two things that are used in conjunction.

00:10:35.000 --> 00:10:37.000
More likely, what you're gonna want to do

00:10:37.000 --> 00:10:39.000
is provide some access that

00:10:39.000 --> 00:10:42.000
when at a current node, you're currently trying to do some processing

00:10:42.000 --> 00:10:47.000
from the Node B, it would be convenient if you could easily access those nodes

00:10:47.000 --> 00:10:51.000
that emanate or outgo from the B node. That's typically the thing you're trying

00:10:51.000 --> 00:10:52.000
to do is

00:10:52.000 --> 00:10:55.000
at B try to visit its neighbors,

00:10:55.000 --> 00:10:58.000
and rather that having to, kind of, pluck them out of the entire collection,

00:10:58.000 --> 00:11:01.000
it might be handy if they were already, kind of, stored in such a way to make it

00:11:01.000 --> 00:11:03.000
easy for you to get to that.

00:11:03.000 --> 00:11:06.000
So the two ways that try to represent adjacency

00:11:06.000 --> 00:11:10.000
in a more efficient way are the adjacency list and the adjacency matrix.

00:11:10.000 --> 00:11:14.000
So in an adjacency list representation, we have a way of associating with each node,

00:11:14.000 --> 00:11:17.000
probably just by storing it in the node structure itself,

00:11:17.000 --> 00:11:21.000
those arcs that outgo from there. So, in this case, A has as direct

00:11:21.000 --> 00:11:22.000
connection to the B

00:11:22.000 --> 00:11:23.000
and C Nodes,

00:11:23.000 --> 00:11:26.000
and so I would have that information stored with the A Node, which is my

00:11:26.000 --> 00:11:29.000
vector or set of outgoing connections.

00:11:29.000 --> 00:11:33.000
Similarly, B has one outgoing connection, which goes to D. C has an outgoing connection to D, and

00:11:33.000 --> 00:11:37.000
then D has outgoing connections that lead back to A and

00:11:37.000 --> 00:11:40.000
to B. So at a particular node I could say, well, who's my neighbors?

00:11:40.000 --> 00:11:41.000
It would be

00:11:41.000 --> 00:11:46.000
easily accessible and not have to be retrieved, or searched, or filtered.

00:11:46.000 --> 00:11:48.000
Another way of doing that

00:11:48.000 --> 00:11:51.000
is to realize that, in a sense, what we have is this, kind of,

00:11:51.000 --> 00:11:53.000
end by end grid

00:11:53.000 --> 00:11:56.000
where each node is potentially connected to

00:11:56.000 --> 00:12:00.000
the other N -1 nodes in that graph. And so if I just build

00:12:00.000 --> 00:12:03.000
a 2x2 matrix that's labeled with

00:12:03.000 --> 00:12:06.000
all the node names on this side and all the node names across the top, that the

00:12:06.000 --> 00:12:08.000
intersection of this says

00:12:08.000 --> 00:12:11.000
is there an arc that leads away from A and

00:12:11.000 --> 00:12:14.000
ends at B? Yes, there is. Is there one from A to C?

00:12:14.000 --> 00:12:18.000
And in the places where there is not an existing connection, a self loop or

00:12:18.000 --> 00:12:19.000


00:12:19.000 --> 00:12:23.000
just a connection that doesn't exist, right, I have an empty slot. So maybe these would be Booleans

00:12:23.000 --> 00:12:25.000
true and false or something, or maybe there'd be some more information here

00:12:25.000 --> 00:12:27.000
about the distance

00:12:27.000 --> 00:12:30.000
or other associated arc properties for that

00:12:30.000 --> 00:12:31.000
particular connection.

00:12:31.000 --> 00:12:34.000
In this case, it's actually very easy then to actually do the really quick

00:12:34.000 --> 00:12:35.000
search of

00:12:35.000 --> 00:12:38.000
I'm at A and I want to know if I can get to B, then it's really just a

00:12:38.000 --> 00:12:41.000
matter of reaching right into the

00:12:41.000 --> 00:12:45.000
slot in constant time in that matrix to pull it out, and

00:12:45.000 --> 00:12:47.000
so

00:12:47.000 --> 00:12:49.000
this one involves a lot of searching to find things. This one involves, perhaps, a

00:12:49.000 --> 00:12:52.000
little bit of searching. If I want to know if A is connected to B, I might have to look

00:12:52.000 --> 00:12:55.000
through all its outgoing connections, right? This one gives me that direct

00:12:55.000 --> 00:12:56.000
access,

00:12:56.000 --> 00:13:00.000
but the tradeoffs here has to do with where the space starts building up, right?

00:13:00.000 --> 00:13:03.000
A full NxN matrix could be very big,

00:13:03.000 --> 00:13:06.000
and sometimes we're allocating space as though there are a full set of

00:13:06.000 --> 00:13:09.000
connections between all the nodes, every node connected to every other,

00:13:09.000 --> 00:13:12.000
and in the cases where a lot of those things are false, there's a lot of

00:13:12.000 --> 00:13:13.000
capacity

00:13:13.000 --> 00:13:16.000
that we have set aside that we're not really tapping into.

00:13:16.000 --> 00:13:19.000
And so in the case of a graph that we call dense,

00:13:19.000 --> 00:13:23.000
where there's a lot of connections, it will use a lot of that capacity, and it might make sense

00:13:23.000 --> 00:13:27.000
in a graph that's more sparse, where there's a few connections. So you have

00:13:27.000 --> 00:13:31.000
thousands of nodes, but maybe only three or four on average are connected to one.

00:13:31.000 --> 00:13:35.000
Then having allocated 1,000 slots of which to only fill in three

00:13:35.000 --> 00:13:36.000
might become

00:13:36.000 --> 00:13:40.000
rather space inefficient. You might prefer an adjacency list representation that

00:13:40.000 --> 00:13:41.000
can get you

00:13:41.000 --> 00:13:44.000
to the ones you're actually using versus the other,

00:13:44.000 --> 00:13:47.000
but in terms of the adjacency matrices is very fast,

00:13:47.000 --> 00:13:48.000
right? A lot of space thrown at this

00:13:48.000 --> 00:13:51.000
gives you this immediate 0,1 access to know

00:13:51.000 --> 00:13:55.000
who your neighbors are. So

00:13:55.000 --> 00:13:59.000
we're gonna use the adjacency list, kind of, strategy here

00:13:59.000 --> 00:14:02.000
for the code we're gonna do today, and it's, kind

00:14:02.000 --> 00:14:07.000
of, a good compromise between the two alternatives that we've seen there.

00:14:07.000 --> 00:14:08.000
So I have a struct node.

00:14:08.000 --> 00:14:11.000
It has some information for the node. Given that I'm not being very specific

00:14:11.000 --> 00:14:14.000
about what the graph is, I'm just gonna, kind of, leave that unspecified. It might be that it's

00:14:14.000 --> 00:14:17.000
the city name. It might be that it's a word. It might be that it's a person in a

00:14:17.000 --> 00:14:19.000
social network, you know,

00:14:19.000 --> 00:14:23.000
some information that represents what this node is an entity,

00:14:23.000 --> 00:14:28.000
and then there's gonna be a set of arcs or a set of nodes it's connected to,

00:14:28.000 --> 00:14:32.000
and so I'm using vector here. I could be using set. I could be using a

00:14:32.000 --> 00:14:35.000
raw array, or a link list, or any of these things.

00:14:35.000 --> 00:14:38.000
I need to use some unbounded collection though because there's no guarantee that

00:14:38.000 --> 00:14:42.000
they will be 0, or 1, or 2 the way there is in a link list or a tree where there's a

00:14:42.000 --> 00:14:45.000
specified number of outgoing links. There can be any number of them, so I'm just leaving

00:14:45.000 --> 00:14:46.000


00:14:46.000 --> 00:14:47.000
a

00:14:47.000 --> 00:14:51.000
variable size collection here to do that work for me.

00:14:51.000 --> 00:14:52.000
The graph itself,

00:14:52.000 --> 00:14:55.000
so I have this idea of all the nodes in the graph, right, would each get a new

00:14:55.000 --> 00:14:58.000
node structure and then be wired up to the other nodes that they are connected

00:14:58.000 --> 00:15:00.000
to,

00:15:00.000 --> 00:15:02.000
but then when I'm trying to operate on that graph,

00:15:02.000 --> 00:15:06.000
I can't just take one pointer and say here's a pointer to some node in the

00:15:06.000 --> 00:15:09.000
graph and say this is - and from here you have accessed everything. In the way

00:15:09.000 --> 00:15:11.000
that a tree,

00:15:11.000 --> 00:15:13.000
the only thing we need to keep track of is the pointer to the root,

00:15:13.000 --> 00:15:17.000
or a link list, the only thing we keep a pointer to, typically, is that frontmost

00:15:17.000 --> 00:15:19.000
cell, and from there, we can reach everything else.

00:15:19.000 --> 00:15:23.000
There is no special head or root cell in a graph.

00:15:23.000 --> 00:15:27.000
A graph is this

00:15:27.000 --> 00:15:30.000
being, kind of, a crazy collection without a lot of rules about

00:15:30.000 --> 00:15:35.000
how they are connected. In fact, it's not even guaranteed that they totally are

00:15:35.000 --> 00:15:36.000
connected.

00:15:36.000 --> 00:15:38.000
That I can't

00:15:38.000 --> 00:15:41.000
guarantee that if I had a pointer, for example, to C,

00:15:41.000 --> 00:15:44.000
right, I may or may not be able to reach all the nodes from there.

00:15:44.000 --> 00:15:49.000
If I had a pointer to A, right, A can get to C and B, and then also

00:15:49.000 --> 00:15:52.000
down to D, but, for example, there could just be an E node over here that was only

00:15:52.000 --> 00:15:55.000
connected to C or not connected to anything at all for that matter, connected

00:15:55.000 --> 00:15:57.000
to F in their own little island.

00:15:57.000 --> 00:16:01.000
That there is no way to identify some special node from which you could reach

00:16:01.000 --> 00:16:02.000
all the nodes.

00:16:02.000 --> 00:16:04.000
There isn't a special root node,

00:16:04.000 --> 00:16:07.000
and you can't even just arbitrarily pick one to be the root because actually there's no

00:16:07.000 --> 00:16:10.000
guarantee that it will be connected to all of the others. So it would not give you

00:16:10.000 --> 00:16:13.000
access to the full entirety of your graph if you just picked on and said I want anything

00:16:13.000 --> 00:16:16.000
reachable from here, it might not get you the whole thing. So,

00:16:16.000 --> 00:16:18.000
typically, what you need to really operate on is

00:16:18.000 --> 00:16:21.000
the entire collection of node stars. You'll have a set or a vector

00:16:21.000 --> 00:16:24.000
that contains all of the nodes that are in the graph,

00:16:24.000 --> 00:16:26.000
and then within that, there's a bunch of other connections that are

00:16:26.000 --> 00:16:35.000
being made on the graph connectivity that you're also exploring.

00:16:35.000 --> 00:16:37.000
So let's make it a little bit more concrete

00:16:37.000 --> 00:16:39.000


00:16:39.000 --> 00:16:40.000
and talk a little bit about

00:16:40.000 --> 00:16:43.000
what really is a node; what really is an arc?

00:16:43.000 --> 00:16:46.000
There may be, actually, be some more information I need to store than just a

00:16:46.000 --> 00:16:48.000
node is connected to other nodes.

00:16:48.000 --> 00:16:51.000
So in the case of a list or a tree,

00:16:51.000 --> 00:16:54.000
the next field and the left and the right field are really just pointers for the

00:16:54.000 --> 00:16:56.000
purpose of organization.

00:16:56.000 --> 00:16:58.000
That's the pointer to the next cell.

00:16:58.000 --> 00:17:02.000
There's no data that is really associated with that link itself.

00:17:02.000 --> 00:17:05.000
That's not true in the case of a graph.

00:17:05.000 --> 00:17:09.000
That often what the links that connect up the nodes

00:17:09.000 --> 00:17:12.000
actually do have a real role to play in terms of being data.

00:17:12.000 --> 00:17:14.000
It may be that what they tell you is

00:17:14.000 --> 00:17:18.000
what road you're taking here, or how long this path is, or how much this

00:17:18.000 --> 00:17:22.000
flight costs, or what time this flight leaves, or how full this flight already is,

00:17:22.000 --> 00:17:24.000
or who knows, you know, just other information

00:17:24.000 --> 00:17:27.000
that is more than just this nodes connected to that one. It's, like, how is it

00:17:27.000 --> 00:17:30.000
connected; what is it connected by, and what are the details of how that

00:17:30.000 --> 00:17:31.000
connection

00:17:31.000 --> 00:17:32.000


00:17:32.000 --> 00:17:34.000
is being represented?

00:17:34.000 --> 00:17:38.000
So it's likely that what you really want is not just pointers to other nodes but

00:17:38.000 --> 00:17:42.000
information about that actual link stored with an arc structure.

00:17:42.000 --> 00:17:44.000
So, typically, we're gonna have

00:17:44.000 --> 00:17:45.000
an arc

00:17:45.000 --> 00:17:48.000
which has information about that arc, how long, how much it costs, you know,

00:17:48.000 --> 00:17:50.000
what flight number it is,

00:17:50.000 --> 00:17:52.000
that will have pointers to the start and end node that it is connecting.

00:17:52.000 --> 00:17:54.000
The node

00:17:54.000 --> 00:17:57.000
then has a collection of arcs,

00:17:57.000 --> 00:18:00.000
and those arcs are expected to, in the adjacency list form, will all be arcs

00:18:00.000 --> 00:18:02.000
that start at this node.

00:18:02.000 --> 00:18:05.000
So their start node is equal to the one that's holding onto to them, and then

00:18:05.000 --> 00:18:06.000
they have an end

00:18:06.000 --> 00:18:10.000
pointer that points to the node at the other end of the arc. Well,

00:18:10.000 --> 00:18:13.000
this gets us into a little C++ bind

00:18:13.000 --> 00:18:16.000
because I've described these data structures

00:18:16.000 --> 00:18:17.000
in a way

00:18:17.000 --> 00:18:22.000
that they both depend on each other; that an arc has pointers to the start and the end

00:18:22.000 --> 00:18:22.000
node.

00:18:22.000 --> 00:18:27.000
A node has a collection of arcs, which is probably a vector or a set of arc pointers,

00:18:27.000 --> 00:18:28.000
and so

00:18:28.000 --> 00:18:31.000
I'm starting to define the structure for arc,

00:18:31.000 --> 00:18:33.000
and then I want to talk about node in it,

00:18:33.000 --> 00:18:35.000
and I think, oh, okay, well, then I better define node first because C++ always likes to

00:18:35.000 --> 00:18:38.000
see things in order. Remember, it always wants to see A,

00:18:38.000 --> 00:18:42.000
and then B if B uses A, and so in terms of how your functions and data

00:18:42.000 --> 00:18:45.000
structures work, you've always gotten this idea like, well, go do the one first. Well, if I say,

00:18:45.000 --> 00:18:48.000
okay, arc's gonna need node. I start to write arc, and I say, oh, I

00:18:48.000 --> 00:18:50.000
need to talk about node. Well, I better put node up here,

00:18:50.000 --> 00:18:54.000
right? But then when I'm starting to write node, I start talking about arc, and they have a

00:18:54.000 --> 00:18:57.000
circular reference to each other, right? They both

00:18:57.000 --> 00:19:00.000
want to depend on the other one before we've gotten around to telling the

00:19:00.000 --> 00:19:01.000
compiler about it.

00:19:01.000 --> 00:19:04.000
One of them has to go first, right? What are we gonna do?

00:19:04.000 --> 00:19:08.000
What we need to do is use the C++ forward reference

00:19:08.000 --> 00:19:09.000
as a

00:19:09.000 --> 00:19:12.000
little hint to the compiler

00:19:12.000 --> 00:19:13.000
that

00:19:13.000 --> 00:19:17.000
we are gonna be defining a Node T structure in a little bit;

00:19:17.000 --> 00:19:18.000
we're gonna get to it,

00:19:18.000 --> 00:19:22.000
but first we're gonna define the Arc T structure that uses that Node T, and then

00:19:22.000 --> 00:19:25.000
we're gonna get around to it. So we're gonna give it, like, a little heads up that says

00:19:25.000 --> 00:19:29.000
there later will be an Act 2 of this play. We're gonna introduce a

00:19:29.000 --> 00:19:31.000
character called Node T.

00:19:31.000 --> 00:19:33.000
So you can now

00:19:33.000 --> 00:19:34.000
start talking about Node

00:19:34.000 --> 00:19:37.000
T in some simple ways

00:19:37.000 --> 00:19:40.000
based on your agreement to the compiler that you plan on telling it more

00:19:40.000 --> 00:19:41.000
about Node T later.

00:19:41.000 --> 00:19:45.000
So the struct Node T says there will be a struct called Node T.

00:19:45.000 --> 00:19:47.000
Okay, the compiler says, okay, I'll write that down.

00:19:47.000 --> 00:19:51.000
You start defining struct Arc T, and it says, oh, I see you have these pointers

00:19:51.000 --> 00:19:54.000
to the node T. Okay. Well, you told me there'd be a struct like that; I guess that'll work out.

00:19:54.000 --> 00:19:58.000
And now you told it what struct Node T is. It's, like, oh, I have a vector

00:19:58.000 --> 00:19:59.000
with these pointers to arc,

00:19:59.000 --> 00:20:02.000
and then it says, okay, now I see the whole picture, and

00:20:02.000 --> 00:20:03.000
it all worked out.

00:20:03.000 --> 00:20:08.000
So it's just a little bit of a

00:20:08.000 --> 00:20:11.000
requirement of how the C++ compiler likes to see stuff in order without ever having

00:20:11.000 --> 00:20:16.000
to go backwards to check anything out again. Okay. So

00:20:16.000 --> 00:20:19.000
I got myself some node structures, got

00:20:19.000 --> 00:20:21.000
some arc structures; they're all

00:20:21.000 --> 00:20:22.000
working together.

00:20:22.000 --> 00:20:24.000
I'm gonna try to do some traversals.

00:20:24.000 --> 00:20:26.000
Try to do some operations that

00:20:26.000 --> 00:20:29.000
work their way around the graph.

00:20:29.000 --> 00:20:30.000
So

00:20:30.000 --> 00:20:33.000
tree traversals, right, the pre, and post, and in order

00:20:33.000 --> 00:20:36.000
are ways that you start at the root, and you visit all the nodes on the tree,

00:20:36.000 --> 00:20:40.000
a very common need to, kind of, process, to delete nodes,

00:20:40.000 --> 00:20:43.000
to print the nodes, to whatnot. So we might want to do the same thing on graphs. I've got a

00:20:43.000 --> 00:20:47.000
graph out there, and I'd like to go exploring it. Maybe I just want to print, and if

00:20:47.000 --> 00:20:50.000
I'm in the Gates building, what are all the places that I can reach from

00:20:50.000 --> 00:20:53.000
the Gates building? Where can I go to; what

00:20:53.000 --> 00:20:54.000
options do I have?

00:20:54.000 --> 00:20:57.000
What I'm gonna do is a traversal

00:20:57.000 --> 00:21:01.000
starting at a node and, kind of, working my way outward to the visible, reachable

00:21:01.000 --> 00:21:04.000
nodes that I can follow those paths to get there.

00:21:04.000 --> 00:21:07.000
We're gonna look at two different ways to do this.

00:21:07.000 --> 00:21:09.000
One is Depth-first and one is Breadth-first,

00:21:09.000 --> 00:21:13.000
and I'll show you both algorithms, and they

00:21:13.000 --> 00:21:15.000
will visit the same nodes;

00:21:15.000 --> 00:21:19.000
when all is done and said, they will visit all the reachable nodes starting from a

00:21:19.000 --> 00:21:20.000
position,

00:21:20.000 --> 00:21:21.000
but they will visit them in different order.

00:21:21.000 --> 00:21:25.000
So just like the pre/post in order tree traversals, they visit all the same

00:21:25.000 --> 00:21:28.000
notes. It's just a matter of when they get around to doing it

00:21:28.000 --> 00:21:29.000
is how

00:21:29.000 --> 00:21:32.000
we distinguish depth and breadth-first traversals.

00:21:32.000 --> 00:21:34.000
Because we have this graph structure

00:21:34.000 --> 00:21:40.000
that has this loose connectivity and the possibility of cycles and multiple paths

00:21:40.000 --> 00:21:41.000
to the same node,

00:21:41.000 --> 00:21:44.000
we are gonna have to be a little bit careful at how we do our work to make

00:21:44.000 --> 00:21:48.000
sure that we don't end up getting stuck in some infinite loop when we keep

00:21:48.000 --> 00:21:51.000
going around the ABC cycle in a particular graph.

00:21:51.000 --> 00:21:55.000
That we need to realize when we're revisiting something we've seen before,

00:21:55.000 --> 00:21:56.000
and then not

00:21:56.000 --> 00:21:57.000


00:21:57.000 --> 00:21:59.000
trigger, kind of, an

00:21:59.000 --> 00:22:03.000
exploration we've already done.

00:22:03.000 --> 00:22:06.000
So let's look at depth-first first.

00:22:06.000 --> 00:22:09.000
This is probably the simpler of the two. They're both, actually, pretty simple.

00:22:09.000 --> 00:22:13.000
Depth-first traversal uses recursion to do its work,

00:22:13.000 --> 00:22:18.000
and the idea is that you pick a starting node - let's say I want to start at Gates,

00:22:18.000 --> 00:22:21.000
and that maybe what I'm trying to find is all the reachable nodes from Gates.

00:22:21.000 --> 00:22:24.000
If I don't have a starting node, then I can just pick one arbitrarily from the

00:22:24.000 --> 00:22:27.000
graph because there is actually no root node of special status,

00:22:27.000 --> 00:22:30.000
and the idea is to go deep. That

00:22:30.000 --> 00:22:33.000
in the case of, for example, a maze, it might be that I

00:22:33.000 --> 00:22:36.000
choose to go north, and then I just keep going north. I just go north, north, north, north, north until

00:22:36.000 --> 00:22:40.000
I run into a dead wall, and then I say, oh, I have to go east, and I go east, east, east, east, east. The idea is to

00:22:40.000 --> 00:22:42.000
just go as far away from

00:22:42.000 --> 00:22:46.000
the original state as I can, just keep going outward, outward, outward, outward,

00:22:46.000 --> 00:22:47.000
outward, outward, outward,

00:22:47.000 --> 00:22:50.000
and eventually I'm gonna run into some dead end, and that dead end could be I

00:22:50.000 --> 00:22:52.000
get to a node that has no outgoing arcs,

00:22:52.000 --> 00:22:54.000
or it could be that I get to a node

00:22:54.000 --> 00:22:57.000
that whose only arcs come back to places I've already been,

00:22:57.000 --> 00:22:58.000
so it cycles back on itself, in

00:22:58.000 --> 00:23:01.000
which case, that also is really a dead end.

00:23:01.000 --> 00:23:03.000
So you go deep.

00:23:03.000 --> 00:23:06.000
You go north, and you see as far as you can get, and

00:23:06.000 --> 00:23:10.000
only after you've, kind of, fully explored everything reachable from starting in

00:23:10.000 --> 00:23:11.000
the north direction

00:23:11.000 --> 00:23:13.000
do you backtrack,

00:23:13.000 --> 00:23:16.000
unmake that decision, and say, well, how about not north; how about east, right? And then

00:23:16.000 --> 00:23:19.000
find everything you can get to from the east, and

00:23:19.000 --> 00:23:22.000
after, kind of, going all through that, come back and try south, and so on. So all of the

00:23:22.000 --> 00:23:23.000
options you have

00:23:23.000 --> 00:23:26.000
exhaustively getting through all your

00:23:26.000 --> 00:23:26.000
neighbors

00:23:26.000 --> 00:23:29.000
in this, kind of, go-deep strategy. Well,

00:23:29.000 --> 00:23:32.000
the important thing is you realize if you've got this recursion going, so what

00:23:32.000 --> 00:23:35.000
you're actually doing, basically, is saying I'm gonna step to my neighbor to the

00:23:35.000 --> 00:23:40.000
right, and I'm gonna explore all the things, so do a depth-first search from there

00:23:40.000 --> 00:23:43.000
to find all the places you can reach, and that one says, well, I'm gonna step through this spot

00:23:43.000 --> 00:23:45.000
and go to one of my neighbors.

00:23:45.000 --> 00:23:48.000
And so we can't do that infinitely without getting into trouble. We need

00:23:48.000 --> 00:23:53.000
to have a base case that stops that recursion, and the two cases that I've

00:23:53.000 --> 00:23:56.000
talked about - one is just when you have no neighbors left to go.

00:23:56.000 --> 00:23:57.000
So when you

00:23:57.000 --> 00:23:59.000


00:23:59.000 --> 00:24:04.000
get to a node that actually, kind of, is a dead end, or that neighbor only has visited

00:24:04.000 --> 00:24:05.000
neighbors

00:24:05.000 --> 00:24:09.000
that surround it. So a

00:24:09.000 --> 00:24:10.000
very simple little piece of

00:24:10.000 --> 00:24:13.000
code that depends on recursion doing its job.

00:24:13.000 --> 00:24:17.000
In this case, we need to know which nodes we have visited, right? We're gonna have to

00:24:17.000 --> 00:24:19.000
have some kind of strategy for saying I

00:24:19.000 --> 00:24:21.000
have been here before.

00:24:21.000 --> 00:24:24.000
In this case, I'm using just a set because that's an easy thing to do.

00:24:24.000 --> 00:24:27.000
I make a set of node pointers, and I update it

00:24:27.000 --> 00:24:29.000
each time I see a new node; I add it to

00:24:29.000 --> 00:24:34.000
that set, and if I ever get to a node who is already previously visited

00:24:34.000 --> 00:24:38.000
from that, there's no reason to do any work from there; we can immediately

00:24:38.000 --> 00:24:41.000
stop. So

00:24:41.000 --> 00:24:43.000
starting that step would be empty.

00:24:43.000 --> 00:24:46.000
If the visit already contains the node that we're currently trying to visit,

00:24:46.000 --> 00:24:50.000
then there's nothing to do. Otherwise, we go ahead and add it, and there's a little node

00:24:50.000 --> 00:24:53.000
here that says we'll do something would occur. What am I trying to do here? Am I trying to print those nodes? Am

00:24:53.000 --> 00:24:56.000
I trying to draw pictures on those nodes, highlight those nodes, you know,

00:24:56.000 --> 00:25:00.000
something I'm doing with them. I don't know what it is, but this is the structure for the depth-first search

00:25:00.000 --> 00:25:01.000
here, and

00:25:01.000 --> 00:25:02.000
then

00:25:02.000 --> 00:25:04.000
for all of the neighbors,

00:25:04.000 --> 00:25:07.000
using the vector of outgoing connections,

00:25:07.000 --> 00:25:08.000
then I run a depth-first search

00:25:08.000 --> 00:25:11.000
on each of those neighbors

00:25:11.000 --> 00:25:14.000
passing that visited set by reference

00:25:14.000 --> 00:25:16.000
through the whole operation. So I will

00:25:16.000 --> 00:25:20.000
always be adding to and modifying this one set, so I will

00:25:20.000 --> 00:25:24.000
be able to know where I am and where I still need to go.

00:25:24.000 --> 00:25:27.000
So if we trace this guy in action -

00:25:27.000 --> 00:25:31.000
if I start arbitrarily from A, I say I'd like to begin with A.

00:25:31.000 --> 00:25:35.000
Then the depth-first search is gonna pick a neighbor, and let's say I happen to just work

00:25:35.000 --> 00:25:38.000
over my neighbors in alphabetical order.

00:25:38.000 --> 00:25:41.000
It would say okay, well, I've picked B. Let's go to everywhere we can get to from B. So A

00:25:41.000 --> 00:25:42.000
gets marked as visited.

00:25:42.000 --> 00:25:44.000
I move onto B; I say, well,

00:25:44.000 --> 00:25:48.000
B, search everywhere you can to from B, and B says okay, well, I need to pick a

00:25:48.000 --> 00:25:50.000
neighbor, how about I pick

00:25:50.000 --> 00:25:51.000
C? And says

00:25:51.000 --> 00:25:53.000
and now,

00:25:53.000 --> 00:25:56.000
having visited C, go everywhere you can get to from C. Well,

00:25:56.000 --> 00:25:59.000
C has no neighbors, right, no outgoing connections whatsoever.

00:25:59.000 --> 00:26:02.000
So C says, well, okay, everywhere you can get to from C is C,

00:26:02.000 --> 00:26:04.000
you know, I'm done.

00:26:04.000 --> 00:26:09.000
That will cause it to backtrack to this place, to B, and B says okay, I explored everywhere I can get to

00:26:09.000 --> 00:26:10.000
from C;

00:26:10.000 --> 00:26:13.000
what other neighbors do I have? B has no other neighbors. So, in fact, B backtracks all the way

00:26:13.000 --> 00:26:15.000
back to A.

00:26:15.000 --> 00:26:19.000
So now A says okay, well, I went everywhere I can get to from B,

00:26:19.000 --> 00:26:20.000
right? Let me look at my next neighbor.

00:26:20.000 --> 00:26:23.000
My next neighbor is C. Ah,

00:26:23.000 --> 00:26:25.000
but C, already visited,

00:26:25.000 --> 00:26:27.000
so there's no reason to

00:26:27.000 --> 00:26:29.000
go crazy on that thing. So, in fact, I can

00:26:29.000 --> 00:26:33.000
immediately see that I've visited that, so I don't have any further path to look at

00:26:33.000 --> 00:26:33.000
there.

00:26:33.000 --> 00:26:35.000
I'll hit D then

00:26:35.000 --> 00:26:37.000
as the next one in sequence. So

00:26:37.000 --> 00:26:40.000
at this point, I've done everything reachable from the B arm, everything

00:26:40.000 --> 00:26:43.000
reachable from the C arm, and now I'm starting on the D arm. I said okay, where can I get to

00:26:43.000 --> 00:26:44.000
from D?

00:26:44.000 --> 00:26:46.000
Well, I can get to E. All right,

00:26:46.000 --> 00:26:48.000
where can I get to from E?

00:26:48.000 --> 00:26:52.000
E goes to F, so the idea is you think of it going deep. It's going as far away as it

00:26:52.000 --> 00:26:56.000
can, as deep as possible, before it, kind of, unwinds

00:26:56.000 --> 00:26:58.000
getting to the F that has no neighbors

00:26:58.000 --> 00:27:01.000
and saying okay. Well, how about - where can I get to

00:27:01.000 --> 00:27:04.000
also from E? I can get to G.

00:27:04.000 --> 00:27:07.000
From G where can I get to? Nowhere,

00:27:07.000 --> 00:27:09.000
so we, kind of, unwind the

00:27:09.000 --> 00:27:12.000
D arm, and then we will eventually get back to and hit the H.

00:27:12.000 --> 00:27:16.000
So the order they actually were hit in this case happened to be alphabetical. I

00:27:16.000 --> 00:27:18.000
assigned them such that that would come out that way.

00:27:18.000 --> 00:27:21.000
That's not really a property of what depth-first search does,

00:27:21.000 --> 00:27:24.000
but think of it as like it's going as deep as possible, as far away as possible,

00:27:24.000 --> 00:27:28.000
and so it makes a choice - it's pretty much like the recursive backtrackers that we've seen. In

00:27:28.000 --> 00:27:31.000
fact, they make a choice, commit to it, and just move forward on it, never, kind of,

00:27:31.000 --> 00:27:35.000
giving a backwards glance, and only if the whole process

00:27:35.000 --> 00:27:35.000
bottoms out

00:27:35.000 --> 00:27:39.000
does it come back and start unmaking decisions and try alternatives,

00:27:39.000 --> 00:27:43.000
eventually exploring everything reachable from A

00:27:43.000 --> 00:27:46.000
when all is done and said. So if

00:27:46.000 --> 00:27:49.000
you, for example, use that on the maze,

00:27:49.000 --> 00:27:52.000
right, the depth-first search on a maze is very much the same strategy we're talking about

00:27:52.000 --> 00:27:53.000
here.

00:27:53.000 --> 00:27:56.000
Instead of the way we did it was actually breadth-first search, if you

00:27:56.000 --> 00:27:57.000
remember back to that assignment,

00:27:57.000 --> 00:28:00.000
the depth-first search alternative is actually doing it the -

00:28:00.000 --> 00:28:01.000
just go deep,

00:28:01.000 --> 00:28:04.000
make a choice, go with it, run all the way until it bottoms out,

00:28:04.000 --> 00:28:12.000
and only then back up and try some other alternatives.

00:28:12.000 --> 00:28:15.000
The breadth-first traversal, gonna hit

00:28:15.000 --> 00:28:16.000
the same nodes

00:28:16.000 --> 00:28:19.000
but just gonna hit them in a different way.

00:28:19.000 --> 00:28:21.000
If my goal were to actually hit all of them,

00:28:21.000 --> 00:28:25.000
then either of them is gonna be a fine strategy. There may be some reason why

00:28:25.000 --> 00:28:25.000


00:28:25.000 --> 00:28:29.000
I'm hoping to stop the search early, and that actually might dictate why I would prefer

00:28:29.000 --> 00:28:31.000
which ones to look at first.

00:28:31.000 --> 00:28:35.000
The breadth-first traversal is going to explore things in terms of distance,

00:28:35.000 --> 00:28:40.000
in this case, expressed in terms of number of hops away from the starting

00:28:40.000 --> 00:28:40.000
node.

00:28:40.000 --> 00:28:44.000
So it's gonna look at everything that's one hop away, and then go back and look at

00:28:44.000 --> 00:28:46.000
everything two hops away, and then three hops away.

00:28:46.000 --> 00:28:49.000
And so if the goal, for example, was to find a node

00:28:49.000 --> 00:28:51.000
in a shortest path connection between it,

00:28:51.000 --> 00:28:55.000
then breadth-first search, by looking at all the ones one hop away, if it was a one hop

00:28:55.000 --> 00:28:56.000
path, it'll find it,

00:28:56.000 --> 00:28:59.000
and if it's a two hop path, it'll find it on that next round, and then the three hop

00:28:59.000 --> 00:29:00.000
path,

00:29:00.000 --> 00:29:02.000
and there might be paths that are 10, and 20, and 100

00:29:02.000 --> 00:29:03.000
steps long,

00:29:03.000 --> 00:29:06.000
but if I find it earlier, I won't go looking at them. It actually might prove to

00:29:06.000 --> 00:29:09.000
be a more efficient way to get to the shortest solution as opposed to depth-first

00:29:09.000 --> 00:29:11.000
search which go off and try all these deep

00:29:11.000 --> 00:29:14.000
paths that may not ever end up where I want it to be

00:29:14.000 --> 00:29:16.000
before it eventually made its way to the short

00:29:16.000 --> 00:29:20.000
solution I was wanting to find. So the thing

00:29:20.000 --> 00:29:21.000
is we have the starting node.

00:29:21.000 --> 00:29:24.000
We're gonna visit all the immediate neighbors.

00:29:24.000 --> 00:29:27.000
So, basically, a loop over the

00:29:27.000 --> 00:29:28.000
immediate one,

00:29:28.000 --> 00:29:31.000
and then while we're doing that, we're gonna actually, kind of, be gathering up that

00:29:31.000 --> 00:29:33.000
next generation.

00:29:33.000 --> 00:29:33.000


00:29:33.000 --> 00:29:37.000
Sometimes that's called the Frontier, sort of, advancing the frontier

00:29:37.000 --> 00:29:39.000
of the nodes to explore next,

00:29:39.000 --> 00:29:42.000
so all the nodes that are two hops away,

00:29:42.000 --> 00:29:44.000
and then while we're processing the ones that are two hops away, we're gonna

00:29:44.000 --> 00:29:47.000
be gathering the frontier

00:29:47.000 --> 00:29:50.000
that is three hops away. And so at each stage we'll be, kind of, processing a

00:29:50.000 --> 00:29:54.000
generation but while assembling the generation N+1

00:29:54.000 --> 00:29:58.000
in case we need to go further to find those things.

00:29:58.000 --> 00:30:00.000
We'll use an auxiliary data structure during

00:30:00.000 --> 00:30:04.000
this. That management of the frontier, keeping track of the nodes that are, kind

00:30:04.000 --> 00:30:05.000
of,

00:30:05.000 --> 00:30:07.000
staged for the

00:30:07.000 --> 00:30:09.000
subsequent iterations,

00:30:09.000 --> 00:30:12.000
the queue is a perfect data structure for that. If I put the

00:30:12.000 --> 00:30:14.000
initial neighbors into a queue,

00:30:14.000 --> 00:30:19.000
as I dequeue them, if I put their neighbors on the back of the queue, so enqueue them

00:30:19.000 --> 00:30:21.000
behind all the one hop neighbors, then

00:30:21.000 --> 00:30:24.000
if I do that with code, as I'm processing Generation 1 off the front of the queue, I'll be

00:30:24.000 --> 00:30:27.000
stacking Generation 2 in the back of the queue,

00:30:27.000 --> 00:30:30.000
and then when all of Generation 1 is exhausted, I'll be processing Generation

00:30:30.000 --> 00:30:31.000
2,

00:30:31.000 --> 00:30:35.000
but, kind of, enqueuing Generation 3 behind it.

00:30:35.000 --> 00:30:39.000
Same deal that we had with depth-first search though is we will have to be

00:30:39.000 --> 00:30:41.000
careful about cycles will pull past that

00:30:41.000 --> 00:30:44.000
when there's more than one possibility of how we get to something, we don't

00:30:44.000 --> 00:30:46.000
want to, kind of, keep going around that cycle

00:30:46.000 --> 00:30:52.000
or do a lot of redundant work visiting nodes that we've already seen. So

00:30:52.000 --> 00:30:55.000
I have the same idea of keeping track of a set

00:30:55.000 --> 00:30:57.000
that knows what

00:30:57.000 --> 00:31:00.000
we've previously visited,

00:31:00.000 --> 00:31:01.000
and the

00:31:01.000 --> 00:31:05.000
initial node gets enqueued just, kind of, by itself, just like Generation 0 gets

00:31:05.000 --> 00:31:07.000
put into the queue,

00:31:07.000 --> 00:31:10.000
and then while the queue is not empty, so while there's still neighbors I haven't

00:31:10.000 --> 00:31:11.000
yet visited,

00:31:11.000 --> 00:31:13.000
I dequeue the frontmost one,

00:31:13.000 --> 00:31:16.000
if it hasn't already been visited on some previous iteration, then I mark it as

00:31:16.000 --> 00:31:17.000
visited,

00:31:17.000 --> 00:31:22.000
and then I enqueue all of its children or neighbors at the back of the queue.

00:31:22.000 --> 00:31:26.000
And so, as of right now, I'm processing Generation 0, that means all of Generation 1, so the

00:31:26.000 --> 00:31:29.000
neighbors that are one hop away, get placed in the queue, and then

00:31:29.000 --> 00:31:34.000
the next iteration coming out here will pull off a Generation 1,

00:31:34.000 --> 00:31:37.000
and then enqueue all of these two hop neighbors.

00:31:37.000 --> 00:31:40.000
Come back and other Generation 1's will get pulled off, enqueue all their two hop

00:31:40.000 --> 00:31:44.000
neighbors, and so the two hop generation will, kind of, be built up behind, and then

00:31:44.000 --> 00:31:48.000
once the last of the one hop generation gets managed,

00:31:48.000 --> 00:31:51.000
start processing the Generation 2 and so on.

00:31:51.000 --> 00:31:55.000
And so this will keep going while the queue has some

00:31:55.000 --> 00:31:58.000
contents in there. So that means some neighbors

00:31:58.000 --> 00:32:01.000
we have connections to that we haven't yet had a chance to explore, and

00:32:01.000 --> 00:32:05.000
either we will find out that they have all been visited, so this will end when

00:32:05.000 --> 00:32:09.000
all of the nodes that are in the queue have been visited or

00:32:09.000 --> 00:32:14.000
there are no more neighbors to enqueue. So we've gotten to those dead ends or those cycles

00:32:14.000 --> 00:32:25.000
that mean we've reach everything that was reachable from that start node.

00:32:25.000 --> 00:32:28.000
And so I've traced this guy

00:32:28.000 --> 00:32:30.000
to see it doing its work.

00:32:30.000 --> 00:32:33.000
So if I start again with A,

00:32:33.000 --> 00:32:36.000
and, again, assuming that I'm gonna process my nodes

00:32:36.000 --> 00:32:40.000
in alphabetical order when I have children, so I will through and I

00:32:40.000 --> 00:32:44.000
will enqueue, so the queue right now has just node A in it. I pull it off. I say

00:32:44.000 --> 00:32:45.000
have I visited? No, I have not.

00:32:45.000 --> 00:32:47.000
So then I mark it as visited,

00:32:47.000 --> 00:32:52.000
and now for each of its neighboring nodes, B, C, D, H, I'm gonna put those into

00:32:52.000 --> 00:32:54.000
the queue,

00:32:54.000 --> 00:32:57.000
and so then they're loaded up, and then I come back around, I say do I still have stuff in the queue? I

00:32:57.000 --> 00:33:01.000
do, so let's take out the first thing that was put in; it was B. Okay,

00:33:01.000 --> 00:33:05.000
and I'll say okay, well, where can you get to from B? You can get to C, so I'm gonna put C in the queue. Now,

00:33:05.000 --> 00:33:09.000
C is actually already in the queue, but we're gonna pull it out

00:33:09.000 --> 00:33:12.000
earlier in terms it'll get handled by the

00:33:12.000 --> 00:33:14.000
visiting status later.

00:33:14.000 --> 00:33:16.000
So, okay, we'll put C, D, and E in

00:33:16.000 --> 00:33:20.000
and then put C behind it. So right now, we'll have C, D, H, and C again.

00:33:20.000 --> 00:33:24.000
It'll find the C that's in the front there. It'll say where can you get to from C?

00:33:24.000 --> 00:33:24.000
Nowhere,

00:33:24.000 --> 00:33:26.000
so no additional

00:33:26.000 --> 00:33:28.000
Generation 2's are added by that one.

00:33:28.000 --> 00:33:31.000
I'll look at D, and I'll say okay, well, what Generation 2's do we have from D? It says, well, I can

00:33:31.000 --> 00:33:32.000
get to E. All right, so put E on the

00:33:32.000 --> 00:33:34.000
back of the

00:33:34.000 --> 00:33:36.000


00:33:36.000 --> 00:33:39.000
queue. Look at H; where can H get to? H can get to G,

00:33:39.000 --> 00:33:42.000
so go ahead and put G on the back of the queue.

00:33:42.000 --> 00:33:45.000
So now, it comes back around to the things that are two hops away.

00:33:45.000 --> 00:33:47.000
The first one it's gonna pull off is C. It's

00:33:47.000 --> 00:33:49.000
gonna say, well, C is already visited,

00:33:49.000 --> 00:33:53.000
so there's no need to redundantly visit that or do anything with that.

00:33:53.000 --> 00:33:54.000
It passes over C. It

00:33:54.000 --> 00:33:56.000
finds the E

00:33:56.000 --> 00:33:58.000
that's behind that,

00:33:58.000 --> 00:34:00.000
and then it finds the G that was behind the

00:34:00.000 --> 00:34:03.000
H. The E also enqueued the F

00:34:03.000 --> 00:34:07.000
that was four hops, and the very last one, right, output will be that F that was

00:34:07.000 --> 00:34:09.000
only reachable by taking

00:34:09.000 --> 00:34:12.000
three hops away from the thing. So this one's, kind of,

00:34:12.000 --> 00:34:13.000
working radially. The

00:34:13.000 --> 00:34:16.000
idea that I'm looking at all the things that I can get to, kind of, in one hop

00:34:16.000 --> 00:34:19.000
are my first generation, then two hops, then three hops,

00:34:19.000 --> 00:34:23.000
and so it's like throwing a stone in the water and watching it ripple out

00:34:23.000 --> 00:34:26.000
that the visiting

00:34:26.000 --> 00:34:29.000
is managed in terms of, kind of, growing length of path,

00:34:29.000 --> 00:34:32.000
and number of hops that it took to get there.

00:34:32.000 --> 00:34:37.000
So this is the strategy that you used for maze solving, if you remember.

00:34:37.000 --> 00:34:37.000
That

00:34:37.000 --> 00:34:42.000
what you were tracking was the queue of paths where the path was

00:34:42.000 --> 00:34:44.000
AB, or ADC, or

00:34:44.000 --> 00:34:46.000
ADE, and that they grew

00:34:46.000 --> 00:34:47.000


00:34:47.000 --> 00:34:50.000
step wise. That, kind of, each iteration through the

00:34:50.000 --> 00:34:54.000
traversal there where you're saying, okay, well, I've seen all the paths that are one hop. Now I've seen all the ones of two,

00:34:54.000 --> 00:34:56.000
now the ones three, and four, and five,

00:34:56.000 --> 00:34:58.000
and when you keep doing that, eventually you get to the hop,

00:34:58.000 --> 00:35:01.000
you know, 75, which leads you

00:35:01.000 --> 00:35:05.000
all the way to the goal, and since you have looked at all the paths that were

00:35:05.000 --> 00:35:09.000
74 and shorter, the first path that you dequeue that leads to your goal

00:35:09.000 --> 00:35:12.000
must be the best or the shortest overall

00:35:12.000 --> 00:35:13.000
because

00:35:13.000 --> 00:35:15.000
of the order you processed them.

00:35:15.000 --> 00:35:18.000
That's not true in depth-first search, right? That depth-first search might find a much

00:35:18.000 --> 00:35:21.000
longer and more circuitous path that led to the goal

00:35:21.000 --> 00:35:25.000
when there is a shorter one that would be eventually found if I had a graph that

00:35:25.000 --> 00:35:28.000
looked,

00:35:28.000 --> 00:35:36.000
you know, had this long path, let's say,

00:35:36.000 --> 00:35:41.000
and so this is goal node, let's say, and this is the start node that

00:35:41.000 --> 00:35:44.000
there could be a two hop path, kind of, off to this angle,

00:35:44.000 --> 00:35:47.000
but if this was the first one explored in the depth-first, right,

00:35:47.000 --> 00:35:50.000
it could eventually, kind of, work its way all the way around through this

00:35:50.000 --> 00:35:52.000
eight hop path, and say, okay, well, I found it, right?

00:35:52.000 --> 00:35:55.000
It would be the first one we found, but there's no guarantee that would be the

00:35:55.000 --> 00:35:58.000
shortest in depth-first search, right? It would just happen to be the one that based on

00:35:58.000 --> 00:36:01.000
its arbitrary choice of which

00:36:01.000 --> 00:36:04.000
neighbors to pursue it happened to get to that there could be a shorter

00:36:04.000 --> 00:36:08.000
one that would be found later in the traversal. That's not true in

00:36:08.000 --> 00:36:12.000
the breadth-first search strategy because it will be looking at this, and then these two,

00:36:12.000 --> 00:36:16.000
and then these two, and, kind of, working its way outward down those edges.

00:36:16.000 --> 00:36:19.000
So as soon as we get to one that hits the goal, we

00:36:19.000 --> 00:36:25.000
know we have found the shortest we can have. Question?

00:36:25.000 --> 00:36:28.000
In that case,

00:36:28.000 --> 00:36:33.000
you could return, okay, we need two

00:36:33.000 --> 00:36:34.000
jumps to get to

00:36:34.000 --> 00:36:35.000
the goal,

00:36:35.000 --> 00:36:36.000
like, as far

00:36:36.000 --> 00:36:40.000
winning the goal. Yeah. How do we remember which path led us there?

00:36:40.000 --> 00:36:42.000
So if you were really doing that, you know, in terms of depth-first search, what

00:36:42.000 --> 00:36:46.000
you'd probably be tracking is what's the path? Probably a stack that said here's the stack

00:36:46.000 --> 00:36:49.000
that led to that path. Let me hold

00:36:49.000 --> 00:36:53.000
onto this, right, and I will compare it to these alternatives. So I'll try it on Neighbor 1,

00:36:53.000 --> 00:36:54.000
get the stack back from that; it's the best.

00:36:54.000 --> 00:36:58.000
Try it on Neighbor 2, get that stack back, see which one is smaller and use

00:36:58.000 --> 00:37:01.000
that as the better choice, and so, eventually, when I tried it out this side, I'd get

00:37:01.000 --> 00:37:03.000
the stack back that was just two,

00:37:03.000 --> 00:37:06.000
and I could say, yeah, that was better than this ten step path I had over here, so I

00:37:06.000 --> 00:37:09.000
will prefer that. So just using your standard, kind of, backtracking with

00:37:09.000 --> 00:37:13.000
using your return values, incorporating them, and deciding which was better.

00:37:13.000 --> 00:37:14.000
And so in this case,

00:37:14.000 --> 00:37:16.000
the depth-first search

00:37:16.000 --> 00:37:19.000
can't really be pruned very easily

00:37:19.000 --> 00:37:24.000
because it has to, kind of, explore the whole thing to know that this

00:37:24.000 --> 00:37:26.000
short little path right over here happened to be just

00:37:26.000 --> 00:37:29.000
arbitrarily examined last, whereas, the breadth-first search actually does, kind of, because

00:37:29.000 --> 00:37:31.000
it's prioritizing by

00:37:31.000 --> 00:37:35.000
order of length. If the goal was shortest path, it will find it sooner and be able to

00:37:35.000 --> 00:37:35.000
stop

00:37:35.000 --> 00:37:39.000
when it found it rather than look at these longer paths. So it won't end up -

00:37:39.000 --> 00:37:42.000
if there were hundreds of other paths over here, breadth-first search won't actually

00:37:42.000 --> 00:37:47.000
have to ever reach them or explore them.

00:37:47.000 --> 00:37:52.000
So there are a lot of things that actually end up being graph search in disguise. That's why

00:37:52.000 --> 00:37:54.000
a, kind of, a graph is a great data structure to, kind of, get to in

00:37:54.000 --> 00:37:57.000
106b because there are lots of problems

00:37:57.000 --> 00:38:01.000
that, in the end, if you can model them using the same

00:38:01.000 --> 00:38:03.000
setup of nodes, and arcs, and traversals,

00:38:03.000 --> 00:38:06.000
that you can answer interesting questions about that data by applying

00:38:06.000 --> 00:38:09.000
the same graph search techniques.

00:38:09.000 --> 00:38:12.000
So as long as you want to know things like, well, which nodes are reachable from this node at

00:38:12.000 --> 00:38:12.000
all?

00:38:12.000 --> 00:38:14.000
And then

00:38:14.000 --> 00:38:17.000
that just means, okay, I'm here and I want to get to these places; what places could

00:38:17.000 --> 00:38:18.000
I get to? Or

00:38:18.000 --> 00:38:21.000
knowing what courses I could take

00:38:21.000 --> 00:38:25.000
that would lead me to a CS degree, it's, kind of, like well, looking for the ones

00:38:25.000 --> 00:38:29.000
that lead to having all your graduation requirements

00:38:29.000 --> 00:38:32.000
satisfied, and each of those arcs could be a course you could take, right, how close did

00:38:32.000 --> 00:38:35.000
that get you to the goal? How can I get to the

00:38:35.000 --> 00:38:40.000
place I'd like to be, or what are the places that this course satisfies

00:38:40.000 --> 00:38:41.000
requirements that lead to?

00:38:41.000 --> 00:38:44.000
Knowing things about

00:38:44.000 --> 00:38:47.000
connectivity is, kind of, useful in things like the, kind of, bacon numbers or

00:38:47.000 --> 00:38:50.000
erdish numbers where people want to know, well, how close am I to greatness by -

00:38:50.000 --> 00:38:54.000
have I been in a movie with somebody who was in a movie, who was in a movie with Kevin Bacon? And,

00:38:54.000 --> 00:38:55.000
for me, it turns out

00:38:55.000 --> 00:38:57.000
zero; I have been in no movies. But have

00:38:57.000 --> 00:38:59.000
you written a paper with someone who then

00:38:59.000 --> 00:39:02.000
connects through a chain to some other famous person, right? Tells you about

00:39:02.000 --> 00:39:06.000
your, kind of, social network distance or something, or I'm linked in. You're trying to get a

00:39:06.000 --> 00:39:09.000
new job, and you want to find somebody to introduce you to

00:39:09.000 --> 00:39:13.000
the head of this cool company. It's, like, figuring who you know who knows them, right?

00:39:13.000 --> 00:39:18.000
It's finding paths through a social network graph.

00:39:18.000 --> 00:39:19.000
Finding out things like

00:39:19.000 --> 00:39:22.000
are there paths with cycles? What's the shortest path

00:39:22.000 --> 00:39:25.000
through the cycle, longest path through the cycle often tells you about things about

00:39:25.000 --> 00:39:27.000
redundancies in that structure

00:39:27.000 --> 00:39:31.000
that, for example, when you're building a network infrastructure, sometimes you do have

00:39:31.000 --> 00:39:34.000
cycles in it because you actually want to have multiple ways to get to things in case

00:39:34.000 --> 00:39:37.000
there's a breakdown. Like if there's a freeway accident, you want to have another way

00:39:37.000 --> 00:39:38.000
to get around it,

00:39:38.000 --> 00:39:41.000
but you are interested in, maybe, trying to keep your cost down is trying to find

00:39:41.000 --> 00:39:47.000
out where are the right places to put in those redundancies, those cycles that

00:39:47.000 --> 00:39:50.000
give you the best benefit with the least cost.

00:39:50.000 --> 00:39:54.000
Trying to find things like a continuous path that visits all the nodes once and exactly

00:39:54.000 --> 00:39:57.000
once, especially doing so efficiently, picking a short

00:39:57.000 --> 00:40:00.000
overall path for that. We talked a little bit about that in terms of what's called the

00:40:00.000 --> 00:40:02.000
traveling salesman problem.

00:40:02.000 --> 00:40:04.000
You are trying to get from - you have ten

00:40:04.000 --> 00:40:06.000
cities to visit on your book tour. You don't want to spend all your lifetime

00:40:06.000 --> 00:40:08.000
on a plane. What's the

00:40:08.000 --> 00:40:13.000
path of crisscrossing the country that gets you to all ten of those with the least

00:40:13.000 --> 00:40:15.000
time painfully

00:40:15.000 --> 00:40:17.000
spent on an airline? And so

00:40:17.000 --> 00:40:20.000
those are just graph search problems. You have these nodes. You have these arcs.

00:40:20.000 --> 00:40:21.000
How can you

00:40:21.000 --> 00:40:23.000
traverse them and visit

00:40:23.000 --> 00:40:25.000
and come up with

00:40:25.000 --> 00:40:27.000
a good solution. There's other

00:40:27.000 --> 00:40:30.000
problems I think that are kind of neat. Word letters, we used to give out an assignment on word letters. We

00:40:30.000 --> 00:40:31.000
didn't this time, but I thought it was an

00:40:31.000 --> 00:40:34.000
interesting problem to think about. The idea of

00:40:34.000 --> 00:40:36.000
how it is you can transform

00:40:36.000 --> 00:40:41.000
one letter at a time from one word to another is very much a

00:40:41.000 --> 00:40:45.000
graph problem behind the scenes, and that largely a breadth-first

00:40:45.000 --> 00:40:47.000
traversal is exactly what you're looking for there. How

00:40:47.000 --> 00:40:51.000
can I transform this with the shortest number of steps is, kind of,

00:40:51.000 --> 00:40:54.000
working radially out from the starting word.

00:40:54.000 --> 00:40:54.000


00:40:54.000 --> 00:40:58.000
Your boggle board turns out, really, is just a graph. If

00:40:58.000 --> 00:41:01.000
you think of

00:41:01.000 --> 00:41:07.000
having this sequence of letters

00:41:07.000 --> 00:41:12.000
that you're trying to work through,

00:41:12.000 --> 00:41:16.000
that really each of these letters can actually just be thought of as a node,

00:41:16.000 --> 00:41:20.000
and then the connections it has are to its eight neighbors, and that finding

00:41:20.000 --> 00:41:24.000
words in there, especially finding paths, starting from

00:41:24.000 --> 00:41:27.000
this node that lead away, that don't revisit, so, in fact, doing breadth-first

00:41:27.000 --> 00:41:31.000
or depth-first search. You most likely did depth-first search when you were writing the

00:41:31.000 --> 00:41:32.000
implementation of boggle,

00:41:32.000 --> 00:41:35.000
but that's, kind of, a deep search on I found an A. I found a P. I found a P, and, kind

00:41:35.000 --> 00:41:40.000
of, keeps going as long as that path looks viable. So, in this case, that the arc

00:41:40.000 --> 00:41:41.000
information we're using is

00:41:41.000 --> 00:41:44.000
having a sub of those letters, do they form a prefix that could be leading

00:41:44.000 --> 00:41:45.000
somewhere good,

00:41:45.000 --> 00:41:47.000
and then the, kind of, recursion bottoming out when

00:41:47.000 --> 00:41:52.000
we have run ourselves into a corner where we have no unvisited nodes

00:41:52.000 --> 00:41:53.000
that neighbor us,

00:41:53.000 --> 00:41:57.000
or only nodes that would extend to a prefix that no longer forms any

00:41:57.000 --> 00:41:58.000
viable words

00:41:58.000 --> 00:42:01.000
as a search problem.

00:42:01.000 --> 00:42:04.000
So it feels like everything you've done this quarter has actually been a graph in disguise; I just didn't

00:42:04.000 --> 00:42:08.000
tell you. The maze problem is very much a graph search problem, building a

00:42:08.000 --> 00:42:10.000
graph and searching it.

00:42:10.000 --> 00:42:13.000
Another one that I think is, kind of, neat to think about is, kind of, related to the idea of word letters

00:42:13.000 --> 00:42:18.000
is that often when you mistype a word, a word

00:42:18.000 --> 00:42:19.000
processor will suggest to you

00:42:19.000 --> 00:42:22.000
here's a word that you might have meant instead. You type a word that it doesn't

00:42:22.000 --> 00:42:24.000
have in its lexicon, and it says, well,

00:42:24.000 --> 00:42:25.000
that could just be

00:42:25.000 --> 00:42:28.000
a word I don't know, but it also could mean that you actually mistyped something, and so

00:42:28.000 --> 00:42:31.000
what you want to look at is having neighboring nodes,

00:42:31.000 --> 00:42:33.000
sort of, diagramming a graph

00:42:33.000 --> 00:42:35.000
of all the words you know of

00:42:35.000 --> 00:42:39.000
and then the, kind of, permutations, the tiny little twiddles to that word that

00:42:39.000 --> 00:42:41.000
are neighbors to it so that

00:42:41.000 --> 00:42:44.000
you could describe a set of suggestions

00:42:44.000 --> 00:42:47.000
as being those things that are within a certain

00:42:47.000 --> 00:42:50.000
hop distance from the original mistyped word. And

00:42:50.000 --> 00:42:53.000
so you could say, yeah, things that have two letters different are probably

00:42:53.000 --> 00:42:56.000
close enough, but things that have three or more letters different aren't actually

00:42:56.000 --> 00:42:58.000
likely to be

00:42:58.000 --> 00:43:00.000
the mistyped word,

00:43:00.000 --> 00:43:04.000
and then that, kind of, leads to supporting things like wildcard searches

00:43:04.000 --> 00:43:08.000
where you're trying to remember whether it's E-R-A-T or A-R-T-E at the

00:43:08.000 --> 00:43:10.000
end of desperate.

00:43:10.000 --> 00:43:14.000
Those things could be modeled as, kind of, like they search through a graph space where

00:43:14.000 --> 00:43:16.000
you're modeling those letters, and then there's, like, a

00:43:16.000 --> 00:43:20.000
branching where you try all of the letters coming out of D-E-S-P, seeing if any

00:43:20.000 --> 00:43:23.000
of them lead to an R-A-T-E.

00:43:23.000 --> 00:43:26.000
So the last thing I wanted

00:43:26.000 --> 00:43:28.000
to just bring up, and this is the one that's going into this week's assignment. I'm just

00:43:28.000 --> 00:43:31.000
gonna actually show it to you to, kind of, get you thinking about it because it's

00:43:31.000 --> 00:43:35.000
actually the problem at hand that you're gonna be solving

00:43:35.000 --> 00:43:36.000
is that

00:43:36.000 --> 00:43:41.000
the one of where I'm interested in finding the shortest path, but I'm gonna add in

00:43:41.000 --> 00:43:43.000
this new idea that

00:43:43.000 --> 00:43:45.000
all hops are not created equal.

00:43:45.000 --> 00:43:49.000
That a flight from San Francisco to Seattle is a shorter distance than the

00:43:49.000 --> 00:43:52.000
one that goes from San Francisco to Boston,

00:43:52.000 --> 00:43:53.000
and if I'm interested in getting

00:43:53.000 --> 00:43:57.000
to someplace, it's not just that I would like to take a direct flight. I'd also

00:43:57.000 --> 00:44:00.000
like to take the set of flights that involves the least, kind of, going out of my way,

00:44:00.000 --> 00:44:03.000
and going through other cities, and other directions.

00:44:03.000 --> 00:44:07.000
And so in this case, a breadth-first search is a little bit too simplistic.

00:44:07.000 --> 00:44:10.000
That if I just look at all the places I can get to in one hop,

00:44:10.000 --> 00:44:11.000


00:44:11.000 --> 00:44:15.000
those are not really equally

00:44:15.000 --> 00:44:16.000
weighted

00:44:16.000 --> 00:44:19.000
in terms of my goal of minimizing my total path distance,

00:44:19.000 --> 00:44:22.000
but if I, instead, prioritize -

00:44:22.000 --> 00:44:25.000
prioritize in your mind, immediately conjure up images of the

00:44:25.000 --> 00:44:26.000
priority queue,

00:44:26.000 --> 00:44:28.000
to look at paths that are,

00:44:28.000 --> 00:44:32.000
if I'm trying to minimize total distance, the shortest path, even if they

00:44:32.000 --> 00:44:36.000
represent several hops, if they're still shorter - if I have three flights, and each are

00:44:36.000 --> 00:44:37.000
100 miles,

00:44:37.000 --> 00:44:40.000
then they represent a flight of 300 total miles,

00:44:40.000 --> 00:44:44.000
and that's still preferred to one that actually is 2,000 miles or 7,000 miles

00:44:44.000 --> 00:44:45.000
going to Asia

00:44:45.000 --> 00:44:49.000
alternatively. And so if I worked on a breadth-first that's been modified by a total

00:44:49.000 --> 00:44:50.000
distance,

00:44:50.000 --> 00:44:51.000
that it would be a way

00:44:51.000 --> 00:44:54.000
to work out the shortest path

00:44:54.000 --> 00:44:55.000
in terms of total weight

00:44:55.000 --> 00:44:58.000
from San Francisco to my destination

00:44:58.000 --> 00:45:01.000
by a

00:45:01.000 --> 00:45:03.000
modified breadth-first, kind of, weighted strategy,

00:45:03.000 --> 00:45:06.000
and that is exactly what you're gonna be doing in this assignment. I'm not gonna

00:45:06.000 --> 00:45:08.000
give away too much by, kind of,

00:45:08.000 --> 00:45:11.000
delving too much in it, but the handout does a pretty good job of talking you through what

00:45:11.000 --> 00:45:15.000
you're doing, but it's a, kind of, neat way to think about that's how the GPS, the MapQuest, and things like

00:45:15.000 --> 00:45:16.000
that are working is

00:45:16.000 --> 00:45:17.000
you're at a site;

00:45:17.000 --> 00:45:21.000
you have a goal, and you have this information that you're trying to

00:45:21.000 --> 00:45:24.000
guide that search, and you're gonna use heuristics about

00:45:24.000 --> 00:45:28.000
appears to be the, kind of, optimal path to, kind of, pick that, and go with it,

00:45:28.000 --> 00:45:31.000
and see where it leads you. So

00:45:31.000 --> 00:45:32.000
that will be your

00:45:32.000 --> 00:45:40.000
task for this week. What we'll talk about on Friday is caching.