WEBVTT 00:00:00.000 --> 00:00:11.000 00:00:11.000 --> 00:00:14.000 This presentation is delivered by the Stanford Center for Professional 00:00:14.000 --> 00:00:25.000 Development. 00:00:25.000 --> 00:00:27.000 Okay, we got volume and everything. All 00:00:27.000 --> 00:00:28.000 right, here we go. Lots of you think your 00:00:28.000 --> 00:00:32.000 learning things that aren't gonna actually someday help you in your future 00:00:32.000 --> 00:00:34.000 career. Let's say you plan to be president. You 00:00:34.000 --> 00:00:38.000 think, ''Should I take 106b; should I not take 106b?'' Watch this interview and 00:00:38.000 --> 00:00:39.000 you will see. [Video] 00:00:39.000 --> 00:00:41.000 We have questions, and we ask 00:00:41.000 --> 00:00:45.000 our candidates questions, and this one is from Larry Schwimmer. What 00:00:45.000 --> 00:00:48.000 - you 00:00:48.000 --> 00:00:52.000 guys think I'm kidding; it's right here. 00:00:52.000 --> 00:00:59.000 What is the most efficient way to sort a million 32-bit integers? 00:00:59.000 --> 00:01:01.000 Well - I'm 00:01:01.000 --> 00:01:07.000 sorry. Maybe we should - that's not a - I think the bubble sort 00:01:07.000 --> 00:01:09.000 would be the wrong way to go. Come on; who told him this? 00:01:09.000 --> 00:01:14.000 There you go. So, apparently - 00:01:14.000 --> 00:01:15.000 the question was 00:01:15.000 --> 00:01:16.000 actually 00:01:16.000 --> 00:01:19.000 by a former student of mine, in 00:01:19.000 --> 00:01:22.000 fact, Larry Schwimmer. So apparently, he's been prepped well. As part of going on the campaign trail, not only do you 00:01:22.000 --> 00:01:25.000 have to know all your policy statements, you also have to have your 00:01:25.000 --> 00:01:29.000 N2 and login sorts kept straight in your mind, so - just so 00:01:29.000 --> 00:01:31.000 you know. So let me give you 00:01:31.000 --> 00:01:35.000 some examples of things that are graphs, just to get you thinking about 00:01:35.000 --> 00:01:36.000 00:01:36.000 --> 00:01:39.000 the data structure that we're gonna be working with and playing around with 00:01:39.000 --> 00:01:40.000 today. Sometimes you 00:01:40.000 --> 00:01:44.000 think of graphs of being those, like, bar graphs, right, your graphing 00:01:44.000 --> 00:01:46.000 histograms. This is actually a different kind of graph. The computer scientist's form of 00:01:46.000 --> 00:01:50.000 graph is just a generalized, recursive data structure 00:01:50.000 --> 00:01:51.000 that, kind of, 00:01:51.000 --> 00:01:54.000 starts with something that looks a little bit like a tree and extends it to 00:01:54.000 --> 00:01:55.000 this notion that there are 00:01:55.000 --> 00:01:57.000 any number of nodes, in 00:01:57.000 --> 00:02:00.000 this case, represented as the cities in this airline route map, 00:02:00.000 --> 00:02:04.000 and then there are connections between them, sometimes called edges or arcs that wire up 00:02:04.000 --> 00:02:08.000 the connections that in the case, for example, of this graph are showing you 00:02:08.000 --> 00:02:10.000 which cities have connecting flights. You have a direct 00:02:10.000 --> 00:02:15.000 flight that takes you from Las Vegas to Phoenix, right, one from Phoenix to Oklahoma 00:02:15.000 --> 00:02:15.000 City. 00:02:15.000 --> 00:02:18.000 There is not a direct flight from Phoenix to Kansas City, but there's a path 00:02:18.000 --> 00:02:21.000 by which you could get there by taking a sequence 00:02:21.000 --> 00:02:25.000 of connecting flights, right, going from Phoenix, to Oklahoma City, to Kansas City is 00:02:25.000 --> 00:02:26.000 one way to get there. 00:02:26.000 --> 00:02:30.000 And so the entire, kind of, map of airline routes forms a graph of the city, 00:02:30.000 --> 00:02:34.000 and there are a lot of interesting questions that having represented that data you might want to 00:02:34.000 --> 00:02:35.000 answer about finding 00:02:35.000 --> 00:02:36.000 flights that are 00:02:36.000 --> 00:02:40.000 cheap, or meet your time specification, or get you where you want to go, 00:02:40.000 --> 00:02:47.000 and that this data would be, kind of, primary in solving that problem. 00:02:47.000 --> 00:02:48.000 Another one, 00:02:48.000 --> 00:02:49.000 right, another 00:02:49.000 --> 00:02:52.000 idea of something that's represented as a graph is something that would support 00:02:52.000 --> 00:02:54.000 the notion of word ladders, those little puzzles where you're 00:02:54.000 --> 00:02:56.000 given the word ''chord,'' 00:02:56.000 --> 00:03:00.000 and you want to change it into the word ''worm,'' and you have these rules that say, well, you 00:03:00.000 --> 00:03:02.000 can change one letter at a time. 00:03:02.000 --> 00:03:04.000 And so the 00:03:04.000 --> 00:03:07.000 connections in the case of the word letter have to do with words that are 00:03:07.000 --> 00:03:12.000 one character different from their neighbors, have direct connections or arcs 00:03:12.000 --> 00:03:16.000 between them. Each of the nodes themselves represents a word, and then 00:03:16.000 --> 00:03:19.000 paths between that represent word ladders. That if you can go from here 00:03:19.000 --> 00:03:22.000 through a sequence of steps directly connecting your way, each of those stepping 00:03:22.000 --> 00:03:26.000 stones is a word in the chain that makes a word ladder. So 00:03:26.000 --> 00:03:29.000 things you might want to solve there is what's the shortest word ladder? How many different 00:03:29.000 --> 00:03:31.000 word ladders - how many different ways can I go from this word to that one 00:03:31.000 --> 00:03:36.000 could be answered by searching around in a space like this. 00:03:36.000 --> 00:03:39.000 Any kind of prerequisite structure, 00:03:39.000 --> 00:03:42.000 such as the one for the major that might say you need to take this class before you take 00:03:42.000 --> 00:03:45.000 that class, and this class before that class, right, 00:03:45.000 --> 00:03:46.000 forms a 00:03:46.000 --> 00:03:49.000 structure that also fits into the main graph. 00:03:49.000 --> 00:03:52.000 In this case, there's a connection to those arcs. 00:03:52.000 --> 00:03:54.000 That the prerequisite is such that 00:03:54.000 --> 00:03:58.000 you take 106a, and it leads into 106b, and that connection doesn't really go in 00:03:58.000 --> 00:04:01.000 the reverse, right? You don't start in 106b and move backwards. 00:04:01.000 --> 00:04:05.000 Whereas in the case, for example, of the word ladder, all of these arcs are what we called undirected, 00:04:05.000 --> 00:04:09.000 right? You can traverse from wood to word and back again, right? 00:04:09.000 --> 00:04:10.000 They're equally 00:04:10.000 --> 00:04:12.000 00:04:12.000 --> 00:04:15.000 visitable or neighbors, in this case. So there may be a situation where you have directed 00:04:15.000 --> 00:04:20.000 arcs where they really imply a one-way directionality through the paths. 00:04:20.000 --> 00:04:22.000 Another one, lots and lots of 00:04:22.000 --> 00:04:23.000 arrows that 00:04:23.000 --> 00:04:25.000 tell you about the lifecycle of 00:04:25.000 --> 00:04:26.000 00:04:26.000 --> 00:04:30.000 Stanford relationships all captured on one 00:04:30.000 --> 00:04:31.000 slide 00:04:31.000 --> 00:04:34.000 in, sort of, the flowchart. So flowcharts have a lot of things about directionality. Like, you're 00:04:34.000 --> 00:04:37.000 in this state, and you can move to this state, so that's your neighboring state, and 00:04:37.000 --> 00:04:39.000 the arcs in there connect those things. 00:04:39.000 --> 00:04:43.000 And so, as you will see, like, you get to start over here and, like, be single and love 00:04:43.000 --> 00:04:47.000 it, and then you can, kind of, flirt with various degrees of seriousness, 00:04:47.000 --> 00:04:50.000 and then eventually there evolves a trombone and serenade, which, you know, 00:04:50.000 --> 00:04:53.000 you can't go back to flirting aimlessly after the trombone and serenade; there's no 00:04:53.000 --> 00:04:55.000 arc that direction. 00:04:55.000 --> 00:04:57.000 When somebody gets the trombone out, 00:04:57.000 --> 00:04:59.000 you know, you take them seriously, 00:04:59.000 --> 00:05:02.000 or you completely blow them off. There's the [inaudible] there. 00:05:02.000 --> 00:05:04.000 And then there's dating, and then there's 00:05:04.000 --> 00:05:08.000 two-phase commit, which is a very important part of any computer scientist's education, 00:05:08.000 --> 00:05:10.000 and then, potentially, other 00:05:10.000 --> 00:05:13.000 tracks of children and what not. But it does 00:05:13.000 --> 00:05:15.000 show you, kind of, places you can go, right? 00:05:15.000 --> 00:05:18.000 It turns out you actually don't get to go from flirt aimlessly to have baby, so 00:05:18.000 --> 00:05:20.000 write that down - 00:05:20.000 --> 00:05:23.000 not allowed in this graph. And 00:05:23.000 --> 00:05:28.000 then it turns out some things that you've already seen are actually graphs in disguise. 00:05:28.000 --> 00:05:32.000 I didn't tell you when I gave you the maze problem and its solution, the solving and 00:05:32.000 --> 00:05:33.000 building of a maze, 00:05:33.000 --> 00:05:37.000 that, in fact, what you're working on though is something that is a graph. 00:05:37.000 --> 00:05:39.000 That a perfect maze, right, 00:05:39.000 --> 00:05:40.000 so we started with this 00:05:40.000 --> 00:05:44.000 fully disconnected maze. If you remember how the Aldis/Broder algorithm worked, it's like 00:05:44.000 --> 00:05:47.000 you have all these nodes that have no connections between them, 00:05:47.000 --> 00:05:51.000 and then you went bopping around breaking down walls, which is effectively 00:05:51.000 --> 00:05:53.000 connecting up the nodes to each other, 00:05:53.000 --> 00:05:57.000 and we kept doing that until actually all of the nodes were connected. So, in effect, we 00:05:57.000 --> 00:06:00.000 were building a graph out of this big grid of cells 00:06:00.000 --> 00:06:03.000 by knocking down those walls to build the connections, and when we were done, then 00:06:03.000 --> 00:06:07.000 we made it so that it was possible to trace paths starting from that lower corner 00:06:07.000 --> 00:06:10.000 and working our way through the neighbors all the way over to there. 00:06:10.000 --> 00:06:13.000 So in the case of this, if you imagine that each of these cells was broken out in this little 00:06:13.000 --> 00:06:16.000 thing, there'd be these neighboring connections where this guy has this 00:06:16.000 --> 00:06:19.000 neighbor but no other ones because actually it has no 00:06:19.000 --> 00:06:22.000 other directions you can move from there, but some of them, for example like this 00:06:22.000 --> 00:06:23.000 one, 00:06:23.000 --> 00:06:26.000 has a full set of four neighbors you can reach 00:06:26.000 --> 00:06:28.000 depending on which walls are intact or not. 00:06:28.000 --> 00:06:29.000 And so the 00:06:29.000 --> 00:06:31.000 creation of that was creating a graph out 00:06:31.000 --> 00:06:34.000 of a set of disconnected nodes, and then solving it is searching that 00:06:34.000 --> 00:06:38.000 graph for a path. Here's one that is a little horrifying that came from the newspaper, which is social networking, kind of, a very big [inaudible] of, like, on the net who do you know, and who do they know, and how do those connections, like, linked in maybe help you get a job. This is one that was actually based on the liaisons, let's say, of a group of high school students, right, and it turns out there was a lot of more interconnected in that graph than I would expect, or appreciate, or approve of but interesting to, kind of, look at. So 00:06:38.000 --> 00:06:39.000 let's talk, like, concretely. 00:06:39.000 --> 00:06:43.000 How are we gonna make stuff work on graphs? Graphs turn out to actually - by showing you that, I'm trying to 00:06:43.000 --> 00:06:46.000 get you this idea that a lot of things are really just graphs in disguise. That a lot 00:06:46.000 --> 00:06:50.000 of the problems you may want to solve, if you can represent it as a graph, then things 00:06:50.000 --> 00:06:53.000 you learn about how to manipulate graphs will help you to solve that kind of 00:06:53.000 --> 00:06:54.000 problem. 00:06:54.000 --> 00:06:59.000 So the basic idea of a graph, it is a recursive data structure based on the node where 00:06:59.000 --> 00:07:02.000 nodes have connections to other nodes. 00:07:02.000 --> 00:07:04.000 Okay. So that's, kind of, like a tree, 00:07:04.000 --> 00:07:07.000 but it actually is more freeform than a tree, right? 00:07:07.000 --> 00:07:11.000 In a tree, right, there's that single root node that has, kind of, a special place, and 00:07:11.000 --> 00:07:15.000 there's this restriction about the connections that there's a single path 00:07:15.000 --> 00:07:17.000 from every other node in the tree 00:07:17.000 --> 00:07:20.000 starting from the root that leads to it. So you never can get there by 00:07:20.000 --> 00:07:24.000 multiple ways, and there's no cycles, and it's all connected, and things like 00:07:24.000 --> 00:07:25.000 that. 00:07:25.000 --> 00:07:26.000 So, in terms of a graph, 00:07:26.000 --> 00:07:29.000 it just doesn't place those restrictions. It says there's a node. 00:07:29.000 --> 00:07:32.000 It has any number of pointers to other nodes. It potentially could even 00:07:32.000 --> 00:07:36.000 have zero pointers to other's nodes, so it could be on a little island of its own out 00:07:36.000 --> 00:07:38.000 here that has no incoming or outgoing flights, right? You 00:07:38.000 --> 00:07:40.000 have to swim if you want to get there. 00:07:40.000 --> 00:07:42.000 00:07:42.000 --> 00:07:45.000 And then there's also not a rule that says that you can't have 00:07:45.000 --> 00:07:47.000 multiple ways to get to the same 00:07:47.000 --> 00:07:50.000 node. So if you're at B, and you're interested in getting to C, 00:07:50.000 --> 00:07:52.000 you could take the direction connection here, 00:07:52.000 --> 00:07:54.000 but you could also take a longer path 00:07:54.000 --> 00:07:58.000 that bops through A and hits to C, and so there's more than one way to get there, 00:07:58.000 --> 00:07:59.000 00:07:59.000 --> 00:08:02.000 which would not be true in a tree structure where it has that constraint, 00:08:02.000 --> 00:08:05.000 but that adds some interesting challenges to solving problems in the graphs 00:08:05.000 --> 00:08:08.000 because there actually are so many different ways you can 00:08:08.000 --> 00:08:11.000 get to the same place. We have to be careful avoiding 00:08:11.000 --> 00:08:13.000 getting into circularities where we're 00:08:13.000 --> 00:08:16.000 doing that, and also not redundantly doing work we've already done before because 00:08:16.000 --> 00:08:18.000 we've visited it previously. 00:08:18.000 --> 00:08:21.000 There can be cycles where 00:08:21.000 --> 00:08:22.000 you can 00:08:22.000 --> 00:08:24.000 completely come around. I this one, actually, 00:08:24.000 --> 00:08:26.000 doesn't have a cycle in it. 00:08:26.000 --> 00:08:27.000 Nope, it doesn't, 00:08:27.000 --> 00:08:31.000 but if I add it - I could add a link, for example, from 00:08:31.000 --> 00:08:34.000 C back to B, and then there would be this place you could just, kind of, keep 00:08:34.000 --> 00:08:36.000 rolling around in the B, A, C 00:08:36.000 --> 00:08:38.000 area. 00:08:38.000 --> 00:08:41.000 So we call each of those guys, the circles here, the nodes, right? 00:08:41.000 --> 00:08:44.000 The connection between them arcs. We 00:08:44.000 --> 00:08:47.000 will talk about the arcs as being directed and undirected in different situations 00:08:47.000 --> 00:08:51.000 where we imply that there's a directionality; you can travel the arc only one-way, 00:08:51.000 --> 00:08:54.000 or in cases where it's undirected, it means it goes both ways. 00:08:54.000 --> 00:08:56.000 We'll talk about two nodes being connected, 00:08:56.000 --> 00:08:58.000 meaning there is a direct connection, an arc, between them, 00:08:58.000 --> 00:09:02.000 and if they are not directly connected, there may possibly be a path 00:09:02.000 --> 00:09:05.000 between them. A sequence of arcs forms a path that you can follow to get 00:09:05.000 --> 00:09:07.000 from one node to another, 00:09:07.000 --> 00:09:11.000 and then cycles just meaning a path that revisits one of the nodes that 00:09:11.000 --> 00:09:14.000 was previously traveled on that path. So I've got 00:09:14.000 --> 00:09:16.000 a little terminology. 00:09:16.000 --> 00:09:18.000 Let's talk a little bit about how we're gonna represent it. 00:09:18.000 --> 00:09:22.000 So before we start making stuff happen on them, it's, like, well, how does this - 00:09:22.000 --> 00:09:26.000 in C++, right, what's the best way, or what are the reasonable ways you 00:09:26.000 --> 00:09:27.000 might represent 00:09:27.000 --> 00:09:30.000 this kind of connected structure? 00:09:30.000 --> 00:09:31.000 So if I have 00:09:31.000 --> 00:09:36.000 this four-node graph, A, B, C, D over where with the connections that are shown with 00:09:36.000 --> 00:09:38.000 direction, in this case, I have directed arcs. 00:09:38.000 --> 00:09:41.000 That one way to think of it is that what a 00:09:41.000 --> 00:09:45.000 graph really is is a set of nodes and a set of arcs, and by set, I mean, sort of, just the 00:09:45.000 --> 00:09:48.000 generalization. There might be a vector of arcs; there might be a vector of 00:09:48.000 --> 00:09:51.000 nodes, whatever, but some collection of arcs, some collection of nodes, 00:09:51.000 --> 00:09:53.000 and that the 00:09:53.000 --> 00:09:56.000 nodes might have information about what location they represent, or 00:09:56.000 --> 00:09:59.000 what word, or what entity, you know, in a social network that they are, 00:09:59.000 --> 00:10:03.000 and then the arcs would show who is directly connected to whom. 00:10:03.000 --> 00:10:06.000 And so one way to manage that, right, is just to have two independent collections - a collection 00:10:06.000 --> 00:10:08.000 of nodes and a 00:10:08.000 --> 00:10:11.000 collection of arcs, and that when I need to know things about connections, I'll have to, kind of, use both 00:10:11.000 --> 00:10:12.000 in tandem. 00:10:12.000 --> 00:10:14.000 So if I want to know if A is connected to B, 00:10:14.000 --> 00:10:18.000 I might actually have to just walk through the entire set of arcs trying to find 00:10:18.000 --> 00:10:22.000 one that connects A and B as its endpoints. If I want to see if 00:10:22.000 --> 00:10:25.000 C is connected to B, same thing, just walk down the whole set. 00:10:25.000 --> 00:10:27.000 00:10:27.000 --> 00:10:29.000 00:10:29.000 --> 00:10:31.000 That's probably the simplest thing to do, right, is to just, kind of, 00:10:31.000 --> 00:10:35.000 have them just be maintained as the two things that are used in conjunction. 00:10:35.000 --> 00:10:37.000 More likely, what you're gonna want to do 00:10:37.000 --> 00:10:39.000 is provide some access that 00:10:39.000 --> 00:10:42.000 when at a current node, you're currently trying to do some processing 00:10:42.000 --> 00:10:47.000 from the Node B, it would be convenient if you could easily access those nodes 00:10:47.000 --> 00:10:51.000 that emanate or outgo from the B node. That's typically the thing you're trying 00:10:51.000 --> 00:10:52.000 to do is 00:10:52.000 --> 00:10:55.000 at B try to visit its neighbors, 00:10:55.000 --> 00:10:58.000 and rather that having to, kind of, pluck them out of the entire collection, 00:10:58.000 --> 00:11:01.000 it might be handy if they were already, kind of, stored in such a way to make it 00:11:01.000 --> 00:11:03.000 easy for you to get to that. 00:11:03.000 --> 00:11:06.000 So the two ways that try to represent adjacency 00:11:06.000 --> 00:11:10.000 in a more efficient way are the adjacency list and the adjacency matrix. 00:11:10.000 --> 00:11:14.000 So in an adjacency list representation, we have a way of associating with each node, 00:11:14.000 --> 00:11:17.000 probably just by storing it in the node structure itself, 00:11:17.000 --> 00:11:21.000 those arcs that outgo from there. So, in this case, A has as direct 00:11:21.000 --> 00:11:22.000 connection to the B 00:11:22.000 --> 00:11:23.000 and C Nodes, 00:11:23.000 --> 00:11:26.000 and so I would have that information stored with the A Node, which is my 00:11:26.000 --> 00:11:29.000 vector or set of outgoing connections. 00:11:29.000 --> 00:11:33.000 Similarly, B has one outgoing connection, which goes to D. C has an outgoing connection to D, and 00:11:33.000 --> 00:11:37.000 then D has outgoing connections that lead back to A and 00:11:37.000 --> 00:11:40.000 to B. So at a particular node I could say, well, who's my neighbors? 00:11:40.000 --> 00:11:41.000 It would be 00:11:41.000 --> 00:11:46.000 easily accessible and not have to be retrieved, or searched, or filtered. 00:11:46.000 --> 00:11:48.000 Another way of doing that 00:11:48.000 --> 00:11:51.000 is to realize that, in a sense, what we have is this, kind of, 00:11:51.000 --> 00:11:53.000 end by end grid 00:11:53.000 --> 00:11:56.000 where each node is potentially connected to 00:11:56.000 --> 00:12:00.000 the other N -1 nodes in that graph. And so if I just build 00:12:00.000 --> 00:12:03.000 a 2x2 matrix that's labeled with 00:12:03.000 --> 00:12:06.000 all the node names on this side and all the node names across the top, that the 00:12:06.000 --> 00:12:08.000 intersection of this says 00:12:08.000 --> 00:12:11.000 is there an arc that leads away from A and 00:12:11.000 --> 00:12:14.000 ends at B? Yes, there is. Is there one from A to C? 00:12:14.000 --> 00:12:18.000 And in the places where there is not an existing connection, a self loop or 00:12:18.000 --> 00:12:19.000 00:12:19.000 --> 00:12:23.000 just a connection that doesn't exist, right, I have an empty slot. So maybe these would be Booleans 00:12:23.000 --> 00:12:25.000 true and false or something, or maybe there'd be some more information here 00:12:25.000 --> 00:12:27.000 about the distance 00:12:27.000 --> 00:12:30.000 or other associated arc properties for that 00:12:30.000 --> 00:12:31.000 particular connection. 00:12:31.000 --> 00:12:34.000 In this case, it's actually very easy then to actually do the really quick 00:12:34.000 --> 00:12:35.000 search of 00:12:35.000 --> 00:12:38.000 I'm at A and I want to know if I can get to B, then it's really just a 00:12:38.000 --> 00:12:41.000 matter of reaching right into the 00:12:41.000 --> 00:12:45.000 slot in constant time in that matrix to pull it out, and 00:12:45.000 --> 00:12:47.000 so 00:12:47.000 --> 00:12:49.000 this one involves a lot of searching to find things. This one involves, perhaps, a 00:12:49.000 --> 00:12:52.000 little bit of searching. If I want to know if A is connected to B, I might have to look 00:12:52.000 --> 00:12:55.000 through all its outgoing connections, right? This one gives me that direct 00:12:55.000 --> 00:12:56.000 access, 00:12:56.000 --> 00:13:00.000 but the tradeoffs here has to do with where the space starts building up, right? 00:13:00.000 --> 00:13:03.000 A full NxN matrix could be very big, 00:13:03.000 --> 00:13:06.000 and sometimes we're allocating space as though there are a full set of 00:13:06.000 --> 00:13:09.000 connections between all the nodes, every node connected to every other, 00:13:09.000 --> 00:13:12.000 and in the cases where a lot of those things are false, there's a lot of 00:13:12.000 --> 00:13:13.000 capacity 00:13:13.000 --> 00:13:16.000 that we have set aside that we're not really tapping into. 00:13:16.000 --> 00:13:19.000 And so in the case of a graph that we call dense, 00:13:19.000 --> 00:13:23.000 where there's a lot of connections, it will use a lot of that capacity, and it might make sense 00:13:23.000 --> 00:13:27.000 in a graph that's more sparse, where there's a few connections. So you have 00:13:27.000 --> 00:13:31.000 thousands of nodes, but maybe only three or four on average are connected to one. 00:13:31.000 --> 00:13:35.000 Then having allocated 1,000 slots of which to only fill in three 00:13:35.000 --> 00:13:36.000 might become 00:13:36.000 --> 00:13:40.000 rather space inefficient. You might prefer an adjacency list representation that 00:13:40.000 --> 00:13:41.000 can get you 00:13:41.000 --> 00:13:44.000 to the ones you're actually using versus the other, 00:13:44.000 --> 00:13:47.000 but in terms of the adjacency matrices is very fast, 00:13:47.000 --> 00:13:48.000 right? A lot of space thrown at this 00:13:48.000 --> 00:13:51.000 gives you this immediate 0,1 access to know 00:13:51.000 --> 00:13:55.000 who your neighbors are. So 00:13:55.000 --> 00:13:59.000 we're gonna use the adjacency list, kind of, strategy here 00:13:59.000 --> 00:14:02.000 for the code we're gonna do today, and it's, kind 00:14:02.000 --> 00:14:07.000 of, a good compromise between the two alternatives that we've seen there. 00:14:07.000 --> 00:14:08.000 So I have a struct node. 00:14:08.000 --> 00:14:11.000 It has some information for the node. Given that I'm not being very specific 00:14:11.000 --> 00:14:14.000 about what the graph is, I'm just gonna, kind of, leave that unspecified. It might be that it's 00:14:14.000 --> 00:14:17.000 the city name. It might be that it's a word. It might be that it's a person in a 00:14:17.000 --> 00:14:19.000 social network, you know, 00:14:19.000 --> 00:14:23.000 some information that represents what this node is an entity, 00:14:23.000 --> 00:14:28.000 and then there's gonna be a set of arcs or a set of nodes it's connected to, 00:14:28.000 --> 00:14:32.000 and so I'm using vector here. I could be using set. I could be using a 00:14:32.000 --> 00:14:35.000 raw array, or a link list, or any of these things. 00:14:35.000 --> 00:14:38.000 I need to use some unbounded collection though because there's no guarantee that 00:14:38.000 --> 00:14:42.000 they will be 0, or 1, or 2 the way there is in a link list or a tree where there's a 00:14:42.000 --> 00:14:45.000 specified number of outgoing links. There can be any number of them, so I'm just leaving 00:14:45.000 --> 00:14:46.000 00:14:46.000 --> 00:14:47.000 a 00:14:47.000 --> 00:14:51.000 variable size collection here to do that work for me. 00:14:51.000 --> 00:14:52.000 The graph itself, 00:14:52.000 --> 00:14:55.000 so I have this idea of all the nodes in the graph, right, would each get a new 00:14:55.000 --> 00:14:58.000 node structure and then be wired up to the other nodes that they are connected 00:14:58.000 --> 00:15:00.000 to, 00:15:00.000 --> 00:15:02.000 but then when I'm trying to operate on that graph, 00:15:02.000 --> 00:15:06.000 I can't just take one pointer and say here's a pointer to some node in the 00:15:06.000 --> 00:15:09.000 graph and say this is - and from here you have accessed everything. In the way 00:15:09.000 --> 00:15:11.000 that a tree, 00:15:11.000 --> 00:15:13.000 the only thing we need to keep track of is the pointer to the root, 00:15:13.000 --> 00:15:17.000 or a link list, the only thing we keep a pointer to, typically, is that frontmost 00:15:17.000 --> 00:15:19.000 cell, and from there, we can reach everything else. 00:15:19.000 --> 00:15:23.000 There is no special head or root cell in a graph. 00:15:23.000 --> 00:15:27.000 A graph is this 00:15:27.000 --> 00:15:30.000 being, kind of, a crazy collection without a lot of rules about 00:15:30.000 --> 00:15:35.000 how they are connected. In fact, it's not even guaranteed that they totally are 00:15:35.000 --> 00:15:36.000 connected. 00:15:36.000 --> 00:15:38.000 That I can't 00:15:38.000 --> 00:15:41.000 guarantee that if I had a pointer, for example, to C, 00:15:41.000 --> 00:15:44.000 right, I may or may not be able to reach all the nodes from there. 00:15:44.000 --> 00:15:49.000 If I had a pointer to A, right, A can get to C and B, and then also 00:15:49.000 --> 00:15:52.000 down to D, but, for example, there could just be an E node over here that was only 00:15:52.000 --> 00:15:55.000 connected to C or not connected to anything at all for that matter, connected 00:15:55.000 --> 00:15:57.000 to F in their own little island. 00:15:57.000 --> 00:16:01.000 That there is no way to identify some special node from which you could reach 00:16:01.000 --> 00:16:02.000 all the nodes. 00:16:02.000 --> 00:16:04.000 There isn't a special root node, 00:16:04.000 --> 00:16:07.000 and you can't even just arbitrarily pick one to be the root because actually there's no 00:16:07.000 --> 00:16:10.000 guarantee that it will be connected to all of the others. So it would not give you 00:16:10.000 --> 00:16:13.000 access to the full entirety of your graph if you just picked on and said I want anything 00:16:13.000 --> 00:16:16.000 reachable from here, it might not get you the whole thing. So, 00:16:16.000 --> 00:16:18.000 typically, what you need to really operate on is 00:16:18.000 --> 00:16:21.000 the entire collection of node stars. You'll have a set or a vector 00:16:21.000 --> 00:16:24.000 that contains all of the nodes that are in the graph, 00:16:24.000 --> 00:16:26.000 and then within that, there's a bunch of other connections that are 00:16:26.000 --> 00:16:35.000 being made on the graph connectivity that you're also exploring. 00:16:35.000 --> 00:16:37.000 So let's make it a little bit more concrete 00:16:37.000 --> 00:16:39.000 00:16:39.000 --> 00:16:40.000 and talk a little bit about 00:16:40.000 --> 00:16:43.000 what really is a node; what really is an arc? 00:16:43.000 --> 00:16:46.000 There may be, actually, be some more information I need to store than just a 00:16:46.000 --> 00:16:48.000 node is connected to other nodes. 00:16:48.000 --> 00:16:51.000 So in the case of a list or a tree, 00:16:51.000 --> 00:16:54.000 the next field and the left and the right field are really just pointers for the 00:16:54.000 --> 00:16:56.000 purpose of organization. 00:16:56.000 --> 00:16:58.000 That's the pointer to the next cell. 00:16:58.000 --> 00:17:02.000 There's no data that is really associated with that link itself. 00:17:02.000 --> 00:17:05.000 That's not true in the case of a graph. 00:17:05.000 --> 00:17:09.000 That often what the links that connect up the nodes 00:17:09.000 --> 00:17:12.000 actually do have a real role to play in terms of being data. 00:17:12.000 --> 00:17:14.000 It may be that what they tell you is 00:17:14.000 --> 00:17:18.000 what road you're taking here, or how long this path is, or how much this 00:17:18.000 --> 00:17:22.000 flight costs, or what time this flight leaves, or how full this flight already is, 00:17:22.000 --> 00:17:24.000 or who knows, you know, just other information 00:17:24.000 --> 00:17:27.000 that is more than just this nodes connected to that one. It's, like, how is it 00:17:27.000 --> 00:17:30.000 connected; what is it connected by, and what are the details of how that 00:17:30.000 --> 00:17:31.000 connection 00:17:31.000 --> 00:17:32.000 00:17:32.000 --> 00:17:34.000 is being represented? 00:17:34.000 --> 00:17:38.000 So it's likely that what you really want is not just pointers to other nodes but 00:17:38.000 --> 00:17:42.000 information about that actual link stored with an arc structure. 00:17:42.000 --> 00:17:44.000 So, typically, we're gonna have 00:17:44.000 --> 00:17:45.000 an arc 00:17:45.000 --> 00:17:48.000 which has information about that arc, how long, how much it costs, you know, 00:17:48.000 --> 00:17:50.000 what flight number it is, 00:17:50.000 --> 00:17:52.000 that will have pointers to the start and end node that it is connecting. 00:17:52.000 --> 00:17:54.000 The node 00:17:54.000 --> 00:17:57.000 then has a collection of arcs, 00:17:57.000 --> 00:18:00.000 and those arcs are expected to, in the adjacency list form, will all be arcs 00:18:00.000 --> 00:18:02.000 that start at this node. 00:18:02.000 --> 00:18:05.000 So their start node is equal to the one that's holding onto to them, and then 00:18:05.000 --> 00:18:06.000 they have an end 00:18:06.000 --> 00:18:10.000 pointer that points to the node at the other end of the arc. Well, 00:18:10.000 --> 00:18:13.000 this gets us into a little C++ bind 00:18:13.000 --> 00:18:16.000 because I've described these data structures 00:18:16.000 --> 00:18:17.000 in a way 00:18:17.000 --> 00:18:22.000 that they both depend on each other; that an arc has pointers to the start and the end 00:18:22.000 --> 00:18:22.000 node. 00:18:22.000 --> 00:18:27.000 A node has a collection of arcs, which is probably a vector or a set of arc pointers, 00:18:27.000 --> 00:18:28.000 and so 00:18:28.000 --> 00:18:31.000 I'm starting to define the structure for arc, 00:18:31.000 --> 00:18:33.000 and then I want to talk about node in it, 00:18:33.000 --> 00:18:35.000 and I think, oh, okay, well, then I better define node first because C++ always likes to 00:18:35.000 --> 00:18:38.000 see things in order. Remember, it always wants to see A, 00:18:38.000 --> 00:18:42.000 and then B if B uses A, and so in terms of how your functions and data 00:18:42.000 --> 00:18:45.000 structures work, you've always gotten this idea like, well, go do the one first. Well, if I say, 00:18:45.000 --> 00:18:48.000 okay, arc's gonna need node. I start to write arc, and I say, oh, I 00:18:48.000 --> 00:18:50.000 need to talk about node. Well, I better put node up here, 00:18:50.000 --> 00:18:54.000 right? But then when I'm starting to write node, I start talking about arc, and they have a 00:18:54.000 --> 00:18:57.000 circular reference to each other, right? They both 00:18:57.000 --> 00:19:00.000 want to depend on the other one before we've gotten around to telling the 00:19:00.000 --> 00:19:01.000 compiler about it. 00:19:01.000 --> 00:19:04.000 One of them has to go first, right? What are we gonna do? 00:19:04.000 --> 00:19:08.000 What we need to do is use the C++ forward reference 00:19:08.000 --> 00:19:09.000 as a 00:19:09.000 --> 00:19:12.000 little hint to the compiler 00:19:12.000 --> 00:19:13.000 that 00:19:13.000 --> 00:19:17.000 we are gonna be defining a Node T structure in a little bit; 00:19:17.000 --> 00:19:18.000 we're gonna get to it, 00:19:18.000 --> 00:19:22.000 but first we're gonna define the Arc T structure that uses that Node T, and then 00:19:22.000 --> 00:19:25.000 we're gonna get around to it. So we're gonna give it, like, a little heads up that says 00:19:25.000 --> 00:19:29.000 there later will be an Act 2 of this play. We're gonna introduce a 00:19:29.000 --> 00:19:31.000 character called Node T. 00:19:31.000 --> 00:19:33.000 So you can now 00:19:33.000 --> 00:19:34.000 start talking about Node 00:19:34.000 --> 00:19:37.000 T in some simple ways 00:19:37.000 --> 00:19:40.000 based on your agreement to the compiler that you plan on telling it more 00:19:40.000 --> 00:19:41.000 about Node T later. 00:19:41.000 --> 00:19:45.000 So the struct Node T says there will be a struct called Node T. 00:19:45.000 --> 00:19:47.000 Okay, the compiler says, okay, I'll write that down. 00:19:47.000 --> 00:19:51.000 You start defining struct Arc T, and it says, oh, I see you have these pointers 00:19:51.000 --> 00:19:54.000 to the node T. Okay. Well, you told me there'd be a struct like that; I guess that'll work out. 00:19:54.000 --> 00:19:58.000 And now you told it what struct Node T is. It's, like, oh, I have a vector 00:19:58.000 --> 00:19:59.000 with these pointers to arc, 00:19:59.000 --> 00:20:02.000 and then it says, okay, now I see the whole picture, and 00:20:02.000 --> 00:20:03.000 it all worked out. 00:20:03.000 --> 00:20:08.000 So it's just a little bit of a 00:20:08.000 --> 00:20:11.000 requirement of how the C++ compiler likes to see stuff in order without ever having 00:20:11.000 --> 00:20:16.000 to go backwards to check anything out again. Okay. So 00:20:16.000 --> 00:20:19.000 I got myself some node structures, got 00:20:19.000 --> 00:20:21.000 some arc structures; they're all 00:20:21.000 --> 00:20:22.000 working together. 00:20:22.000 --> 00:20:24.000 I'm gonna try to do some traversals. 00:20:24.000 --> 00:20:26.000 Try to do some operations that 00:20:26.000 --> 00:20:29.000 work their way around the graph. 00:20:29.000 --> 00:20:30.000 So 00:20:30.000 --> 00:20:33.000 tree traversals, right, the pre, and post, and in order 00:20:33.000 --> 00:20:36.000 are ways that you start at the root, and you visit all the nodes on the tree, 00:20:36.000 --> 00:20:40.000 a very common need to, kind of, process, to delete nodes, 00:20:40.000 --> 00:20:43.000 to print the nodes, to whatnot. So we might want to do the same thing on graphs. I've got a 00:20:43.000 --> 00:20:47.000 graph out there, and I'd like to go exploring it. Maybe I just want to print, and if 00:20:47.000 --> 00:20:50.000 I'm in the Gates building, what are all the places that I can reach from 00:20:50.000 --> 00:20:53.000 the Gates building? Where can I go to; what 00:20:53.000 --> 00:20:54.000 options do I have? 00:20:54.000 --> 00:20:57.000 What I'm gonna do is a traversal 00:20:57.000 --> 00:21:01.000 starting at a node and, kind of, working my way outward to the visible, reachable 00:21:01.000 --> 00:21:04.000 nodes that I can follow those paths to get there. 00:21:04.000 --> 00:21:07.000 We're gonna look at two different ways to do this. 00:21:07.000 --> 00:21:09.000 One is Depth-first and one is Breadth-first, 00:21:09.000 --> 00:21:13.000 and I'll show you both algorithms, and they 00:21:13.000 --> 00:21:15.000 will visit the same nodes; 00:21:15.000 --> 00:21:19.000 when all is done and said, they will visit all the reachable nodes starting from a 00:21:19.000 --> 00:21:20.000 position, 00:21:20.000 --> 00:21:21.000 but they will visit them in different order. 00:21:21.000 --> 00:21:25.000 So just like the pre/post in order tree traversals, they visit all the same 00:21:25.000 --> 00:21:28.000 notes. It's just a matter of when they get around to doing it 00:21:28.000 --> 00:21:29.000 is how 00:21:29.000 --> 00:21:32.000 we distinguish depth and breadth-first traversals. 00:21:32.000 --> 00:21:34.000 Because we have this graph structure 00:21:34.000 --> 00:21:40.000 that has this loose connectivity and the possibility of cycles and multiple paths 00:21:40.000 --> 00:21:41.000 to the same node, 00:21:41.000 --> 00:21:44.000 we are gonna have to be a little bit careful at how we do our work to make 00:21:44.000 --> 00:21:48.000 sure that we don't end up getting stuck in some infinite loop when we keep 00:21:48.000 --> 00:21:51.000 going around the ABC cycle in a particular graph. 00:21:51.000 --> 00:21:55.000 That we need to realize when we're revisiting something we've seen before, 00:21:55.000 --> 00:21:56.000 and then not 00:21:56.000 --> 00:21:57.000 00:21:57.000 --> 00:21:59.000 trigger, kind of, an 00:21:59.000 --> 00:22:03.000 exploration we've already done. 00:22:03.000 --> 00:22:06.000 So let's look at depth-first first. 00:22:06.000 --> 00:22:09.000 This is probably the simpler of the two. They're both, actually, pretty simple. 00:22:09.000 --> 00:22:13.000 Depth-first traversal uses recursion to do its work, 00:22:13.000 --> 00:22:18.000 and the idea is that you pick a starting node - let's say I want to start at Gates, 00:22:18.000 --> 00:22:21.000 and that maybe what I'm trying to find is all the reachable nodes from Gates. 00:22:21.000 --> 00:22:24.000 If I don't have a starting node, then I can just pick one arbitrarily from the 00:22:24.000 --> 00:22:27.000 graph because there is actually no root node of special status, 00:22:27.000 --> 00:22:30.000 and the idea is to go deep. That 00:22:30.000 --> 00:22:33.000 in the case of, for example, a maze, it might be that I 00:22:33.000 --> 00:22:36.000 choose to go north, and then I just keep going north. I just go north, north, north, north, north until 00:22:36.000 --> 00:22:40.000 I run into a dead wall, and then I say, oh, I have to go east, and I go east, east, east, east, east. The idea is to 00:22:40.000 --> 00:22:42.000 just go as far away from 00:22:42.000 --> 00:22:46.000 the original state as I can, just keep going outward, outward, outward, outward, 00:22:46.000 --> 00:22:47.000 outward, outward, outward, 00:22:47.000 --> 00:22:50.000 and eventually I'm gonna run into some dead end, and that dead end could be I 00:22:50.000 --> 00:22:52.000 get to a node that has no outgoing arcs, 00:22:52.000 --> 00:22:54.000 or it could be that I get to a node 00:22:54.000 --> 00:22:57.000 that whose only arcs come back to places I've already been, 00:22:57.000 --> 00:22:58.000 so it cycles back on itself, in 00:22:58.000 --> 00:23:01.000 which case, that also is really a dead end. 00:23:01.000 --> 00:23:03.000 So you go deep. 00:23:03.000 --> 00:23:06.000 You go north, and you see as far as you can get, and 00:23:06.000 --> 00:23:10.000 only after you've, kind of, fully explored everything reachable from starting in 00:23:10.000 --> 00:23:11.000 the north direction 00:23:11.000 --> 00:23:13.000 do you backtrack, 00:23:13.000 --> 00:23:16.000 unmake that decision, and say, well, how about not north; how about east, right? And then 00:23:16.000 --> 00:23:19.000 find everything you can get to from the east, and 00:23:19.000 --> 00:23:22.000 after, kind of, going all through that, come back and try south, and so on. So all of the 00:23:22.000 --> 00:23:23.000 options you have 00:23:23.000 --> 00:23:26.000 exhaustively getting through all your 00:23:26.000 --> 00:23:26.000 neighbors 00:23:26.000 --> 00:23:29.000 in this, kind of, go-deep strategy. Well, 00:23:29.000 --> 00:23:32.000 the important thing is you realize if you've got this recursion going, so what 00:23:32.000 --> 00:23:35.000 you're actually doing, basically, is saying I'm gonna step to my neighbor to the 00:23:35.000 --> 00:23:40.000 right, and I'm gonna explore all the things, so do a depth-first search from there 00:23:40.000 --> 00:23:43.000 to find all the places you can reach, and that one says, well, I'm gonna step through this spot 00:23:43.000 --> 00:23:45.000 and go to one of my neighbors. 00:23:45.000 --> 00:23:48.000 And so we can't do that infinitely without getting into trouble. We need 00:23:48.000 --> 00:23:53.000 to have a base case that stops that recursion, and the two cases that I've 00:23:53.000 --> 00:23:56.000 talked about - one is just when you have no neighbors left to go. 00:23:56.000 --> 00:23:57.000 So when you 00:23:57.000 --> 00:23:59.000 00:23:59.000 --> 00:24:04.000 get to a node that actually, kind of, is a dead end, or that neighbor only has visited 00:24:04.000 --> 00:24:05.000 neighbors 00:24:05.000 --> 00:24:09.000 that surround it. So a 00:24:09.000 --> 00:24:10.000 very simple little piece of 00:24:10.000 --> 00:24:13.000 code that depends on recursion doing its job. 00:24:13.000 --> 00:24:17.000 In this case, we need to know which nodes we have visited, right? We're gonna have to 00:24:17.000 --> 00:24:19.000 have some kind of strategy for saying I 00:24:19.000 --> 00:24:21.000 have been here before. 00:24:21.000 --> 00:24:24.000 In this case, I'm using just a set because that's an easy thing to do. 00:24:24.000 --> 00:24:27.000 I make a set of node pointers, and I update it 00:24:27.000 --> 00:24:29.000 each time I see a new node; I add it to 00:24:29.000 --> 00:24:34.000 that set, and if I ever get to a node who is already previously visited 00:24:34.000 --> 00:24:38.000 from that, there's no reason to do any work from there; we can immediately 00:24:38.000 --> 00:24:41.000 stop. So 00:24:41.000 --> 00:24:43.000 starting that step would be empty. 00:24:43.000 --> 00:24:46.000 If the visit already contains the node that we're currently trying to visit, 00:24:46.000 --> 00:24:50.000 then there's nothing to do. Otherwise, we go ahead and add it, and there's a little node 00:24:50.000 --> 00:24:53.000 here that says we'll do something would occur. What am I trying to do here? Am I trying to print those nodes? Am 00:24:53.000 --> 00:24:56.000 I trying to draw pictures on those nodes, highlight those nodes, you know, 00:24:56.000 --> 00:25:00.000 something I'm doing with them. I don't know what it is, but this is the structure for the depth-first search 00:25:00.000 --> 00:25:01.000 here, and 00:25:01.000 --> 00:25:02.000 then 00:25:02.000 --> 00:25:04.000 for all of the neighbors, 00:25:04.000 --> 00:25:07.000 using the vector of outgoing connections, 00:25:07.000 --> 00:25:08.000 then I run a depth-first search 00:25:08.000 --> 00:25:11.000 on each of those neighbors 00:25:11.000 --> 00:25:14.000 passing that visited set by reference 00:25:14.000 --> 00:25:16.000 through the whole operation. So I will 00:25:16.000 --> 00:25:20.000 always be adding to and modifying this one set, so I will 00:25:20.000 --> 00:25:24.000 be able to know where I am and where I still need to go. 00:25:24.000 --> 00:25:27.000 So if we trace this guy in action - 00:25:27.000 --> 00:25:31.000 if I start arbitrarily from A, I say I'd like to begin with A. 00:25:31.000 --> 00:25:35.000 Then the depth-first search is gonna pick a neighbor, and let's say I happen to just work 00:25:35.000 --> 00:25:38.000 over my neighbors in alphabetical order. 00:25:38.000 --> 00:25:41.000 It would say okay, well, I've picked B. Let's go to everywhere we can get to from B. So A 00:25:41.000 --> 00:25:42.000 gets marked as visited. 00:25:42.000 --> 00:25:44.000 I move onto B; I say, well, 00:25:44.000 --> 00:25:48.000 B, search everywhere you can to from B, and B says okay, well, I need to pick a 00:25:48.000 --> 00:25:50.000 neighbor, how about I pick 00:25:50.000 --> 00:25:51.000 C? And says 00:25:51.000 --> 00:25:53.000 and now, 00:25:53.000 --> 00:25:56.000 having visited C, go everywhere you can get to from C. Well, 00:25:56.000 --> 00:25:59.000 C has no neighbors, right, no outgoing connections whatsoever. 00:25:59.000 --> 00:26:02.000 So C says, well, okay, everywhere you can get to from C is C, 00:26:02.000 --> 00:26:04.000 you know, I'm done. 00:26:04.000 --> 00:26:09.000 That will cause it to backtrack to this place, to B, and B says okay, I explored everywhere I can get to 00:26:09.000 --> 00:26:10.000 from C; 00:26:10.000 --> 00:26:13.000 what other neighbors do I have? B has no other neighbors. So, in fact, B backtracks all the way 00:26:13.000 --> 00:26:15.000 back to A. 00:26:15.000 --> 00:26:19.000 So now A says okay, well, I went everywhere I can get to from B, 00:26:19.000 --> 00:26:20.000 right? Let me look at my next neighbor. 00:26:20.000 --> 00:26:23.000 My next neighbor is C. Ah, 00:26:23.000 --> 00:26:25.000 but C, already visited, 00:26:25.000 --> 00:26:27.000 so there's no reason to 00:26:27.000 --> 00:26:29.000 go crazy on that thing. So, in fact, I can 00:26:29.000 --> 00:26:33.000 immediately see that I've visited that, so I don't have any further path to look at 00:26:33.000 --> 00:26:33.000 there. 00:26:33.000 --> 00:26:35.000 I'll hit D then 00:26:35.000 --> 00:26:37.000 as the next one in sequence. So 00:26:37.000 --> 00:26:40.000 at this point, I've done everything reachable from the B arm, everything 00:26:40.000 --> 00:26:43.000 reachable from the C arm, and now I'm starting on the D arm. I said okay, where can I get to 00:26:43.000 --> 00:26:44.000 from D? 00:26:44.000 --> 00:26:46.000 Well, I can get to E. All right, 00:26:46.000 --> 00:26:48.000 where can I get to from E? 00:26:48.000 --> 00:26:52.000 E goes to F, so the idea is you think of it going deep. It's going as far away as it 00:26:52.000 --> 00:26:56.000 can, as deep as possible, before it, kind of, unwinds 00:26:56.000 --> 00:26:58.000 getting to the F that has no neighbors 00:26:58.000 --> 00:27:01.000 and saying okay. Well, how about - where can I get to 00:27:01.000 --> 00:27:04.000 also from E? I can get to G. 00:27:04.000 --> 00:27:07.000 From G where can I get to? Nowhere, 00:27:07.000 --> 00:27:09.000 so we, kind of, unwind the 00:27:09.000 --> 00:27:12.000 D arm, and then we will eventually get back to and hit the H. 00:27:12.000 --> 00:27:16.000 So the order they actually were hit in this case happened to be alphabetical. I 00:27:16.000 --> 00:27:18.000 assigned them such that that would come out that way. 00:27:18.000 --> 00:27:21.000 That's not really a property of what depth-first search does, 00:27:21.000 --> 00:27:24.000 but think of it as like it's going as deep as possible, as far away as possible, 00:27:24.000 --> 00:27:28.000 and so it makes a choice - it's pretty much like the recursive backtrackers that we've seen. In 00:27:28.000 --> 00:27:31.000 fact, they make a choice, commit to it, and just move forward on it, never, kind of, 00:27:31.000 --> 00:27:35.000 giving a backwards glance, and only if the whole process 00:27:35.000 --> 00:27:35.000 bottoms out 00:27:35.000 --> 00:27:39.000 does it come back and start unmaking decisions and try alternatives, 00:27:39.000 --> 00:27:43.000 eventually exploring everything reachable from A 00:27:43.000 --> 00:27:46.000 when all is done and said. So if 00:27:46.000 --> 00:27:49.000 you, for example, use that on the maze, 00:27:49.000 --> 00:27:52.000 right, the depth-first search on a maze is very much the same strategy we're talking about 00:27:52.000 --> 00:27:53.000 here. 00:27:53.000 --> 00:27:56.000 Instead of the way we did it was actually breadth-first search, if you 00:27:56.000 --> 00:27:57.000 remember back to that assignment, 00:27:57.000 --> 00:28:00.000 the depth-first search alternative is actually doing it the - 00:28:00.000 --> 00:28:01.000 just go deep, 00:28:01.000 --> 00:28:04.000 make a choice, go with it, run all the way until it bottoms out, 00:28:04.000 --> 00:28:12.000 and only then back up and try some other alternatives. 00:28:12.000 --> 00:28:15.000 The breadth-first traversal, gonna hit 00:28:15.000 --> 00:28:16.000 the same nodes 00:28:16.000 --> 00:28:19.000 but just gonna hit them in a different way. 00:28:19.000 --> 00:28:21.000 If my goal were to actually hit all of them, 00:28:21.000 --> 00:28:25.000 then either of them is gonna be a fine strategy. There may be some reason why 00:28:25.000 --> 00:28:25.000 00:28:25.000 --> 00:28:29.000 I'm hoping to stop the search early, and that actually might dictate why I would prefer 00:28:29.000 --> 00:28:31.000 which ones to look at first. 00:28:31.000 --> 00:28:35.000 The breadth-first traversal is going to explore things in terms of distance, 00:28:35.000 --> 00:28:40.000 in this case, expressed in terms of number of hops away from the starting 00:28:40.000 --> 00:28:40.000 node. 00:28:40.000 --> 00:28:44.000 So it's gonna look at everything that's one hop away, and then go back and look at 00:28:44.000 --> 00:28:46.000 everything two hops away, and then three hops away. 00:28:46.000 --> 00:28:49.000 And so if the goal, for example, was to find a node 00:28:49.000 --> 00:28:51.000 in a shortest path connection between it, 00:28:51.000 --> 00:28:55.000 then breadth-first search, by looking at all the ones one hop away, if it was a one hop 00:28:55.000 --> 00:28:56.000 path, it'll find it, 00:28:56.000 --> 00:28:59.000 and if it's a two hop path, it'll find it on that next round, and then the three hop 00:28:59.000 --> 00:29:00.000 path, 00:29:00.000 --> 00:29:02.000 and there might be paths that are 10, and 20, and 100 00:29:02.000 --> 00:29:03.000 steps long, 00:29:03.000 --> 00:29:06.000 but if I find it earlier, I won't go looking at them. It actually might prove to 00:29:06.000 --> 00:29:09.000 be a more efficient way to get to the shortest solution as opposed to depth-first 00:29:09.000 --> 00:29:11.000 search which go off and try all these deep 00:29:11.000 --> 00:29:14.000 paths that may not ever end up where I want it to be 00:29:14.000 --> 00:29:16.000 before it eventually made its way to the short 00:29:16.000 --> 00:29:20.000 solution I was wanting to find. So the thing 00:29:20.000 --> 00:29:21.000 is we have the starting node. 00:29:21.000 --> 00:29:24.000 We're gonna visit all the immediate neighbors. 00:29:24.000 --> 00:29:27.000 So, basically, a loop over the 00:29:27.000 --> 00:29:28.000 immediate one, 00:29:28.000 --> 00:29:31.000 and then while we're doing that, we're gonna actually, kind of, be gathering up that 00:29:31.000 --> 00:29:33.000 next generation. 00:29:33.000 --> 00:29:33.000 00:29:33.000 --> 00:29:37.000 Sometimes that's called the Frontier, sort of, advancing the frontier 00:29:37.000 --> 00:29:39.000 of the nodes to explore next, 00:29:39.000 --> 00:29:42.000 so all the nodes that are two hops away, 00:29:42.000 --> 00:29:44.000 and then while we're processing the ones that are two hops away, we're gonna 00:29:44.000 --> 00:29:47.000 be gathering the frontier 00:29:47.000 --> 00:29:50.000 that is three hops away. And so at each stage we'll be, kind of, processing a 00:29:50.000 --> 00:29:54.000 generation but while assembling the generation N+1 00:29:54.000 --> 00:29:58.000 in case we need to go further to find those things. 00:29:58.000 --> 00:30:00.000 We'll use an auxiliary data structure during 00:30:00.000 --> 00:30:04.000 this. That management of the frontier, keeping track of the nodes that are, kind 00:30:04.000 --> 00:30:05.000 of, 00:30:05.000 --> 00:30:07.000 staged for the 00:30:07.000 --> 00:30:09.000 subsequent iterations, 00:30:09.000 --> 00:30:12.000 the queue is a perfect data structure for that. If I put the 00:30:12.000 --> 00:30:14.000 initial neighbors into a queue, 00:30:14.000 --> 00:30:19.000 as I dequeue them, if I put their neighbors on the back of the queue, so enqueue them 00:30:19.000 --> 00:30:21.000 behind all the one hop neighbors, then 00:30:21.000 --> 00:30:24.000 if I do that with code, as I'm processing Generation 1 off the front of the queue, I'll be 00:30:24.000 --> 00:30:27.000 stacking Generation 2 in the back of the queue, 00:30:27.000 --> 00:30:30.000 and then when all of Generation 1 is exhausted, I'll be processing Generation 00:30:30.000 --> 00:30:31.000 2, 00:30:31.000 --> 00:30:35.000 but, kind of, enqueuing Generation 3 behind it. 00:30:35.000 --> 00:30:39.000 Same deal that we had with depth-first search though is we will have to be 00:30:39.000 --> 00:30:41.000 careful about cycles will pull past that 00:30:41.000 --> 00:30:44.000 when there's more than one possibility of how we get to something, we don't 00:30:44.000 --> 00:30:46.000 want to, kind of, keep going around that cycle 00:30:46.000 --> 00:30:52.000 or do a lot of redundant work visiting nodes that we've already seen. So 00:30:52.000 --> 00:30:55.000 I have the same idea of keeping track of a set 00:30:55.000 --> 00:30:57.000 that knows what 00:30:57.000 --> 00:31:00.000 we've previously visited, 00:31:00.000 --> 00:31:01.000 and the 00:31:01.000 --> 00:31:05.000 initial node gets enqueued just, kind of, by itself, just like Generation 0 gets 00:31:05.000 --> 00:31:07.000 put into the queue, 00:31:07.000 --> 00:31:10.000 and then while the queue is not empty, so while there's still neighbors I haven't 00:31:10.000 --> 00:31:11.000 yet visited, 00:31:11.000 --> 00:31:13.000 I dequeue the frontmost one, 00:31:13.000 --> 00:31:16.000 if it hasn't already been visited on some previous iteration, then I mark it as 00:31:16.000 --> 00:31:17.000 visited, 00:31:17.000 --> 00:31:22.000 and then I enqueue all of its children or neighbors at the back of the queue. 00:31:22.000 --> 00:31:26.000 And so, as of right now, I'm processing Generation 0, that means all of Generation 1, so the 00:31:26.000 --> 00:31:29.000 neighbors that are one hop away, get placed in the queue, and then 00:31:29.000 --> 00:31:34.000 the next iteration coming out here will pull off a Generation 1, 00:31:34.000 --> 00:31:37.000 and then enqueue all of these two hop neighbors. 00:31:37.000 --> 00:31:40.000 Come back and other Generation 1's will get pulled off, enqueue all their two hop 00:31:40.000 --> 00:31:44.000 neighbors, and so the two hop generation will, kind of, be built up behind, and then 00:31:44.000 --> 00:31:48.000 once the last of the one hop generation gets managed, 00:31:48.000 --> 00:31:51.000 start processing the Generation 2 and so on. 00:31:51.000 --> 00:31:55.000 And so this will keep going while the queue has some 00:31:55.000 --> 00:31:58.000 contents in there. So that means some neighbors 00:31:58.000 --> 00:32:01.000 we have connections to that we haven't yet had a chance to explore, and 00:32:01.000 --> 00:32:05.000 either we will find out that they have all been visited, so this will end when 00:32:05.000 --> 00:32:09.000 all of the nodes that are in the queue have been visited or 00:32:09.000 --> 00:32:14.000 there are no more neighbors to enqueue. So we've gotten to those dead ends or those cycles 00:32:14.000 --> 00:32:25.000 that mean we've reach everything that was reachable from that start node. 00:32:25.000 --> 00:32:28.000 And so I've traced this guy 00:32:28.000 --> 00:32:30.000 to see it doing its work. 00:32:30.000 --> 00:32:33.000 So if I start again with A, 00:32:33.000 --> 00:32:36.000 and, again, assuming that I'm gonna process my nodes 00:32:36.000 --> 00:32:40.000 in alphabetical order when I have children, so I will through and I 00:32:40.000 --> 00:32:44.000 will enqueue, so the queue right now has just node A in it. I pull it off. I say 00:32:44.000 --> 00:32:45.000 have I visited? No, I have not. 00:32:45.000 --> 00:32:47.000 So then I mark it as visited, 00:32:47.000 --> 00:32:52.000 and now for each of its neighboring nodes, B, C, D, H, I'm gonna put those into 00:32:52.000 --> 00:32:54.000 the queue, 00:32:54.000 --> 00:32:57.000 and so then they're loaded up, and then I come back around, I say do I still have stuff in the queue? I 00:32:57.000 --> 00:33:01.000 do, so let's take out the first thing that was put in; it was B. Okay, 00:33:01.000 --> 00:33:05.000 and I'll say okay, well, where can you get to from B? You can get to C, so I'm gonna put C in the queue. Now, 00:33:05.000 --> 00:33:09.000 C is actually already in the queue, but we're gonna pull it out 00:33:09.000 --> 00:33:12.000 earlier in terms it'll get handled by the 00:33:12.000 --> 00:33:14.000 visiting status later. 00:33:14.000 --> 00:33:16.000 So, okay, we'll put C, D, and E in 00:33:16.000 --> 00:33:20.000 and then put C behind it. So right now, we'll have C, D, H, and C again. 00:33:20.000 --> 00:33:24.000 It'll find the C that's in the front there. It'll say where can you get to from C? 00:33:24.000 --> 00:33:24.000 Nowhere, 00:33:24.000 --> 00:33:26.000 so no additional 00:33:26.000 --> 00:33:28.000 Generation 2's are added by that one. 00:33:28.000 --> 00:33:31.000 I'll look at D, and I'll say okay, well, what Generation 2's do we have from D? It says, well, I can 00:33:31.000 --> 00:33:32.000 get to E. All right, so put E on the 00:33:32.000 --> 00:33:34.000 back of the 00:33:34.000 --> 00:33:36.000 00:33:36.000 --> 00:33:39.000 queue. Look at H; where can H get to? H can get to G, 00:33:39.000 --> 00:33:42.000 so go ahead and put G on the back of the queue. 00:33:42.000 --> 00:33:45.000 So now, it comes back around to the things that are two hops away. 00:33:45.000 --> 00:33:47.000 The first one it's gonna pull off is C. It's 00:33:47.000 --> 00:33:49.000 gonna say, well, C is already visited, 00:33:49.000 --> 00:33:53.000 so there's no need to redundantly visit that or do anything with that. 00:33:53.000 --> 00:33:54.000 It passes over C. It 00:33:54.000 --> 00:33:56.000 finds the E 00:33:56.000 --> 00:33:58.000 that's behind that, 00:33:58.000 --> 00:34:00.000 and then it finds the G that was behind the 00:34:00.000 --> 00:34:03.000 H. The E also enqueued the F 00:34:03.000 --> 00:34:07.000 that was four hops, and the very last one, right, output will be that F that was 00:34:07.000 --> 00:34:09.000 only reachable by taking 00:34:09.000 --> 00:34:12.000 three hops away from the thing. So this one's, kind of, 00:34:12.000 --> 00:34:13.000 working radially. The 00:34:13.000 --> 00:34:16.000 idea that I'm looking at all the things that I can get to, kind of, in one hop 00:34:16.000 --> 00:34:19.000 are my first generation, then two hops, then three hops, 00:34:19.000 --> 00:34:23.000 and so it's like throwing a stone in the water and watching it ripple out 00:34:23.000 --> 00:34:26.000 that the visiting 00:34:26.000 --> 00:34:29.000 is managed in terms of, kind of, growing length of path, 00:34:29.000 --> 00:34:32.000 and number of hops that it took to get there. 00:34:32.000 --> 00:34:37.000 So this is the strategy that you used for maze solving, if you remember. 00:34:37.000 --> 00:34:37.000 That 00:34:37.000 --> 00:34:42.000 what you were tracking was the queue of paths where the path was 00:34:42.000 --> 00:34:44.000 AB, or ADC, or 00:34:44.000 --> 00:34:46.000 ADE, and that they grew 00:34:46.000 --> 00:34:47.000 00:34:47.000 --> 00:34:50.000 step wise. That, kind of, each iteration through the 00:34:50.000 --> 00:34:54.000 traversal there where you're saying, okay, well, I've seen all the paths that are one hop. Now I've seen all the ones of two, 00:34:54.000 --> 00:34:56.000 now the ones three, and four, and five, 00:34:56.000 --> 00:34:58.000 and when you keep doing that, eventually you get to the hop, 00:34:58.000 --> 00:35:01.000 you know, 75, which leads you 00:35:01.000 --> 00:35:05.000 all the way to the goal, and since you have looked at all the paths that were 00:35:05.000 --> 00:35:09.000 74 and shorter, the first path that you dequeue that leads to your goal 00:35:09.000 --> 00:35:12.000 must be the best or the shortest overall 00:35:12.000 --> 00:35:13.000 because 00:35:13.000 --> 00:35:15.000 of the order you processed them. 00:35:15.000 --> 00:35:18.000 That's not true in depth-first search, right? That depth-first search might find a much 00:35:18.000 --> 00:35:21.000 longer and more circuitous path that led to the goal 00:35:21.000 --> 00:35:25.000 when there is a shorter one that would be eventually found if I had a graph that 00:35:25.000 --> 00:35:28.000 looked, 00:35:28.000 --> 00:35:36.000 you know, had this long path, let's say, 00:35:36.000 --> 00:35:41.000 and so this is goal node, let's say, and this is the start node that 00:35:41.000 --> 00:35:44.000 there could be a two hop path, kind of, off to this angle, 00:35:44.000 --> 00:35:47.000 but if this was the first one explored in the depth-first, right, 00:35:47.000 --> 00:35:50.000 it could eventually, kind of, work its way all the way around through this 00:35:50.000 --> 00:35:52.000 eight hop path, and say, okay, well, I found it, right? 00:35:52.000 --> 00:35:55.000 It would be the first one we found, but there's no guarantee that would be the 00:35:55.000 --> 00:35:58.000 shortest in depth-first search, right? It would just happen to be the one that based on 00:35:58.000 --> 00:36:01.000 its arbitrary choice of which 00:36:01.000 --> 00:36:04.000 neighbors to pursue it happened to get to that there could be a shorter 00:36:04.000 --> 00:36:08.000 one that would be found later in the traversal. That's not true in 00:36:08.000 --> 00:36:12.000 the breadth-first search strategy because it will be looking at this, and then these two, 00:36:12.000 --> 00:36:16.000 and then these two, and, kind of, working its way outward down those edges. 00:36:16.000 --> 00:36:19.000 So as soon as we get to one that hits the goal, we 00:36:19.000 --> 00:36:25.000 know we have found the shortest we can have. Question? 00:36:25.000 --> 00:36:28.000 In that case, 00:36:28.000 --> 00:36:33.000 you could return, okay, we need two 00:36:33.000 --> 00:36:34.000 jumps to get to 00:36:34.000 --> 00:36:35.000 the goal, 00:36:35.000 --> 00:36:36.000 like, as far 00:36:36.000 --> 00:36:40.000 winning the goal. Yeah. How do we remember which path led us there? 00:36:40.000 --> 00:36:42.000 So if you were really doing that, you know, in terms of depth-first search, what 00:36:42.000 --> 00:36:46.000 you'd probably be tracking is what's the path? Probably a stack that said here's the stack 00:36:46.000 --> 00:36:49.000 that led to that path. Let me hold 00:36:49.000 --> 00:36:53.000 onto this, right, and I will compare it to these alternatives. So I'll try it on Neighbor 1, 00:36:53.000 --> 00:36:54.000 get the stack back from that; it's the best. 00:36:54.000 --> 00:36:58.000 Try it on Neighbor 2, get that stack back, see which one is smaller and use 00:36:58.000 --> 00:37:01.000 that as the better choice, and so, eventually, when I tried it out this side, I'd get 00:37:01.000 --> 00:37:03.000 the stack back that was just two, 00:37:03.000 --> 00:37:06.000 and I could say, yeah, that was better than this ten step path I had over here, so I 00:37:06.000 --> 00:37:09.000 will prefer that. So just using your standard, kind of, backtracking with 00:37:09.000 --> 00:37:13.000 using your return values, incorporating them, and deciding which was better. 00:37:13.000 --> 00:37:14.000 And so in this case, 00:37:14.000 --> 00:37:16.000 the depth-first search 00:37:16.000 --> 00:37:19.000 can't really be pruned very easily 00:37:19.000 --> 00:37:24.000 because it has to, kind of, explore the whole thing to know that this 00:37:24.000 --> 00:37:26.000 short little path right over here happened to be just 00:37:26.000 --> 00:37:29.000 arbitrarily examined last, whereas, the breadth-first search actually does, kind of, because 00:37:29.000 --> 00:37:31.000 it's prioritizing by 00:37:31.000 --> 00:37:35.000 order of length. If the goal was shortest path, it will find it sooner and be able to 00:37:35.000 --> 00:37:35.000 stop 00:37:35.000 --> 00:37:39.000 when it found it rather than look at these longer paths. So it won't end up - 00:37:39.000 --> 00:37:42.000 if there were hundreds of other paths over here, breadth-first search won't actually 00:37:42.000 --> 00:37:47.000 have to ever reach them or explore them. 00:37:47.000 --> 00:37:52.000 So there are a lot of things that actually end up being graph search in disguise. That's why 00:37:52.000 --> 00:37:54.000 a, kind of, a graph is a great data structure to, kind of, get to in 00:37:54.000 --> 00:37:57.000 106b because there are lots of problems 00:37:57.000 --> 00:38:01.000 that, in the end, if you can model them using the same 00:38:01.000 --> 00:38:03.000 setup of nodes, and arcs, and traversals, 00:38:03.000 --> 00:38:06.000 that you can answer interesting questions about that data by applying 00:38:06.000 --> 00:38:09.000 the same graph search techniques. 00:38:09.000 --> 00:38:12.000 So as long as you want to know things like, well, which nodes are reachable from this node at 00:38:12.000 --> 00:38:12.000 all? 00:38:12.000 --> 00:38:14.000 And then 00:38:14.000 --> 00:38:17.000 that just means, okay, I'm here and I want to get to these places; what places could 00:38:17.000 --> 00:38:18.000 I get to? Or 00:38:18.000 --> 00:38:21.000 knowing what courses I could take 00:38:21.000 --> 00:38:25.000 that would lead me to a CS degree, it's, kind of, like well, looking for the ones 00:38:25.000 --> 00:38:29.000 that lead to having all your graduation requirements 00:38:29.000 --> 00:38:32.000 satisfied, and each of those arcs could be a course you could take, right, how close did 00:38:32.000 --> 00:38:35.000 that get you to the goal? How can I get to the 00:38:35.000 --> 00:38:40.000 place I'd like to be, or what are the places that this course satisfies 00:38:40.000 --> 00:38:41.000 requirements that lead to? 00:38:41.000 --> 00:38:44.000 Knowing things about 00:38:44.000 --> 00:38:47.000 connectivity is, kind of, useful in things like the, kind of, bacon numbers or 00:38:47.000 --> 00:38:50.000 erdish numbers where people want to know, well, how close am I to greatness by - 00:38:50.000 --> 00:38:54.000 have I been in a movie with somebody who was in a movie, who was in a movie with Kevin Bacon? And, 00:38:54.000 --> 00:38:55.000 for me, it turns out 00:38:55.000 --> 00:38:57.000 zero; I have been in no movies. But have 00:38:57.000 --> 00:38:59.000 you written a paper with someone who then 00:38:59.000 --> 00:39:02.000 connects through a chain to some other famous person, right? Tells you about 00:39:02.000 --> 00:39:06.000 your, kind of, social network distance or something, or I'm linked in. You're trying to get a 00:39:06.000 --> 00:39:09.000 new job, and you want to find somebody to introduce you to 00:39:09.000 --> 00:39:13.000 the head of this cool company. It's, like, figuring who you know who knows them, right? 00:39:13.000 --> 00:39:18.000 It's finding paths through a social network graph. 00:39:18.000 --> 00:39:19.000 Finding out things like 00:39:19.000 --> 00:39:22.000 are there paths with cycles? What's the shortest path 00:39:22.000 --> 00:39:25.000 through the cycle, longest path through the cycle often tells you about things about 00:39:25.000 --> 00:39:27.000 redundancies in that structure 00:39:27.000 --> 00:39:31.000 that, for example, when you're building a network infrastructure, sometimes you do have 00:39:31.000 --> 00:39:34.000 cycles in it because you actually want to have multiple ways to get to things in case 00:39:34.000 --> 00:39:37.000 there's a breakdown. Like if there's a freeway accident, you want to have another way 00:39:37.000 --> 00:39:38.000 to get around it, 00:39:38.000 --> 00:39:41.000 but you are interested in, maybe, trying to keep your cost down is trying to find 00:39:41.000 --> 00:39:47.000 out where are the right places to put in those redundancies, those cycles that 00:39:47.000 --> 00:39:50.000 give you the best benefit with the least cost. 00:39:50.000 --> 00:39:54.000 Trying to find things like a continuous path that visits all the nodes once and exactly 00:39:54.000 --> 00:39:57.000 once, especially doing so efficiently, picking a short 00:39:57.000 --> 00:40:00.000 overall path for that. We talked a little bit about that in terms of what's called the 00:40:00.000 --> 00:40:02.000 traveling salesman problem. 00:40:02.000 --> 00:40:04.000 You are trying to get from - you have ten 00:40:04.000 --> 00:40:06.000 cities to visit on your book tour. You don't want to spend all your lifetime 00:40:06.000 --> 00:40:08.000 on a plane. What's the 00:40:08.000 --> 00:40:13.000 path of crisscrossing the country that gets you to all ten of those with the least 00:40:13.000 --> 00:40:15.000 time painfully 00:40:15.000 --> 00:40:17.000 spent on an airline? And so 00:40:17.000 --> 00:40:20.000 those are just graph search problems. You have these nodes. You have these arcs. 00:40:20.000 --> 00:40:21.000 How can you 00:40:21.000 --> 00:40:23.000 traverse them and visit 00:40:23.000 --> 00:40:25.000 and come up with 00:40:25.000 --> 00:40:27.000 a good solution. There's other 00:40:27.000 --> 00:40:30.000 problems I think that are kind of neat. Word letters, we used to give out an assignment on word letters. We 00:40:30.000 --> 00:40:31.000 didn't this time, but I thought it was an 00:40:31.000 --> 00:40:34.000 interesting problem to think about. The idea of 00:40:34.000 --> 00:40:36.000 how it is you can transform 00:40:36.000 --> 00:40:41.000 one letter at a time from one word to another is very much a 00:40:41.000 --> 00:40:45.000 graph problem behind the scenes, and that largely a breadth-first 00:40:45.000 --> 00:40:47.000 traversal is exactly what you're looking for there. How 00:40:47.000 --> 00:40:51.000 can I transform this with the shortest number of steps is, kind of, 00:40:51.000 --> 00:40:54.000 working radially out from the starting word. 00:40:54.000 --> 00:40:54.000 00:40:54.000 --> 00:40:58.000 Your boggle board turns out, really, is just a graph. If 00:40:58.000 --> 00:41:01.000 you think of 00:41:01.000 --> 00:41:07.000 having this sequence of letters 00:41:07.000 --> 00:41:12.000 that you're trying to work through, 00:41:12.000 --> 00:41:16.000 that really each of these letters can actually just be thought of as a node, 00:41:16.000 --> 00:41:20.000 and then the connections it has are to its eight neighbors, and that finding 00:41:20.000 --> 00:41:24.000 words in there, especially finding paths, starting from 00:41:24.000 --> 00:41:27.000 this node that lead away, that don't revisit, so, in fact, doing breadth-first 00:41:27.000 --> 00:41:31.000 or depth-first search. You most likely did depth-first search when you were writing the 00:41:31.000 --> 00:41:32.000 implementation of boggle, 00:41:32.000 --> 00:41:35.000 but that's, kind of, a deep search on I found an A. I found a P. I found a P, and, kind 00:41:35.000 --> 00:41:40.000 of, keeps going as long as that path looks viable. So, in this case, that the arc 00:41:40.000 --> 00:41:41.000 information we're using is 00:41:41.000 --> 00:41:44.000 having a sub of those letters, do they form a prefix that could be leading 00:41:44.000 --> 00:41:45.000 somewhere good, 00:41:45.000 --> 00:41:47.000 and then the, kind of, recursion bottoming out when 00:41:47.000 --> 00:41:52.000 we have run ourselves into a corner where we have no unvisited nodes 00:41:52.000 --> 00:41:53.000 that neighbor us, 00:41:53.000 --> 00:41:57.000 or only nodes that would extend to a prefix that no longer forms any 00:41:57.000 --> 00:41:58.000 viable words 00:41:58.000 --> 00:42:01.000 as a search problem. 00:42:01.000 --> 00:42:04.000 So it feels like everything you've done this quarter has actually been a graph in disguise; I just didn't 00:42:04.000 --> 00:42:08.000 tell you. The maze problem is very much a graph search problem, building a 00:42:08.000 --> 00:42:10.000 graph and searching it. 00:42:10.000 --> 00:42:13.000 Another one that I think is, kind of, neat to think about is, kind of, related to the idea of word letters 00:42:13.000 --> 00:42:18.000 is that often when you mistype a word, a word 00:42:18.000 --> 00:42:19.000 processor will suggest to you 00:42:19.000 --> 00:42:22.000 here's a word that you might have meant instead. You type a word that it doesn't 00:42:22.000 --> 00:42:24.000 have in its lexicon, and it says, well, 00:42:24.000 --> 00:42:25.000 that could just be 00:42:25.000 --> 00:42:28.000 a word I don't know, but it also could mean that you actually mistyped something, and so 00:42:28.000 --> 00:42:31.000 what you want to look at is having neighboring nodes, 00:42:31.000 --> 00:42:33.000 sort of, diagramming a graph 00:42:33.000 --> 00:42:35.000 of all the words you know of 00:42:35.000 --> 00:42:39.000 and then the, kind of, permutations, the tiny little twiddles to that word that 00:42:39.000 --> 00:42:41.000 are neighbors to it so that 00:42:41.000 --> 00:42:44.000 you could describe a set of suggestions 00:42:44.000 --> 00:42:47.000 as being those things that are within a certain 00:42:47.000 --> 00:42:50.000 hop distance from the original mistyped word. And 00:42:50.000 --> 00:42:53.000 so you could say, yeah, things that have two letters different are probably 00:42:53.000 --> 00:42:56.000 close enough, but things that have three or more letters different aren't actually 00:42:56.000 --> 00:42:58.000 likely to be 00:42:58.000 --> 00:43:00.000 the mistyped word, 00:43:00.000 --> 00:43:04.000 and then that, kind of, leads to supporting things like wildcard searches 00:43:04.000 --> 00:43:08.000 where you're trying to remember whether it's E-R-A-T or A-R-T-E at the 00:43:08.000 --> 00:43:10.000 end of desperate. 00:43:10.000 --> 00:43:14.000 Those things could be modeled as, kind of, like they search through a graph space where 00:43:14.000 --> 00:43:16.000 you're modeling those letters, and then there's, like, a 00:43:16.000 --> 00:43:20.000 branching where you try all of the letters coming out of D-E-S-P, seeing if any 00:43:20.000 --> 00:43:23.000 of them lead to an R-A-T-E. 00:43:23.000 --> 00:43:26.000 So the last thing I wanted 00:43:26.000 --> 00:43:28.000 to just bring up, and this is the one that's going into this week's assignment. I'm just 00:43:28.000 --> 00:43:31.000 gonna actually show it to you to, kind of, get you thinking about it because it's 00:43:31.000 --> 00:43:35.000 actually the problem at hand that you're gonna be solving 00:43:35.000 --> 00:43:36.000 is that 00:43:36.000 --> 00:43:41.000 the one of where I'm interested in finding the shortest path, but I'm gonna add in 00:43:41.000 --> 00:43:43.000 this new idea that 00:43:43.000 --> 00:43:45.000 all hops are not created equal. 00:43:45.000 --> 00:43:49.000 That a flight from San Francisco to Seattle is a shorter distance than the 00:43:49.000 --> 00:43:52.000 one that goes from San Francisco to Boston, 00:43:52.000 --> 00:43:53.000 and if I'm interested in getting 00:43:53.000 --> 00:43:57.000 to someplace, it's not just that I would like to take a direct flight. I'd also 00:43:57.000 --> 00:44:00.000 like to take the set of flights that involves the least, kind of, going out of my way, 00:44:00.000 --> 00:44:03.000 and going through other cities, and other directions. 00:44:03.000 --> 00:44:07.000 And so in this case, a breadth-first search is a little bit too simplistic. 00:44:07.000 --> 00:44:10.000 That if I just look at all the places I can get to in one hop, 00:44:10.000 --> 00:44:11.000 00:44:11.000 --> 00:44:15.000 those are not really equally 00:44:15.000 --> 00:44:16.000 weighted 00:44:16.000 --> 00:44:19.000 in terms of my goal of minimizing my total path distance, 00:44:19.000 --> 00:44:22.000 but if I, instead, prioritize - 00:44:22.000 --> 00:44:25.000 prioritize in your mind, immediately conjure up images of the 00:44:25.000 --> 00:44:26.000 priority queue, 00:44:26.000 --> 00:44:28.000 to look at paths that are, 00:44:28.000 --> 00:44:32.000 if I'm trying to minimize total distance, the shortest path, even if they 00:44:32.000 --> 00:44:36.000 represent several hops, if they're still shorter - if I have three flights, and each are 00:44:36.000 --> 00:44:37.000 100 miles, 00:44:37.000 --> 00:44:40.000 then they represent a flight of 300 total miles, 00:44:40.000 --> 00:44:44.000 and that's still preferred to one that actually is 2,000 miles or 7,000 miles 00:44:44.000 --> 00:44:45.000 going to Asia 00:44:45.000 --> 00:44:49.000 alternatively. And so if I worked on a breadth-first that's been modified by a total 00:44:49.000 --> 00:44:50.000 distance, 00:44:50.000 --> 00:44:51.000 that it would be a way 00:44:51.000 --> 00:44:54.000 to work out the shortest path 00:44:54.000 --> 00:44:55.000 in terms of total weight 00:44:55.000 --> 00:44:58.000 from San Francisco to my destination 00:44:58.000 --> 00:45:01.000 by a 00:45:01.000 --> 00:45:03.000 modified breadth-first, kind of, weighted strategy, 00:45:03.000 --> 00:45:06.000 and that is exactly what you're gonna be doing in this assignment. I'm not gonna 00:45:06.000 --> 00:45:08.000 give away too much by, kind of, 00:45:08.000 --> 00:45:11.000 delving too much in it, but the handout does a pretty good job of talking you through what 00:45:11.000 --> 00:45:15.000 you're doing, but it's a, kind of, neat way to think about that's how the GPS, the MapQuest, and things like 00:45:15.000 --> 00:45:16.000 that are working is 00:45:16.000 --> 00:45:17.000 you're at a site; 00:45:17.000 --> 00:45:21.000 you have a goal, and you have this information that you're trying to 00:45:21.000 --> 00:45:24.000 guide that search, and you're gonna use heuristics about 00:45:24.000 --> 00:45:28.000 appears to be the, kind of, optimal path to, kind of, pick that, and go with it, 00:45:28.000 --> 00:45:31.000 and see where it leads you. So 00:45:31.000 --> 00:45:32.000 that will be your 00:45:32.000 --> 00:45:40.000 task for this week. What we'll talk about on Friday is caching.