WEBVTT 00:00:00.000 --> 00:00:07.000 00:00:07.000 --> 00:00:12.000 00:00:12.000 --> 00:00:15.000 This presentation is delivered by the Stanford Center for Professional 00:00:15.000 --> 00:00:25.000 Development. 00:00:25.000 --> 00:00:27.000 How are we doing? 00:00:27.000 --> 00:00:28.000 Spring apparently is here; 00:00:28.000 --> 00:00:33.000 just like California, one day it's winter, one day it's spring. Hopefully it's here to say. 00:00:33.000 --> 00:00:38.000 So we had talked about doing it, we did this last time, we did an unsorted vector 00:00:38.000 --> 00:00:40.000 implementation of the map, we just did a little struct of 00:00:40.000 --> 00:00:42.000 the key value pair. 00:00:42.000 --> 00:00:44.000 And then we went through 00:00:44.000 --> 00:00:48.000 that both add and get value are running at linear time in that form because of the need 00:00:48.000 --> 00:00:52.000 to search; search to find a value to override it, and search to find a value to 00:00:52.000 --> 00:00:54.000 pull out the existing 00:00:54.000 --> 00:00:55.000 match to it. 00:00:55.000 --> 00:00:58.000 And if we went to the sorted form of that, we could get get value to be able to 00:00:58.000 --> 00:01:02.000 capitalize on that binary search property and be able to quickly find the 00:01:02.000 --> 00:01:06.000 key if it exists in there or to determine it doesn't exist. 00:01:06.000 --> 00:01:09.000 But the add still suffered from the shuffling problem. We can quickly 00:01:09.000 --> 00:01:12.000 find, using the binary search to find where the 00:01:12.000 --> 00:01:15.000 new element would need to go. But if it doesn't already go there, it doesn't 00:01:15.000 --> 00:01:17.000 already exist. We're going to have to shuffle to make that space. 00:01:17.000 --> 00:01:21.000 So we have still a linear factor in there that we're working on getting rid of. 00:01:21.000 --> 00:01:24.000 And then I know at the bottom that none of them have any 00:01:24.000 --> 00:01:27.000 built in overhead per element. There may be some capacity, excess capacity in the 00:01:27.000 --> 00:01:30.000 vector that we're absorbing, but there's no 00:01:30.000 --> 00:01:33.000 four, ten, eight-byte chunk that every 00:01:33.000 --> 00:01:36.000 element is taking as a cost. 00:01:36.000 --> 00:01:39.000 So we started to talk about this at the end on Friday, and we're going to go with this and 00:01:39.000 --> 00:01:41.000 see where this takes 00:01:41.000 --> 00:01:45.000 us. We had the idea of the vector, and we know what vector constraints are, right? This 00:01:45.000 --> 00:01:48.000 idea of the contiguous memory, and we were going to think about the link list for a minute to 00:01:48.000 --> 00:01:51.000 see if it was going to buy us anything. 00:01:51.000 --> 00:01:54.000 It tends to have an inverse relationship, that once you know where you 00:01:54.000 --> 00:01:57.000 want to insert something it's easy to do that rearrangement because it's just a 00:01:57.000 --> 00:02:01.000 little bit of pointer rewiring that needs to happen in that situation. 00:02:01.000 --> 00:02:03.000 But it's hard to find a position insert, 00:02:03.000 --> 00:02:06.000 even if the link list is in sorted order, 00:02:06.000 --> 00:02:10.000 knowing where the S's are, or the M's are, or the B's are is still a 00:02:10.000 --> 00:02:13.000 matter of starting from the beginning; that your only access in the 00:02:13.000 --> 00:02:16.000 singly linked list is to start at the beginning and work your way down. 00:02:16.000 --> 00:02:20.000 And so I'm going to take you through a little bit of a thought exercise about well, what if 00:02:20.000 --> 00:02:23.000 we try to do something about that? Right now having a pointer coming into 00:02:23.000 --> 00:02:26.000 Bashful and then subsequent ones going in sorted order 00:02:26.000 --> 00:02:27.000 isn't really very helpful. 00:02:27.000 --> 00:02:30.000 But if we were willing to try to rearrange our pointers 00:02:30.000 --> 00:02:33.000 to give us the access that we have in the 00:02:33.000 --> 00:02:36.000 sorted array form; for example, if I had a pointer 00:02:36.000 --> 00:02:39.000 to the middlemost element, in 00:02:39.000 --> 00:02:42.000 this case the Grumpy that's 00:02:42.000 --> 00:02:43.000 midway down the list, if I had 00:02:43.000 --> 00:02:45.000 that pointer 00:02:45.000 --> 00:02:49.000 and from there I could say, is the element I'm looking for 00:02:49.000 --> 00:02:53.000 ahead of Grumpy or behind Grumpy. So having Grumpy split my input in 00:02:53.000 --> 00:02:54.000 half. 00:02:54.000 --> 00:02:56.000 If I had that, right, then that would actually get me 00:02:56.000 --> 00:02:59.000 sort of moving towards that direction that I wanted to get, which is using the binary 00:02:59.000 --> 00:03:03.000 search property to facilitate throwing away half the data 00:03:03.000 --> 00:03:06.000 and look into just the half remaining that's interesting. 00:03:06.000 --> 00:03:09.000 Well, if I had a pointer to Grumpy, that doesn't quite solve my problem. A pointer to Grumpy 00:03:09.000 --> 00:03:12.000 could get me halfway down the list, but then the pointer's coming in this way. If I 00:03:12.000 --> 00:03:16.000 inverted the pointers leading away from it, 00:03:16.000 --> 00:03:17.000 that's also kind of helping 00:03:17.000 --> 00:03:20.000 to facilitate what I'm trying to do. So if I could point toward the middle 00:03:20.000 --> 00:03:24.000 and then kind of move away, and then if I actually kind of applied the same strategy not just at the topmost 00:03:24.000 --> 00:03:28.000 level, but at every subsequent division down, from the half, to the 00:03:28.000 --> 00:03:29.000 quarter to the eight, 00:03:29.000 --> 00:03:32.000 to the sixteenth. And instead of having a pointer 00:03:32.000 --> 00:03:34.000 to the front of the list, what I really maintained was a pointer to the 00:03:34.000 --> 00:03:36.000 middlemost element, and then 00:03:36.000 --> 00:03:38.000 that element maintained pointers 00:03:38.000 --> 00:03:40.000 to the middle of the 00:03:40.000 --> 00:03:44.000 upper quarter and the middle of the lower quarter, and all the way down. 00:03:44.000 --> 00:03:46.000 Then I could pull up this list 00:03:46.000 --> 00:03:50.000 into something that we call a binary search tree. 00:03:50.000 --> 00:03:54.000 So the way I would be storing this is having a pointer to a node, which 00:03:54.000 --> 00:03:58.000 is expected to the median or the middle, or somewhere right along 00:03:58.000 --> 00:03:59.000 the 00:03:59.000 --> 00:04:00.000 values being stored. 00:04:00.000 --> 00:04:04.000 And that it has two pointers, not just a next or previous or something like that, that are related to all 00:04:04.000 --> 00:04:08.000 the elements that precede this one in the ordering, and all the elements that 00:04:08.000 --> 00:04:09.000 follow it are in 00:04:09.000 --> 00:04:11.000 these two pointers. 00:04:11.000 --> 00:04:14.000 They call these subtrees coming off the smaller trees over here with three 00:04:14.000 --> 00:04:17.000 elements on this side, and three elements on that side. 00:04:17.000 --> 00:04:19.000 And that each of those nodes 00:04:19.000 --> 00:04:22.000 has recursively the same formulation; that it has two 00:04:22.000 --> 00:04:25.000 subtree hanging off of it 00:04:25.000 --> 00:04:27.000 that contain, 00:04:27.000 --> 00:04:32.000 given any particular node, that there's this binary search ordering here that says 00:04:32.000 --> 00:04:36.000 everything that's in that left subtree precedes this node in ordering. 00:04:36.000 --> 00:04:38.000 And everything that's in that right subtree 00:04:38.000 --> 00:04:39.000 follows it. 00:04:39.000 --> 00:04:44.000 And that's true for every node throughout the entire arrangement here. 00:04:44.000 --> 00:04:47.000 So this is called a tree, and 00:04:47.000 --> 00:04:50.000 this particular form is actually called the binary search tree. I'm 00:04:50.000 --> 00:04:53.000 going to throw a little bit of vocabulary at you, because I'm going to start using these words, and I just want to get 00:04:53.000 --> 00:04:54.000 them out there 00:04:54.000 --> 00:04:58.000 early so that you understand what those words mean when I start using them. 00:04:58.000 --> 00:05:01.000 Is that each of the cells in here, 00:05:01.000 --> 00:05:02.000 like the cells that are 00:05:02.000 --> 00:05:04.000 allocated for a link list, 00:05:04.000 --> 00:05:07.000 are individually heap allocated, we're going to have pointers that allow us to get back 00:05:07.000 --> 00:05:08.000 to 00:05:08.000 --> 00:05:12.000 them. We typically call those nodes; in terms of tree I'll call it a cell or node in the list kind 00:05:12.000 --> 00:05:16.000 of interchangeably. I'm more likely to use node when I'm talking about tree. 00:05:16.000 --> 00:05:19.000 And a tree is really just a pointer to a node. 00:05:19.000 --> 00:05:21.000 The empty tree 00:05:21.000 --> 00:05:24.000 would be a pointer whose value is null, so pointing at nothing. 00:05:24.000 --> 00:05:27.000 A singleton tree would be a pointer to a single node 00:05:27.000 --> 00:05:29.000 that has left and right subtrees 00:05:29.000 --> 00:05:31.000 that are both empty or both null. 00:05:31.000 --> 00:05:35.000 So Bashful by itself is a singleton, a pointer to 00:05:35.000 --> 00:05:38.000 Bashful. When we talk about subtrees, it just kind of means looking at the top; Grumpy has two 00:05:38.000 --> 00:05:40.000 subtrees, a left and a right, 00:05:40.000 --> 00:05:41.000 00:05:41.000 --> 00:05:42.000 that 00:05:42.000 --> 00:05:45.000 themselves are just smaller forms of the same recursive structure. 00:05:45.000 --> 00:05:49.000 And then we would say that Grumpy is the parent of Doc and Sleepy. 00:05:49.000 --> 00:05:52.000 So those are the two children nodes, and we'll call them the left child and the right 00:05:52.000 --> 00:05:53.000 child. 00:05:53.000 --> 00:05:56.000 And the node that's at the very top, 00:05:56.000 --> 00:06:00.000 the one that our external pointer is coming into and then kind of leads the way, is called 00:06:00.000 --> 00:06:02.000 the root node. The 00:06:02.000 --> 00:06:02.000 first node 00:06:02.000 --> 00:06:05.000 of the tree where we first start our work is going to be called the root. 00:06:05.000 --> 00:06:09.000 And these nodes down here at the bottom that have empty subtrees are called leaf nodes. 00:06:09.000 --> 00:06:14.000 So a node that has no further subtrees, so by itself has null 00:06:14.000 --> 00:06:16.000 left and right trees, is called a leaf. 00:06:16.000 --> 00:06:18.000 So that actually means the tree is kind of upside down 00:06:18.000 --> 00:06:21.000 if you look at it. That Grumpy's the root and these things are the 00:06:21.000 --> 00:06:25.000 leaves shows you that apparently most computer scientists don't get out very much, 00:06:25.000 --> 00:06:28.000 because they haven't noticed that trees actually grow the other direction. 00:06:28.000 --> 00:06:29.000 00:06:29.000 --> 00:06:32.000 But just kind of go with it, and 00:06:32.000 --> 00:06:33.000 you'll be 00:06:33.000 --> 00:06:36.000 speaking tree-speak in no time. 00:06:36.000 --> 00:06:43.000 Anybody have any questions about what we've got basically set up here? 00:06:43.000 --> 00:06:45.000 So there is a 00:06:45.000 --> 00:06:48.000 whole family of things that come under the heading of tree in computer 00:06:48.000 --> 00:06:52.000 science. We're going to look in particular at this one embodiment, the binary search 00:06:52.000 --> 00:06:53.000 tree. 00:06:53.000 --> 00:06:56.000 But before I get deep down in the details of that, I just want to back 00:06:56.000 --> 00:06:59.000 up for a second and just talk about what trees in general mean, and what kind of 00:06:59.000 --> 00:07:01.000 things in general are represented in manipulative trees. 00:07:01.000 --> 00:07:06.000 So that you have a sense of what is the broad notion of tree as a 00:07:06.000 --> 00:07:08.000 computer scientist would see it, 00:07:08.000 --> 00:07:10.000 is just that a tree is a recursive branching structure 00:07:10.000 --> 00:07:14.000 where there are nodes, which have pointers to subtrees. 00:07:14.000 --> 00:07:17.000 There may be one of those subtrees. There may be ten of those subtrees. There 00:07:17.000 --> 00:07:20.000 may be two; depending on the nature of the tree you're working on there may be a 00:07:20.000 --> 00:07:22.000 constraint on exactly how many children they have 00:07:22.000 --> 00:07:26.000 or some flexibility in the number of children they have 00:07:26.000 --> 00:07:27.000 00:07:27.000 --> 00:07:29.000 depending on what's being modeled here. 00:07:29.000 --> 00:07:31.000 There's always a single root node. 00:07:31.000 --> 00:07:35.000 There's a node that kind of sits at the top of this hierarchy that leads down to 00:07:35.000 --> 00:07:36.000 these things, 00:07:36.000 --> 00:07:41.000 and that to form a tree there has to be a reachable path from the root 00:07:41.000 --> 00:07:43.000 to every single node. 00:07:43.000 --> 00:07:46.000 So there aren't some nodes that exist way out here, and there aren't two different ways to 00:07:46.000 --> 00:07:48.000 get to the same 00:07:48.000 --> 00:07:50.000 node in a tree; that if the 00:07:50.000 --> 00:07:53.000 midterm is kept in the folder called exams, which is in the folder 00:07:53.000 --> 00:07:56.000 called cs106, which is in my home folder. There's not some other way that 00:07:56.000 --> 00:07:59.000 you can get to the midterm; the midterm lives exactly that part in the tree, 00:07:59.000 --> 00:08:01.000 and there's no 00:08:01.000 --> 00:08:04.000 circularity or loops in the structure. 00:08:04.000 --> 00:08:07.000 And so a lot of things do get modeled tree, the one I've drawn here is just the file system. 00:08:07.000 --> 00:08:10.000 Think about how your files are maintained on a computer, there tends to be a 00:08:10.000 --> 00:08:11.000 structure of an 00:08:11.000 --> 00:08:14.000 outermost folder, which has inner folders, 00:08:14.000 --> 00:08:17.000 which have inner folders. And along the way there are these files, which you can think of as leaf 00:08:17.000 --> 00:08:21.000 nodes that have no further subtrees underneath them. 00:08:21.000 --> 00:08:24.000 And you can talk about, say, what is the size of 00:08:24.000 --> 00:08:27.000 all the folders in my home directory? Well it would kind 00:08:27.000 --> 00:08:28.000 of be recursively defined as summing over 00:08:28.000 --> 00:08:30.000 all the folders within it, 00:08:30.000 --> 00:08:33.000 how much storage is being used by those; which when it gets down to a leaf node can say, how 00:08:33.000 --> 00:08:38.000 much space does this file take up and kind of add those together. 00:08:38.000 --> 00:08:42.000 Things like your family trees, right, sort of modeling genealogy. 00:08:42.000 --> 00:08:44.000 Things like the decomposition tree in your program; 00:08:44.000 --> 00:08:47.000 main calling, 00:08:47.000 --> 00:08:50.000 read file calling, set up boggle board calling, 00:08:50.000 --> 00:08:52.000 give instructions. 00:08:52.000 --> 00:08:56.000 There is a tree structure to that, about what pieces use what pieces and 00:08:56.000 --> 00:08:57.000 descendants from there. 00:08:57.000 --> 00:09:01.000 The ones that we're going to look at in particularly are the ones that are in the family of the binary trees. So a binary 00:09:01.000 --> 00:09:05.000 tree has two children, exactly two children. There's a left and a right 00:09:05.000 --> 00:09:07.000 pointer coming out of each node. 00:09:07.000 --> 00:09:11.000 One or both of those can be null, but there can never be more than two 00:09:11.000 --> 00:09:11.000 children. 00:09:11.000 --> 00:09:15.000 And there is sort of space for recording two children, as opposed to recording things which 00:09:15.000 --> 00:09:17.000 are trinary, which have three 00:09:17.000 --> 00:09:20.000 possibilities for children, or where there may be some four or eight or 00:09:20.000 --> 00:09:22.000 some other number there. 00:09:22.000 --> 00:09:24.000 And in particular it's the search tree 00:09:24.000 --> 00:09:26.000 that's going to help us get 00:09:26.000 --> 00:09:28.000 the work done that we want, which is 00:09:28.000 --> 00:09:32.000 there's not just any old arrangement of what's on the left, and what's on the right side. 00:09:32.000 --> 00:09:35.000 That the nodes are going to be organized in here to facilitate binary search by 00:09:35.000 --> 00:09:36.000 keeping 00:09:36.000 --> 00:09:37.000 one half 00:09:37.000 --> 00:09:40.000 over on the left and one half over on the right, and that we'll know which 00:09:40.000 --> 00:09:44.000 half something goes in by virtue of looking at the key. So 00:09:44.000 --> 00:09:47.000 let's take a look at a simple example of a binary search tree with 00:09:47.000 --> 00:09:49.000 numbers. So 00:09:49.000 --> 00:09:53.000 if I have a struct node that keeps track of an integer value and has two pointers, the 00:09:53.000 --> 00:09:55.000 left and the right, 00:09:55.000 --> 00:09:59.000 that point away from it. And in this case my root node is 57, 00:09:59.000 --> 00:10:03.000 and looking at 57 I can tell you that every value in this tree 00:10:03.000 --> 00:10:07.000 that is less than 57 will be in the left subtree. There will be no values 00:10:07.000 --> 00:10:09.000 that are 56 or lower that are on the 00:10:09.000 --> 00:10:10.000 right side. 00:10:10.000 --> 00:10:14.000 And that will be true at every step down the tree. So if I'm looking for 00:10:14.000 --> 00:10:15.000 something 00:10:15.000 --> 00:10:18.000 that actually gives me that immediate access we were using the vector's 00:10:18.000 --> 00:10:19.000 sorting property for. Which is if 00:10:19.000 --> 00:10:24.000 I'm looking for the number 70, and I start at the root node of 57, then I know 00:10:24.000 --> 00:10:25.000 it's not in the left subtree, 00:10:25.000 --> 00:10:28.000 so assuming that about half the nodes are 00:10:28.000 --> 00:10:30.000 pointed to over on that side, 00:10:30.000 --> 00:10:34.000 I've now discarded half of the input even from consideration. If 00:10:34.000 --> 00:10:36.000 I start at the 70, 00:10:36.000 --> 00:10:39.000 and I say okay well if 57 must be over here, then again recursively apply the 00:10:39.000 --> 00:10:44.000 same search property here. It's like if I'm looking for 70, is it going to be 00:10:44.000 --> 00:10:45.000 before or after 79? 00:10:45.000 --> 00:10:49.000 It's going to be before, so in fact I throw away anything that leads away on 00:10:49.000 --> 00:10:52.000 the right subtree of 79, and work my way one level down. 00:10:52.000 --> 00:10:53.000 So each step 00:10:53.000 --> 00:10:58.000 in the recursive search is represented by sort of taking a step down 00:10:58.000 --> 00:11:02.000 in that search tree; working my way down from the root to the leaf nodes. And 00:11:02.000 --> 00:11:05.000 I either will find the value I'm looking for along the way, 00:11:05.000 --> 00:11:08.000 or I will fall off the bottom of that tree 00:11:08.000 --> 00:11:10.000 when I haven't been able to find it. 00:11:10.000 --> 00:11:13.000 And that will tell me; oh it couldn't have been in the tree, because the only place 00:11:13.000 --> 00:11:16.000 it could have been is the one that followed the binary search property. 00:11:16.000 --> 00:11:19.000 And so if I look for 70, I say is it greater or less than 00:11:19.000 --> 00:11:21.000 62? It's greater. Is it 00:11:21.000 --> 00:11:24.000 greater or less than 71? It's less then. And then I work off there I get to the empty 00:11:24.000 --> 00:11:26.000 tree. Then I say well if there was a 60, that's where it was 00:11:26.000 --> 00:11:31.000 going to be, and it wasn't there so there much not be a 70 in this tree. 00:11:31.000 --> 00:11:32.000 So it is exactly designed 00:11:32.000 --> 00:11:36.000 to support one kind of search, and one kind of search only, which is this 00:11:36.000 --> 00:11:40.000 efficient divide and conquer binary search, work its way down. 00:11:40.000 --> 00:11:40.000 00:11:40.000 --> 00:11:42.000 The other kinds of things, let's say, that 00:11:42.000 --> 00:11:44.000 trees 00:11:44.000 --> 00:11:45.000 are used for 00:11:45.000 --> 00:11:48.000 may not need this kind of support, so they may not go out of their way to do this. But the one 00:11:48.000 --> 00:11:52.000 we're actually very interested in is because of maps doing a lot of 00:11:52.000 --> 00:11:55.000 additions into a sorted facility and 00:11:55.000 --> 00:11:57.000 searches in that sort of facility. If we can 00:11:57.000 --> 00:12:00.000 bottle it as a binary search tree, we're able to get the best of both worlds. To 00:12:00.000 --> 00:12:03.000 have that super fast logarithmic search, 00:12:03.000 --> 00:12:07.000 as well as have the flexibility of the pointer based structure 00:12:07.000 --> 00:12:11.000 to help us do an insert. So if I'm ready to put a 70 into this tree, 00:12:11.000 --> 00:12:14.000 the same path that led to me to where it should be 00:12:14.000 --> 00:12:16.000 tells us where to put that new node 00:12:16.000 --> 00:12:20.000 in place. And if all I need to do is to wire that in place; and if all I need to do to wire that in place is to kind of create a new node in the 00:12:20.000 --> 00:12:21.000 heap and 00:12:21.000 --> 00:12:24.000 put the pointers coming in and out of it the way that I would in a link list, 00:12:24.000 --> 00:12:27.000 then I should be able to do both searches and inserts 00:12:27.000 --> 00:12:30.000 in time proportional to the height of the tree, 00:12:30.000 --> 00:12:35.000 which for a relatively balanced tree would be logarithmic. So 00:12:35.000 --> 00:12:37.000 it sounds like we can get everything we wanted here 00:12:37.000 --> 00:12:40.000 with a little bit of work. So 00:12:40.000 --> 00:12:43.000 let me tell you - just do a little bit of playing with trees, just to kind of get used to thinking 00:12:43.000 --> 00:12:46.000 about how we do stuff on them, and then I'm going to go forward with 00:12:46.000 --> 00:12:49.000 implementing map using a binary search tree. 00:12:49.000 --> 00:12:52.000 Trees are very, very recursive. So for those of you who still feel like 00:12:52.000 --> 00:12:54.000 recursion is a little bit hazy, 00:12:54.000 --> 00:12:58.000 this is another great area to strengthen your thinking about 00:12:58.000 --> 00:13:01.000 recursive. In fact, I think a lot of people find operating on trees very, very 00:13:01.000 --> 00:13:03.000 natural. Because trees are so recursive, the 00:13:03.000 --> 00:13:07.000 idea that you write recursive algorithms to them is almost like you do it without even thinking, you don't 00:13:07.000 --> 00:13:08.000 have to think hard. You're like, 00:13:08.000 --> 00:13:11.000 you typically need to do something to your current node, 00:13:11.000 --> 00:13:14.000 and then recursively operate on your left and your right. 00:13:14.000 --> 00:13:17.000 And it's so natural that you don't even have stop and think hard about 00:13:17.000 --> 00:13:18.000 what that means. So 00:13:18.000 --> 00:13:21.000 your base case tends to be getting to an empty tree and having a 00:13:21.000 --> 00:13:23.000 pointer that points to null, 00:13:23.000 --> 00:13:28.000 and in the other cases have some node operate on your children. 00:13:28.000 --> 00:13:31.000 Some of the simplest things are just traversals. 00:13:31.000 --> 00:13:33.000 If I tree structure and I'd like to 00:13:33.000 --> 00:13:35.000 do something to every node, 00:13:35.000 --> 00:13:37.000 go through it and print them all, 00:13:37.000 --> 00:13:39.000 go through it and write them to a file, 00:13:39.000 --> 00:13:42.000 go through and add them all up to determine the average score, I 00:13:42.000 --> 00:13:46.000 need these things. I'm going to need to do some processing that visits every node of the 00:13:46.000 --> 00:13:48.000 tree once and exactly once, 00:13:48.000 --> 00:13:50.000 and recursion is exactly the way to do that. 00:13:50.000 --> 00:13:53.000 And there's a little bit of some choices, and when you're handling a 00:13:53.000 --> 00:13:55.000 particular part of a tree, are you handling the current node and then 00:13:55.000 --> 00:13:59.000 recurring on the left and right subtrees? Are you recurring first and then handling 00:13:59.000 --> 00:14:02.000 the nodes. So some of those things that we saw in terms of the link list 00:14:02.000 --> 00:14:03.000 that 00:14:03.000 --> 00:14:06.000 change the order of the processing, but they don't change 00:14:06.000 --> 00:14:08.000 whether you visit everybody. It's just a matter of which one you want to do 00:14:08.000 --> 00:14:09.000 before. 00:14:09.000 --> 00:14:12.000 I'm going to look at a couple of the simple traversals here, 00:14:12.000 --> 00:14:14.000 and I'll point out 00:14:14.000 --> 00:14:16.000 why that mattered. 00:14:16.000 --> 00:14:18.000 If we look at in order, 00:14:18.000 --> 00:14:22.000 so we'll consider that an in order traversal is for you to operate on your 00:14:22.000 --> 00:14:24.000 left subtree recursively, 00:14:24.000 --> 00:14:27.000 handle the current node right here, and then operate on your right subtree 00:14:27.000 --> 00:14:29.000 recursively. 00:14:29.000 --> 00:14:31.000 The base case, which is 00:14:31.000 --> 00:14:34.000 kind of hidden in here, is if T is null, we get to an empty subtree we 00:14:34.000 --> 00:14:37.000 don't do anything. So only in the nonempty case are we actually 00:14:37.000 --> 00:14:39.000 processing and making recursive calls. 00:14:39.000 --> 00:14:43.000 So if I start with my node 57 being the root of my tree, and I say 00:14:43.000 --> 00:14:45.000 print that tree starting from the root. 00:14:45.000 --> 00:14:48.000 It will print everything out of the left subtree, 00:14:48.000 --> 00:14:52.000 then print the 57, and then print everything out of the right subtree. 00:14:52.000 --> 00:14:55.000 And given that gets applied everywhere down, 00:14:55.000 --> 00:14:58.000 the effect will be 57 says first print my left subtree. 00:14:58.000 --> 00:15:01.000 And 37 says, well print my left subtree. And 15 says, well first 00:15:01.000 --> 00:15:03.000 print my left subtree. 00:15:03.000 --> 00:15:06.000 And then seven says, first print my left subtree. Well, that ones empty, 00:15:06.000 --> 00:15:09.000 and so nothing gets printed. 00:15:09.000 --> 00:15:13.000 As the recursion unwinds, it comes back to the last most call. The last most call was 00:15:13.000 --> 00:15:17.000 seven. It says, okay, print seven and now print seven's right subtree. Again that's an empty, 00:15:17.000 --> 00:15:19.000 so nothing gets printed there. It 00:15:19.000 --> 00:15:22.000 will come back and then do the same thing with 15. I've done everything 00:15:22.000 --> 00:15:23.000 to the left of 15; 00:15:23.000 --> 00:15:26.000 let's print 15 and its right. 00:15:26.000 --> 00:15:29.000 And doing that kind of all the way throughout the tree means that we will get the numbers out 00:15:29.000 --> 00:15:31.000 in sorted order; 00:15:31.000 --> 00:15:34.000 all right, five, 17, seven, 00:15:34.000 --> 00:15:37.000 15, 32, 57, 59, 62, 71, 00:15:37.000 --> 00:15:40.000 79, 93. An in order traversal 00:15:40.000 --> 00:15:42.000 in a binary search tree 00:15:42.000 --> 00:15:50.000 rediscovers, visits all the nodes from smallest to largest. [Inaudible] 00:15:50.000 --> 00:15:54.000 So duplicates, you just have to have a strategy. It turns out in the map, the duplicates 00:15:54.000 --> 00:15:54.000 overwrite. 00:15:54.000 --> 00:15:58.000 So any time you get a new value you print there. Typically you just need to decide, 00:15:58.000 --> 00:16:00.000 do they go to the left or do they go to the right? 00:16:00.000 --> 00:16:03.000 So that you know when you're equal to if you wanted to find another one 00:16:03.000 --> 00:16:04.000 of these, where would it go? 00:16:04.000 --> 00:16:07.000 In a lot of situations you just don't have duplicates, so it's not actually a big deal. 00:16:07.000 --> 00:16:10.000 You can't throw them arbitrarily on either side because you would end up having to search 00:16:10.000 --> 00:16:14.000 both sides to find them. So you want to just know that's it's less than or equal to on the 00:16:14.000 --> 00:16:18.000 left and greater than on the right. So that's 00:16:18.000 --> 00:16:22.000 pretty neat, right, that that means that this 00:16:22.000 --> 00:16:22.000 00:16:22.000 --> 00:16:25.000 for purposes of being able to iterate for something, if 00:16:25.000 --> 00:16:26.000 you were to walk 00:16:26.000 --> 00:16:30.000 the structure in order, then you can produce them back out in sorted order, which tends to be 00:16:30.000 --> 00:16:31.000 a convenient way 00:16:31.000 --> 00:16:33.000 00:16:33.000 --> 00:16:36.000 to print the output. There are a lot of situations where you do want that browsable 00:16:36.000 --> 00:16:43.000 list in alphanumerical order. And the tree makes that easy to do. 00:16:43.000 --> 00:16:45.000 When it comes time to delete the tree, I'm 00:16:45.000 --> 00:16:48.000 going to need to go through and delete each of those nodes individually, the 00:16:48.000 --> 00:16:51.000 same way I would on a link list to delete the whole structure. 00:16:51.000 --> 00:16:54.000 The 00:16:54.000 --> 00:16:57.000 order of traversal that's going to be most convenient for doing that is going to be to do it 00:16:57.000 --> 00:16:58.000 post order. 00:16:58.000 --> 00:17:00.000 Which is to say, when I'm looking at the root node 57, 00:17:00.000 --> 00:17:05.000 I want to delete my entire left subtree recursively, delete my entire right 00:17:05.000 --> 00:17:10.000 subtree recursively, and only after both of those pieces have been cleaned up, 00:17:10.000 --> 00:17:12.000 delete the node we're on; 00:17:12.000 --> 00:17:14.000 so deleting down below to the left, down below to the right 00:17:14.000 --> 00:17:17.000 and then delete this one. If I did it in the other order, if 00:17:17.000 --> 00:17:21.000 I tried to delete in between in the in order or in the pre order case where I did 00:17:21.000 --> 00:17:23.000 my node and then to my subtrees, 00:17:23.000 --> 00:17:26.000 I'd be reaching into parts of my memory I've deleted. 00:17:26.000 --> 00:17:31.000 So we do want to do it in that order, so 00:17:31.000 --> 00:17:32.000 recursion 00:17:32.000 --> 00:17:35.000 coming back to visit you again. 00:17:35.000 --> 00:17:42.000 Let's talk about how to make map work as a tree. 00:17:42.000 --> 00:17:46.000 So if you think about what maps hold, maps hold a key, which is string type; a 00:17:46.000 --> 00:17:48.000 value, which is client 00:17:48.000 --> 00:17:51.000 some sort of template parameter there. 00:17:51.000 --> 00:17:55.000 Now that I've built a structure that raps those two things with a pair and 00:17:55.000 --> 00:17:58.000 adds to it, right these two additional pointers to the left and to the right 00:17:58.000 --> 00:18:01.000 subtrees that point back to a recursive struct, 00:18:01.000 --> 00:18:05.000 the one data member that the map will have at the top is just the root 00:18:05.000 --> 00:18:09.000 pointer. The pointer to the node at the top of the tree, 00:18:09.000 --> 00:18:12.000 and then from there we'll traverse pointers to find our way to other ones. 00:18:12.000 --> 00:18:15.000 And we will organize that tree for the binary search property. 00:18:15.000 --> 00:18:16.000 So using the 00:18:16.000 --> 00:18:20.000 key as the discriminate of what goes left and right, 00:18:20.000 --> 00:18:25.000 we'll compare each new key to the key at the root and decide is it going to go in the left subtree 00:18:25.000 --> 00:18:27.000 root or the right subtree root? And so on down to the 00:18:27.000 --> 00:18:31.000 matching position when we're doing a find, or to the place to insert if we're ready to 00:18:31.000 --> 00:18:33.000 put a new node in. 00:18:33.000 --> 00:18:36.000 So get value and add will have very similar structures, in terms of 00:18:36.000 --> 00:18:38.000 searching that tree 00:18:38.000 --> 00:18:39.000 00:18:39.000 --> 00:18:40.000 from the root 00:18:40.000 --> 00:18:45.000 down to the leaf to find that match along the way, or to 00:18:45.000 --> 00:18:49.000 report where to put a new one in. So 00:18:49.000 --> 00:18:52.000 let's look at the actual code; it has 00:18:52.000 --> 00:18:54.000 my struct, it 00:18:54.000 --> 00:18:57.000 has the strike key, it has the value type node, 00:18:57.000 --> 00:19:00.000 two pointers there, and then the one data member for the map 00:19:00.000 --> 00:19:03.000 itself here is the pointer to the root node. I 00:19:03.000 --> 00:19:09.000 also put a couple of helper number functions that I'm going to need. 00:19:09.000 --> 00:19:10.000 00:19:10.000 --> 00:19:13.000 Searching the tree and entering a new node in the tree are going to require a little 00:19:13.000 --> 00:19:18.000 helper, and we're going to see why that is. That basically that the add and get value public 00:19:18.000 --> 00:19:20.000 number functions are going to be just wrappers 00:19:20.000 --> 00:19:23.000 that turn and pass the code off to these recursive helpers. 00:19:23.000 --> 00:19:26.000 And if these cases, the additional information that each of these is 00:19:26.000 --> 00:19:28.000 tracking is what node we're at 00:19:28.000 --> 00:19:32.000 as we work our way down the tree. But the initial call to add is just going to say, well 00:19:32.000 --> 00:19:34.000 here's a key to value go put it in the right place. 00:19:34.000 --> 00:19:37.000 And that while we're doing the recursion we need to know where we are in the tree. So 00:19:37.000 --> 00:19:39.000 we're adding that extra parameter. 00:19:39.000 --> 00:19:43.000 What a lot of recursive code needs is just a little more housekeeping, and the 00:19:43.000 --> 00:19:44.000 housekeeping is 00:19:44.000 --> 00:19:51.000 what part of the tree we're currently searching or trying to enter into. 00:19:51.000 --> 00:19:54.000 So this is the outer call for get value. 00:19:54.000 --> 00:19:57.000 So get value is really just a wrapper, 00:19:57.000 --> 00:19:59.000 it's going to call tree search. It's going to 00:19:59.000 --> 00:20:03.000 ask the starting value for root. It's the 00:20:03.000 --> 00:20:04.000 beginning point for the search, 00:20:04.000 --> 00:20:07.000 looking for a particular key. And what tree searches are going to return is a 00:20:07.000 --> 00:20:10.000 pointer to a node. If 00:20:10.000 --> 00:20:14.000 that node was not found, if there was no node that has a string 00:20:14.000 --> 00:20:17.000 key that matches the one that we're looking for, then we'll return null. Otherwise, we'll return 00:20:17.000 --> 00:20:18.000 the node. 00:20:18.000 --> 00:20:20.000 And so it either will error 00:20:20.000 --> 00:20:23.000 when it didn't find it or 00:20:23.000 --> 00:20:25.000 pull out the value that was out of 00:20:25.000 --> 00:20:25.000 that 00:20:25.000 --> 00:20:27.000 node structure. 00:20:27.000 --> 00:20:30.000 Well, this part looks pretty okay. The template, right, there's always just a 00:20:30.000 --> 00:20:33.000 little bit of goo up there, right, seeing kind of 00:20:33.000 --> 00:20:36.000 the template header on the top and the use of the val type, and all this other stuff. 00:20:36.000 --> 00:20:39.000 It's going to get a little bit worse here. So just kind of 00:20:39.000 --> 00:20:42.000 hopefully take a deep breath and say, okay that part actually 00:20:42.000 --> 00:20:46.000 doesn't scare me too much. We're going to look at what happens when I write the tree search code, where 00:20:46.000 --> 00:20:48.000 it really actually goes off and does the search down the tree. And there's going to be 00:20:48.000 --> 00:20:49.000 00:20:49.000 --> 00:20:52.000 a little bit more machinations required on this. So let's just talk about it first, 00:20:52.000 --> 00:20:55.000 just what the code does. And then I'm going to come back and talk about some of the syntax that's 00:20:55.000 --> 00:20:58.000 required to make this work correctly. 00:20:58.000 --> 00:21:02.000 So the idea of tree search is, given a pointer to a particular 00:21:02.000 --> 00:21:02.000 subtree, it 00:21:02.000 --> 00:21:05.000 could be the root of the original tree or some 00:21:05.000 --> 00:21:07.000 smaller subtree further down, and a key that we're looking for, we're 00:21:07.000 --> 00:21:10.000 going to return a match when we find that node. 00:21:10.000 --> 00:21:15.000 So looking at the step here, where if the key that we have matches the key of the node 00:21:15.000 --> 00:21:17.000 we're looking at, then this is the node. 00:21:17.000 --> 00:21:19.000 This is the one we wanted. 00:21:19.000 --> 00:21:23.000 Otherwise, right, we look at two cases. If the key is less than this node's key, 00:21:23.000 --> 00:21:26.000 then we're just going to return the result from searching in the left subtree. 00:21:26.000 --> 00:21:31.000 Otherwise it must be greater than, and we're going to search in the right subtree. 00:21:31.000 --> 00:21:34.000 So this case, right, duplicates aren't allowed, so there will be exactly one match, if 00:21:34.000 --> 00:21:37.000 at all. 00:21:37.000 --> 00:21:39.000 And when we find it we return. 00:21:39.000 --> 00:21:43.000 The other case that needs to be handled as part of base case is if what if we ever get to an 00:21:43.000 --> 00:21:44.000 empty tree. 00:21:44.000 --> 00:21:47.000 So we're looking for 32 and the last node we just passed was 00:21:47.000 --> 00:21:50.000 35. We needed to be the left of it, and it turns out there was nothing down there, we 00:21:50.000 --> 00:21:52.000 have an empty tree. 00:21:52.000 --> 00:21:54.000 We'll hit this case where T is null, and we'll return null. That means there was never 00:21:54.000 --> 00:21:58.000 a match to that. There's no other place in the tree 00:21:58.000 --> 00:22:01.000 where that 32 could have been. The binary search property completely dictates the 00:22:01.000 --> 00:22:04.000 series of left and right turns. There's nowhere else to look. 00:22:04.000 --> 00:22:06.000 So that's actually, really solving 00:22:06.000 --> 00:22:07.000 00:22:07.000 --> 00:22:11.000 a lot of our - cleaning up our work for us, streamlining where we need to go; 00:22:11.000 --> 00:22:14.000 because the tree is organized through the 00:22:14.000 --> 00:22:16.000 mid point and the quarter point and then the eighth point 00:22:16.000 --> 00:22:19.000 to narrow down to where we could go. So as 00:22:19.000 --> 00:22:21.000 it is, this code works correctly. 00:22:21.000 --> 00:22:25.000 So do you have any questions about how it's thinking about doing its work? 00:22:25.000 --> 00:22:30.000 And then I'm going to show you why the syntax is a little bit wrong though. 00:22:30.000 --> 00:22:32.000 Okay, so let's talk about this 00:22:32.000 --> 00:22:34.000 return type coming out of 00:22:34.000 --> 00:22:36.000 the function. 00:22:36.000 --> 00:22:38.000 So let's go back to this for a second. 00:22:38.000 --> 00:22:41.000 That node is defined 00:22:41.000 --> 00:22:43.000 as a private member 00:22:43.000 --> 00:22:46.000 of the map class. 00:22:46.000 --> 00:22:48.000 So node is actually not a global type. 00:22:48.000 --> 00:22:53.000 It does not exist outside the map class. Its real name is 00:22:53.000 --> 00:22:56.000 00:22:56.000 --> 00:22:58.000 map 00:22:58.000 --> 00:23:00.000 val type:: 00:23:00.000 --> 00:23:03.000 node. 00:23:03.000 --> 00:23:05.000 That's its full name, 00:23:05.000 --> 00:23:08.000 is that. Now, 00:23:08.000 --> 00:23:08.000 inside 00:23:08.000 --> 00:23:10.000 the implementation of map, 00:23:10.000 --> 00:23:14.000 which is to say inside the class map, or inside member functions that we're 00:23:14.000 --> 00:23:17.000 defining later, you'll not that we can just kind of drop 00:23:17.000 --> 00:23:21.000 the big full name. The kind of my name is mister such and such, and such and 00:23:21.000 --> 00:23:26.000 such; it's like okay, you can just call it node, the short name internal to the class 00:23:26.000 --> 00:23:28.000 because it's inside that scope. And 00:23:28.000 --> 00:23:31.000 it's able to say, okay in this scope is there something called node? Okay there is. Okay, that's the 00:23:31.000 --> 00:23:34.000 one. Outside of that scope, though, if you were trying to use that outside of the 00:23:34.000 --> 00:23:37.000 class or some other places, it's going to 00:23:37.000 --> 00:23:40.000 have a little bit more trouble with this idea of node by itself. It's going to need to see 00:23:40.000 --> 00:23:42.000 that full name. Let 00:23:42.000 --> 00:23:44.000 me show you on this piece of code, 00:23:44.000 --> 00:23:46.000 that node here 00:23:46.000 --> 00:23:51.000 happens to be in the slight quirk of the way sequel plus defines things, that that 00:23:51.000 --> 00:23:56.000 use of node is not considered in class scope yet. 00:23:56.000 --> 00:23:57.000 That 00:23:57.000 --> 00:24:01.000 the compiler leads from left to write, so we'll look at the first one. It says, template 00:24:01.000 --> 00:24:04.000 type name val type, and so what the compiler sees so far is, I'm about to 00:24:04.000 --> 00:24:07.000 see something that's templated on val type. 00:24:07.000 --> 00:24:10.000 So it's a pattern for which val type is a placeholder in. 00:24:10.000 --> 00:24:12.000 And then it sees val type 00:24:12.000 --> 00:24:16.000 as the return value, and it says, okay, that's the placeholder. And then it sees this map :: and 00:24:16.000 --> 00:24:18.000 00:24:18.000 --> 00:24:21.000 it says, okay, this is within the - the thing 00:24:21.000 --> 00:24:26.000 that I'm defining here is within scope of the map class. And so it sees the map 00:24:26.000 --> 00:24:26.000 :: 00:24:26.000 --> 00:24:29.000 and that as soon as it crosses those double colons, 00:24:29.000 --> 00:24:31.000 from that point forward 00:24:31.000 --> 00:24:33.000 it's inside the map scope. 00:24:33.000 --> 00:24:34.000 And so, 00:24:34.000 --> 00:24:37.000 when I use things like tree search, or I use the name node, it knows what I'm 00:24:37.000 --> 00:24:40.000 talking about. It says, oh there's a tree search in map class. There's a node struct in map 00:24:40.000 --> 00:24:43.000 class, it's all good. 00:24:43.000 --> 00:24:47.000 In the case of this one I've seen template type and val type, and 00:24:47.000 --> 00:24:50.000 at this point it doesn't know anything yet. 00:24:50.000 --> 00:24:53.000 And so it sees me using node, and it say, 00:24:53.000 --> 00:24:54.000 node I've 00:24:54.000 --> 00:24:55.000 never heard of node. 00:24:55.000 --> 00:24:58.000 I'm in the middle of making some template based on val type, but I haven't heard 00:24:58.000 --> 00:25:01.000 anything about node. Node isn't something that exists in the global name space. 00:25:01.000 --> 00:25:03.000 You must have made some 00:25:03.000 --> 00:25:04.000 tragic error. 00:25:04.000 --> 00:25:08.000 What it wants you to say is 00:25:08.000 --> 00:25:09.000 to give it the full name 00:25:09.000 --> 00:25:12.000 of map :: node. 00:25:12.000 --> 00:25:14.000 That the real full 00:25:14.000 --> 00:25:16.000 this is my name, 00:25:16.000 --> 00:25:21.000 and otherwise it will be quite confused. 00:25:21.000 --> 00:25:24.000 If that weren't enough, 00:25:24.000 --> 00:25:26.000 we'd like to think that we were done 00:25:26.000 --> 00:25:27.000 with 00:25:27.000 --> 00:25:30.000 placating the C compiler. But it turns out there's one other little 00:25:30.000 --> 00:25:32.000 thing that has to happen here, 00:25:32.000 --> 00:25:34.000 which is there is an obscure 00:25:34.000 --> 00:25:38.000 reason why, even though I've given it the full name, and I feel like I've done my 00:25:38.000 --> 00:25:40.000 duty in describing this thing, 00:25:40.000 --> 00:25:43.000 that the compiler still is a little bit confused 00:25:43.000 --> 00:25:45.000 by this usage. 00:25:45.000 --> 00:25:48.000 And because the C language is just immensely complicated, 00:25:48.000 --> 00:25:50.000 there are a lot of funny interactions, 00:25:50.000 --> 00:25:52.000 there are a couple of situations it can get in to, 00:25:52.000 --> 00:25:53.000 where 00:25:53.000 --> 00:25:56.000 because it's trying hard to read left to right and see the things. 00:25:56.000 --> 00:25:59.000 Sometimes it needs to know some things in advance that it doesn't have all the 00:25:59.000 --> 00:26:00.000 information for. 00:26:00.000 --> 00:26:03.000 And it turns out in this case there's actually some obscure situation where 00:26:03.000 --> 00:26:06.000 it might not realize this is really a type. 00:26:06.000 --> 00:26:08.000 It might think it was 00:26:08.000 --> 00:26:10.000 more of a global constant 00:26:10.000 --> 00:26:14.000 that we actually have to use the type name key word again here 00:26:14.000 --> 00:26:16.000 on this one 00:26:16.000 --> 00:26:17.000 construction. 00:26:17.000 --> 00:26:20.000 It is basically a hint to the compiler 00:26:20.000 --> 00:26:22.000 that says, the next thing coming up 00:26:22.000 --> 00:26:25.000 really is a type name. 00:26:25.000 --> 00:26:28.000 So given that you can't tell that, you'd think it would be obvious; in 00:26:28.000 --> 00:26:31.000 this case it doesn't seem like there's a lot for it to be confused at, but there's 00:26:31.000 --> 00:26:33.000 actually some other constructions that we can't rule out. 00:26:33.000 --> 00:26:37.000 So the compiler is a little bit confused in this situation, and it wants you to 00:26:37.000 --> 00:26:38.000 tell it in advance, 00:26:38.000 --> 00:26:41.000 and so I'll tell you that the basic rules for when you're going to need to use this is 00:26:41.000 --> 00:26:45.000 exactly in this and only this situation. 00:26:45.000 --> 00:26:48.000 And I'll tell you what the things are that are required for type name to 00:26:48.000 --> 00:26:50.000 become part of your 00:26:50.000 --> 00:26:51.000 repertoire. 00:26:51.000 --> 00:26:53.000 You have a return type 00:26:53.000 --> 00:26:54.000 on a member function 00:26:54.000 --> 00:26:58.000 that is defined as part of a class that is a template, 00:26:58.000 --> 00:27:00.000 and that type that you are using 00:27:00.000 --> 00:27:03.000 is defined within the scope of that template. 00:27:03.000 --> 00:27:07.000 So it's trying to use something that's within a template class scope 00:27:07.000 --> 00:27:08.000 outside of that scope 00:27:08.000 --> 00:27:09.000 00:27:09.000 --> 00:27:11.000 that requires this type name. 00:27:11.000 --> 00:27:14.000 And the only place where we see things used out of their scope is on the 00:27:14.000 --> 00:27:15.000 return value. 00:27:15.000 --> 00:27:19.000 So this is kind of the one 00:27:19.000 --> 00:27:21.000 configuration that causes you to have to do this. 00:27:21.000 --> 00:27:22.000 00:27:22.000 --> 00:27:25.000 It's really just, you know, a little bit of a C quirk. 00:27:25.000 --> 00:27:28.000 It doesn't change anything about what the code's doing or anything interesting about binary 00:27:28.000 --> 00:27:31.000 search trees, but it will come up in any situation where you have this helper member 00:27:31.000 --> 00:27:34.000 function that wants to return something that's templated. You 00:27:34.000 --> 00:27:35.000 have 00:27:35.000 --> 00:27:37.000 to placate the compiler there. 00:27:37.000 --> 00:27:39.000 The error message when you don't do it, actually turns out to be very good in 00:27:39.000 --> 00:27:45.000 GCC and very terrible in visual studio. GCC says type name required here, 00:27:45.000 --> 00:27:47.000 and visual studio says something completely 00:27:47.000 --> 00:27:48.000 mystical. 00:27:48.000 --> 00:27:49.000 So it 00:27:49.000 --> 00:27:52.000 is something to be just a little bit careful about when you're trying to write 00:27:52.000 --> 00:27:56.000 that kind of 00:27:56.000 --> 00:27:58.000 code. Question? The asterisk? The asterisk is just because it returns a pointer. 00:27:58.000 --> 00:28:00.000 00:28:00.000 --> 00:28:03.000 In all the situations, I'm always returning a pointer to the node back 00:28:03.000 --> 00:28:04.000 there. 00:28:04.000 --> 00:28:16.000 So here's the pointer to the match I found or node when I didn't find one. All right, that's a little bit of grimy code. Let's keep going. 00:28:16.000 --> 00:28:17.000 So how is it 00:28:17.000 --> 00:28:18.000 that add 00:28:18.000 --> 00:28:22.000 works? Well, it starts like get value, that 00:28:22.000 --> 00:28:23.000 the 00:28:23.000 --> 00:28:27.000 strategy we're going to use for inserting into the tree is the really simple one that say, 00:28:27.000 --> 00:28:30.000 well don't rearrange anything that's there. For example, if I were trying to put 00:28:30.000 --> 00:28:32.000 the number 54 into this tree, 00:28:32.000 --> 00:28:33.000 you could say - well 00:28:33.000 --> 00:28:36.000 you could imagine putting it at the root and moving things around. Imagine 00:28:36.000 --> 00:28:40.000 that your goal is to put it in with a minimal amount of rearrangement. 00:28:40.000 --> 00:28:44.000 So leaving all the existing nodes connected the way they are, 00:28:44.000 --> 00:28:47.000 we're going to follow the binary search path of 00:28:47.000 --> 00:28:49.000 the get value code 00:28:49.000 --> 00:28:51.000 to find the place where 54 would have been 00:28:51.000 --> 00:28:54.000 if we left everything else in tact. 00:28:54.000 --> 00:28:57.000 Looking at 57 you could say, well it has to be to the left of 00:28:57.000 --> 00:29:01.000 57. Looking to the 32 you say, it has to be to the right of 00:29:01.000 --> 00:29:01.000 32. 00:29:01.000 --> 00:29:05.000 Here's where we get some empty tree, so we say, well that's the place where we'll put it. 00:29:05.000 --> 00:29:06.000 We're going to tack it off, kind 00:29:06.000 --> 00:29:12.000 of hang it off one of the empty trees 00:29:12.000 --> 00:29:15.000 along the front tier, the bottom of that tree. 00:29:15.000 --> 00:29:18.000 And so, we'll just walk our way down to where we fall off the tree, and put the 00:29:18.000 --> 00:29:19.000 new node in place there. 00:29:19.000 --> 00:29:22.000 So if I want the node 58 in, I'll say it 00:29:22.000 --> 00:29:25.000 has to go to the right; it has to go to the left; it has to go to the left; it has to go to the left; it 00:29:25.000 --> 00:29:27.000 will go right down there. If 00:29:27.000 --> 00:29:30.000 I want the node 100 in, it has to go right, right, right, right, right. 00:29:30.000 --> 00:29:35.000 That just kind of following the path, leaving all the current pointers and 00:29:35.000 --> 00:29:40.000 nodes in place, kind of finding a new place. And there's exactly one 00:29:40.000 --> 00:29:43.000 that given any number in a current configuration 00:29:43.000 --> 00:29:46.000 and the goal of not rearranging anything, there's exactly one place to put 00:29:46.000 --> 00:29:48.000 that new node in there 00:29:48.000 --> 00:29:49.000 that will 00:29:49.000 --> 00:29:51.000 keep the binary search property in tact. 00:29:51.000 --> 00:29:55.000 So it actually does a lot of decision making for us. 00:29:55.000 --> 00:29:58.000 We don't have to make any hard decisions about where to go. It just all this 00:29:58.000 --> 00:30:05.000 left, right, left, right based on less than and greater than. What if the values are equal? If 00:30:05.000 --> 00:30:09.000 the values are equal you have to have a plan, and it could be 00:30:09.000 --> 00:30:11.000 that you're using all the less than or equal to or greater than 00:30:11.000 --> 00:30:14.000 or equal to. You just have to have a plan. In terms of our map it turns out to 00:30:14.000 --> 00:30:17.000 equal to all the rights. 00:30:17.000 --> 00:30:20.000 So we handle it differently, which we don't every put a duplicate into the 00:30:20.000 --> 00:30:23.000 tree. 00:30:23.000 --> 00:30:26.000 So the add call is going to basically just be a wrapper 00:30:26.000 --> 00:30:29.000 that calls tree enter, starting from the root 00:30:29.000 --> 00:30:31.000 with the key and the value 00:30:31.000 --> 00:30:34.000 that then goes and does that work, walks its way down to the bottom to 00:30:34.000 --> 00:30:40.000 find the right place to stick the new one in. 00:30:40.000 --> 00:30:43.000 Here is 00:30:43.000 --> 00:30:44.000 a 00:30:44.000 --> 00:30:49.000 very dense little piece of code that will put a node into a tree 00:30:49.000 --> 00:30:52.000 using the strategy that we've just talked about of, 00:30:52.000 --> 00:30:55.000 don't rearrange any of the pointers, just walk your way down to where you find an empty 00:30:55.000 --> 00:30:59.000 tree. And once you find that empty tree, replace the empty tree with a singleton node 00:30:59.000 --> 00:31:02.000 with the new key and value that you have. 00:31:02.000 --> 00:31:06.000 So let's not look at the base case first. Let's actually look at the 00:31:06.000 --> 00:31:09.000 recursive cases, because I think that they're a little bit easier to think about. 00:31:09.000 --> 00:31:13.000 Is if we every find a match, so we have a key coming in, and it matches 00:31:13.000 --> 00:31:14.000 the current node, then we just overwrite. 00:31:14.000 --> 00:31:17.000 So that's the way the map handles 00:31:17.000 --> 00:31:18.000 insertion of a duplicate value. 00:31:18.000 --> 00:31:21.000 Now, in some other situation you might actually insert one. But in our case it turns out we 00:31:21.000 --> 00:31:24.000 never want a duplicate. We always want to take the existing value and replace 00:31:24.000 --> 00:31:26.000 it with a new one. 00:31:26.000 --> 00:31:28.000 If it matches, we're done. 00:31:28.000 --> 00:31:29.000 Otherwise, 00:31:29.000 --> 00:31:32.000 we need to decide whether we go left or right. 00:31:32.000 --> 00:31:36.000 Based on the key we're attempting to insert or the current node value, it either 00:31:36.000 --> 00:31:39.000 will go in the left of the subtree, left of the right, in which case we make 00:31:39.000 --> 00:31:44.000 exactly one recursive call either to the left or to the right depending on the 00:31:44.000 --> 00:31:46.000 relationship of key to the current 00:31:46.000 --> 00:31:48.000 key's value. 00:31:48.000 --> 00:31:49.000 And then 00:31:49.000 --> 00:31:52.000 the one base case that everything eventually gets down to, that's going to create a 00:31:52.000 --> 00:31:53.000 new node, 00:31:53.000 --> 00:31:57.000 is when you have an empty tree, 00:31:57.000 --> 00:31:59.000 then that says that you have followed the path that 00:31:59.000 --> 00:32:04.000 has lead off the tree and fallen off the bottom. It says replace that with 00:32:04.000 --> 00:32:05.000 this new singleton node. 00:32:05.000 --> 00:32:07.000 So we make a new node out in the heap. 00:32:07.000 --> 00:32:09.000 We're assigning the key 00:32:09.000 --> 00:32:12.000 based on this. We're assigning the value based on the parameter coming in. And then we set 00:32:12.000 --> 00:32:14.000 its left and right null to do 00:32:14.000 --> 00:32:16.000 00:32:16.000 --> 00:32:20.000 a new leaf node, right, it has left and right null subtrees. 00:32:20.000 --> 00:32:22.000 And the way this got wired in the tree, 00:32:22.000 --> 00:32:26.000 you don't see who is it that's now pointing to this tree, this new 00:32:26.000 --> 00:32:28.000 singleton node that I placed into here? 00:32:28.000 --> 00:32:31.000 You don't typically see that wire that's coming into it. Who 00:32:31.000 --> 00:32:33.000 is pointing to it 00:32:33.000 --> 00:32:34.000 is actually being done 00:32:34.000 --> 00:32:39.000 by T being passed by reference. 00:32:39.000 --> 00:32:42.000 The node right above it, 00:32:42.000 --> 00:32:44.000 the left or the right 00:32:44.000 --> 00:32:48.000 of what's going to become the parent of the new node being entered, 00:32:48.000 --> 00:32:51.000 was passed by reference. It previously was null, 00:32:51.000 --> 00:32:55.000 and it got reassigned to point to this new node. So the kind of wiring in of 00:32:55.000 --> 00:32:58.000 that node is happening by the pass by reference of the pointer, which is a 00:32:58.000 --> 00:33:04.000 tricky thing to kind of watch and get your head around. Let's 00:33:04.000 --> 00:33:06.000 look at the code operating. Do you have a 00:33:06.000 --> 00:33:13.000 question? Yes. [Inaudible]. Can you 00:33:13.000 --> 00:33:14.000 explain 00:33:14.000 --> 00:33:16.000 00:33:16.000 --> 00:33:17.000 that little step? 00:33:17.000 --> 00:33:22.000 Instructor: Well this step is just something in the case of the map, if we every find an existing entity that has that key, and 00:33:22.000 --> 00:33:23.000 we overwrite. 00:33:23.000 --> 00:33:27.000 So in the map you said previously, Bob's phone number is this, and then you install a new 00:33:27.000 --> 00:33:30.000 phone number on Bob. If it finds a map to Bob saying Bob was previously entered, it just 00:33:30.000 --> 00:33:34.000 replaces his value. So no nodes are created, no pointers are rearranged; 00:33:34.000 --> 00:33:38.000 it just commandeers that node and changes its value. 00:33:38.000 --> 00:33:43.000 That's the behaviors specified by map. 00:33:43.000 --> 00:33:45.000 So if we have this 00:33:45.000 --> 00:33:46.000 right now, we've got 00:33:46.000 --> 00:33:49.000 some handful of fruits that are installed 00:33:49.000 --> 00:33:53.000 in our tree with some value. It's doesn't matter what the value is, so I didn't even bother to fill 00:33:53.000 --> 00:33:57.000 it in to confuse you. So this is what I have, papaya is my root node, and then 00:33:57.000 --> 00:34:01.000 the next level down has banana and peach, and the pear and what not. So 00:34:01.000 --> 00:34:03.000 root is pointing to papaya right now. 00:34:03.000 --> 00:34:06.000 When I make my call to insert a new value, 00:34:06.000 --> 00:34:09.000 let's say that the one that I want to put in the tree now is orange, 00:34:09.000 --> 00:34:13.000 the very first call to tree enter looks like this, 00:34:13.000 --> 00:34:16.000 T is the reference parameter that's coming in, and 00:34:16.000 --> 00:34:20.000 it refers to root. 00:34:20.000 --> 00:34:24.000 So the very first call to tree enter says please insert 00:34:24.000 --> 00:34:27.000 orange and its value 00:34:27.000 --> 00:34:28.000 into the tree 00:34:28.000 --> 00:34:31.000 who is pointed to by 00:34:31.000 --> 00:34:34.000 root. So it actually has that by reference. It's going to need that by 00:34:34.000 --> 00:34:37.000 reference because it's actually planning on updating that 00:34:37.000 --> 00:34:39.000 when it finally installs the new node. 00:34:39.000 --> 00:34:43.000 So the first call to tree enter says, okay well is orange less than or greater than 00:34:43.000 --> 00:34:45.000 papaya. 00:34:45.000 --> 00:34:47.000 Orange precedes papaya 00:34:47.000 --> 00:34:48.000 in the alphabet, 00:34:48.000 --> 00:34:50.000 so okay, well call tree enter 00:34:50.000 --> 00:34:52.000 the second time 00:34:52.000 --> 00:34:53.000 passing 00:34:53.000 --> 00:34:58.000 the left subtree of the current value of tee. And so T 00:34:58.000 --> 00:35:00.000 was a reference to root right then, 00:35:00.000 --> 00:35:04.000 now T becomes a reference to the left subtree coming out of papaya. So it says, now 00:35:04.000 --> 00:35:06.000 okay, for the tree 00:35:06.000 --> 00:35:10.000 that we're talking about here, which is the one that points to the banana and 00:35:10.000 --> 00:35:14.000 below. That's the new root of that subtree is banana, it says, does orange go 00:35:14.000 --> 00:35:15.000 to the left or right 00:35:15.000 --> 00:35:19.000 of the root node 00:35:19.000 --> 00:35:21.000 banana? And it goes to the right. So the third stack frame 00:35:21.000 --> 00:35:24.000 of tree enter now says, well the reference pointer 00:35:24.000 --> 00:35:28.000 is now looking at the right subtree coming out of banana. And it 00:35:28.000 --> 00:35:29.000 says, well does melon go to 00:35:29.000 --> 00:35:31.000 the left or the right of 00:35:31.000 --> 00:35:32.000 that? 00:35:32.000 --> 00:35:34.000 It goes to the right, 00:35:34.000 --> 00:35:36.000 so now on this call, right, 00:35:36.000 --> 00:35:40.000 the current value of T is going to be null. It is a reference, so a synonym for 00:35:40.000 --> 00:35:44.000 what was the right field coming out of the melon node, 00:35:44.000 --> 00:35:47.000 and in this is the case where it says, if orange was in the tree, 00:35:47.000 --> 00:35:50.000 then this is where it would be hanging off of. It would be coming off of the 00:35:50.000 --> 00:35:51.000 right 00:35:51.000 --> 00:35:52.000 00:35:52.000 --> 00:35:53.000 subtree of melon. 00:35:53.000 --> 00:35:55.000 That part is null, 00:35:55.000 --> 00:35:58.000 and so that's our place. That's the place where we're going to take out the 00:35:58.000 --> 00:35:59.000 existing null, 00:35:59.000 --> 00:36:00.000 wipe over it, 00:36:00.000 --> 00:36:04.000 and put in a new pointer to the node orange 00:36:04.000 --> 00:36:06.000 that's out here. 00:36:06.000 --> 00:36:09.000 And so that's how the thing got wired in. It's actually kind of 00:36:09.000 --> 00:36:12.000 subtle to see how that happened, because you don?t ever see a 00:36:12.000 --> 00:36:16.000 direct descendant, like T is left equals this, equals that new node; or T is 00:36:16.000 --> 00:36:17.000 right equals that new 00:36:17.000 --> 00:36:20.000 node. It happened by virtue of passing T right 00:36:20.000 --> 00:36:22.000 by reference into the 00:36:22.000 --> 00:36:26.000 calls further down the stack. That eventually, right when it hit the base 00:36:26.000 --> 00:36:26.000 case, 00:36:26.000 --> 00:36:30.000 it took what was previously a pointer to null and overwrote it with a pointer to this 00:36:30.000 --> 00:36:31.000 new cell, 00:36:31.000 --> 00:36:37.000 that then got attached into the tree. Let's take a 00:36:37.000 --> 00:36:38.000 look at this piece 00:36:38.000 --> 00:36:40.000 of code. 00:36:40.000 --> 00:36:44.000 That by reference there turns out to be really, really important. 00:36:44.000 --> 00:36:46.000 If that 00:36:46.000 --> 00:36:49.000 T was just being passed as an ordinary node star, 00:36:49.000 --> 00:36:52.000 nodes would just get lost, dropped on the floor like crazy. 00:36:52.000 --> 00:36:56.000 For example, consider the very first time when you're passing in root and root 00:36:56.000 --> 00:36:58.000 is empty. If root is null, 00:36:58.000 --> 00:36:59.000 it would say, well if root is null 00:36:59.000 --> 00:37:02.000 and then it would change this local parameter T; T is some new node of these 00:37:02.000 --> 00:37:06.000 things. But if it really wasn't making permanent changes to the root, 00:37:06.000 --> 00:37:08.000 really wasn't making root pointer of that new node, root 00:37:08.000 --> 00:37:12.000 would continue pointing to null, and that node would just get orphaned. And 00:37:12.000 --> 00:37:14.000 every subsequent one, the same thing would happen. 00:37:14.000 --> 00:37:17.000 So in order to have this be a persistent change, 00:37:17.000 --> 00:37:19.000 the pointer that's coming in, 00:37:19.000 --> 00:37:22.000 either the root in the first case or T left or T 00:37:22.000 --> 00:37:23.000 right 00:37:23.000 --> 00:37:25.000 in subsequent calls, 00:37:25.000 --> 00:37:30.000 needed to have the ability to be permanently modified when we attach that new 00:37:30.000 --> 00:37:33.000 subtree. 00:37:33.000 --> 00:37:37.000 That's really very tricky code, so I would be happy to 00:37:37.000 --> 00:37:39.000 try to answer 00:37:39.000 --> 00:37:41.000 questions or go through something to try to get your head around what 00:37:41.000 --> 00:37:44.000 it's supposed to be doing. If you could ask a question I 00:37:44.000 --> 00:37:47.000 could 00:37:47.000 --> 00:37:49.000 help you. Or you could just nod your head and say, I'm just totally fine. 00:37:49.000 --> 00:37:52.000 Could you use the tree search algorithm that you programmed before? 00:37:52.000 --> 00:37:53.000 They're 00:37:53.000 --> 00:37:58.000 very close, right, and 00:37:58.000 --> 00:37:59.000 where did I put it? 00:37:59.000 --> 00:38:02.000 It's on this slide. 00:38:02.000 --> 00:38:05.000 The problem with this one right now is it does really stop and return the 00:38:05.000 --> 00:38:07.000 null. So I could actually unify them, but it would take a little more 00:38:07.000 --> 00:38:11.000 work, and at some point you say, wow I could unify this 00:38:11.000 --> 00:38:15.000 at the expense of making this other piece of code a little more complicated. So in this case I chose 00:38:15.000 --> 00:38:18.000 to keep them separate. It's not impossible to unify them, but you end up 00:38:18.000 --> 00:38:19.000 having to point to 00:38:19.000 --> 00:38:22.000 where the node is or a null where you would insert a new node. Tree 00:38:22.000 --> 00:38:26.000 search doesn't care about, but tree enter does. It sort of would 00:38:26.000 --> 00:38:29.000 modify it to fit the needs of both at the expense of making it a little more awkward 00:38:29.000 --> 00:38:38.000 on both sides. Question here. 00:38:38.000 --> 00:38:43.000 [Inaudible] the number example. 00:38:43.000 --> 00:38:47.000 If you were inserting 54 where you said, if you called print, 00:38:47.000 --> 00:38:49.000 wouldn't it print 00:38:49.000 --> 00:38:51.000 out of order? So 00:38:51.000 --> 00:38:56.000 54 would go left of 57, right of 32, so it'd go over here. 00:38:56.000 --> 00:38:59.000 But the way in order traversals work is says, 00:38:59.000 --> 00:39:03.000 print everything in my left subtree, print everything in my left subtree, left subtree. So it would kind of print seven 00:39:03.000 --> 00:39:07.000 as it came back five, 32, and then it would print 54 before it got to the 00:39:07.000 --> 00:39:08.000 57. So 00:39:08.000 --> 00:39:12.000 an in order traversal of a binary search tree, as long as the binary search tree 00:39:12.000 --> 00:39:16.000 is properly maintained, will print them in sorted order. 00:39:16.000 --> 00:39:19.000 It's almost like, don't even think too hard about it, if 00:39:19.000 --> 00:39:22.000 everything in my left subtree is smaller, and everything in my right subtree is greater than me, 00:39:22.000 --> 00:39:26.000 if I print all of those I have to come out in sorted order. 00:39:26.000 --> 00:39:30.000 If the property is correctly maintained in the tree, then that 00:39:30.000 --> 00:39:31.000 by definition 00:39:31.000 --> 00:39:34.000 is a sorted way to pass through all the numbers. So if 00:39:34.000 --> 00:39:38.000 at any point an in order traversal is hitting numbers out of order, it would mean that the tree wasn't 00:39:38.000 --> 00:39:40.000 properly 00:39:40.000 --> 00:39:51.000 organized. 00:39:51.000 --> 00:39:55.000 So let's talk about how that works, there are a 00:39:55.000 --> 00:39:59.000 couple of other operations on the tree, things like doing the size, 00:39:59.000 --> 00:40:01.000 which would count the number of members. Which you could do by just doing a traversal, or you 00:40:01.000 --> 00:40:04.000 might do by just caching a number 00:40:04.000 --> 00:40:07.000 that you updated as you added and moved nodes. And 00:40:07.000 --> 00:40:10.000 the contains keys, that's basically just a tree search 00:40:10.000 --> 00:40:14.000 that determines whether it found a node or not. 00:40:14.000 --> 00:40:17.000 The rest of the implementation is not interest. There's also deleting, but I 00:40:17.000 --> 00:40:18.000 didn't show you 00:40:18.000 --> 00:40:23.000 that. The space use, how does this relate to the other known 00:40:23.000 --> 00:40:24.000 implementations that we have for math? 00:40:24.000 --> 00:40:27.000 It adds certainly some space overhead. We're 00:40:27.000 --> 00:40:31.000 going to have two pointers, a left and a right off every entry. So every entry is 00:40:31.000 --> 00:40:35.000 guaranteed to have an excess 8-byte capacity 00:40:35.000 --> 00:40:37.000 relative to the data being stored. 00:40:37.000 --> 00:40:40.000 It does actually add those tightly to the allocation. So like a link 00:40:40.000 --> 00:40:43.000 list, it allocates nodes on a per 00:40:43.000 --> 00:40:46.000 entry basis, so there's actually not a lot of 00:40:46.000 --> 00:40:49.000 reserve capacity that's waiting around unused the way it might been in a 00:40:49.000 --> 00:40:51.000 vector situation. 00:40:51.000 --> 00:40:53.000 And the performance of it is that it's based on the 00:40:53.000 --> 00:40:55.000 height of the tree, 00:40:55.000 --> 00:41:00.000 that both the add and the get value are doing a traversal starting from the root and 00:41:00.000 --> 00:41:03.000 working their way down to the bottom of the tree, either finding what it wanted 00:41:03.000 --> 00:41:05.000 along the way 00:41:05.000 --> 00:41:07.000 or falling off the bottom of the tree 00:41:07.000 --> 00:41:11.000 when it wasn't found or when it needs to be inserting a new node. 00:41:11.000 --> 00:41:15.000 That height of the tree, you're expecting it to be logarithmic. And so 00:41:15.000 --> 00:41:17.000 if I look at, for example, the values, I've got 00:41:17.000 --> 00:41:22.000 seven values here, two, eight, 14, 15, 20, and 21. 00:41:22.000 --> 00:41:23.000 Those seven, 00:41:23.000 --> 00:41:27.000 eight values can be inserted in a tree to get very 00:41:27.000 --> 00:41:28.000 different results. 00:41:28.000 --> 00:41:30.000 There's not one binary search tree 00:41:30.000 --> 00:41:33.000 that represents those values 00:41:33.000 --> 00:41:37.000 uniquely. Depending on what order you manage to insert them you'll have 00:41:37.000 --> 00:41:38.000 different things. 00:41:38.000 --> 00:41:39.000 For example, 00:41:39.000 --> 00:41:41.000 given the way we're doing it, 00:41:41.000 --> 00:41:44.000 whatever node is inserted first will be the root, and we never move the 00:41:44.000 --> 00:41:48.000 root. We leave the root along. So if you look at this as a historical 00:41:48.000 --> 00:41:50.000 artifact and say, well they entered a bunch of numbers, 00:41:50.000 --> 00:41:52.000 and this is the tree I got. 00:41:52.000 --> 00:41:55.000 I can tell you for sure that 15 was the very first number that they entered. 00:41:55.000 --> 00:41:57.000 There's no other possibility, 15 got entered first. 00:41:57.000 --> 00:42:01.000 The next number that got inserted was one of eight or 20, I don't know 00:42:01.000 --> 00:42:03.000 which. 00:42:03.000 --> 00:42:06.000 I can tell you, for example, that eight got inserted before 14. In terms 00:42:06.000 --> 00:42:09.000 of a level-by-level arrangement, the ones on level one 00:42:09.000 --> 00:42:13.000 were always inserted before the ones on level two. 00:42:13.000 --> 00:42:16.000 But one possibility I said here was well maybe 15 went in first, 00:42:16.000 --> 00:42:18.000 then eight, and then two, 00:42:18.000 --> 00:42:21.000 and then 20, and then 21, and then 14, and then 18. There 00:42:21.000 --> 00:42:25.000 are some other rearrangements that are also possible that would create the exact same 00:42:25.000 --> 00:42:26.000 tree. 00:42:26.000 --> 00:42:30.000 And there are other variations that would produce a different, but still valid, 00:42:30.000 --> 00:42:32.000 binary search tree. 00:42:32.000 --> 00:42:33.000 So with these seven nodes, right, 00:42:33.000 --> 00:42:36.000 this tree is height three, 00:42:36.000 --> 00:42:39.000 and is what's considered perfectly balanced. Half the nodes are on the 00:42:39.000 --> 00:42:42.000 left and the right of each tree all the way through it, 00:42:42.000 --> 00:42:45.000 leading to that just beautifully balanced 00:42:45.000 --> 00:42:46.000 00:42:46.000 --> 00:42:47.000 result we're hoping for 00:42:47.000 --> 00:42:52.000 where both our operations, both add and get value, will be logarithmic. So 00:42:52.000 --> 00:42:54.000 if I have 00:42:54.000 --> 00:42:56.000 1,000 00:42:56.000 --> 00:42:59.000 nodes inserted, that height tree should be about ten. And if everything kind of went 00:42:59.000 --> 00:43:03.000 well and we got kind of went well, and we got an even split all the way around, then it would take ten 00:43:03.000 --> 00:43:06.000 comparisons to find something in that tree or determine it wasn't there' and 00:43:06.000 --> 00:43:07.000 also to insert, 00:43:07.000 --> 00:43:10.000 which was the thing we were having trouble with at the vector case, that 00:43:10.000 --> 00:43:13.000 because that search leads us right to the place where it needs to go, 00:43:13.000 --> 00:43:16.000 and the inserting because of the flexibility of the pointer base structure is 00:43:16.000 --> 00:43:17.000 very easy, 00:43:17.000 --> 00:43:21.000 we can make both that insert and add operate logarithmically. 00:43:21.000 --> 00:43:24.000 Let's think about some cases that aren't so good. 00:43:24.000 --> 00:43:28.000 I take those same seven numbers and I insert them 00:43:28.000 --> 00:43:32.000 not quite so perfectly, and I get a not quite so even result. 00:43:32.000 --> 00:43:36.000 So on this side I can tell you that 20 was the first one inserted. 00:43:36.000 --> 00:43:38.000 It was the root and the root doesn't move, 00:43:38.000 --> 00:43:41.000 but then subsequent ones was inserted an eight, and then a 21, and then an 18, 00:43:41.000 --> 00:43:46.000 and then a 14, and then a 15, and then it went back and got that two, 00:43:46.000 --> 00:43:47.000 the one over there. 00:43:47.000 --> 00:43:50.000 Another possibility for how it got inserted 00:43:50.000 --> 00:43:51.000 00:43:51.000 --> 00:43:53.000 18, 14, 15, and so on. 00:43:53.000 --> 00:43:54.000 These are 00:43:54.000 --> 00:43:58.000 not quite as balanced as the last one, but they're not terribly unbalanced. 00:43:58.000 --> 00:44:00.000 They're like one or two 00:44:00.000 --> 00:44:03.000 more than the perfectly balanced case. 00:44:03.000 --> 00:44:09.000 So at logarithmic plus a little constant still doesn't look 00:44:09.000 --> 00:44:11.000 that bad. Okay, how bad can it really get? 00:44:11.000 --> 00:44:15.000 It can get really bad. 00:44:15.000 --> 00:44:18.000 I can take those seven numbers and insert them in the worst 00:44:18.000 --> 00:44:21.000 case, and produce a completely degenerate tree 00:44:21.000 --> 00:44:24.000 where, for example, if I insert them in sorted order 00:44:24.000 --> 00:44:25.000 or in reverse sorted order, 00:44:25.000 --> 00:44:26.000 00:44:26.000 --> 00:44:28.000 I really have a link list. 00:44:28.000 --> 00:44:32.000 I don't have a tree at all. It goes two, 18, 14, 15, 20, 21, 00:44:32.000 --> 00:44:34.000 that's still a valid binary search tree. 00:44:34.000 --> 00:44:35.000 If you 00:44:35.000 --> 00:44:38.000 print it in order, it still comes out in sorted order. 00:44:38.000 --> 00:44:41.000 It still can be searched using a binary search strategy. 00:44:41.000 --> 00:44:45.000 It's just because the way it's been divided up here, it's not buying 00:44:45.000 --> 00:44:48.000 us a lot. So if I'm looking for the number 22, I'll say, well 00:44:48.000 --> 00:44:52.000 is it to the left or the right of two? It's to the right. Is it to the left or right of 00:44:52.000 --> 00:44:55.000 eight? Oh, it's to the right. Is it to the left or right, you know like all the way down, so finding 00:44:55.000 --> 00:44:57.000 22 looking at every single one 00:44:57.000 --> 00:45:01.000 of the current entries in the tree, before I could conclude, as I fell off the 00:45:01.000 --> 00:45:04.000 bottom, that it wasn't there. 00:45:04.000 --> 00:45:06.000 Similarly kind of over there; 00:45:06.000 --> 00:45:09.000 it didn't buy us a lot. It has a lot to do with 00:45:09.000 --> 00:45:13.000 how we inserted it as to how good a balance we're going to get. 00:45:13.000 --> 00:45:16.000 And there are some cases that will produce just drastically unbalanced 00:45:16.000 --> 00:45:17.000 00:45:17.000 --> 00:45:21.000 things that require linear searches, totally linear to walk that way 00:45:21.000 --> 00:45:22.000 down 00:45:22.000 --> 00:45:25.000 to find something. 00:45:25.000 --> 00:45:26.000 There's a couple of other 00:45:26.000 --> 00:45:29.000 degenerate trees, just to mention it's not only the sorted or reverse sorted. 00:45:29.000 --> 00:45:33.000 There's also these other kind of saw tooth inputs, they're called, 00:45:33.000 --> 00:45:35.000 where you just happen to at any given point to 00:45:35.000 --> 00:45:39.000 have inserted the max or the min of the elements that remain, so it 00:45:39.000 --> 00:45:42.000 keeps kind of dividing the input into one and 00:45:42.000 --> 00:45:44.000 N-1. So having the 00:45:44.000 --> 00:45:47.000 largest and the smallest, and the largest and the smallest, and then 00:45:47.000 --> 00:45:48.000 subsequent smallest down there 00:45:48.000 --> 00:45:52.000 can also produce a linear line. It's not quite all just down the left or 00:45:52.000 --> 00:45:55.000 down the right, but it does mean that every single node has an empty 00:45:55.000 --> 00:45:59.000 left or right subtree is what's considered the degenerate case here. And 00:45:59.000 --> 00:46:01.000 both have the same property where 00:46:01.000 --> 00:46:04.000 half of the pointers off 00:46:04.000 --> 00:46:08.000 of each cell are going nowhere. 00:46:08.000 --> 00:46:10.000 There is an interesting relationship here 00:46:10.000 --> 00:46:12.000 between these worst-case inputs for transversion and quick sort, and they're 00:46:12.000 --> 00:46:14.000 actually kind of exactly one to one. 00:46:14.000 --> 00:46:18.000 That what made quick sort deteriorate was the element of the input, 00:46:18.000 --> 00:46:21.000 the max or the min of what remain. And that's exactly the same case 00:46:21.000 --> 00:46:25.000 for transversion, max or min of what remains means that you are not dividing your 00:46:25.000 --> 00:46:29.000 input into two partitions of roughly equal size. You're dividing them into the 00:46:29.000 --> 00:46:33.000 one, N-1, which is getting you nowhere. What do you 00:46:33.000 --> 00:46:37.000 get to do about it? 00:46:37.000 --> 00:46:40.000 You have to decide first of all if it's a problem worth solving. 00:46:40.000 --> 00:46:41.000 00:46:41.000 --> 00:46:44.000 So that's what the first one say. Sometimes you just say that 00:46:44.000 --> 00:46:46.000 it will sometimes happen in some 00:46:46.000 --> 00:46:47.000 inputs 00:46:47.000 --> 00:46:50.000 that I consider rare enough that it will degenerate. I don't think they're 00:46:50.000 --> 00:46:52.000 important enough or common enough to make a big deal out of. 00:46:52.000 --> 00:46:56.000 I don't think that applies here, because it turns out that getting your data in sorted order is 00:46:56.000 --> 00:46:59.000 probably not unlikely. That somebody might be refilling a map 00:46:59.000 --> 00:47:02.000 with data they read from a file that they wrote out in sorted order last time. 00:47:02.000 --> 00:47:05.000 So saying that if it is coming in sorted, 00:47:05.000 --> 00:47:09.000 then I'm probably going to tolerate that, is probably not going to work. 00:47:09.000 --> 00:47:12.000 And there's two main strategies then, if you've agreed to take action about it, 00:47:12.000 --> 00:47:14.000 about what you're going to do. 00:47:14.000 --> 00:47:15.000 00:47:15.000 --> 00:47:19.000 The first one is to wait until 00:47:19.000 --> 00:47:21.000 the problem gets bad enough 00:47:21.000 --> 00:47:23.000 that you decide to do something about it. 00:47:23.000 --> 00:47:28.000 So you don't do any rebalancing as you go. You just let stuff go, left, right, left, 00:47:28.000 --> 00:47:30.000 right, whatever's happening. And 00:47:30.000 --> 00:47:31.000 00:47:31.000 --> 00:47:34.000 you try to keep track of the height of your tree, 00:47:34.000 --> 00:47:37.000 and you realize when the height of your tree seems sufficiently out of whack with 00:47:37.000 --> 00:47:39.000 your expected logarithmic height, 00:47:39.000 --> 00:47:42.000 to do something about it. Then you just fix the whole tree. 00:47:42.000 --> 00:47:45.000 You can clean the whole thing up. Pull them all out into a vector, rearrange them, 00:47:45.000 --> 00:47:49.000 put them back in, so the middlemost node is the root all the way down. And kind of 00:47:49.000 --> 00:47:50.000 create the perfectly balanced tree, 00:47:50.000 --> 00:47:52.000 and then keep going 00:47:52.000 --> 00:47:54.000 and wait for it to get out of whack, and fix it again; 00:47:54.000 --> 00:47:56.000 kind 00:47:56.000 --> 00:47:58.000 of wait and see. 00:47:58.000 --> 00:48:01.000 The other strategy, and this happens to be a more common one in 00:48:01.000 --> 00:48:06.000 practice, is you just constantly do a little bit of fixing up all the time. 00:48:06.000 --> 00:48:09.000 Whenever you notice that the tree is looking just a little bit lopsided, you let 00:48:09.000 --> 00:48:12.000 it get a little bit; usually you'll let it be out of balance by one. 00:48:12.000 --> 00:48:15.000 So you might have height four on that side and height three on that side. 00:48:15.000 --> 00:48:17.000 But as soon as it starts to get to be three and five, 00:48:17.000 --> 00:48:20.000 you do a little shuffle to bring them back to both four. And it 00:48:20.000 --> 00:48:21.000 involves just a little bit of work 00:48:21.000 --> 00:48:23.000 all the time 00:48:23.000 --> 00:48:25.000 to where you just don't let 00:48:25.000 --> 00:48:28.000 the tree turn into a real mess. 00:48:28.000 --> 00:48:31.000 But there's a little bit of extra fixing up being done all the time. 00:48:31.000 --> 00:48:35.000 And so the particular kind of tree that's discussed in the text, which you can read about, 00:48:35.000 --> 00:48:37.000 but we won't cover 00:48:37.000 --> 00:48:40.000 as part of our core material, is called the ADL tree. It's interesting to 00:48:40.000 --> 00:48:43.000 read about it to see what does it take, what kind of 00:48:43.000 --> 00:48:47.000 process is required to monitor this and do something about it. And 00:48:47.000 --> 00:48:50.000 so if we have that in place, 00:48:50.000 --> 00:48:51.000 00:48:51.000 --> 00:48:53.000 we can get it to where 00:48:53.000 --> 00:48:57.000 the binary search tree part of this is logarithmic on both 00:48:57.000 --> 00:48:59.000 inserting new values and searching for values, 00:48:59.000 --> 00:49:03.000 which is that Holy Grail of efficiency boundary. You can say, yeah logarithmic is 00:49:03.000 --> 00:49:04.000 something we can tolerate 00:49:04.000 --> 00:49:08.000 by virtue of adding 8 bytes of pointer, potentially some balance factor, 00:49:08.000 --> 00:49:12.000 and some fanciness to guarantee that performance all over, 00:49:12.000 --> 00:49:15.000 but gets us somewhere that will scale very, very well for thousands 00:49:15.000 --> 00:49:18.000 and millions of entries, right, logarithmic is still a very, very fast 00:49:18.000 --> 00:49:20.000 function. So it 00:49:20.000 --> 00:49:24.000 creates a map that could really actually go the distance in scale. So, that's 00:49:24.000 --> 00:49:28.000 your talk about binary search trees. We'll come back and talk about 00:49:28.000 --> 00:49:38.000 hashing in graphs and stuff. I will see you on Wednesday.