WEBVTT 00:00:00.000 --> 00:00:12.000 00:00:12.000 --> 00:00:15.000 This presentation is delivered by the Stanford Center for Professional 00:00:15.000 --> 00:00:28.000 Development. 00:00:28.000 --> 00:00:29.000 00:00:29.000 --> 00:00:33.000 What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever 00:00:33.000 --> 00:00:36.000 gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda 00:00:36.000 --> 00:00:39.000 clever, inventive idea for how you do search and retrieval in a very efficient 00:00:39.000 --> 00:00:40.000 way. 00:00:40.000 --> 00:00:44.000 So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday, 00:00:44.000 --> 00:00:48.000 where we had talked about ways we might implement the map abstraction, right, the key 00:00:48.000 --> 00:00:49.000 value pairing 00:00:49.000 --> 00:00:52.000 to allow for quick search and retrieval 00:00:52.000 --> 00:00:56.000 and we had looked at vector and had just done a little thought experiment about 00:00:56.000 --> 00:00:59.000 what sorting the vector would buy us, right, and that would give us the 00:00:59.000 --> 00:01:02.000 ability to do searches faster because we could use binary search, 00:01:02.000 --> 00:01:07.000 right. But in the case of add, right, we'd still be slow because the 00:01:07.000 --> 00:01:10.000 need be kinda shuffled or rearranged in the continuous memory 00:01:10.000 --> 00:01:13.000 and then that's what motivated our introduction to trees there, the binary search tree, 00:01:13.000 --> 00:01:17.000 which is designed exactly to enable that kind of searching, right, that logarithmic 00:01:17.000 --> 00:01:20.000 divide and conquer, divide in half and work your way down 00:01:20.000 --> 00:01:22.000 and then because of the flexibility of the pointer base structure there, right, we 00:01:22.000 --> 00:01:24.000 can also do the insert 00:01:24.000 --> 00:01:26.000 in the same logarithmic 00:01:26.000 --> 00:01:30.000 time because we're no longer constrained by that contiguous memory that the array 00:01:30.000 --> 00:01:30.000 fact 00:01:30.000 --> 00:01:32.000 data structures force us to. 00:01:32.000 --> 00:01:36.000 And so, at that point what we have seen this logarithmic and both get valued at and so 00:01:36.000 --> 00:01:38.000 on the PQ, right, which you were just finishing, 00:01:38.000 --> 00:01:42.000 you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you 00:01:42.000 --> 00:01:44.000 would normally run into in an 00:01:44.000 --> 00:01:45.000 ordinary use, right, 00:01:45.000 --> 00:01:49.000 it's un-measurable, right. So those of you who got your heap up and running and 00:01:49.000 --> 00:01:51.000 correctly working logarithmic and time was discovering that you can put 10,000, 20,000, 00:01:51.000 --> 00:01:54.000 50,000, 100,000 things in it and just never be able to measure 00:01:54.000 --> 00:01:58.000 what it took to NQ or DQ from those structures because logarithmic is very, 00:01:58.000 --> 00:02:00.000 very fast. 00:02:00.000 --> 00:02:01.000 So that said, 00:02:01.000 --> 00:02:04.000 that's actually a fine point to say, we have a great math notation, we don't have to look further, 00:02:04.000 --> 00:02:07.000 but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where 00:02:07.000 --> 00:02:10.000 we can do both those operations in constant time, which is to say that 00:02:10.000 --> 00:02:13.000 no matter how large the table gets, right, we expect 00:02:13.000 --> 00:02:16.000 we'll be able to insert and retrieve with 00:02:16.000 --> 00:02:21.000 the exact same time required kind of for a 10,000 table as a million entry 00:02:21.000 --> 00:02:22.000 table which is pretty neat. 00:02:22.000 --> 00:02:26.000 To note that as we start to get to these more complicated data structures, 00:02:26.000 --> 00:02:29.000 one thing we also need to keep in mind, right, is that there are other things 00:02:29.000 --> 00:02:33.000 of trade offs, right, in adding that complexity of the code that 00:02:33.000 --> 00:02:36.000 starting with something that's simple and 00:02:36.000 --> 00:02:39.000 a little slower performer but that's easy to write might actually solve the problem 00:02:39.000 --> 00:02:42.000 we have at hand. So maybe that we don't really wanna go to one of these fancier 00:02:42.000 --> 00:02:45.000 things because we don't really need it to be that much faster 00:02:45.000 --> 00:02:48.000 because the BST, right, adds a bunch of memory, right, so left and right pointers 00:02:48.000 --> 00:02:53.000 for every cell, which is an 8 byte overhead on top of the entry itself. 00:02:53.000 --> 00:02:56.000 Now we have to deal with pointers, dynamic memories, all this allocation, 00:02:56.000 --> 00:02:59.000 deallocation, right, we've got a lot of recursion going on in there 00:02:59.000 --> 00:03:01.000 and so just opportunities for error 00:03:01.000 --> 00:03:03.000 and stuff to creep into 00:03:03.000 --> 00:03:07.000 and if we take the steps to provide tree balancing, which we only kind of grazed 00:03:07.000 --> 00:03:11.000 the surface of. We just mentioned the need for it. It actually adds quite a lot of 00:03:11.000 --> 00:03:14.000 complication to the code to do the rotations 00:03:14.000 --> 00:03:16.000 or the rebalancing 00:03:16.000 --> 00:03:21.000 that from a simple binary search tree. So in fact the whole package, right, of 00:03:21.000 --> 00:03:24.000 building a really guaranteed log n performing 00:03:24.000 --> 00:03:27.000 binary search tree is actually a pretty big investment of human capital 00:03:27.000 --> 00:03:29.000 that you 00:03:29.000 --> 00:03:30.000 have to kinda 00:03:30.000 --> 00:03:33.000 balance against the future hit off of it running faster. I'm 00:03:33.000 --> 00:03:36.000 gonna show this strategy today that actually is 00:03:36.000 --> 00:03:37.000 kind of simple to write. 00:03:37.000 --> 00:03:40.000 It gets us solve one case and doesn't have the degenerate 00:03:40.000 --> 00:03:43.000 looming that the binary search tree did. 00:03:43.000 --> 00:03:46.000 So, let me kinda just give you a theoretical concept to think about how 00:03:46.000 --> 00:03:49.000 you do stuff. 00:03:49.000 --> 00:03:51.000 You have a big dictionary. You've got, you know, like a 00:03:51.000 --> 00:03:54.000 many thousand page dictionary, right, and you want to look up a word. You certainly 00:03:54.000 --> 00:03:57.000 don't do linear search. You're about to look up the word xylophone and you're actually not 00:03:57.000 --> 00:04:00.000 going to start from the A's and kind of page your way through, you 00:04:00.000 --> 00:04:03.000 know, 2,000 pages to get to the X's. You 00:04:03.000 --> 00:04:07.000 don't even ? like we do binary search. You can do binary searches pretty fast, right? It 00:04:07.000 --> 00:04:10.000 turns out like starting at the M's to get something that's in X, right? You 00:04:10.000 --> 00:04:11.000 tend to know about where X 00:04:11.000 --> 00:04:14.000 is and often dictionaries have little tabs 00:04:14.000 --> 00:04:17.000 along the pages that give you an idea of kinda where the breaks for 00:04:17.000 --> 00:04:22.000 certain words are. Maybe it says Xa is here and Xe is there or something, 00:04:22.000 --> 00:04:25.000 that lets you kinda more quickly just illuminate all the surrounding noise 00:04:25.000 --> 00:04:29.000 in the dictionary, kinda get straight to the right region where the X words are 00:04:29.000 --> 00:04:30.000 and then you can do 00:04:30.000 --> 00:04:33.000 a simple search from there, maybe a little bit of a linear or binary search to kind of zero 00:04:33.000 --> 00:04:37.000 in on the page where the word is you wanna find is. 00:04:37.000 --> 00:04:38.000 This is the idea 00:04:38.000 --> 00:04:45.000 behind this concept called a hash table or based on the algorithmic technique of hashing 00:04:45.000 --> 00:04:46.000 is that 00:04:46.000 --> 00:04:49.000 rather than try to figure out how to kinda keep this very large collection sorted or organized 00:04:49.000 --> 00:04:51.000 in a way that you can kind of 00:04:51.000 --> 00:04:52.000 jump around within it, it says, 00:04:52.000 --> 00:04:56.000 ''Well how about we maintain a bunch of different collections.'' We break up the 00:04:56.000 --> 00:04:59.000 data set into little tiny pieces 00:04:59.000 --> 00:05:00.000 and if we can 00:05:00.000 --> 00:05:02.000 tell, given a key, 00:05:02.000 --> 00:05:05.000 like which piece of the data set it's supposed to be in and we'll call those pieces 00:05:05.000 --> 00:05:08.000 buckets. I'm gonna put these in different buckets. 00:05:08.000 --> 00:05:11.000 So if I had a whole bunch of students I might put all the freshmen in this bucket 00:05:11.000 --> 00:05:13.000 and the sophomores in this bucket or something like that and then when I look 00:05:13.000 --> 00:05:16.000 for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the 00:05:16.000 --> 00:05:18.000 freshmen bucket. 00:05:18.000 --> 00:05:19.000 Okay. You know, the idea 00:05:19.000 --> 00:05:22.000 in a simple case would be like, ''Okay, well I ? it has some 00:05:22.000 --> 00:05:26.000 criteria by which I can divide you into these buckets that don't overlap and everyone 00:05:26.000 --> 00:05:28.000 has a unique bucket to go in 00:05:28.000 --> 00:05:31.000 and if I can make up enough of those categorizations, 00:05:31.000 --> 00:05:33.000 have as many of those buckets around 00:05:33.000 --> 00:05:35.000 to divide you up into, 00:05:35.000 --> 00:05:38.000 then I can hopefully keep the set that lands in that bucket pretty small, 00:05:38.000 --> 00:05:42.000 so that I don't have a smaller of students. So maybe I did it by the 00:05:42.000 --> 00:05:45.000 first letter of your last name. I could say, ''Well everybody whose last name begins with W 00:05:45.000 --> 00:05:47.000 goes here and everybody who's B goes here and so on.'' Well, 00:05:47.000 --> 00:05:50.000 when somebody comes to me then I don't have to look through the B's. 00:05:50.000 --> 00:05:51.000 You know, and if you have 00:05:51.000 --> 00:05:53.000 a class of 200 students 00:05:53.000 --> 00:05:56.000 and you have 26 letters of the alphabet then you'd think that if it were 00:05:56.000 --> 00:05:59.000 evenly divided across those letters, it's not going to be to be, it turns out in practice, 00:05:59.000 --> 00:06:02.000 but as a thought experiment there'd be about 4 people in a bucket and 00:06:02.000 --> 00:06:06.000 it'd be pretty fast to just look through the bucket and find the one who I was 00:06:06.000 --> 00:06:08.000 looking for. Okay, 00:06:08.000 --> 00:06:10.000 so let's take this idea. This idea of, like, 00:06:10.000 --> 00:06:14.000 there's going to be some strategy for how we divide into buckets and we call 00:06:14.000 --> 00:06:16.000 that strategy the hash function. 00:06:16.000 --> 00:06:18.000 That, given a key, 00:06:18.000 --> 00:06:23.000 and a buckets and we'll just label them with numbers. Zero to the number of buckets 00:06:23.000 --> 00:06:26.000 minus one. So let's say I have 99 buckets, I've got zero to 00:06:26.000 --> 00:06:27.000 98 00:06:27.000 --> 00:06:28.000 that I will take the key 00:06:28.000 --> 00:06:32.000 and the key in this case is the string key that we're using in the map, 00:06:32.000 --> 00:06:33.000 and I have a hash function 00:06:33.000 --> 00:06:37.000 which given a key is input produces a number as output 00:06:37.000 --> 00:06:39.000 in the range zero to B minus one. 00:06:39.000 --> 00:06:42.000 And that is what's called the hash value for that key 00:06:42.000 --> 00:06:45.000 and that tells me which bucket to look at, 00:06:45.000 --> 00:06:48.000 either find a match for that if I'm trying to do a get value or where to place a 00:06:48.000 --> 00:06:52.000 new entry if I'm trying to add into the table. 00:06:52.000 --> 00:06:57.000 So that's the basic idea. I'm 00:06:57.000 --> 00:07:00.000 gonna talk a little bit about hash functions. 00:07:00.000 --> 00:07:03.000 In terms of writing a really good hash function is actually pretty complicated 00:07:03.000 --> 00:07:04.000 and so 00:07:04.000 --> 00:07:07.000 we're gonna think about it in sort of a ? kind 00:07:07.000 --> 00:07:10.000 of a big picture perspective and think about what qualities a hash 00:07:10.000 --> 00:07:13.000 function has to have and then I'm actually not gonna go through 00:07:13.000 --> 00:07:16.000 proving to you about what mathematically constitutes a good hash function. I'm just gonna give you 00:07:16.000 --> 00:07:21.000 some intuitive sense of what kinda things matter when designing a hash function. 00:07:21.000 --> 00:07:23.000 So it's given a key that maps it to a number. 00:07:23.000 --> 00:07:25.000 Typically it's gonna say, you know, 00:07:25.000 --> 00:07:27.000 zero to the number of buckets minus one. 00:07:27.000 --> 00:07:30.000 Often that means that somehow it's gonna compute something and then use 00:07:30.000 --> 00:07:33.000 the mod operation to bring it back in range. 00:07:33.000 --> 00:07:35.000 So it's gonna take your number 00:07:35.000 --> 00:07:37.000 and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you 00:07:37.000 --> 00:07:39.000 a number zero to four.'' 00:07:39.000 --> 00:07:41.000 It's very important that it be 00:07:41.000 --> 00:07:44.000 stable, that you put the key in, you get a number out. If you put that same key in 00:07:44.000 --> 00:07:48.000 later you should get the same number out. So it has to be sort of reliable guaranteed 00:07:48.000 --> 00:07:52.000 mapping. It can't be random. That 00:07:52.000 --> 00:07:55.000 said, you want it to almost look as though it's random in terms of 00:07:55.000 --> 00:07:59.000 how it maps things to buckets, that you don't want it to have a real systematic plan 00:07:59.000 --> 00:08:02.000 where it puts a whole bunch of things in bucket one and very few things in bucket two 00:08:02.000 --> 00:08:06.000 or somehow always uses the even buckets and not the odds ones or something like that. So 00:08:06.000 --> 00:08:08.000 you wanna have a strategy that's gonna take those keys 00:08:08.000 --> 00:08:11.000 and use your whole bucket range 00:08:11.000 --> 00:08:12.000 in an equal manner. 00:08:12.000 --> 00:08:15.000 So that if you put 100 keys in and you have 100 buckets you're hoping 00:08:15.000 --> 00:08:16.000 that, 00:08:16.000 --> 00:08:19.000 on average, each bucket will have one thing. That you won't have these clusters 00:08:19.000 --> 00:08:22.000 where 25 things are in one bucket, ten here and then a whole bunch of buckets 00:08:22.000 --> 00:08:24.000 are empty. So 00:08:24.000 --> 00:08:27.000 you want this kinda divided across the range. 00:08:27.000 --> 00:08:29.000 So given that we have string input, 00:08:29.000 --> 00:08:32.000 likely what we're gonna be using is the characters in the string as part of 00:08:32.000 --> 00:08:34.000 the ? how can I take these characters 00:08:34.000 --> 00:08:40.000 and somehow kinda jungle them up in a way to sort of move them around in these 00:08:40.000 --> 00:08:43.000 buckets in a way that's reliable so that I get that same string back, I get the 00:08:43.000 --> 00:08:44.000 same one. 00:08:44.000 --> 00:08:47.000 But that doesn't actually produce a lot of artifacts of 00:08:47.000 --> 00:08:50.000 similar strings, let's say mapping to the same place, hopefully spreading them 00:08:50.000 --> 00:08:52.000 around. 00:08:52.000 --> 00:08:55.000 So for example, some really simple things you could use, you could just take the first 00:08:55.000 --> 00:08:56.000 letter. So I 00:08:56.000 --> 00:08:59.000 talked about this sorta with last names. If I was trying to divide up my students, right, I could divide 00:08:59.000 --> 00:09:01.000 them by the first letter of their last name. 00:09:01.000 --> 00:09:05.000 That would use exactly 26 buckets, no more, no less, 00:09:05.000 --> 00:09:08.000 and if I happen to have 26 buckets 00:09:08.000 --> 00:09:11.000 then at least there's some chance it would get some coverage across the range. 00:09:11.000 --> 00:09:14.000 There is likely to be clustering, right, some 00:09:14.000 --> 00:09:15.000 names are a lot more common 00:09:15.000 --> 00:09:18.000 or letters are a lot more commonly used than others. 00:09:18.000 --> 00:09:21.000 So it would tend to produce a little bit of an artifact, but it would be reliable and it would 00:09:21.000 --> 00:09:24.000 use 26 buckets. If I had 100 buckets, 00:09:24.000 --> 00:09:26.000 then this would be kind of a crummy function 00:09:26.000 --> 00:09:28.000 because in fact it would use the first 26 buckets and none of the other ones, 00:09:28.000 --> 00:09:30.000 right, if it was just taking 00:09:30.000 --> 00:09:32.000 that letter and deciding. So, 00:09:32.000 --> 00:09:36.000 you know, in some situations this would get me a little bit of the thing I wanted. 00:09:36.000 --> 00:09:40.000 Things like using the length of the word. A 00:09:40.000 --> 00:09:42.000 very, very bad clustering function here, 00:09:42.000 --> 00:09:44.000 right because there'd be a bunch of strings that are, like, six characters, seven 00:09:44.000 --> 00:09:46.000 characters, ten characters, 00:09:46.000 --> 00:09:50.000 but very few that are one or two, and very few that are more than ten, or something. 00:09:50.000 --> 00:09:53.000 You'd have this very tight little cluster in the middle, right, and would not use a 00:09:53.000 --> 00:09:57.000 large bucket range effectively at all. 00:09:57.000 --> 00:10:01.000 More likely what you're gonna try to use is kind of more of the information about the word, 00:10:01.000 --> 00:10:04.000 the key that you have. Instead of just a single letter or the count of letters, it's 00:10:04.000 --> 00:10:06.000 trying to use kinda all the data you have at hand, 00:10:06.000 --> 00:10:08.000 which is to say all the letters, 00:10:08.000 --> 00:10:12.000 and so one way to do it would be to take the ASCII values for the letters, so A is 00:10:12.000 --> 00:10:13.000 97, you 00:10:13.000 --> 00:10:15.000 know, B is 98, 00:10:15.000 --> 00:10:17.000 is you would take a letter, a word like atoms and you'd add 00:10:17.000 --> 00:10:18.000 96 00:10:18.000 --> 00:10:22.000 plus 99 plus 96, you know, add 00:10:22.000 --> 00:10:23.000 them together 00:10:23.000 --> 00:10:26.000 and then you'd get some sum of them, 500, 600, whatever it is, 00:10:26.000 --> 00:10:28.000 depending on how many letters there are 00:10:28.000 --> 00:10:32.000 and then use mod to back it back into ? let's say your range was, you had 100 00:10:32.000 --> 00:10:34.000 buckets, right, so you'd want it to be zero to 99, so 00:10:34.000 --> 00:10:37.000 you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd 00:10:37.000 --> 00:10:42.000 be, okay, the sum to 524 then it goes into bucket 24. 00:10:42.000 --> 00:10:47.000 So if you did that, right, and so if act 00:10:47.000 --> 00:10:50.000 is the word that you have, right, you'd add these codes together and let's say this is 00:10:50.000 --> 00:10:53.000 97 and then this is 99 and then 00:10:53.000 --> 00:10:57.000 that's, you know, a hundred and something and so we end up with ? I'm making this 00:10:57.000 --> 00:10:59.000 number up, right ? 283, then we might 00:10:59.000 --> 00:11:03.000 mod to the last two digit and we'll say this one goes into bucket 83. 00:11:03.000 --> 00:11:07.000 And if we have, you know, dog, 00:11:07.000 --> 00:11:11.000 maybe that adds to 267 and we would 00:11:11.000 --> 00:11:13.000 mod that down to 00:11:13.000 --> 00:11:17.000 67. So this actually tries to use, kinda, all the information at hand, all the letters 00:11:17.000 --> 00:11:18.000 that we have 00:11:18.000 --> 00:11:19.000 and bring them down, 00:11:19.000 --> 00:11:22.000 and so if you figure that across the space of all the numbers that are being added here 00:11:22.000 --> 00:11:25.000 you are just as likely to get 210 as you are to get 299, you know 00:11:25.000 --> 00:11:28.000 that any of the subsequent numbers in between, 00:11:28.000 --> 00:11:32.000 then hopefully you'd get kind of a spread across your buckets. 00:11:32.000 --> 00:11:36.000 There is one thing this clusters on. Does anybody see what words cluster in this 00:11:36.000 --> 00:11:40.000 system? 00:11:40.000 --> 00:11:41.000 That you'd think of as 00:11:41.000 --> 00:11:47.000 maybe not being related, but would land in the same bucket? To 00:11:47.000 --> 00:11:49.000 rearrange some numbers. Yes. Any 00:11:49.000 --> 00:11:51.000 anagram, right, so 00:11:51.000 --> 00:11:52.000 you 00:11:52.000 --> 00:11:54.000 take cat which is just a ranagram ? 00:11:54.000 --> 00:11:57.000 anagram of act, right, it's gonna add to the same thing, right, 00:11:57.000 --> 00:12:00.000 and so it would be a little bit of clustering in that respect. 00:12:00.000 --> 00:12:02.000 So anagrams come down to the same place. 00:12:02.000 --> 00:12:05.000 And so the words that you expect to, kinda like, act and cat to 00:12:05.000 --> 00:12:09.000 you seem like different words, they are gonna collide in this system and that might be a little 00:12:09.000 --> 00:12:11.000 bit of something we worry about. So one way you can actually kind of avoid that in the 00:12:11.000 --> 00:12:12.000 way that 00:12:12.000 --> 00:12:15.000 the hash function we're going to eventually use will do, is 00:12:15.000 --> 00:12:16.000 it'll do this thing where 00:12:16.000 --> 00:12:21.000 not only does it add the letters together, but it also multiplies the letter by a 00:12:21.000 --> 00:12:22.000 very large prime number 00:12:22.000 --> 00:12:25.000 that helps to kinda say, ''Well the large prime number 00:12:25.000 --> 00:12:26.000 times C plus 00:12:26.000 --> 00:12:30.000 the large prime number squared times A plus the large prime number cubed times C and 00:12:30.000 --> 00:12:33.000 then it just allows the, each position of the letter actually is also then 00:12:33.000 --> 00:12:35.000 encoded in the number that comes out 00:12:35.000 --> 00:12:37.000 and so something that's positionally different 00:12:37.000 --> 00:12:40.000 won't necessarily land in the same bucket 00:12:40.000 --> 00:12:43.000 without some other factors kinda coming into play. So a little 00:12:43.000 --> 00:12:49.000 bit of trying to just take the information and jumble it up a little bit. 00:12:49.000 --> 00:12:51.000 So let me show you a little bit about how this looks. Actually I'm gonna tell you about 00:12:51.000 --> 00:12:54.000 the collisions and then I'll show you the data structure. 00:12:54.000 --> 00:12:58.000 So certainly I said there's gonna be this problem where things collide, 00:12:58.000 --> 00:13:01.000 that 83 and 83 in this simple system of adding the codes will 00:13:01.000 --> 00:13:03.000 come out the same. 00:13:03.000 --> 00:13:06.000 So it can't be that each of these buckets holds exactly one thing. 00:13:06.000 --> 00:13:07.000 The hashing, 00:13:07.000 --> 00:13:08.000 right, although we may 00:13:08.000 --> 00:13:11.000 try to get those buckets as small as possible we have to account for the 00:13:11.000 --> 00:13:14.000 possibility that two things will collide, that their hash function, even though 00:13:14.000 --> 00:13:16.000 they were different keys, 00:13:16.000 --> 00:13:18.000 will land in the same bucket 00:13:18.000 --> 00:13:20.000 based on how I've jumbled it up 00:13:20.000 --> 00:13:23.000 and so we're trying to avoid that, but when it does happen we also have to have a 00:13:23.000 --> 00:13:25.000 strategy for it. So we have to have a way of saying, ''Well 00:13:25.000 --> 00:13:28.000 there can be more than one thing in a bucket.'' Or some other way of when there's 00:13:28.000 --> 00:13:33.000 a collision, deciding where to put the one that was placed in the table second. 00:13:33.000 --> 00:13:36.000 And in the strategy, we're gonna use the one called chaning, 00:13:36.000 --> 00:13:39.000 and the chaning basically says, ''Well if once we have something in a bucket like ACT, 00:13:39.000 --> 00:13:43.000 if we hash something and it comes to the same bucket, we'll just tack it onto that bucket, 00:13:43.000 --> 00:13:45.000 in this case using a link list.'' 00:13:45.000 --> 00:13:48.000 So each of these buckets is actually gonna be a link list of 00:13:48.000 --> 00:13:49.000 the entries 00:13:49.000 --> 00:13:52.000 and so the entire table will be a 00:13:52.000 --> 00:13:54.000 vector or array of those. 00:13:54.000 --> 00:13:56.000 So let me show you, 00:13:56.000 --> 00:13:59.000 a little 00:13:59.000 --> 00:14:05.000 live action. 00:14:05.000 --> 00:14:08.000 So this is a hash table that has seven buckets. 00:14:08.000 --> 00:14:08.000 In 00:14:08.000 --> 00:14:11.000 this case number 0 through 6 are assigned to each one. 00:14:11.000 --> 00:14:14.000 They all start empty, and so these are all empty link lists illustrated by the 00:14:14.000 --> 00:14:15.000 null, 00:14:15.000 --> 00:14:17.000 and then the hash function 00:14:17.000 --> 00:14:20.000 is shown here like a black box. You don't actually know how the workings of hash 00:14:20.000 --> 00:14:23.000 function is, but you know what it's job is to do, which is you give it a key and it gives you 00:14:23.000 --> 00:14:27.000 a number in the range 0 to 6, and that same key when passed, and again gives you 00:14:27.000 --> 00:14:28.000 the same number. 00:14:28.000 --> 00:14:31.000 So if I say 106B 00:14:31.000 --> 00:14:34.000 equals Julie, it pushes 106B through the hash function and says, 00:14:34.000 --> 00:14:35.000 ''Okay, well 106B 00:14:35.000 --> 00:14:37.000 comes out as one.'' 00:14:37.000 --> 00:14:40.000 You add those numbers together, jumble them up, the letters, and you get one. 00:14:40.000 --> 00:14:43.000 And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new 00:14:43.000 --> 00:14:46.000 cell for 106B Julie and stick it in the table.'' 00:14:46.000 --> 00:14:48.000 So now there's a 00:14:48.000 --> 00:14:51.000 non empty chain in one place, the rest are empty. 00:14:51.000 --> 00:14:53.000 If I say 107 00:14:53.000 --> 00:14:55.000 is being taught by 00:14:55.000 --> 00:14:57.000 Jerry, 107 happens to get the same hash function. 00:14:57.000 --> 00:15:00.000 107, 106B, different letters but it just happen to come out that way, 00:15:00.000 --> 00:15:01.000 hash to one 00:15:01.000 --> 00:15:04.000 and so it went ahead in this case just tacked it on to the front of the list. 00:15:04.000 --> 00:15:05.000 If I do 00:15:05.000 --> 00:15:08.000 106A 00:15:08.000 --> 00:15:10.000 being taught by Patrick, 00:15:10.000 --> 00:15:11.000 106A 00:15:11.000 --> 00:15:14.000 happens to sum to zero in the hash function. 00:15:14.000 --> 00:15:17.000 That's all we know about the black box that ends up loading it up into the zero's 00:15:17.000 --> 00:15:17.000 slot. 00:15:17.000 --> 00:15:21.000 I say, well wait, it's being taught by 00:15:21.000 --> 00:15:25.000 Nick, it also come out to be zero and so 00:15:25.000 --> 00:15:27.000 it ends up kinda pushing on the front of that. So I go 103 00:15:27.000 --> 00:15:30.000 is being taught by Maron and it's in 00:15:30.000 --> 00:15:33.000 bucket five and so as I kinda type things in I'm starting to see that things are just kinda 00:15:33.000 --> 00:15:35.000 going around in difference place in the table. 00:15:35.000 --> 00:15:38.000 A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue 00:15:38.000 --> 00:15:42.000 to type in strings for other course numbers, I'd hopefully kinda fill out the table 00:15:42.000 --> 00:15:43.000 to where there was kind of an even, 00:15:43.000 --> 00:15:46.000 roughly even spread across that. 00:15:46.000 --> 00:15:47.000 When I want to look something up 00:15:47.000 --> 00:15:51.000 because I wanna see who's teaching 106B, then 00:15:51.000 --> 00:15:54.000 it will use the same process in reverse. It asks the hash function, 00:15:54.000 --> 00:15:56.000 ''Well which bucket should I look in?'' And it 00:15:56.000 --> 00:15:59.000 says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So 00:15:59.000 --> 00:16:01.000 all the other buckets are eliminated 00:16:01.000 --> 00:16:04.000 from consideration, so very quick kind of 00:16:04.000 --> 00:16:07.000 focused down to the right place and then it actually just does a link list reversal to 00:16:07.000 --> 00:16:10.000 find 106B in the bucket 00:16:10.000 --> 00:16:11.000 where it has to be, 00:16:11.000 --> 00:16:13.000 and so 00:16:13.000 --> 00:16:16.000 the key to realizing what makes this so magical, right, 00:16:16.000 --> 00:16:19.000 is that the choice of the number of buckets 00:16:19.000 --> 00:16:21.000 is up to me. 00:16:21.000 --> 00:16:24.000 I can pick the number of buckets to be whatever I want. 00:16:24.000 --> 00:16:27.000 I can pick it to be five, I can pick it to be 10, I can be to be 00:16:27.000 --> 00:16:29.000 1,000, I can pick it to be a million. If 00:16:29.000 --> 00:16:34.000 I choose the number of buckets to be kind of roughly in the same order of magnitude 00:16:34.000 --> 00:16:35.000 as the number of entries. So if I 00:16:35.000 --> 00:16:38.000 wanna put a thousand things in my table, 00:16:38.000 --> 00:16:40.000 if I make a thousand buckets to hold them in 00:16:40.000 --> 00:16:43.000 and my hash function can divide stuff 00:16:43.000 --> 00:16:45.000 pretty equally across those thousand buckets, 00:16:45.000 --> 00:16:47.000 then I will expect that each of those link lists is about length one. 00:16:47.000 --> 00:16:52.000 Some will be two, some will be zero, but if it's doing its job right then I have a 00:16:52.000 --> 00:16:56.000 bunch of little tiny buckets, each of which holds a tiny fraction of 00:16:56.000 --> 00:16:58.000 the data set, in this case one or two, 00:16:58.000 --> 00:17:00.000 and it's very easy to search through them. 00:17:00.000 --> 00:17:05.000 If I know I'm gonna put a million things in, then I make a million buckets, and then 00:17:05.000 --> 00:17:08.000 I divide one out of a million things, I run it through the hash function, it gives me 00:17:08.000 --> 00:17:09.000 this number 00:17:09.000 --> 00:17:12.000 that says, ''Oh, it's 862,444.'' I go to 00:17:12.000 --> 00:17:16.000 that bucket and the item is either there in that bucket or I have to look a little 00:17:16.000 --> 00:17:18.000 bit to find it, 00:17:18.000 --> 00:17:20.000 but there's no other place in the table I have to look. So the fact that there's 00:17:20.000 --> 00:17:21.000 a 00:17:21.000 --> 00:17:24.000 million other entries kinda near by in these other buckets is completely not 00:17:24.000 --> 00:17:26.000 important to us, right, and so 00:17:26.000 --> 00:17:28.000 the searching and 00:17:28.000 --> 00:17:32.000 updating, adding, and replacing, and removing, and stuff all happens at the bucket level 00:17:32.000 --> 00:17:35.000 and by choosing the buckets to be 00:17:35.000 --> 00:17:36.000 00:17:36.000 --> 00:17:39.000 the number, total number are gonna be roughly equal to the size of the things, then 00:17:39.000 --> 00:17:40.000 I have this 00:17:40.000 --> 00:17:43.000 constant time access to 00:17:43.000 --> 00:17:50.000 the tiny part of the subset of the set that matters. That's pretty cool. Let's write some 00:17:50.000 --> 00:17:51.000 code. It's kinda 00:17:51.000 --> 00:17:53.000 good to see the whole thing 00:17:53.000 --> 00:18:00.000 doing what it needs to do. So I'm gonna go ahead and go back over here. 00:18:00.000 --> 00:18:02.000 All right, so I've got my 00:18:02.000 --> 00:18:05.000 map which has add and get value and 00:18:05.000 --> 00:18:07.000 not much else there, right, 00:18:07.000 --> 00:18:12.000 but kinda just set up and ready for us to go. What I'm gonna build here 00:18:12.000 --> 00:18:13.000 is a 00:18:13.000 --> 00:18:15.000 link list cell 00:18:15.000 --> 00:18:20.000 that's gonna hold a string key, a val type value, and 00:18:20.000 --> 00:18:24.000 a pointer to the next, and 00:18:24.000 --> 00:18:27.000 so I plan on having each one of these cells, right, for each entry that goes in the 00:18:27.000 --> 00:18:27.000 table 00:18:27.000 --> 00:18:31.000 and then I'm going to have an array of 00:18:31.000 --> 00:18:32.000 these, and 00:18:32.000 --> 00:18:35.000 I'm gonna make a constant for it, and then I'll talk a little bit about the 00:18:35.000 --> 00:18:42.000 consequences of that. So I'm gonna make a constant that is ? so 00:18:42.000 --> 00:18:45.000 I make it 99. So then I have 99 buckets to start with. That's not 00:18:45.000 --> 00:18:49.000 gonna solve all my problems, it's gonna show that it's kinda tuned for maps that 00:18:49.000 --> 00:18:51.000 expect to hold about a hundred things, but 00:18:51.000 --> 00:18:56.000 ? and so I have an array in this case, so if you look at deciphering this 00:18:56.000 --> 00:18:58.000 declaration is a little bit complicated here. 00:18:58.000 --> 00:19:00.000 It says that buckets is the name of the variable, 00:19:00.000 --> 00:19:06.000 it is an array of num buckets length, so a fixed size, 99 00:19:06.000 --> 00:19:07.000 entry array here. 00:19:07.000 --> 00:19:11.000 Each of those entries is a pointer to a cell, so ready to be the 00:19:11.000 --> 00:19:16.000 head pointer of a link list for us. 00:19:16.000 --> 00:19:19.000 I'm gonna add a ? 00:19:19.000 --> 00:19:23.000 well, we'll get there in a second. I'm gonna go over here and work on the constructor first. 00:19:23.000 --> 00:19:25.000 So 00:19:25.000 --> 00:19:30.000 the first thing I'm gonna do is make sure that those pointers have good values. 00:19:30.000 --> 00:19:32.000 If I do not do this, right, they will be junk 00:19:32.000 --> 00:19:36.000 and there will be badness that will happen, so I can make sure that each of them starts off as an 00:19:36.000 --> 00:19:37.000 empty list, 00:19:37.000 --> 00:19:42.000 and then correspondently, right, I would need to be doing a delete all list 00:19:42.000 --> 00:19:43.000 cells here, but 00:19:43.000 --> 00:19:48.000 I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then 00:19:48.000 --> 00:19:52.000 delete all the cells in each of those lists that are there. 00:19:52.000 --> 00:19:54.000 And then for add and get value 00:19:54.000 --> 00:19:56.000 what I'm gonna ? else we need to do 00:19:56.000 --> 00:19:57.000 is figure out 00:19:57.000 --> 00:19:58.000 in both cases, right, which 00:19:58.000 --> 00:20:00.000 is the appropriate list to operate on. 00:20:00.000 --> 00:20:03.000 So running the hash function on my key, 00:20:03.000 --> 00:20:08.000 seeing where it tells me to look, and then either searching it to find the match 00:20:08.000 --> 00:20:09.000 for get value 00:20:09.000 --> 00:20:13.000 to return the matching value or in the add case, to search for the matching 00:20:13.000 --> 00:20:16.000 entry to overwrite if it exists and if not, then to create a new cell and tack it on the 00:20:16.000 --> 00:20:19.000 front of the list. So 00:20:19.000 --> 00:20:21.000 let's write a 00:20:21.000 --> 00:20:25.000 helper function because they're both gonna do the same thing. I'm gonna need 00:20:25.000 --> 00:20:28.000 the same thing twice. So I'm gonna put a hash function in here 00:20:28.000 --> 00:20:31.000 that, given a key and a number of buckets, is gonna 00:20:31.000 --> 00:20:35.000 return to me the right bucket to look in, 00:20:35.000 --> 00:20:39.000 and then I'm gonna write a find cell function that, given a key and 00:20:39.000 --> 00:20:39.000 the pointer to a list, 00:20:39.000 --> 00:20:43.000 will find the cell that matches that key or return a null if it didn't find 00:20:43.000 --> 00:20:52.000 it. Okay, let me put my hash 00:20:52.000 --> 00:21:03.000 function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ? 00:21:03.000 --> 00:21:05.000 so what I'm gonna do, I'm 00:21:05.000 --> 00:21:07.000 gonna make an interesting choice here 00:21:07.000 --> 00:21:12.000 and I'm gonna have my hash function just return zero, which 00:21:12.000 --> 00:21:13.000 seems pretty surprising. 00:21:13.000 --> 00:21:16.000 But I'm gonna talk about why that is. It turns out, when I'm first building it that 00:21:16.000 --> 00:21:19.000 actually is an easy way to test that my code is working 00:21:19.000 --> 00:21:22.000 and to not be dependent on the behavior of the hash function. At this point it says, ''We'll 00:21:22.000 --> 00:21:23.000 just put 00:21:23.000 --> 00:21:25.000 everything in the zero's bucket.'' 00:21:25.000 --> 00:21:27.000 Now that's not a good strategy for my long term efficiency, 00:21:27.000 --> 00:21:29.000 but it is a good way to test that 00:21:29.000 --> 00:21:33.000 my handling of the bucket, and the searching, and the finding, and 00:21:33.000 --> 00:21:36.000 inserting is correct, and the performance I should expect to be abysmal on this, right, it should be 00:21:36.000 --> 00:21:37.000 totally linear 00:21:37.000 --> 00:21:39.000 in terms of the number of elements to add or to get them because they'll all 00:21:39.000 --> 00:21:41.000 be in one link list 00:21:41.000 --> 00:21:45.000 with no easy access to them. And then eventually I can change this to divide across 00:21:45.000 --> 00:21:48.000 the buckets and get better spread, but the correctness of the hash table 00:21:48.000 --> 00:21:52.000 isn't affected by this choice. It's kind of an interesting way to think about designing 00:21:52.000 --> 00:21:54.000 the code. 00:21:54.000 --> 00:22:01.000 The other helper I have down here is gonna be my find cell that, given the 00:22:01.000 --> 00:22:04.000 head pointer to a list and the key, 00:22:04.000 --> 00:22:14.000 is gonna go find it, and it's basically just 00:22:14.000 --> 00:22:17.000 if this ker cells key equals the key, 00:22:17.000 --> 00:22:20.000 then we return it and then if we've gone through the whole list and haven't found it, we return 00:22:20.000 --> 00:22:22.000 a null. 00:22:22.000 --> 00:22:25.000 So this guy is returning a cell 00:22:25.000 --> 00:22:27.000 T star, right, 00:22:27.000 --> 00:22:30.000 as the matching cell within the thing, and this one is gonna provoke 00:22:30.000 --> 00:22:31.000 that 00:22:31.000 --> 00:22:35.000 little C++ compilation snafu that we encountered on the binary search 00:22:35.000 --> 00:22:36.000 tree, as well, 00:22:36.000 --> 00:22:40.000 which is that cell T, right, is a type that's declared in the private 00:22:40.000 --> 00:22:43.000 section within the scope of my map. So 00:22:43.000 --> 00:22:46.000 outside of the scope, which to say, before I've crossed the my map 00:22:46.000 --> 00:22:51.000 angle bracket val type colon, colon, it doesn't actually believe we're in my map scope 00:22:51.000 --> 00:22:54.000 yet. It hasn't seen that and it can't anticipate it, so I have to actually give 00:22:54.000 --> 00:22:56.000 the full name 00:22:56.000 --> 00:22:58.000 with the scope specifier on it, 00:22:58.000 --> 00:23:03.000 and then because of this use of the template name outside of its class 00:23:03.000 --> 00:23:09.000 is a special place for the C++ compiler, the type name keyword also 00:23:09.000 --> 00:23:12.000 needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the 00:23:12.000 --> 00:23:14.000 right place, which I kinda 00:23:14.000 --> 00:23:15.000 short changed, 00:23:15.000 --> 00:23:18.000 and then searching to find a cell. If 00:23:18.000 --> 00:23:23.000 I go back up here to get value, 00:23:23.000 --> 00:23:25.000 and I call the hash function, 00:23:25.000 --> 00:23:31.000 given my key and the number of buckets to find out which bucket to work on, 00:23:31.000 --> 00:23:35.000 I go looking for a match by calling find cell 00:23:35.000 --> 00:23:40.000 of the key in buckets of which bucket, 00:23:40.000 --> 00:23:43.000 and then if match does not equal null then 00:23:43.000 --> 00:23:45.000 I can return match as val and 00:23:45.000 --> 00:23:46.000 then 00:23:46.000 --> 00:23:47.000 otherwise 00:23:47.000 --> 00:23:52.000 we say, ''No such key found.'' So that's 00:23:52.000 --> 00:23:58.000 the error case, right, for get value was that if it didn't find one, its behavior is to 00:23:58.000 --> 00:23:59.000 raise an error 00:23:59.000 --> 00:24:01.000 so that someone knows that they've 00:24:01.000 --> 00:24:05.000 asked something it can't deal 00:24:05.000 --> 00:24:09.000 with. Okay, so the key thing here usually is that the hash is some sort of constant operation that 00:24:09.000 --> 00:24:12.000 just jumbles the key up, in this case it just returned zero, but could actually look at 00:24:12.000 --> 00:24:14.000 the elements of the key, 00:24:14.000 --> 00:24:16.000 and then it does its reversal down that link list, 00:24:16.000 --> 00:24:20.000 which given a correct working hash function should have divided them up in 00:24:20.000 --> 00:24:20.000 these 00:24:20.000 --> 00:24:23.000 little tiny chains, then we can just find the one 00:24:23.000 --> 00:24:26.000 quickly by looking through that link list. If we find it, right, we've got it, 00:24:26.000 --> 00:24:28.000 otherwise, right, we're in 00:24:28.000 --> 00:24:31.000 the error case. 00:24:31.000 --> 00:24:37.000 Add looks almost exactly the same. I 00:24:37.000 --> 00:24:40.000 start by getting the bucket, doing the find cell, 00:24:40.000 --> 00:24:41.000 if it's not ? 00:24:41.000 --> 00:24:43.000 if it did find it, right, 00:24:43.000 --> 00:24:46.000 then we overwrite, 00:24:46.000 --> 00:24:49.000 and then in the second case, right, where we didn't find it, right, then we need to 00:24:49.000 --> 00:24:55.000 make ourselves a new cell, and 00:24:55.000 --> 00:24:59.000 copy in the key, copy in the 00:24:59.000 --> 00:25:01.000 value, 00:25:01.000 --> 00:25:05.000 set it ? in this case, the easiest place, right, to add something to a link list is I should just do append 00:25:05.000 --> 00:25:08.000 it to the front, right, no need to make it hard on ourselves, right, we're not 00:25:08.000 --> 00:25:11.000 keeping them in any particular order, whatsoever, right, they're base on kind of order 00:25:11.000 --> 00:25:12.000 of entry. 00:25:12.000 --> 00:25:17.000 So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So 00:25:17.000 --> 00:25:18.000 it will 00:25:18.000 --> 00:25:20.000 point to the 00:25:20.000 --> 00:25:23.000 current front pointer of the cell, 00:25:23.000 --> 00:25:26.000 and then we will update the 00:25:26.000 --> 00:25:34.000 bucket pointer to point to this guy. So I think 00:25:34.000 --> 00:25:35.000 we are 00:25:35.000 --> 00:25:37.000 in okay shape here. Let me 00:25:37.000 --> 00:25:40.000 take a look and make sure I like what I did, right, 00:25:40.000 --> 00:25:44.000 so hashing, finding the match, if it found a non empty, so ? this is the 00:25:44.000 --> 00:25:46.000 overwrite case of finding an existing cell, we just overwrite with that. Otherwise, 00:25:46.000 --> 00:25:50.000 making a new cell and then attaching it to be the 00:25:50.000 --> 00:25:50.000 front 00:25:50.000 --> 00:25:52.000 of the bucket there. 00:25:52.000 --> 00:25:55.000 So for example, in the case where the bucket's totally empty, it's good to kinda trace 00:25:55.000 --> 00:25:59.000 that. That's the most common case when we start, right, if we hash to bucket zero 00:25:59.000 --> 00:26:01.000 and we do a search of a null link list, we should get a null out, 00:26:01.000 --> 00:26:03.000 so then we create a new cell, right, 00:26:03.000 --> 00:26:06.000 and we set the next field to be what the current bucket's head pointer was, which 00:26:06.000 --> 00:26:10.000 was null. So there should be a single to list and then we update the bucket 00:26:10.000 --> 00:26:12.000 to point to this guy. So then we should have a 00:26:12.000 --> 00:26:14.000 single list cell with null trailing it 00:26:14.000 --> 00:26:18.000 or inserting into the empty bucket case. 00:26:18.000 --> 00:26:20.000 I like it. Let's see if I have a little code over 00:26:20.000 --> 00:26:24.000 here that adds 00:26:24.000 --> 00:26:27.000 some things, car, car, cat, with some numbers. Okay, and that tries to 00:26:27.000 --> 00:26:31.000 retrieve the value of car, which it wrote a three and then it wrote a four on top 00:26:31.000 --> 00:26:39.000 of it, so hopefully the answer we should get out of this is four. Okay, if I ? be 00:26:39.000 --> 00:26:41.000 interesting to do something, like ask for something that we know is not in the 00:26:41.000 --> 00:26:45.000 table, right, to see that our error case gets handled correctly, 00:26:45.000 --> 00:26:46.000 the switch 00:26:46.000 --> 00:26:47.000 key's 00:26:47.000 --> 00:26:50.000 found. And if I were to continue to do this, I could do a little bit of stress testing to see how 00:26:50.000 --> 00:26:55.000 that zero is causing us some performance implications, but if I sat there and just put tons and 00:26:55.000 --> 00:26:56.000 tons of things in there, right, 00:26:56.000 --> 00:27:00.000 made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one 00:27:00.000 --> 00:27:04.000 long chain off the zero bucket, ignoring all the other ones, right, 00:27:04.000 --> 00:27:08.000 but then it's kinda stress testing my link list handling about the adding and 00:27:08.000 --> 00:27:09.000 searching that list, but 00:27:09.000 --> 00:27:12.000 would also show us that we expect that as we got to be, have 100 entries, 200 00:27:12.000 --> 00:27:16.000 entries, 300 entries that subsequent adds and get values, right, would 00:27:16.000 --> 00:27:21.000 show a linear increase in performance because of the 00:27:21.000 --> 00:27:27.000 cluster that we got going there. 00:27:27.000 --> 00:27:30.000 So let me give you a real hash function. I'll talk 00:27:30.000 --> 00:27:37.000 you through what it's doing and then I'll 00:27:37.000 --> 00:27:39.000 leave you with the thought of, 00:27:39.000 --> 00:27:41.000 that writing one of these actually correctly and reliably is 00:27:41.000 --> 00:27:44.000 actually 00:27:44.000 --> 00:27:47.000 more of an advance exercise than we're gonna look at, but this is actually a hash 00:27:47.000 --> 00:27:52.000 function that's taken from Don Knuth's Art of Computer Science, 00:27:52.000 --> 00:27:55.000 some of the work that guides a lot of computer scientists still to this day, 00:27:55.000 --> 00:27:59.000 and it has the strategy I basically told you about of, 00:27:59.000 --> 00:28:02.000 for all the characters that are in the string, 00:28:02.000 --> 00:28:06.000 right, adding it into the thing, but having multiplied, 00:28:06.000 --> 00:28:10.000 right, the previous hash code that's being built up by this large multiplier 00:28:10.000 --> 00:28:12.000 which is in effect kinda taking 00:28:12.000 --> 00:28:14.000 advantage of the positional information. 00:28:14.000 --> 00:28:17.000 And so the multiplier happens to be a very large negative number 00:28:17.000 --> 00:28:19.000 that's a prime that says, okay, 00:28:19.000 --> 00:28:20.000 the first time through, right, 00:28:20.000 --> 00:28:23.000 it'll take the hash code of zero and multiply it by and add the first character. So the first 00:28:23.000 --> 00:28:26.000 character gets added in without any modification. 00:28:26.000 --> 00:28:29.000 The next character, though, will take the previous one and multiply it by 00:28:29.000 --> 00:28:31.000 one power of the multiplier 00:28:31.000 --> 00:28:33.000 and then add in the next character, 00:28:33.000 --> 00:28:36.000 and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so 00:28:36.000 --> 00:28:39.000 raising it to the squared, and the third, and to the fourth powers, it works down the 00:28:39.000 --> 00:28:40.000 string. The 00:28:40.000 --> 00:28:43.000 effectiveness actually is that it's gonna do a lot of overflow. 00:28:43.000 --> 00:28:45.000 That's a very large number 00:28:45.000 --> 00:28:47.000 and it's adding 00:28:47.000 --> 00:28:51.000 and multiplying in this very large space, so in effect we're actually counting on the 00:28:51.000 --> 00:28:54.000 fact that the addition and multiplication 00:28:54.000 --> 00:28:56.000 that is built into C++ 00:28:56.000 --> 00:28:58.000 doesn't raise any errors when the numbers get so large that they can't be represented. It 00:28:58.000 --> 00:29:03.000 just kind of truncates back down to what fits, and so it kind of wraps around using a modular 00:29:03.000 --> 00:29:06.000 strategy here, like a ring. And so we'll actually end up kind of making a very big number out 00:29:06.000 --> 00:29:10.000 of this. The longer the string is, the more those multiplies, the more those powers raise. We've got 00:29:10.000 --> 00:29:13.000 this huge number, what we do is kinda truncating it back to what fits in to a long 00:29:13.000 --> 00:29:16.000 and then when we're done, we take whatever 00:29:16.000 --> 00:29:18.000 number fell out of this process 00:29:18.000 --> 00:29:21.000 and we mod it by the number of buckets to give us a number from zero to buckets minus 00:29:21.000 --> 00:29:22.000 one. 00:29:22.000 --> 00:29:26.000 And so whatever gets stuffed into here, right, it's reliable in the sense that it's, 00:29:26.000 --> 00:29:28.000 every time you put in the number you'll get the same thing back. 00:29:28.000 --> 00:29:33.000 We know it will be in the range end buckets because of that last mod 00:29:33.000 --> 00:29:34.000 call 00:29:34.000 --> 00:29:36.000 that worked it out for us 00:29:36.000 --> 00:29:37.000 and we 00:29:37.000 --> 00:29:40.000 then use that to say which bucket to go look in when we're ready 00:29:40.000 --> 00:29:41.000 to do our work. 00:29:41.000 --> 00:29:44.000 So if I switch that in for my old hash 00:29:44.000 --> 00:29:47.000 function, I shouldn't actually see any 00:29:47.000 --> 00:29:47.000 change 00:29:47.000 --> 00:29:48.000 from the outside, 00:29:48.000 --> 00:29:51.000 other than the performance, right, should suddenly actually be a lot faster, which is 00:29:51.000 --> 00:29:55.000 hard to tell when I have three things in there, but ? I'm gonna 00:29:55.000 --> 00:30:01.000 take out that error case. In this case car 00:30:01.000 --> 00:30:03.000 and cat now 00:30:03.000 --> 00:30:06.000 are probably not very likely to be going to the same bucket. In fact I can 00:30:06.000 --> 00:30:09.000 guarantee they don't go in the same bucket, whereas 00:30:09.000 --> 00:30:16.000 before they were kind of overlapping. 00:30:16.000 --> 00:30:18.000 I'll leave that code there for a second, although 00:30:18.000 --> 00:30:19.000 really I would say that point of 00:30:19.000 --> 00:30:22.000 hashing really is not to get so worried about how it is that the hash code does it's 00:30:22.000 --> 00:30:25.000 work, but to understand the general idea of what a hashing function has to do, 00:30:25.000 --> 00:30:29.000 right, what its role is and how it relates to getting this constant time 00:30:29.000 --> 00:30:30.000 performance. 00:30:30.000 --> 00:30:33.000 So I'll give you an example of hashing in the real world, I think is really interesting 00:30:33.000 --> 00:30:37.000 to see it actually happen and people not even realizing how hashing is so useful. 00:30:37.000 --> 00:30:39.000 So I ordered something from REI, which is 00:30:39.000 --> 00:30:41.000 this camping store and 00:30:41.000 --> 00:30:42.000 it ? 00:30:42.000 --> 00:30:43.000 they didn't have it in stock, so they had to 00:30:43.000 --> 00:30:46.000 place an order for it. This is actually kinda awhile ago, but I ? 00:30:46.000 --> 00:30:50.000 so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't 00:30:50.000 --> 00:30:52.000 ask me my name. I thought this was very funny. Like, 00:30:52.000 --> 00:30:55.000 ''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like, 00:30:55.000 --> 00:30:59.000 ''Okay, what are the last two digits of your phone number?'' And I'm 00:30:59.000 --> 00:31:02.000 like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last 00:31:02.000 --> 00:31:04.000 two digits of my phone number are 21. 00:31:04.000 --> 00:31:06.000 And they say ? and they go and they look in the 21 basket and then they say, ''What's your 00:31:06.000 --> 00:31:09.000 name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're 00:31:09.000 --> 00:31:11.000 like, ''Okay, it's here.'' I'm like, 00:31:11.000 --> 00:31:12.000 ''Okay.'' So then I go to get it, like 00:31:12.000 --> 00:31:13.000 a couple days later, and I go to the store 00:31:13.000 --> 00:31:15.000 and I'm standing in the line waiting for them, 00:31:15.000 --> 00:31:17.000 and I look up on the wall 00:31:17.000 --> 00:31:20.000 and they have 00:31:20.000 --> 00:31:22.000 100 baskets, little 00:31:22.000 --> 00:31:24.000 wire baskets up on the wall 00:31:24.000 --> 00:31:27.000 and they're labeled 0 through 9 over here 00:31:27.000 --> 00:31:29.000 and 0 through 9 over there, 00:31:29.000 --> 00:31:33.000 and when I come in they ask me the same question. ''What's the last two digits of your phone number?'' 00:31:33.000 --> 00:31:36.000 Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then 00:31:36.000 --> 00:31:38.000 they walk over here, right, 00:31:38.000 --> 00:31:40.000 to the 21 basket here. 00:31:40.000 --> 00:31:42.000 It's like the top digit in this digit, 00:31:42.000 --> 00:31:45.000 and there's only one piece of paper in there, and they pull out my order thing, 00:31:45.000 --> 00:31:48.000 and then they get me my tent and everybody's happy. 00:31:48.000 --> 00:31:52.000 So while I'm there, I try to engage the cashier on how this is an example of a real world 00:31:52.000 --> 00:31:55.000 hashing system 00:31:55.000 --> 00:31:58.000 and I still got my tent, I'll have you know, but it was close, 00:31:58.000 --> 00:32:02.000 right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.'' 00:32:02.000 --> 00:32:06.000 I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a 00:32:06.000 --> 00:32:10.000 very good example of what the investment in the number of buckets is versus what 00:32:10.000 --> 00:32:12.000 tradeoff it gives you, right. Because there's this very physical 00:32:12.000 --> 00:32:13.000 sort of set up of this, it's like 00:32:13.000 --> 00:32:17.000 by choosing a larger number of buckets, right, you can more fine grain your 00:32:17.000 --> 00:32:18.000 access, 00:32:18.000 --> 00:32:22.000 but that investment in buckets means you're kind of permanently installing that story. So 00:32:22.000 --> 00:32:25.000 in this case, right, they didn't use just the last digit of my phone number. They could've done that and 00:32:25.000 --> 00:32:28.000 they would have only had ten buckets on the wall, but then they would expect, 00:32:28.000 --> 00:32:30.000 you know, ten times as many things in each bucket. 00:32:30.000 --> 00:32:33.000 And apparently their 00:32:33.000 --> 00:32:37.000 estimate was the number of active orders in this backorder category was enough 00:32:37.000 --> 00:32:38.000 to 00:32:38.000 --> 00:32:41.000 warrant being more than ten, you know, significantly more than ten, 00:32:41.000 --> 00:32:45.000 but not so much more than 100, then in fact 100 buckets was the right 00:32:45.000 --> 00:32:46.000 investment to make in their bucket strategy. 00:32:46.000 --> 00:32:49.000 They didn't put three buckets on the wall because, you know, like what are they 00:32:49.000 --> 00:32:52.000 gonna do, have this big 3D sort of thing. They didn't enjoy this 00:32:52.000 --> 00:32:56.000 discussion, I went on for hours with them and they were, like, really not amused. But it 00:32:56.000 --> 00:32:59.000 had, you know, the three digits, then you'd get this even more likelihood that each 00:32:59.000 --> 00:33:03.000 bucket would be empty, but ? or have a very small number of things, but given 00:33:03.000 --> 00:33:07.000 their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me 00:33:07.000 --> 00:33:09.000 the 00:33:09.000 --> 00:33:11.000 first two digits of my phone number. 00:33:11.000 --> 00:33:14.000 They asked me the last two. Why does that 00:33:14.000 --> 00:33:18.000 matter? Because you're using all [inaudible]. Yeah. It's hated a 00:33:18.000 --> 00:33:21.000 lot. If you ask, like Stanford students, right, like especially 00:33:21.000 --> 00:33:25.000 when, you know, like campus numbers used to all be 49's or 72's or something, so like, if 00:33:25.000 --> 00:33:27.000 you use the first two digits, right, you'd have everybody's in the 49 bucket or 00:33:27.000 --> 00:33:29.000 the 72 bucket or 00:33:29.000 --> 00:33:32.000 something, and a whole bunch of other ones not used. An example, even the first two digits are never 00:33:32.000 --> 00:33:35.000 00, like there's a whole bunch of buckets that don't even get used as the first two digits of a 00:33:35.000 --> 00:33:38.000 phone number. Phone numbers never start with zeros. 00:33:38.000 --> 00:33:42.000 That they rarely actually have zeros or ones in them at all. They try to use those for the 00:33:42.000 --> 00:33:44.000 area code and stuff in the leading digits. So I 00:33:44.000 --> 00:33:45.000 thought it was just a really 00:33:45.000 --> 00:33:48.000 interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of 00:33:48.000 --> 00:33:52.000 course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me 00:33:52.000 --> 00:33:56.000 my stuff, but I paid my money and they [inaudible], 00:33:56.000 --> 00:34:00.000 but it kinda shows you, okay what you're trying to do here is try to make that number of 00:34:00.000 --> 00:34:00.000 buckets, 00:34:00.000 --> 00:34:03.000 right, roughly equal to the number of elements 00:34:03.000 --> 00:34:04.000 so that 00:34:04.000 --> 00:34:09.000 if your hash function's dividing up across your whole world, 00:34:09.000 --> 00:34:11.000 you've got 00:34:11.000 --> 00:34:14.000 constant access to what 00:34:14.000 --> 00:34:17.000 you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value 00:34:17.000 --> 00:34:21.000 that without it, it assumes it's an int and that number is to big to fit in an int and it 00:34:21.000 --> 00:34:25.000 is twice as big as a long in terms of space and so the L needs to be there, otherwise it 00:34:25.000 --> 00:34:26.000 will 00:34:26.000 --> 00:34:30.000 try to read it as an int and through away part of the information I was trying to give it. So it's the 00:34:30.000 --> 00:34:31.000 way 00:34:31.000 --> 00:34:35.000 of identifying it along constant. 00:34:35.000 --> 00:34:37.000 So let me talk about this just a little bit more 00:34:37.000 --> 00:34:39.000 in terms of 00:34:39.000 --> 00:34:41.000 00:34:41.000 --> 00:34:43.000 actual performance, right, we've got N 00:34:43.000 --> 00:34:46.000 over B, right, so if the division is complete, right, the number of entries over 00:34:46.000 --> 00:34:48.000 the number of buckets, 00:34:48.000 --> 00:34:48.000 00:34:48.000 --> 00:34:51.000 if we make them kind of on the same order of 00:34:51.000 --> 00:34:52.000 magnitude, kind of in the same range, 00:34:52.000 --> 00:34:55.000 so having to sum this could be to make a little bit of an estimated guess about 00:34:55.000 --> 00:34:58.000 where we're going, 00:34:58.000 --> 00:35:01.000 and then there are some choices here in how we store each bucket, right. We could use 00:35:01.000 --> 00:35:05.000 a little array, we could use a link list, right, we could use a vector. 00:35:05.000 --> 00:35:07.000 We expect this to be very small, is the truth. 00:35:07.000 --> 00:35:09.000 We're hoping that it's gonna be one or two, 00:35:09.000 --> 00:35:13.000 and so there's not a lot of reason to buy any real expensive structure or even bother 00:35:13.000 --> 00:35:17.000 doing things like sorting. Like you could sort them, but like if you expect there to be two 00:35:17.000 --> 00:35:19.000 things, what's the point? You're gonna spend more time 00:35:19.000 --> 00:35:21.000 figuring out whether to put it in the front or the back 00:35:21.000 --> 00:35:25.000 then you would gain at all by being able to 00:35:25.000 --> 00:35:27.000 find it a little faster. 00:35:27.000 --> 00:35:28.000 So the link list 00:35:28.000 --> 00:35:32.000 is just gonna be easiest way to get kind of allocated without over capacity, 00:35:32.000 --> 00:35:35.000 excess capacity that's unused in the bucket. You know it's gonna be small, use the 00:35:35.000 --> 00:35:37.000 simple strategy. 00:35:37.000 --> 00:35:41.000 It's also, though, an important thing that the hash type is likely to do in kind of a 00:35:41.000 --> 00:35:44.000 industrial strength situation is it does have to deal with this idea, well what if that number of 00:35:44.000 --> 00:35:47.000 buckets that you predicted was wrong? 00:35:47.000 --> 00:35:51.000 And so the map as we have given it to you, right, has to take this sponge because 00:35:51.000 --> 00:35:54.000 it doesn't know in advance how many things are you gonna put in. 00:35:54.000 --> 00:35:57.000 The only way to know that is to wait and see as things get added, 00:35:57.000 --> 00:35:59.000 and then realize your buckets are getting full. 00:35:59.000 --> 00:36:02.000 So typically the industrial strength version of the hash table is gonna 00:36:02.000 --> 00:36:06.000 track what's called the load factor. And the load factor is just the number of total 00:36:06.000 --> 00:36:08.000 entries divided by the number of buckets. 00:36:08.000 --> 00:36:09.000 When that 00:36:09.000 --> 00:36:13.000 factor gets high, and high actually is actually quite small, in fact. If 00:36:13.000 --> 00:36:15.000 it's two, is often the trigger for 00:36:15.000 --> 00:36:19.000 causing a rehash situation, so 00:36:19.000 --> 00:36:22.000 not even letting it get to be three or four. Just saying as soon as you realize that 00:36:22.000 --> 00:36:22.000 you have 00:36:22.000 --> 00:36:26.000 exceeded by double the capacity you intended, you planned for 100 things, 00:36:26.000 --> 00:36:27.000 you got 200 things, 00:36:27.000 --> 00:36:28.000 go ahead 00:36:28.000 --> 00:36:31.000 and redo your whole bucket array. 00:36:31.000 --> 00:36:39.000 So in the case, for example, of our simple like 0 through 7 case, right, maybe 00:36:39.000 --> 00:36:43.000 it's 0 through 6. One, two, three, 00:36:43.000 --> 00:36:45.000 four, so I have one that had six buckets, right, 00:36:45.000 --> 00:36:48.000 and then I've gotten to where 00:36:48.000 --> 00:36:50.000 each of them has 00:36:50.000 --> 00:36:54.000 two or maybe three in some and one in another. 00:36:54.000 --> 00:36:57.000 The plan is to just take this whole bucket array and 00:36:57.000 --> 00:37:01.000 enlarge it by a factor of two, so grow it, and in this case, from 6 to 12, 00:37:01.000 --> 00:37:02.000 and then rehash 00:37:02.000 --> 00:37:04.000 to move everybody. The 00:37:04.000 --> 00:37:08.000 idea is that the earlier decision about where to place them was based on knowing that 00:37:08.000 --> 00:37:11.000 there was exactly six buckets and so where it was placed before 00:37:11.000 --> 00:37:15.000 is not likely to be the place it will place if you have, you know, 00:37:15.000 --> 00:37:18.000 12 choices to lay them down into. 00:37:18.000 --> 00:37:21.000 And ideal ? in fact you don't even want it to be that case. Like, if they all land into the 00:37:21.000 --> 00:37:23.000 same bucket, right, you would have 00:37:23.000 --> 00:37:25.000 the same clustering and then a big empty clump, 00:37:25.000 --> 00:37:28.000 and so you would rehash everything, run it through the new hash function having 00:37:28.000 --> 00:37:31.000 changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?'' 00:37:31.000 --> 00:37:34.000 And hopefully you'd end up getting a clean table, right, where you had 12 00:37:34.000 --> 00:37:37.000 items now with one in each bucket, 00:37:37.000 --> 00:37:40.000 ready to kinda give constant performance through 00:37:40.000 --> 00:37:42.000 that and then potentially, 00:37:42.000 --> 00:37:43.000 again if you overload 00:37:43.000 --> 00:37:47.000 your load factor, go back in and rearrange stuff again. 00:37:47.000 --> 00:37:50.000 So using this strategy a little bit like the one we talked about with the binary search tree of just 00:37:50.000 --> 00:37:51.000 wait and see. Rather 00:37:51.000 --> 00:37:55.000 than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any 00:37:55.000 --> 00:37:57.000 real alarm or crisis, 00:37:57.000 --> 00:38:00.000 it waits until there's kinda sufficiently clear demand 00:38:00.000 --> 00:38:03.000 that exceeds the capacity plan for, 00:38:03.000 --> 00:38:05.000 to do a big rearrangement. 00:38:05.000 --> 00:38:08.000 So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast. 00:38:08.000 --> 00:38:12.000 Starting to get just a hair slower, right, having to do a search through a few things as opposed 00:38:12.000 --> 00:38:13.000 to one, 00:38:13.000 --> 00:38:16.000 and then every now and then, right, you'd see an add which caused a big reshuffle of 00:38:16.000 --> 00:38:19.000 the table, which would be a totally linear operation in the number of 00:38:19.000 --> 00:38:20.000 elements 00:38:20.000 --> 00:38:22.000 and then they'd be fast again 00:38:22.000 --> 00:38:24.000 for another big clump of inserts until that 00:38:24.000 --> 00:38:27.000 same expansion might be complete or triggered by 00:38:27.000 --> 00:38:28.000 it growing even more. 00:38:28.000 --> 00:38:31.000 But dividing it sorta by the total number of inserts, so if you had 1,000 inserts 00:38:31.000 --> 00:38:35.000 before you overloaded and then you had to copy all 1,000 things to get ready for the 00:38:35.000 --> 00:38:36.000 next thousand, 00:38:36.000 --> 00:38:39.000 if you average that out, it still ends up 00:38:39.000 --> 00:38:43.000 being called a constant operation. So 00:38:43.000 --> 00:38:46.000 that's pretty neat. 00:38:46.000 --> 00:38:47.000 It's a ? 00:38:47.000 --> 00:38:50.000 really one of the, sort of, more, I think, 00:38:50.000 --> 00:38:53.000 easily understood and beautiful algorithmic techniques, right, for solving, right, 00:38:53.000 --> 00:38:55.000 this hard problem of 00:38:55.000 --> 00:38:57.000 search and retrieval 00:38:57.000 --> 00:38:58.000 that 00:38:58.000 --> 00:39:01.000 the vector and the sorting and the BST are all trying to get at, but ? and sometimes the 00:39:01.000 --> 00:39:04.000 sorter, vector and the BST are solving actually a much harder problem, right, 00:39:04.000 --> 00:39:09.000 which is keeping a total ordering of all the data and being able to flip it, 00:39:09.000 --> 00:39:12.000 traverse it and sorted order is one of the advantages that the sorted vector 00:39:12.000 --> 00:39:12.000 and 00:39:12.000 --> 00:39:14.000 the BST have that hash table doesn't have at all. 00:39:14.000 --> 00:39:17.000 In fact, it's jumbled it all up, and so if you think about how the iterator for 00:39:17.000 --> 00:39:21.000 the map works, our map is actually implemented using a hash table, 00:39:21.000 --> 00:39:24.000 it actually just iterates over the link list, and that explains why it almost appears 00:39:24.000 --> 00:39:27.000 completely random. If you put a bunch of things in the table and you go to iterate 00:39:27.000 --> 00:39:28.000 and 00:39:28.000 --> 00:39:29.000 visit them again, 00:39:29.000 --> 00:39:33.000 that the order you visit the keys seems to make no sense, and that's because it's based on 00:39:33.000 --> 00:39:35.000 their hash codes, right, which 00:39:35.000 --> 00:39:36.000 aren't designed 00:39:36.000 --> 00:39:37.000 to be 00:39:37.000 --> 00:39:39.000 any real sensible ordering 00:39:39.000 --> 00:39:42.000 to anybody not in the know of how the hash function works, 00:39:42.000 --> 00:39:46.000 whereas iterating over a vector that's sorted or iterating over a BST 00:39:46.000 --> 00:39:47.000 or a, 00:39:47.000 --> 00:39:51.000 in this case, our set, for example, is backed by a binary research tree 00:39:51.000 --> 00:39:54.000 will give you that sorted order. So it solves this harder problem, right, which cause there to 00:39:54.000 --> 00:39:57.000 be more kinda more investment in that problem, whereas hashing solves just exactly the 00:39:57.000 --> 00:39:58.000 one problem, which is 00:39:58.000 --> 00:40:02.000 I want to be able to find exactly this value again 00:40:02.000 --> 00:40:03.000 and update this value, 00:40:03.000 --> 00:40:08.000 and nothing more is needed than just identifying this, 00:40:08.000 --> 00:40:10.000 not finding the things that are less than this. For example, if you needed to find all 00:40:10.000 --> 00:40:13.000 the values that were less than this key alphabetically, 00:40:13.000 --> 00:40:15.000 right, the hash table makes that 00:40:15.000 --> 00:40:16.000 no easier 00:40:16.000 --> 00:40:18.000 than an unsorted vector, 00:40:18.000 --> 00:40:21.000 whereas things like assorted vector and binary search tree actually enable that kinda search, 00:40:21.000 --> 00:40:24.000 you could find the place in the tree and kind of work your way down the left side 00:40:24.000 --> 00:40:27.000 to find the things that were less than that. 00:40:27.000 --> 00:40:31.000 That's not actually being solved by the hash at all. Its use of 00:40:31.000 --> 00:40:34.000 memory is actually comparable to a 00:40:34.000 --> 00:40:35.000 binary search tree 00:40:35.000 --> 00:40:38.000 in that it has a four byte pointer per entry, which is the next field on the 00:40:38.000 --> 00:40:40.000 link list chain, and 00:40:40.000 --> 00:40:44.000 then there's a four byte pointer for each bucket, the head of the link list 00:40:44.000 --> 00:40:47.000 cell. Given that we expect that the buckets to be roughly equal to entry, 00:40:47.000 --> 00:40:51.000 than we can kinda just summarize that as well. Each entry 00:40:51.000 --> 00:40:53.000 represented an eight byte overhead, 00:40:53.000 --> 00:40:56.000 which is the same eight byte overhead, right, that the left and right pointers of 00:40:56.000 --> 00:40:58.000 the binary search 00:40:58.000 --> 00:40:59.000 tree add. 00:40:59.000 --> 00:41:01.000 Does it have any degenerate cases? That's a good 00:41:01.000 --> 00:41:06.000 question. So one of the things that made the binary search tree a little bit of a 00:41:06.000 --> 00:41:08.000 heavy investment was that to get that 00:41:08.000 --> 00:41:11.000 balancing behavior, even when we're getting data coming in in a way that's 00:41:11.000 --> 00:41:16.000 causing a lopsided construction, we have to go out of our way to do something special. 00:41:16.000 --> 00:41:19.000 What is the degenerate case for hash? Is there something that caused it to behave really, 00:41:19.000 --> 00:41:24.000 really badly? Anyone wanna help me out with 00:41:24.000 --> 00:41:28.000 that? Is it always good? Does it always give a good spread? Can we 00:41:28.000 --> 00:41:32.000 end up with everything in the same bucket somehow? [Inaudible] repeated it that way? 00:41:32.000 --> 00:41:35.000 So if you have repeated elements the way that ? 00:41:35.000 --> 00:41:37.000 both versions of the map work is they overwrite. 00:41:37.000 --> 00:41:41.000 So actually the nice thing is, yeah, if you did it again you just take that same cell and 00:41:41.000 --> 00:41:43.000 overwrite it, so in fact we wouldn't get a clustering based on the same entry 00:41:43.000 --> 00:41:45.000 going in and out. 00:41:45.000 --> 00:41:51.000 But how'd we end up with everything being in the same bucket? 00:41:51.000 --> 00:41:54.000 Mostly comes out, if you look at your hash function, when would a hash function collide? 00:41:54.000 --> 00:41:58.000 Right, and if you have a dumb hash function, right, you can definitely have some degenerate 00:41:58.000 --> 00:42:01.000 cases. My dumb hash function of return zero, right, being the 00:42:01.000 --> 00:42:03.000 worst example of that, 00:42:03.000 --> 00:42:07.000 but any hash function, right, that wasn't being particularly clever about using all of its 00:42:07.000 --> 00:42:09.000 information might actually have some clustering in 00:42:09.000 --> 00:42:12.000 it. It is possible, for example, even with a particularly good hash function that 00:42:12.000 --> 00:42:15.000 there are strings that will hash to the same thing, and it's like, if you somehow got 00:42:15.000 --> 00:42:18.000 really, really unlucky 00:42:18.000 --> 00:42:19.000 that you ? 00:42:19.000 --> 00:42:23.000 and had a hash function that wasn't doing the best job possible that you could get some 00:42:23.000 --> 00:42:25.000 clustering, but in general it doesn't require 00:42:25.000 --> 00:42:29.000 ? the sole responsibility, right, for the generate case comes down to your hash function. 00:42:29.000 --> 00:42:31.000 So as long as your hash function 00:42:31.000 --> 00:42:34.000 and your inputs match your expectation, 00:42:34.000 --> 00:42:35.000 you don't have any surprises about 00:42:35.000 --> 00:42:38.000 how they got inserted the way it was in a binary search tree, which is sometimes hard 00:42:38.000 --> 00:42:40.000 to control. 00:42:40.000 --> 00:42:43.000 Just as you could underestimate the number of buckets you need ? Yeah. ? could 00:42:43.000 --> 00:42:46.000 you over estimate the number of buckets you 00:42:46.000 --> 00:42:51.000 need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and 00:42:51.000 --> 00:42:52.000 do you then go through and take 00:42:52.000 --> 00:42:54.000 that memory back? Yeah. 00:42:54.000 --> 00:42:57.000 Yeah, so that's a great question. So a lot of data structures 00:42:57.000 --> 00:42:58.000 don't shrink, 00:42:58.000 --> 00:43:02.000 right, and for example, the vector often will only grow on demand and then even if you deleted a 00:43:02.000 --> 00:43:04.000 bunch of, or removed a bunch of elements, it doesn't 00:43:04.000 --> 00:43:06.000 necessarily consider shrinking a big deal, and so 00:43:06.000 --> 00:43:10.000 that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around 00:43:10.000 --> 00:43:12.000 you're not using 00:43:12.000 --> 00:43:15.000 doesn't have nearly the same penalty that not 00:43:15.000 --> 00:43:17.000 having the memory when you needed it and having to do a lot of extra work to find 00:43:17.000 --> 00:43:18.000 things. 00:43:18.000 --> 00:43:21.000 So typically, right, you would try to pick a size that you are 00:43:21.000 --> 00:43:26.000 willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.'' 00:43:26.000 --> 00:43:28.000 00:43:28.000 --> 00:43:30.000 But that, as you grow, you might not shrink back to that size. You might just 00:43:30.000 --> 00:43:34.000 let the capacity just kinda lay in waste. There are good 00:43:34.000 --> 00:43:37.000 reasons to actually be tidy about it, but again, some of this is hard to predict. 00:43:37.000 --> 00:43:40.000 It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows 00:43:40.000 --> 00:43:41.000 big again. Like, maybe you 00:43:41.000 --> 00:43:43.000 have a new freshman class come in, 00:43:43.000 --> 00:43:45.000 and then you graduate a whole bunch of people. Then 00:43:45.000 --> 00:43:47.000 at the point where you graduate them all, you're about ready to take in a new freshman 00:43:47.000 --> 00:43:51.000 class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on 00:43:51.000 --> 00:43:53.000 reusing in a minute anyway, and so maybe 00:43:53.000 --> 00:43:56.000 that clearing it out and totally releasing 00:43:56.000 --> 00:43:57.000 may actually be an exercise 00:43:57.000 --> 00:44:01.000 that was unimportant. You're planning on getting back to that size eventually, anyway. 00:44:01.000 --> 00:44:04.000 Most hash tables tend to grow is actually the truth. Right, you tend to 00:44:04.000 --> 00:44:07.000 be collecting data in there and 00:44:07.000 --> 00:44:11.000 they tend to only enlarge. It's kind of unusual that you take out a bunch of 00:44:11.000 --> 00:44:15.000 entries you already put in. 00:44:15.000 --> 00:44:18.000 And so I just put a little last note, which is like, okay, well I had said earlier that when we 00:44:18.000 --> 00:44:19.000 have the map, 00:44:19.000 --> 00:44:22.000 the key had to be string type and I was gonna at 00:44:22.000 --> 00:44:25.000 some point come back to this and talk about 00:44:25.000 --> 00:44:28.000 why that was the one restriction in all our template classes, right, you can put anything you 00:44:28.000 --> 00:44:31.000 want in statter or vector or stacker queue. 00:44:31.000 --> 00:44:34.000 That map lets you store any kind of value associated with that key, but that key was 00:44:34.000 --> 00:44:36.000 string type, and 00:44:36.000 --> 00:44:40.000 so given how our map is implemented as a hash table, that actually starts to make sense, that if 00:44:40.000 --> 00:44:42.000 you're gonna take the key and kind of jumble it up and map it to 00:44:42.000 --> 00:44:44.000 a bucket, 00:44:44.000 --> 00:44:47.000 you need to know something about how to extract information from the key, and so 00:44:47.000 --> 00:44:51.000 this case, knowing it's a string means you know it has characters, you know its length, and you can do 00:44:51.000 --> 00:44:52.000 those things. 00:44:52.000 --> 00:44:53.000 If you wanted to make a map 00:44:53.000 --> 00:44:57.000 that could take any kinda key, maybe integers is key or student structures is 00:44:57.000 --> 00:44:59.000 key or, you 00:44:59.000 --> 00:45:00.000 know, 00:45:00.000 --> 00:45:01.000 doubles is key, 00:45:01.000 --> 00:45:05.000 any of these things, it's like, well, how can you write a generic form 00:45:05.000 --> 00:45:06.000 that could use 00:45:06.000 --> 00:45:08.000 a hashing operation on that 00:45:08.000 --> 00:45:11.000 kind of key and map it to bucket numbers. 00:45:11.000 --> 00:45:14.000 You would build a two type template to kinda get this working in C++, so 00:45:14.000 --> 00:45:18.000 you can actually have a template that has a type name for the key type, a type name for 00:45:18.000 --> 00:45:19.000 the value type, 00:45:19.000 --> 00:45:20.000 and then 00:45:20.000 --> 00:45:25.000 in the member functions you'd refer to both key type and value type as these two 00:45:25.000 --> 00:45:26.000 distinct place holders. 00:45:26.000 --> 00:45:29.000 And then when the client set it up, right, they'd be saying, ''Okay, I want a map 00:45:29.000 --> 00:45:32.000 that has strings that map to integers.'' So maybe this is 00:45:32.000 --> 00:45:36.000 words and the page they appear in a document. I have a map of integers and a vector 00:45:36.000 --> 00:45:40.000 string. Maybe this is score on an exam and the names of the students who got that 00:45:40.000 --> 00:45:42.000 score, doing it that way, 00:45:42.000 --> 00:45:44.000 and to make this work, right, 00:45:44.000 --> 00:45:48.000 there has to be some kind of universal hashing that, given some generic type, 00:45:48.000 --> 00:45:53.000 can turn it into an integer in the range of buckets. The 00:45:53.000 --> 00:45:53.000 ? 00:45:53.000 --> 00:45:56.000 what it's gonna require here is that the client get involved. 00:45:56.000 --> 00:45:58.000 That you can not write this 00:45:58.000 --> 00:46:00.000 generic 00:46:00.000 --> 00:46:01.000 hash function 00:46:01.000 --> 00:46:04.000 that will work for all types, right, if it's a student structure, what are you gonna look at? The 00:46:04.000 --> 00:46:08.000 ID field, the names field, under like ? there's just no sensible way that you can 00:46:08.000 --> 00:46:10.000 talk about a generic type and talk about how to hash 00:46:10.000 --> 00:46:14.000 it, so what you would have to do is have some sort of client call back, so probably you would 00:46:14.000 --> 00:46:18.000 create this by passing out a call back that's given a key type thing, and a number of 00:46:18.000 --> 00:46:20.000 buckets would 00:46:20.000 --> 00:46:24.000 do the necessary machinations, right, to hash 00:46:24.000 --> 00:46:26.000 it into a right number. 00:46:26.000 --> 00:46:30.000 And so that would be kinda one of those coordination things between the client and the implementer 00:46:30.000 --> 00:46:34.000 about the client knows what the data is what the data is it's trying to store, and kinda how to 00:46:34.000 --> 00:46:35.000 00:46:35.000 --> 00:46:41.000 mash it up and then the map would know, okay, given that number, where to stick it. So 00:46:41.000 --> 00:46:42.000 that's what it would take, and then 00:46:42.000 --> 00:46:45.000 rather sorta load that onto our 00:46:45.000 --> 00:46:48.000 novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can 00:46:48.000 --> 00:46:53.000 do a hash internally without having to get involved.'' All right, 00:46:53.000 --> 00:46:55.000 any questions about hashing? I'm gonna 00:46:55.000 --> 00:46:59.000 give you like, a two second talk about set 00:46:59.000 --> 00:47:01.000 because it's the one 00:47:01.000 --> 00:47:04.000 ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking 00:47:04.000 --> 00:47:07.000 about, it doesn't add any new things that we haven't already done, so in fact I'm just 00:47:07.000 --> 00:47:10.000 gonna tell you what it does, and 00:47:10.000 --> 00:47:13.000 then your picture will be complete, right. 00:47:13.000 --> 00:47:16.000 It wants to do fast searching, fast updating, add and remove, and it also has all 00:47:16.000 --> 00:47:21.000 these high level operations, such as intersect, union and difference, 00:47:21.000 --> 00:47:24.000 that are kind of its primary set of things to do. 00:47:24.000 --> 00:47:25.000 00:47:25.000 --> 00:47:27.000 A bunch of the things we've already seen, right, 00:47:27.000 --> 00:47:30.000 would actually be a reasonable strategy for upholding stat, right, using some kind 00:47:30.000 --> 00:47:31.000 of vector or array. 00:47:31.000 --> 00:47:34.000 Most likely you'd want to keep it in sorted order because that would buy you 00:47:34.000 --> 00:47:37.000 the fast lookup for contains, 00:47:37.000 --> 00:47:41.000 but the add and remove, right, would necessarily be slow in those cases because of the 00:47:41.000 --> 00:47:43.000 continuous memory. 00:47:43.000 --> 00:47:44.000 00:47:44.000 --> 00:47:47.000 The link list, not really probably too good of an option here 00:47:47.000 --> 00:47:50.000 because it contains then becomes 00:47:50.000 --> 00:47:51.000 linear and 00:47:51.000 --> 00:47:54.000 add and remove similarly, right, liner, so it doesn't actually get you anywhere, in 00:47:54.000 --> 00:47:57.000 fact it just adds the memory in pointers and gunk like that. 00:47:57.000 --> 00:47:59.000 The binary search tree, probably a 00:47:59.000 --> 00:48:00.000 pretty good call, 00:48:00.000 --> 00:48:03.000 right, that's the ? 00:48:03.000 --> 00:48:06.000 kinda meshes the advantage of Adamic structure with the sorted property to 00:48:06.000 --> 00:48:11.000 be able to give it fast update adding and removing and searching all can be done 00:48:11.000 --> 00:48:14.000 in logarithmic time if you have balancing. 00:48:14.000 --> 00:48:17.000 And then it actually also enables the high level ops, so the idea of being able to do a union 00:48:17.000 --> 00:48:19.000 intersection difference, and actually being able to 00:48:19.000 --> 00:48:23.000 walk over both of the trees in sorted order, which is very easily done 00:48:23.000 --> 00:48:24.000 with a [inaudible] reversal, 00:48:24.000 --> 00:48:28.000 makes those operations actually more efficient to implement than actually kinda having to deal with 00:48:28.000 --> 00:48:32.000 data coming at you all different ways. It 00:48:32.000 --> 00:48:33.000 turns out hashing really 00:48:33.000 --> 00:48:35.000 won't quite do it 00:48:35.000 --> 00:48:36.000 00:48:36.000 --> 00:48:37.000 very easily for us 00:48:37.000 --> 00:48:40.000 because of the same need about the template situations, like, well you have to have a hash function that 00:48:40.000 --> 00:48:43.000 can hash these things and do the right thing, and so the tree happens to be a 00:48:43.000 --> 00:48:46.000 good of just ? if the client can just give you ordering information, you can 00:48:46.000 --> 00:48:48.000 do all the hard work without 00:48:48.000 --> 00:48:50.000 pushing hashing onto them. So in fact 00:48:50.000 --> 00:48:53.000 the way our set really does work is it does use a binary search tree. It happens to 00:48:53.000 --> 00:48:56.000 use it through another layered abstraction 00:48:56.000 --> 00:48:59.000 that there actually is a BST class that we've never really 00:48:59.000 --> 00:49:02.000 encountered directly, but the BST class just takes the idea of a binary search 00:49:02.000 --> 00:49:06.000 tree, adds balancing, and 00:49:06.000 --> 00:49:08.000 all the properties about finding and inserting and adding, 00:49:08.000 --> 00:49:12.000 and packages it up, and then set actually just turns out to be one liners 00:49:12.000 --> 00:49:14.000 making calls into the binary search tree. 00:49:14.000 --> 00:49:17.000 This is where all the heavy lifting 00:49:17.000 --> 00:49:18.000 goes, is down there. 00:49:18.000 --> 00:49:23.000 That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our 00:49:23.000 --> 00:49:26.000 BST uses ? or set uses the BST, which is the binary search tree, 00:49:26.000 --> 00:49:28.000 right, our vector, right, using an array, 00:49:28.000 --> 00:49:33.000 and then that stacks and queues having both viable strategies both for array 00:49:33.000 --> 00:49:37.000 vector based backing as well as link list both getting good time performance. And then 00:49:37.000 --> 00:49:38.000 you saw with the PQ, right, 00:49:38.000 --> 00:49:40.000 even 00:49:40.000 --> 00:49:44.000 yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the 00:49:44.000 --> 00:49:46.000 kind of really classic things in which a lot of things get built, 00:49:46.000 --> 00:49:51.000 and so having a chance to use them both as a client and kinda dig 00:49:51.000 --> 00:49:53.000 around as an implementer kinda gives you this big picture of 00:49:53.000 --> 00:49:55.000 some of the more common ways that 00:49:55.000 --> 00:49:57.000 cool data structures get built. 00:49:57.000 --> 00:50:00.000 So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman 00:50:00.000 --> 00:50:03.000 cafe in a little bit, if you've got some time, please come, and have 00:50:03.000 --> 00:50:04.000 a good weekend. 00:50:04.000 --> 00:50:07.000 Work hard on pathfinder.