0:00:00.000,0:00:12.000 0:00:12.000,0:00:15.000 This presentation is delivered by the Stanford Center for Professional 0:00:15.000,0:00:28.000 Development. 0:00:28.000,0:00:29.000 0:00:29.000,0:00:33.000 What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever 0:00:33.000,0:00:36.000 gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda 0:00:36.000,0:00:39.000 clever, inventive idea for how you do search and retrieval in a very efficient 0:00:39.000,0:00:40.000 way. 0:00:40.000,0:00:44.000 So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday, 0:00:44.000,0:00:48.000 where we had talked about ways we might implement the map abstraction, right, the key 0:00:48.000,0:00:49.000 value pairing 0:00:49.000,0:00:52.000 to allow for quick search and retrieval 0:00:52.000,0:00:56.000 and we had looked at vector and had just done a little thought experiment about 0:00:56.000,0:00:59.000 what sorting the vector would buy us, right, and that would give us the 0:00:59.000,0:01:02.000 ability to do searches faster because we could use binary search, 0:01:02.000,0:01:07.000 right. But in the case of add, right, we'd still be slow because the 0:01:07.000,0:01:10.000 need be kinda shuffled or rearranged in the continuous memory 0:01:10.000,0:01:13.000 and then that's what motivated our introduction to trees there, the binary search tree, 0:01:13.000,0:01:17.000 which is designed exactly to enable that kind of searching, right, that logarithmic 0:01:17.000,0:01:20.000 divide and conquer, divide in half and work your way down 0:01:20.000,0:01:22.000 and then because of the flexibility of the pointer base structure there, right, we 0:01:22.000,0:01:24.000 can also do the insert 0:01:24.000,0:01:26.000 in the same logarithmic 0:01:26.000,0:01:30.000 time because we're no longer constrained by that contiguous memory that the array 0:01:30.000,0:01:30.000 fact 0:01:30.000,0:01:32.000 data structures force us to. 0:01:32.000,0:01:36.000 And so, at that point what we have seen this logarithmic and both get valued at and so 0:01:36.000,0:01:38.000 on the PQ, right, which you were just finishing, 0:01:38.000,0:01:42.000 you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you 0:01:42.000,0:01:44.000 would normally run into in an 0:01:44.000,0:01:45.000 ordinary use, right, 0:01:45.000,0:01:49.000 it's un-measurable, right. So those of you who got your heap up and running and 0:01:49.000,0:01:51.000 correctly working logarithmic and time was discovering that you can put 10,000, 20,000, 0:01:51.000,0:01:54.000 50,000, 100,000 things in it and just never be able to measure 0:01:54.000,0:01:58.000 what it took to NQ or DQ from those structures because logarithmic is very, 0:01:58.000,0:02:00.000 very fast. 0:02:00.000,0:02:01.000 So that said, 0:02:01.000,0:02:04.000 that's actually a fine point to say, we have a great math notation, we don't have to look further, 0:02:04.000,0:02:07.000 but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where 0:02:07.000,0:02:10.000 we can do both those operations in constant time, which is to say that 0:02:10.000,0:02:13.000 no matter how large the table gets, right, we expect 0:02:13.000,0:02:16.000 we'll be able to insert and retrieve with 0:02:16.000,0:02:21.000 the exact same time required kind of for a 10,000 table as a million entry 0:02:21.000,0:02:22.000 table which is pretty neat. 0:02:22.000,0:02:26.000 To note that as we start to get to these more complicated data structures, 0:02:26.000,0:02:29.000 one thing we also need to keep in mind, right, is that there are other things 0:02:29.000,0:02:33.000 of trade offs, right, in adding that complexity of the code that 0:02:33.000,0:02:36.000 starting with something that's simple and 0:02:36.000,0:02:39.000 a little slower performer but that's easy to write might actually solve the problem 0:02:39.000,0:02:42.000 we have at hand. So maybe that we don't really wanna go to one of these fancier 0:02:42.000,0:02:45.000 things because we don't really need it to be that much faster 0:02:45.000,0:02:48.000 because the BST, right, adds a bunch of memory, right, so left and right pointers 0:02:48.000,0:02:53.000 for every cell, which is an 8 byte overhead on top of the entry itself. 0:02:53.000,0:02:56.000 Now we have to deal with pointers, dynamic memories, all this allocation, 0:02:56.000,0:02:59.000 deallocation, right, we've got a lot of recursion going on in there 0:02:59.000,0:03:01.000 and so just opportunities for error 0:03:01.000,0:03:03.000 and stuff to creep into 0:03:03.000,0:03:07.000 and if we take the steps to provide tree balancing, which we only kind of grazed 0:03:07.000,0:03:11.000 the surface of. We just mentioned the need for it. It actually adds quite a lot of 0:03:11.000,0:03:14.000 complication to the code to do the rotations 0:03:14.000,0:03:16.000 or the rebalancing 0:03:16.000,0:03:21.000 that from a simple binary search tree. So in fact the whole package, right, of 0:03:21.000,0:03:24.000 building a really guaranteed log n performing 0:03:24.000,0:03:27.000 binary search tree is actually a pretty big investment of human capital 0:03:27.000,0:03:29.000 that you 0:03:29.000,0:03:30.000 have to kinda 0:03:30.000,0:03:33.000 balance against the future hit off of it running faster. I'm 0:03:33.000,0:03:36.000 gonna show this strategy today that actually is 0:03:36.000,0:03:37.000 kind of simple to write. 0:03:37.000,0:03:40.000 It gets us solve one case and doesn't have the degenerate 0:03:40.000,0:03:43.000 looming that the binary search tree did. 0:03:43.000,0:03:46.000 So, let me kinda just give you a theoretical concept to think about how 0:03:46.000,0:03:49.000 you do stuff. 0:03:49.000,0:03:51.000 You have a big dictionary. You've got, you know, like a 0:03:51.000,0:03:54.000 many thousand page dictionary, right, and you want to look up a word. You certainly 0:03:54.000,0:03:57.000 don't do linear search. You're about to look up the word xylophone and you're actually not 0:03:57.000,0:04:00.000 going to start from the A's and kind of page your way through, you 0:04:00.000,0:04:03.000 know, 2,000 pages to get to the X's. You 0:04:03.000,0:04:07.000 don't even ? like we do binary search. You can do binary searches pretty fast, right? It 0:04:07.000,0:04:10.000 turns out like starting at the M's to get something that's in X, right? You 0:04:10.000,0:04:11.000 tend to know about where X 0:04:11.000,0:04:14.000 is and often dictionaries have little tabs 0:04:14.000,0:04:17.000 along the pages that give you an idea of kinda where the breaks for 0:04:17.000,0:04:22.000 certain words are. Maybe it says Xa is here and Xe is there or something, 0:04:22.000,0:04:25.000 that lets you kinda more quickly just illuminate all the surrounding noise 0:04:25.000,0:04:29.000 in the dictionary, kinda get straight to the right region where the X words are 0:04:29.000,0:04:30.000 and then you can do 0:04:30.000,0:04:33.000 a simple search from there, maybe a little bit of a linear or binary search to kind of zero 0:04:33.000,0:04:37.000 in on the page where the word is you wanna find is. 0:04:37.000,0:04:38.000 This is the idea 0:04:38.000,0:04:45.000 behind this concept called a hash table or based on the algorithmic technique of hashing 0:04:45.000,0:04:46.000 is that 0:04:46.000,0:04:49.000 rather than try to figure out how to kinda keep this very large collection sorted or organized 0:04:49.000,0:04:51.000 in a way that you can kind of 0:04:51.000,0:04:52.000 jump around within it, it says, 0:04:52.000,0:04:56.000 ''Well how about we maintain a bunch of different collections.'' We break up the 0:04:56.000,0:04:59.000 data set into little tiny pieces 0:04:59.000,0:05:00.000 and if we can 0:05:00.000,0:05:02.000 tell, given a key, 0:05:02.000,0:05:05.000 like which piece of the data set it's supposed to be in and we'll call those pieces 0:05:05.000,0:05:08.000 buckets. I'm gonna put these in different buckets. 0:05:08.000,0:05:11.000 So if I had a whole bunch of students I might put all the freshmen in this bucket 0:05:11.000,0:05:13.000 and the sophomores in this bucket or something like that and then when I look 0:05:13.000,0:05:16.000 for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the 0:05:16.000,0:05:18.000 freshmen bucket. 0:05:18.000,0:05:19.000 Okay. You know, the idea 0:05:19.000,0:05:22.000 in a simple case would be like, ''Okay, well I ? it has some 0:05:22.000,0:05:26.000 criteria by which I can divide you into these buckets that don't overlap and everyone 0:05:26.000,0:05:28.000 has a unique bucket to go in 0:05:28.000,0:05:31.000 and if I can make up enough of those categorizations, 0:05:31.000,0:05:33.000 have as many of those buckets around 0:05:33.000,0:05:35.000 to divide you up into, 0:05:35.000,0:05:38.000 then I can hopefully keep the set that lands in that bucket pretty small, 0:05:38.000,0:05:42.000 so that I don't have a smaller of students. So maybe I did it by the 0:05:42.000,0:05:45.000 first letter of your last name. I could say, ''Well everybody whose last name begins with W 0:05:45.000,0:05:47.000 goes here and everybody who's B goes here and so on.'' Well, 0:05:47.000,0:05:50.000 when somebody comes to me then I don't have to look through the B's. 0:05:50.000,0:05:51.000 You know, and if you have 0:05:51.000,0:05:53.000 a class of 200 students 0:05:53.000,0:05:56.000 and you have 26 letters of the alphabet then you'd think that if it were 0:05:56.000,0:05:59.000 evenly divided across those letters, it's not going to be to be, it turns out in practice, 0:05:59.000,0:06:02.000 but as a thought experiment there'd be about 4 people in a bucket and 0:06:02.000,0:06:06.000 it'd be pretty fast to just look through the bucket and find the one who I was 0:06:06.000,0:06:08.000 looking for. Okay, 0:06:08.000,0:06:10.000 so let's take this idea. This idea of, like, 0:06:10.000,0:06:14.000 there's going to be some strategy for how we divide into buckets and we call 0:06:14.000,0:06:16.000 that strategy the hash function. 0:06:16.000,0:06:18.000 That, given a key, 0:06:18.000,0:06:23.000 and a buckets and we'll just label them with numbers. Zero to the number of buckets 0:06:23.000,0:06:26.000 minus one. So let's say I have 99 buckets, I've got zero to 0:06:26.000,0:06:27.000 98 0:06:27.000,0:06:28.000 that I will take the key 0:06:28.000,0:06:32.000 and the key in this case is the string key that we're using in the map, 0:06:32.000,0:06:33.000 and I have a hash function 0:06:33.000,0:06:37.000 which given a key is input produces a number as output 0:06:37.000,0:06:39.000 in the range zero to B minus one. 0:06:39.000,0:06:42.000 And that is what's called the hash value for that key 0:06:42.000,0:06:45.000 and that tells me which bucket to look at, 0:06:45.000,0:06:48.000 either find a match for that if I'm trying to do a get value or where to place a 0:06:48.000,0:06:52.000 new entry if I'm trying to add into the table. 0:06:52.000,0:06:57.000 So that's the basic idea. I'm 0:06:57.000,0:07:00.000 gonna talk a little bit about hash functions. 0:07:00.000,0:07:03.000 In terms of writing a really good hash function is actually pretty complicated 0:07:03.000,0:07:04.000 and so 0:07:04.000,0:07:07.000 we're gonna think about it in sort of a ? kind 0:07:07.000,0:07:10.000 of a big picture perspective and think about what qualities a hash 0:07:10.000,0:07:13.000 function has to have and then I'm actually not gonna go through 0:07:13.000,0:07:16.000 proving to you about what mathematically constitutes a good hash function. I'm just gonna give you 0:07:16.000,0:07:21.000 some intuitive sense of what kinda things matter when designing a hash function. 0:07:21.000,0:07:23.000 So it's given a key that maps it to a number. 0:07:23.000,0:07:25.000 Typically it's gonna say, you know, 0:07:25.000,0:07:27.000 zero to the number of buckets minus one. 0:07:27.000,0:07:30.000 Often that means that somehow it's gonna compute something and then use 0:07:30.000,0:07:33.000 the mod operation to bring it back in range. 0:07:33.000,0:07:35.000 So it's gonna take your number 0:07:35.000,0:07:37.000 and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you 0:07:37.000,0:07:39.000 a number zero to four.'' 0:07:39.000,0:07:41.000 It's very important that it be 0:07:41.000,0:07:44.000 stable, that you put the key in, you get a number out. If you put that same key in 0:07:44.000,0:07:48.000 later you should get the same number out. So it has to be sort of reliable guaranteed 0:07:48.000,0:07:52.000 mapping. It can't be random. That 0:07:52.000,0:07:55.000 said, you want it to almost look as though it's random in terms of 0:07:55.000,0:07:59.000 how it maps things to buckets, that you don't want it to have a real systematic plan 0:07:59.000,0:08:02.000 where it puts a whole bunch of things in bucket one and very few things in bucket two 0:08:02.000,0:08:06.000 or somehow always uses the even buckets and not the odds ones or something like that. So 0:08:06.000,0:08:08.000 you wanna have a strategy that's gonna take those keys 0:08:08.000,0:08:11.000 and use your whole bucket range 0:08:11.000,0:08:12.000 in an equal manner. 0:08:12.000,0:08:15.000 So that if you put 100 keys in and you have 100 buckets you're hoping 0:08:15.000,0:08:16.000 that, 0:08:16.000,0:08:19.000 on average, each bucket will have one thing. That you won't have these clusters 0:08:19.000,0:08:22.000 where 25 things are in one bucket, ten here and then a whole bunch of buckets 0:08:22.000,0:08:24.000 are empty. So 0:08:24.000,0:08:27.000 you want this kinda divided across the range. 0:08:27.000,0:08:29.000 So given that we have string input, 0:08:29.000,0:08:32.000 likely what we're gonna be using is the characters in the string as part of 0:08:32.000,0:08:34.000 the ? how can I take these characters 0:08:34.000,0:08:40.000 and somehow kinda jungle them up in a way to sort of move them around in these 0:08:40.000,0:08:43.000 buckets in a way that's reliable so that I get that same string back, I get the 0:08:43.000,0:08:44.000 same one. 0:08:44.000,0:08:47.000 But that doesn't actually produce a lot of artifacts of 0:08:47.000,0:08:50.000 similar strings, let's say mapping to the same place, hopefully spreading them 0:08:50.000,0:08:52.000 around. 0:08:52.000,0:08:55.000 So for example, some really simple things you could use, you could just take the first 0:08:55.000,0:08:56.000 letter. So I 0:08:56.000,0:08:59.000 talked about this sorta with last names. If I was trying to divide up my students, right, I could divide 0:08:59.000,0:09:01.000 them by the first letter of their last name. 0:09:01.000,0:09:05.000 That would use exactly 26 buckets, no more, no less, 0:09:05.000,0:09:08.000 and if I happen to have 26 buckets 0:09:08.000,0:09:11.000 then at least there's some chance it would get some coverage across the range. 0:09:11.000,0:09:14.000 There is likely to be clustering, right, some 0:09:14.000,0:09:15.000 names are a lot more common 0:09:15.000,0:09:18.000 or letters are a lot more commonly used than others. 0:09:18.000,0:09:21.000 So it would tend to produce a little bit of an artifact, but it would be reliable and it would 0:09:21.000,0:09:24.000 use 26 buckets. If I had 100 buckets, 0:09:24.000,0:09:26.000 then this would be kind of a crummy function 0:09:26.000,0:09:28.000 because in fact it would use the first 26 buckets and none of the other ones, 0:09:28.000,0:09:30.000 right, if it was just taking 0:09:30.000,0:09:32.000 that letter and deciding. So, 0:09:32.000,0:09:36.000 you know, in some situations this would get me a little bit of the thing I wanted. 0:09:36.000,0:09:40.000 Things like using the length of the word. A 0:09:40.000,0:09:42.000 very, very bad clustering function here, 0:09:42.000,0:09:44.000 right because there'd be a bunch of strings that are, like, six characters, seven 0:09:44.000,0:09:46.000 characters, ten characters, 0:09:46.000,0:09:50.000 but very few that are one or two, and very few that are more than ten, or something. 0:09:50.000,0:09:53.000 You'd have this very tight little cluster in the middle, right, and would not use a 0:09:53.000,0:09:57.000 large bucket range effectively at all. 0:09:57.000,0:10:01.000 More likely what you're gonna try to use is kind of more of the information about the word, 0:10:01.000,0:10:04.000 the key that you have. Instead of just a single letter or the count of letters, it's 0:10:04.000,0:10:06.000 trying to use kinda all the data you have at hand, 0:10:06.000,0:10:08.000 which is to say all the letters, 0:10:08.000,0:10:12.000 and so one way to do it would be to take the ASCII values for the letters, so A is 0:10:12.000,0:10:13.000 97, you 0:10:13.000,0:10:15.000 know, B is 98, 0:10:15.000,0:10:17.000 is you would take a letter, a word like atoms and you'd add 0:10:17.000,0:10:18.000 96 0:10:18.000,0:10:22.000 plus 99 plus 96, you know, add 0:10:22.000,0:10:23.000 them together 0:10:23.000,0:10:26.000 and then you'd get some sum of them, 500, 600, whatever it is, 0:10:26.000,0:10:28.000 depending on how many letters there are 0:10:28.000,0:10:32.000 and then use mod to back it back into ? let's say your range was, you had 100 0:10:32.000,0:10:34.000 buckets, right, so you'd want it to be zero to 99, so 0:10:34.000,0:10:37.000 you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd 0:10:37.000,0:10:42.000 be, okay, the sum to 524 then it goes into bucket 24. 0:10:42.000,0:10:47.000 So if you did that, right, and so if act 0:10:47.000,0:10:50.000 is the word that you have, right, you'd add these codes together and let's say this is 0:10:50.000,0:10:53.000 97 and then this is 99 and then 0:10:53.000,0:10:57.000 that's, you know, a hundred and something and so we end up with ? I'm making this 0:10:57.000,0:10:59.000 number up, right ? 283, then we might 0:10:59.000,0:11:03.000 mod to the last two digit and we'll say this one goes into bucket 83. 0:11:03.000,0:11:07.000 And if we have, you know, dog, 0:11:07.000,0:11:11.000 maybe that adds to 267 and we would 0:11:11.000,0:11:13.000 mod that down to 0:11:13.000,0:11:17.000 67. So this actually tries to use, kinda, all the information at hand, all the letters 0:11:17.000,0:11:18.000 that we have 0:11:18.000,0:11:19.000 and bring them down, 0:11:19.000,0:11:22.000 and so if you figure that across the space of all the numbers that are being added here 0:11:22.000,0:11:25.000 you are just as likely to get 210 as you are to get 299, you know 0:11:25.000,0:11:28.000 that any of the subsequent numbers in between, 0:11:28.000,0:11:32.000 then hopefully you'd get kind of a spread across your buckets. 0:11:32.000,0:11:36.000 There is one thing this clusters on. Does anybody see what words cluster in this 0:11:36.000,0:11:40.000 system? 0:11:40.000,0:11:41.000 That you'd think of as 0:11:41.000,0:11:47.000 maybe not being related, but would land in the same bucket? To 0:11:47.000,0:11:49.000 rearrange some numbers. Yes. Any 0:11:49.000,0:11:51.000 anagram, right, so 0:11:51.000,0:11:52.000 you 0:11:52.000,0:11:54.000 take cat which is just a ranagram ? 0:11:54.000,0:11:57.000 anagram of act, right, it's gonna add to the same thing, right, 0:11:57.000,0:12:00.000 and so it would be a little bit of clustering in that respect. 0:12:00.000,0:12:02.000 So anagrams come down to the same place. 0:12:02.000,0:12:05.000 And so the words that you expect to, kinda like, act and cat to 0:12:05.000,0:12:09.000 you seem like different words, they are gonna collide in this system and that might be a little 0:12:09.000,0:12:11.000 bit of something we worry about. So one way you can actually kind of avoid that in the 0:12:11.000,0:12:12.000 way that 0:12:12.000,0:12:15.000 the hash function we're going to eventually use will do, is 0:12:15.000,0:12:16.000 it'll do this thing where 0:12:16.000,0:12:21.000 not only does it add the letters together, but it also multiplies the letter by a 0:12:21.000,0:12:22.000 very large prime number 0:12:22.000,0:12:25.000 that helps to kinda say, ''Well the large prime number 0:12:25.000,0:12:26.000 times C plus 0:12:26.000,0:12:30.000 the large prime number squared times A plus the large prime number cubed times C and 0:12:30.000,0:12:33.000 then it just allows the, each position of the letter actually is also then 0:12:33.000,0:12:35.000 encoded in the number that comes out 0:12:35.000,0:12:37.000 and so something that's positionally different 0:12:37.000,0:12:40.000 won't necessarily land in the same bucket 0:12:40.000,0:12:43.000 without some other factors kinda coming into play. So a little 0:12:43.000,0:12:49.000 bit of trying to just take the information and jumble it up a little bit. 0:12:49.000,0:12:51.000 So let me show you a little bit about how this looks. Actually I'm gonna tell you about 0:12:51.000,0:12:54.000 the collisions and then I'll show you the data structure. 0:12:54.000,0:12:58.000 So certainly I said there's gonna be this problem where things collide, 0:12:58.000,0:13:01.000 that 83 and 83 in this simple system of adding the codes will 0:13:01.000,0:13:03.000 come out the same. 0:13:03.000,0:13:06.000 So it can't be that each of these buckets holds exactly one thing. 0:13:06.000,0:13:07.000 The hashing, 0:13:07.000,0:13:08.000 right, although we may 0:13:08.000,0:13:11.000 try to get those buckets as small as possible we have to account for the 0:13:11.000,0:13:14.000 possibility that two things will collide, that their hash function, even though 0:13:14.000,0:13:16.000 they were different keys, 0:13:16.000,0:13:18.000 will land in the same bucket 0:13:18.000,0:13:20.000 based on how I've jumbled it up 0:13:20.000,0:13:23.000 and so we're trying to avoid that, but when it does happen we also have to have a 0:13:23.000,0:13:25.000 strategy for it. So we have to have a way of saying, ''Well 0:13:25.000,0:13:28.000 there can be more than one thing in a bucket.'' Or some other way of when there's 0:13:28.000,0:13:33.000 a collision, deciding where to put the one that was placed in the table second. 0:13:33.000,0:13:36.000 And in the strategy, we're gonna use the one called chaning, 0:13:36.000,0:13:39.000 and the chaning basically says, ''Well if once we have something in a bucket like ACT, 0:13:39.000,0:13:43.000 if we hash something and it comes to the same bucket, we'll just tack it onto that bucket, 0:13:43.000,0:13:45.000 in this case using a link list.'' 0:13:45.000,0:13:48.000 So each of these buckets is actually gonna be a link list of 0:13:48.000,0:13:49.000 the entries 0:13:49.000,0:13:52.000 and so the entire table will be a 0:13:52.000,0:13:54.000 vector or array of those. 0:13:54.000,0:13:56.000 So let me show you, 0:13:56.000,0:13:59.000 a little 0:13:59.000,0:14:05.000 live action. 0:14:05.000,0:14:08.000 So this is a hash table that has seven buckets. 0:14:08.000,0:14:08.000 In 0:14:08.000,0:14:11.000 this case number 0 through 6 are assigned to each one. 0:14:11.000,0:14:14.000 They all start empty, and so these are all empty link lists illustrated by the 0:14:14.000,0:14:15.000 null, 0:14:15.000,0:14:17.000 and then the hash function 0:14:17.000,0:14:20.000 is shown here like a black box. You don't actually know how the workings of hash 0:14:20.000,0:14:23.000 function is, but you know what it's job is to do, which is you give it a key and it gives you 0:14:23.000,0:14:27.000 a number in the range 0 to 6, and that same key when passed, and again gives you 0:14:27.000,0:14:28.000 the same number. 0:14:28.000,0:14:31.000 So if I say 106B 0:14:31.000,0:14:34.000 equals Julie, it pushes 106B through the hash function and says, 0:14:34.000,0:14:35.000 ''Okay, well 106B 0:14:35.000,0:14:37.000 comes out as one.'' 0:14:37.000,0:14:40.000 You add those numbers together, jumble them up, the letters, and you get one. 0:14:40.000,0:14:43.000 And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new 0:14:43.000,0:14:46.000 cell for 106B Julie and stick it in the table.'' 0:14:46.000,0:14:48.000 So now there's a 0:14:48.000,0:14:51.000 non empty chain in one place, the rest are empty. 0:14:51.000,0:14:53.000 If I say 107 0:14:53.000,0:14:55.000 is being taught by 0:14:55.000,0:14:57.000 Jerry, 107 happens to get the same hash function. 0:14:57.000,0:15:00.000 107, 106B, different letters but it just happen to come out that way, 0:15:00.000,0:15:01.000 hash to one 0:15:01.000,0:15:04.000 and so it went ahead in this case just tacked it on to the front of the list. 0:15:04.000,0:15:05.000 If I do 0:15:05.000,0:15:08.000 106A 0:15:08.000,0:15:10.000 being taught by Patrick, 0:15:10.000,0:15:11.000 106A 0:15:11.000,0:15:14.000 happens to sum to zero in the hash function. 0:15:14.000,0:15:17.000 That's all we know about the black box that ends up loading it up into the zero's 0:15:17.000,0:15:17.000 slot. 0:15:17.000,0:15:21.000 I say, well wait, it's being taught by 0:15:21.000,0:15:25.000 Nick, it also come out to be zero and so 0:15:25.000,0:15:27.000 it ends up kinda pushing on the front of that. So I go 103 0:15:27.000,0:15:30.000 is being taught by Maron and it's in 0:15:30.000,0:15:33.000 bucket five and so as I kinda type things in I'm starting to see that things are just kinda 0:15:33.000,0:15:35.000 going around in difference place in the table. 0:15:35.000,0:15:38.000 A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue 0:15:38.000,0:15:42.000 to type in strings for other course numbers, I'd hopefully kinda fill out the table 0:15:42.000,0:15:43.000 to where there was kind of an even, 0:15:43.000,0:15:46.000 roughly even spread across that. 0:15:46.000,0:15:47.000 When I want to look something up 0:15:47.000,0:15:51.000 because I wanna see who's teaching 106B, then 0:15:51.000,0:15:54.000 it will use the same process in reverse. It asks the hash function, 0:15:54.000,0:15:56.000 ''Well which bucket should I look in?'' And it 0:15:56.000,0:15:59.000 says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So 0:15:59.000,0:16:01.000 all the other buckets are eliminated 0:16:01.000,0:16:04.000 from consideration, so very quick kind of 0:16:04.000,0:16:07.000 focused down to the right place and then it actually just does a link list reversal to 0:16:07.000,0:16:10.000 find 106B in the bucket 0:16:10.000,0:16:11.000 where it has to be, 0:16:11.000,0:16:13.000 and so 0:16:13.000,0:16:16.000 the key to realizing what makes this so magical, right, 0:16:16.000,0:16:19.000 is that the choice of the number of buckets 0:16:19.000,0:16:21.000 is up to me. 0:16:21.000,0:16:24.000 I can pick the number of buckets to be whatever I want. 0:16:24.000,0:16:27.000 I can pick it to be five, I can pick it to be 10, I can be to be 0:16:27.000,0:16:29.000 1,000, I can pick it to be a million. If 0:16:29.000,0:16:34.000 I choose the number of buckets to be kind of roughly in the same order of magnitude 0:16:34.000,0:16:35.000 as the number of entries. So if I 0:16:35.000,0:16:38.000 wanna put a thousand things in my table, 0:16:38.000,0:16:40.000 if I make a thousand buckets to hold them in 0:16:40.000,0:16:43.000 and my hash function can divide stuff 0:16:43.000,0:16:45.000 pretty equally across those thousand buckets, 0:16:45.000,0:16:47.000 then I will expect that each of those link lists is about length one. 0:16:47.000,0:16:52.000 Some will be two, some will be zero, but if it's doing its job right then I have a 0:16:52.000,0:16:56.000 bunch of little tiny buckets, each of which holds a tiny fraction of 0:16:56.000,0:16:58.000 the data set, in this case one or two, 0:16:58.000,0:17:00.000 and it's very easy to search through them. 0:17:00.000,0:17:05.000 If I know I'm gonna put a million things in, then I make a million buckets, and then 0:17:05.000,0:17:08.000 I divide one out of a million things, I run it through the hash function, it gives me 0:17:08.000,0:17:09.000 this number 0:17:09.000,0:17:12.000 that says, ''Oh, it's 862,444.'' I go to 0:17:12.000,0:17:16.000 that bucket and the item is either there in that bucket or I have to look a little 0:17:16.000,0:17:18.000 bit to find it, 0:17:18.000,0:17:20.000 but there's no other place in the table I have to look. So the fact that there's 0:17:20.000,0:17:21.000 a 0:17:21.000,0:17:24.000 million other entries kinda near by in these other buckets is completely not 0:17:24.000,0:17:26.000 important to us, right, and so 0:17:26.000,0:17:28.000 the searching and 0:17:28.000,0:17:32.000 updating, adding, and replacing, and removing, and stuff all happens at the bucket level 0:17:32.000,0:17:35.000 and by choosing the buckets to be 0:17:35.000,0:17:36.000 0:17:36.000,0:17:39.000 the number, total number are gonna be roughly equal to the size of the things, then 0:17:39.000,0:17:40.000 I have this 0:17:40.000,0:17:43.000 constant time access to 0:17:43.000,0:17:50.000 the tiny part of the subset of the set that matters. That's pretty cool. Let's write some 0:17:50.000,0:17:51.000 code. It's kinda 0:17:51.000,0:17:53.000 good to see the whole thing 0:17:53.000,0:18:00.000 doing what it needs to do. So I'm gonna go ahead and go back over here. 0:18:00.000,0:18:02.000 All right, so I've got my 0:18:02.000,0:18:05.000 map which has add and get value and 0:18:05.000,0:18:07.000 not much else there, right, 0:18:07.000,0:18:12.000 but kinda just set up and ready for us to go. What I'm gonna build here 0:18:12.000,0:18:13.000 is a 0:18:13.000,0:18:15.000 link list cell 0:18:15.000,0:18:20.000 that's gonna hold a string key, a val type value, and 0:18:20.000,0:18:24.000 a pointer to the next, and 0:18:24.000,0:18:27.000 so I plan on having each one of these cells, right, for each entry that goes in the 0:18:27.000,0:18:27.000 table 0:18:27.000,0:18:31.000 and then I'm going to have an array of 0:18:31.000,0:18:32.000 these, and 0:18:32.000,0:18:35.000 I'm gonna make a constant for it, and then I'll talk a little bit about the 0:18:35.000,0:18:42.000 consequences of that. So I'm gonna make a constant that is ? so 0:18:42.000,0:18:45.000 I make it 99. So then I have 99 buckets to start with. That's not 0:18:45.000,0:18:49.000 gonna solve all my problems, it's gonna show that it's kinda tuned for maps that 0:18:49.000,0:18:51.000 expect to hold about a hundred things, but 0:18:51.000,0:18:56.000 ? and so I have an array in this case, so if you look at deciphering this 0:18:56.000,0:18:58.000 declaration is a little bit complicated here. 0:18:58.000,0:19:00.000 It says that buckets is the name of the variable, 0:19:00.000,0:19:06.000 it is an array of num buckets length, so a fixed size, 99 0:19:06.000,0:19:07.000 entry array here. 0:19:07.000,0:19:11.000 Each of those entries is a pointer to a cell, so ready to be the 0:19:11.000,0:19:16.000 head pointer of a link list for us. 0:19:16.000,0:19:19.000 I'm gonna add a ? 0:19:19.000,0:19:23.000 well, we'll get there in a second. I'm gonna go over here and work on the constructor first. 0:19:23.000,0:19:25.000 So 0:19:25.000,0:19:30.000 the first thing I'm gonna do is make sure that those pointers have good values. 0:19:30.000,0:19:32.000 If I do not do this, right, they will be junk 0:19:32.000,0:19:36.000 and there will be badness that will happen, so I can make sure that each of them starts off as an 0:19:36.000,0:19:37.000 empty list, 0:19:37.000,0:19:42.000 and then correspondently, right, I would need to be doing a delete all list 0:19:42.000,0:19:43.000 cells here, but 0:19:43.000,0:19:48.000 I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then 0:19:48.000,0:19:52.000 delete all the cells in each of those lists that are there. 0:19:52.000,0:19:54.000 And then for add and get value 0:19:54.000,0:19:56.000 what I'm gonna ? else we need to do 0:19:56.000,0:19:57.000 is figure out 0:19:57.000,0:19:58.000 in both cases, right, which 0:19:58.000,0:20:00.000 is the appropriate list to operate on. 0:20:00.000,0:20:03.000 So running the hash function on my key, 0:20:03.000,0:20:08.000 seeing where it tells me to look, and then either searching it to find the match 0:20:08.000,0:20:09.000 for get value 0:20:09.000,0:20:13.000 to return the matching value or in the add case, to search for the matching 0:20:13.000,0:20:16.000 entry to overwrite if it exists and if not, then to create a new cell and tack it on the 0:20:16.000,0:20:19.000 front of the list. So 0:20:19.000,0:20:21.000 let's write a 0:20:21.000,0:20:25.000 helper function because they're both gonna do the same thing. I'm gonna need 0:20:25.000,0:20:28.000 the same thing twice. So I'm gonna put a hash function in here 0:20:28.000,0:20:31.000 that, given a key and a number of buckets, is gonna 0:20:31.000,0:20:35.000 return to me the right bucket to look in, 0:20:35.000,0:20:39.000 and then I'm gonna write a find cell function that, given a key and 0:20:39.000,0:20:39.000 the pointer to a list, 0:20:39.000,0:20:43.000 will find the cell that matches that key or return a null if it didn't find 0:20:43.000,0:20:52.000 it. Okay, let me put my hash 0:20:52.000,0:21:03.000 function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ? 0:21:03.000,0:21:05.000 so what I'm gonna do, I'm 0:21:05.000,0:21:07.000 gonna make an interesting choice here 0:21:07.000,0:21:12.000 and I'm gonna have my hash function just return zero, which 0:21:12.000,0:21:13.000 seems pretty surprising. 0:21:13.000,0:21:16.000 But I'm gonna talk about why that is. It turns out, when I'm first building it that 0:21:16.000,0:21:19.000 actually is an easy way to test that my code is working 0:21:19.000,0:21:22.000 and to not be dependent on the behavior of the hash function. At this point it says, ''We'll 0:21:22.000,0:21:23.000 just put 0:21:23.000,0:21:25.000 everything in the zero's bucket.'' 0:21:25.000,0:21:27.000 Now that's not a good strategy for my long term efficiency, 0:21:27.000,0:21:29.000 but it is a good way to test that 0:21:29.000,0:21:33.000 my handling of the bucket, and the searching, and the finding, and 0:21:33.000,0:21:36.000 inserting is correct, and the performance I should expect to be abysmal on this, right, it should be 0:21:36.000,0:21:37.000 totally linear 0:21:37.000,0:21:39.000 in terms of the number of elements to add or to get them because they'll all 0:21:39.000,0:21:41.000 be in one link list 0:21:41.000,0:21:45.000 with no easy access to them. And then eventually I can change this to divide across 0:21:45.000,0:21:48.000 the buckets and get better spread, but the correctness of the hash table 0:21:48.000,0:21:52.000 isn't affected by this choice. It's kind of an interesting way to think about designing 0:21:52.000,0:21:54.000 the code. 0:21:54.000,0:22:01.000 The other helper I have down here is gonna be my find cell that, given the 0:22:01.000,0:22:04.000 head pointer to a list and the key, 0:22:04.000,0:22:14.000 is gonna go find it, and it's basically just 0:22:14.000,0:22:17.000 if this ker cells key equals the key, 0:22:17.000,0:22:20.000 then we return it and then if we've gone through the whole list and haven't found it, we return 0:22:20.000,0:22:22.000 a null. 0:22:22.000,0:22:25.000 So this guy is returning a cell 0:22:25.000,0:22:27.000 T star, right, 0:22:27.000,0:22:30.000 as the matching cell within the thing, and this one is gonna provoke 0:22:30.000,0:22:31.000 that 0:22:31.000,0:22:35.000 little C++ compilation snafu that we encountered on the binary search 0:22:35.000,0:22:36.000 tree, as well, 0:22:36.000,0:22:40.000 which is that cell T, right, is a type that's declared in the private 0:22:40.000,0:22:43.000 section within the scope of my map. So 0:22:43.000,0:22:46.000 outside of the scope, which to say, before I've crossed the my map 0:22:46.000,0:22:51.000 angle bracket val type colon, colon, it doesn't actually believe we're in my map scope 0:22:51.000,0:22:54.000 yet. It hasn't seen that and it can't anticipate it, so I have to actually give 0:22:54.000,0:22:56.000 the full name 0:22:56.000,0:22:58.000 with the scope specifier on it, 0:22:58.000,0:23:03.000 and then because of this use of the template name outside of its class 0:23:03.000,0:23:09.000 is a special place for the C++ compiler, the type name keyword also 0:23:09.000,0:23:12.000 needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the 0:23:12.000,0:23:14.000 right place, which I kinda 0:23:14.000,0:23:15.000 short changed, 0:23:15.000,0:23:18.000 and then searching to find a cell. If 0:23:18.000,0:23:23.000 I go back up here to get value, 0:23:23.000,0:23:25.000 and I call the hash function, 0:23:25.000,0:23:31.000 given my key and the number of buckets to find out which bucket to work on, 0:23:31.000,0:23:35.000 I go looking for a match by calling find cell 0:23:35.000,0:23:40.000 of the key in buckets of which bucket, 0:23:40.000,0:23:43.000 and then if match does not equal null then 0:23:43.000,0:23:45.000 I can return match as val and 0:23:45.000,0:23:46.000 then 0:23:46.000,0:23:47.000 otherwise 0:23:47.000,0:23:52.000 we say, ''No such key found.'' So that's 0:23:52.000,0:23:58.000 the error case, right, for get value was that if it didn't find one, its behavior is to 0:23:58.000,0:23:59.000 raise an error 0:23:59.000,0:24:01.000 so that someone knows that they've 0:24:01.000,0:24:05.000 asked something it can't deal 0:24:05.000,0:24:09.000 with. Okay, so the key thing here usually is that the hash is some sort of constant operation that 0:24:09.000,0:24:12.000 just jumbles the key up, in this case it just returned zero, but could actually look at 0:24:12.000,0:24:14.000 the elements of the key, 0:24:14.000,0:24:16.000 and then it does its reversal down that link list, 0:24:16.000,0:24:20.000 which given a correct working hash function should have divided them up in 0:24:20.000,0:24:20.000 these 0:24:20.000,0:24:23.000 little tiny chains, then we can just find the one 0:24:23.000,0:24:26.000 quickly by looking through that link list. If we find it, right, we've got it, 0:24:26.000,0:24:28.000 otherwise, right, we're in 0:24:28.000,0:24:31.000 the error case. 0:24:31.000,0:24:37.000 Add looks almost exactly the same. I 0:24:37.000,0:24:40.000 start by getting the bucket, doing the find cell, 0:24:40.000,0:24:41.000 if it's not ? 0:24:41.000,0:24:43.000 if it did find it, right, 0:24:43.000,0:24:46.000 then we overwrite, 0:24:46.000,0:24:49.000 and then in the second case, right, where we didn't find it, right, then we need to 0:24:49.000,0:24:55.000 make ourselves a new cell, and 0:24:55.000,0:24:59.000 copy in the key, copy in the 0:24:59.000,0:25:01.000 value, 0:25:01.000,0:25:05.000 set it ? in this case, the easiest place, right, to add something to a link list is I should just do append 0:25:05.000,0:25:08.000 it to the front, right, no need to make it hard on ourselves, right, we're not 0:25:08.000,0:25:11.000 keeping them in any particular order, whatsoever, right, they're base on kind of order 0:25:11.000,0:25:12.000 of entry. 0:25:12.000,0:25:17.000 So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So 0:25:17.000,0:25:18.000 it will 0:25:18.000,0:25:20.000 point to the 0:25:20.000,0:25:23.000 current front pointer of the cell, 0:25:23.000,0:25:26.000 and then we will update the 0:25:26.000,0:25:34.000 bucket pointer to point to this guy. So I think 0:25:34.000,0:25:35.000 we are 0:25:35.000,0:25:37.000 in okay shape here. Let me 0:25:37.000,0:25:40.000 take a look and make sure I like what I did, right, 0:25:40.000,0:25:44.000 so hashing, finding the match, if it found a non empty, so ? this is the 0:25:44.000,0:25:46.000 overwrite case of finding an existing cell, we just overwrite with that. Otherwise, 0:25:46.000,0:25:50.000 making a new cell and then attaching it to be the 0:25:50.000,0:25:50.000 front 0:25:50.000,0:25:52.000 of the bucket there. 0:25:52.000,0:25:55.000 So for example, in the case where the bucket's totally empty, it's good to kinda trace 0:25:55.000,0:25:59.000 that. That's the most common case when we start, right, if we hash to bucket zero 0:25:59.000,0:26:01.000 and we do a search of a null link list, we should get a null out, 0:26:01.000,0:26:03.000 so then we create a new cell, right, 0:26:03.000,0:26:06.000 and we set the next field to be what the current bucket's head pointer was, which 0:26:06.000,0:26:10.000 was null. So there should be a single to list and then we update the bucket 0:26:10.000,0:26:12.000 to point to this guy. So then we should have a 0:26:12.000,0:26:14.000 single list cell with null trailing it 0:26:14.000,0:26:18.000 or inserting into the empty bucket case. 0:26:18.000,0:26:20.000 I like it. Let's see if I have a little code over 0:26:20.000,0:26:24.000 here that adds 0:26:24.000,0:26:27.000 some things, car, car, cat, with some numbers. Okay, and that tries to 0:26:27.000,0:26:31.000 retrieve the value of car, which it wrote a three and then it wrote a four on top 0:26:31.000,0:26:39.000 of it, so hopefully the answer we should get out of this is four. Okay, if I ? be 0:26:39.000,0:26:41.000 interesting to do something, like ask for something that we know is not in the 0:26:41.000,0:26:45.000 table, right, to see that our error case gets handled correctly, 0:26:45.000,0:26:46.000 the switch 0:26:46.000,0:26:47.000 key's 0:26:47.000,0:26:50.000 found. And if I were to continue to do this, I could do a little bit of stress testing to see how 0:26:50.000,0:26:55.000 that zero is causing us some performance implications, but if I sat there and just put tons and 0:26:55.000,0:26:56.000 tons of things in there, right, 0:26:56.000,0:27:00.000 made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one 0:27:00.000,0:27:04.000 long chain off the zero bucket, ignoring all the other ones, right, 0:27:04.000,0:27:08.000 but then it's kinda stress testing my link list handling about the adding and 0:27:08.000,0:27:09.000 searching that list, but 0:27:09.000,0:27:12.000 would also show us that we expect that as we got to be, have 100 entries, 200 0:27:12.000,0:27:16.000 entries, 300 entries that subsequent adds and get values, right, would 0:27:16.000,0:27:21.000 show a linear increase in performance because of the 0:27:21.000,0:27:27.000 cluster that we got going there. 0:27:27.000,0:27:30.000 So let me give you a real hash function. I'll talk 0:27:30.000,0:27:37.000 you through what it's doing and then I'll 0:27:37.000,0:27:39.000 leave you with the thought of, 0:27:39.000,0:27:41.000 that writing one of these actually correctly and reliably is 0:27:41.000,0:27:44.000 actually 0:27:44.000,0:27:47.000 more of an advance exercise than we're gonna look at, but this is actually a hash 0:27:47.000,0:27:52.000 function that's taken from Don Knuth's Art of Computer Science, 0:27:52.000,0:27:55.000 some of the work that guides a lot of computer scientists still to this day, 0:27:55.000,0:27:59.000 and it has the strategy I basically told you about of, 0:27:59.000,0:28:02.000 for all the characters that are in the string, 0:28:02.000,0:28:06.000 right, adding it into the thing, but having multiplied, 0:28:06.000,0:28:10.000 right, the previous hash code that's being built up by this large multiplier 0:28:10.000,0:28:12.000 which is in effect kinda taking 0:28:12.000,0:28:14.000 advantage of the positional information. 0:28:14.000,0:28:17.000 And so the multiplier happens to be a very large negative number 0:28:17.000,0:28:19.000 that's a prime that says, okay, 0:28:19.000,0:28:20.000 the first time through, right, 0:28:20.000,0:28:23.000 it'll take the hash code of zero and multiply it by and add the first character. So the first 0:28:23.000,0:28:26.000 character gets added in without any modification. 0:28:26.000,0:28:29.000 The next character, though, will take the previous one and multiply it by 0:28:29.000,0:28:31.000 one power of the multiplier 0:28:31.000,0:28:33.000 and then add in the next character, 0:28:33.000,0:28:36.000 and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so 0:28:36.000,0:28:39.000 raising it to the squared, and the third, and to the fourth powers, it works down the 0:28:39.000,0:28:40.000 string. The 0:28:40.000,0:28:43.000 effectiveness actually is that it's gonna do a lot of overflow. 0:28:43.000,0:28:45.000 That's a very large number 0:28:45.000,0:28:47.000 and it's adding 0:28:47.000,0:28:51.000 and multiplying in this very large space, so in effect we're actually counting on the 0:28:51.000,0:28:54.000 fact that the addition and multiplication 0:28:54.000,0:28:56.000 that is built into C++ 0:28:56.000,0:28:58.000 doesn't raise any errors when the numbers get so large that they can't be represented. It 0:28:58.000,0:29:03.000 just kind of truncates back down to what fits, and so it kind of wraps around using a modular 0:29:03.000,0:29:06.000 strategy here, like a ring. And so we'll actually end up kind of making a very big number out 0:29:06.000,0:29:10.000 of this. The longer the string is, the more those multiplies, the more those powers raise. We've got 0:29:10.000,0:29:13.000 this huge number, what we do is kinda truncating it back to what fits in to a long 0:29:13.000,0:29:16.000 and then when we're done, we take whatever 0:29:16.000,0:29:18.000 number fell out of this process 0:29:18.000,0:29:21.000 and we mod it by the number of buckets to give us a number from zero to buckets minus 0:29:21.000,0:29:22.000 one. 0:29:22.000,0:29:26.000 And so whatever gets stuffed into here, right, it's reliable in the sense that it's, 0:29:26.000,0:29:28.000 every time you put in the number you'll get the same thing back. 0:29:28.000,0:29:33.000 We know it will be in the range end buckets because of that last mod 0:29:33.000,0:29:34.000 call 0:29:34.000,0:29:36.000 that worked it out for us 0:29:36.000,0:29:37.000 and we 0:29:37.000,0:29:40.000 then use that to say which bucket to go look in when we're ready 0:29:40.000,0:29:41.000 to do our work. 0:29:41.000,0:29:44.000 So if I switch that in for my old hash 0:29:44.000,0:29:47.000 function, I shouldn't actually see any 0:29:47.000,0:29:47.000 change 0:29:47.000,0:29:48.000 from the outside, 0:29:48.000,0:29:51.000 other than the performance, right, should suddenly actually be a lot faster, which is 0:29:51.000,0:29:55.000 hard to tell when I have three things in there, but ? I'm gonna 0:29:55.000,0:30:01.000 take out that error case. In this case car 0:30:01.000,0:30:03.000 and cat now 0:30:03.000,0:30:06.000 are probably not very likely to be going to the same bucket. In fact I can 0:30:06.000,0:30:09.000 guarantee they don't go in the same bucket, whereas 0:30:09.000,0:30:16.000 before they were kind of overlapping. 0:30:16.000,0:30:18.000 I'll leave that code there for a second, although 0:30:18.000,0:30:19.000 really I would say that point of 0:30:19.000,0:30:22.000 hashing really is not to get so worried about how it is that the hash code does it's 0:30:22.000,0:30:25.000 work, but to understand the general idea of what a hashing function has to do, 0:30:25.000,0:30:29.000 right, what its role is and how it relates to getting this constant time 0:30:29.000,0:30:30.000 performance. 0:30:30.000,0:30:33.000 So I'll give you an example of hashing in the real world, I think is really interesting 0:30:33.000,0:30:37.000 to see it actually happen and people not even realizing how hashing is so useful. 0:30:37.000,0:30:39.000 So I ordered something from REI, which is 0:30:39.000,0:30:41.000 this camping store and 0:30:41.000,0:30:42.000 it ? 0:30:42.000,0:30:43.000 they didn't have it in stock, so they had to 0:30:43.000,0:30:46.000 place an order for it. This is actually kinda awhile ago, but I ? 0:30:46.000,0:30:50.000 so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't 0:30:50.000,0:30:52.000 ask me my name. I thought this was very funny. Like, 0:30:52.000,0:30:55.000 ''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like, 0:30:55.000,0:30:59.000 ''Okay, what are the last two digits of your phone number?'' And I'm 0:30:59.000,0:31:02.000 like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last 0:31:02.000,0:31:04.000 two digits of my phone number are 21. 0:31:04.000,0:31:06.000 And they say ? and they go and they look in the 21 basket and then they say, ''What's your 0:31:06.000,0:31:09.000 name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're 0:31:09.000,0:31:11.000 like, ''Okay, it's here.'' I'm like, 0:31:11.000,0:31:12.000 ''Okay.'' So then I go to get it, like 0:31:12.000,0:31:13.000 a couple days later, and I go to the store 0:31:13.000,0:31:15.000 and I'm standing in the line waiting for them, 0:31:15.000,0:31:17.000 and I look up on the wall 0:31:17.000,0:31:20.000 and they have 0:31:20.000,0:31:22.000 100 baskets, little 0:31:22.000,0:31:24.000 wire baskets up on the wall 0:31:24.000,0:31:27.000 and they're labeled 0 through 9 over here 0:31:27.000,0:31:29.000 and 0 through 9 over there, 0:31:29.000,0:31:33.000 and when I come in they ask me the same question. ''What's the last two digits of your phone number?'' 0:31:33.000,0:31:36.000 Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then 0:31:36.000,0:31:38.000 they walk over here, right, 0:31:38.000,0:31:40.000 to the 21 basket here. 0:31:40.000,0:31:42.000 It's like the top digit in this digit, 0:31:42.000,0:31:45.000 and there's only one piece of paper in there, and they pull out my order thing, 0:31:45.000,0:31:48.000 and then they get me my tent and everybody's happy. 0:31:48.000,0:31:52.000 So while I'm there, I try to engage the cashier on how this is an example of a real world 0:31:52.000,0:31:55.000 hashing system 0:31:55.000,0:31:58.000 and I still got my tent, I'll have you know, but it was close, 0:31:58.000,0:32:02.000 right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.'' 0:32:02.000,0:32:06.000 I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a 0:32:06.000,0:32:10.000 very good example of what the investment in the number of buckets is versus what 0:32:10.000,0:32:12.000 tradeoff it gives you, right. Because there's this very physical 0:32:12.000,0:32:13.000 sort of set up of this, it's like 0:32:13.000,0:32:17.000 by choosing a larger number of buckets, right, you can more fine grain your 0:32:17.000,0:32:18.000 access, 0:32:18.000,0:32:22.000 but that investment in buckets means you're kind of permanently installing that story. So 0:32:22.000,0:32:25.000 in this case, right, they didn't use just the last digit of my phone number. They could've done that and 0:32:25.000,0:32:28.000 they would have only had ten buckets on the wall, but then they would expect, 0:32:28.000,0:32:30.000 you know, ten times as many things in each bucket. 0:32:30.000,0:32:33.000 And apparently their 0:32:33.000,0:32:37.000 estimate was the number of active orders in this backorder category was enough 0:32:37.000,0:32:38.000 to 0:32:38.000,0:32:41.000 warrant being more than ten, you know, significantly more than ten, 0:32:41.000,0:32:45.000 but not so much more than 100, then in fact 100 buckets was the right 0:32:45.000,0:32:46.000 investment to make in their bucket strategy. 0:32:46.000,0:32:49.000 They didn't put three buckets on the wall because, you know, like what are they 0:32:49.000,0:32:52.000 gonna do, have this big 3D sort of thing. They didn't enjoy this 0:32:52.000,0:32:56.000 discussion, I went on for hours with them and they were, like, really not amused. But it 0:32:56.000,0:32:59.000 had, you know, the three digits, then you'd get this even more likelihood that each 0:32:59.000,0:33:03.000 bucket would be empty, but ? or have a very small number of things, but given 0:33:03.000,0:33:07.000 their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me 0:33:07.000,0:33:09.000 the 0:33:09.000,0:33:11.000 first two digits of my phone number. 0:33:11.000,0:33:14.000 They asked me the last two. Why does that 0:33:14.000,0:33:18.000 matter? Because you're using all [inaudible]. Yeah. It's hated a 0:33:18.000,0:33:21.000 lot. If you ask, like Stanford students, right, like especially 0:33:21.000,0:33:25.000 when, you know, like campus numbers used to all be 49's or 72's or something, so like, if 0:33:25.000,0:33:27.000 you use the first two digits, right, you'd have everybody's in the 49 bucket or 0:33:27.000,0:33:29.000 the 72 bucket or 0:33:29.000,0:33:32.000 something, and a whole bunch of other ones not used. An example, even the first two digits are never 0:33:32.000,0:33:35.000 00, like there's a whole bunch of buckets that don't even get used as the first two digits of a 0:33:35.000,0:33:38.000 phone number. Phone numbers never start with zeros. 0:33:38.000,0:33:42.000 That they rarely actually have zeros or ones in them at all. They try to use those for the 0:33:42.000,0:33:44.000 area code and stuff in the leading digits. So I 0:33:44.000,0:33:45.000 thought it was just a really 0:33:45.000,0:33:48.000 interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of 0:33:48.000,0:33:52.000 course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me 0:33:52.000,0:33:56.000 my stuff, but I paid my money and they [inaudible], 0:33:56.000,0:34:00.000 but it kinda shows you, okay what you're trying to do here is try to make that number of 0:34:00.000,0:34:00.000 buckets, 0:34:00.000,0:34:03.000 right, roughly equal to the number of elements 0:34:03.000,0:34:04.000 so that 0:34:04.000,0:34:09.000 if your hash function's dividing up across your whole world, 0:34:09.000,0:34:11.000 you've got 0:34:11.000,0:34:14.000 constant access to what 0:34:14.000,0:34:17.000 you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value 0:34:17.000,0:34:21.000 that without it, it assumes it's an int and that number is to big to fit in an int and it 0:34:21.000,0:34:25.000 is twice as big as a long in terms of space and so the L needs to be there, otherwise it 0:34:25.000,0:34:26.000 will 0:34:26.000,0:34:30.000 try to read it as an int and through away part of the information I was trying to give it. So it's the 0:34:30.000,0:34:31.000 way 0:34:31.000,0:34:35.000 of identifying it along constant. 0:34:35.000,0:34:37.000 So let me talk about this just a little bit more 0:34:37.000,0:34:39.000 in terms of 0:34:39.000,0:34:41.000 0:34:41.000,0:34:43.000 actual performance, right, we've got N 0:34:43.000,0:34:46.000 over B, right, so if the division is complete, right, the number of entries over 0:34:46.000,0:34:48.000 the number of buckets, 0:34:48.000,0:34:48.000 0:34:48.000,0:34:51.000 if we make them kind of on the same order of 0:34:51.000,0:34:52.000 magnitude, kind of in the same range, 0:34:52.000,0:34:55.000 so having to sum this could be to make a little bit of an estimated guess about 0:34:55.000,0:34:58.000 where we're going, 0:34:58.000,0:35:01.000 and then there are some choices here in how we store each bucket, right. We could use 0:35:01.000,0:35:05.000 a little array, we could use a link list, right, we could use a vector. 0:35:05.000,0:35:07.000 We expect this to be very small, is the truth. 0:35:07.000,0:35:09.000 We're hoping that it's gonna be one or two, 0:35:09.000,0:35:13.000 and so there's not a lot of reason to buy any real expensive structure or even bother 0:35:13.000,0:35:17.000 doing things like sorting. Like you could sort them, but like if you expect there to be two 0:35:17.000,0:35:19.000 things, what's the point? You're gonna spend more time 0:35:19.000,0:35:21.000 figuring out whether to put it in the front or the back 0:35:21.000,0:35:25.000 then you would gain at all by being able to 0:35:25.000,0:35:27.000 find it a little faster. 0:35:27.000,0:35:28.000 So the link list 0:35:28.000,0:35:32.000 is just gonna be easiest way to get kind of allocated without over capacity, 0:35:32.000,0:35:35.000 excess capacity that's unused in the bucket. You know it's gonna be small, use the 0:35:35.000,0:35:37.000 simple strategy. 0:35:37.000,0:35:41.000 It's also, though, an important thing that the hash type is likely to do in kind of a 0:35:41.000,0:35:44.000 industrial strength situation is it does have to deal with this idea, well what if that number of 0:35:44.000,0:35:47.000 buckets that you predicted was wrong? 0:35:47.000,0:35:51.000 And so the map as we have given it to you, right, has to take this sponge because 0:35:51.000,0:35:54.000 it doesn't know in advance how many things are you gonna put in. 0:35:54.000,0:35:57.000 The only way to know that is to wait and see as things get added, 0:35:57.000,0:35:59.000 and then realize your buckets are getting full. 0:35:59.000,0:36:02.000 So typically the industrial strength version of the hash table is gonna 0:36:02.000,0:36:06.000 track what's called the load factor. And the load factor is just the number of total 0:36:06.000,0:36:08.000 entries divided by the number of buckets. 0:36:08.000,0:36:09.000 When that 0:36:09.000,0:36:13.000 factor gets high, and high actually is actually quite small, in fact. If 0:36:13.000,0:36:15.000 it's two, is often the trigger for 0:36:15.000,0:36:19.000 causing a rehash situation, so 0:36:19.000,0:36:22.000 not even letting it get to be three or four. Just saying as soon as you realize that 0:36:22.000,0:36:22.000 you have 0:36:22.000,0:36:26.000 exceeded by double the capacity you intended, you planned for 100 things, 0:36:26.000,0:36:27.000 you got 200 things, 0:36:27.000,0:36:28.000 go ahead 0:36:28.000,0:36:31.000 and redo your whole bucket array. 0:36:31.000,0:36:39.000 So in the case, for example, of our simple like 0 through 7 case, right, maybe 0:36:39.000,0:36:43.000 it's 0 through 6. One, two, three, 0:36:43.000,0:36:45.000 four, so I have one that had six buckets, right, 0:36:45.000,0:36:48.000 and then I've gotten to where 0:36:48.000,0:36:50.000 each of them has 0:36:50.000,0:36:54.000 two or maybe three in some and one in another. 0:36:54.000,0:36:57.000 The plan is to just take this whole bucket array and 0:36:57.000,0:37:01.000 enlarge it by a factor of two, so grow it, and in this case, from 6 to 12, 0:37:01.000,0:37:02.000 and then rehash 0:37:02.000,0:37:04.000 to move everybody. The 0:37:04.000,0:37:08.000 idea is that the earlier decision about where to place them was based on knowing that 0:37:08.000,0:37:11.000 there was exactly six buckets and so where it was placed before 0:37:11.000,0:37:15.000 is not likely to be the place it will place if you have, you know, 0:37:15.000,0:37:18.000 12 choices to lay them down into. 0:37:18.000,0:37:21.000 And ideal ? in fact you don't even want it to be that case. Like, if they all land into the 0:37:21.000,0:37:23.000 same bucket, right, you would have 0:37:23.000,0:37:25.000 the same clustering and then a big empty clump, 0:37:25.000,0:37:28.000 and so you would rehash everything, run it through the new hash function having 0:37:28.000,0:37:31.000 changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?'' 0:37:31.000,0:37:34.000 And hopefully you'd end up getting a clean table, right, where you had 12 0:37:34.000,0:37:37.000 items now with one in each bucket, 0:37:37.000,0:37:40.000 ready to kinda give constant performance through 0:37:40.000,0:37:42.000 that and then potentially, 0:37:42.000,0:37:43.000 again if you overload 0:37:43.000,0:37:47.000 your load factor, go back in and rearrange stuff again. 0:37:47.000,0:37:50.000 So using this strategy a little bit like the one we talked about with the binary search tree of just 0:37:50.000,0:37:51.000 wait and see. Rather 0:37:51.000,0:37:55.000 than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any 0:37:55.000,0:37:57.000 real alarm or crisis, 0:37:57.000,0:38:00.000 it waits until there's kinda sufficiently clear demand 0:38:00.000,0:38:03.000 that exceeds the capacity plan for, 0:38:03.000,0:38:05.000 to do a big rearrangement. 0:38:05.000,0:38:08.000 So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast. 0:38:08.000,0:38:12.000 Starting to get just a hair slower, right, having to do a search through a few things as opposed 0:38:12.000,0:38:13.000 to one, 0:38:13.000,0:38:16.000 and then every now and then, right, you'd see an add which caused a big reshuffle of 0:38:16.000,0:38:19.000 the table, which would be a totally linear operation in the number of 0:38:19.000,0:38:20.000 elements 0:38:20.000,0:38:22.000 and then they'd be fast again 0:38:22.000,0:38:24.000 for another big clump of inserts until that 0:38:24.000,0:38:27.000 same expansion might be complete or triggered by 0:38:27.000,0:38:28.000 it growing even more. 0:38:28.000,0:38:31.000 But dividing it sorta by the total number of inserts, so if you had 1,000 inserts 0:38:31.000,0:38:35.000 before you overloaded and then you had to copy all 1,000 things to get ready for the 0:38:35.000,0:38:36.000 next thousand, 0:38:36.000,0:38:39.000 if you average that out, it still ends up 0:38:39.000,0:38:43.000 being called a constant operation. So 0:38:43.000,0:38:46.000 that's pretty neat. 0:38:46.000,0:38:47.000 It's a ? 0:38:47.000,0:38:50.000 really one of the, sort of, more, I think, 0:38:50.000,0:38:53.000 easily understood and beautiful algorithmic techniques, right, for solving, right, 0:38:53.000,0:38:55.000 this hard problem of 0:38:55.000,0:38:57.000 search and retrieval 0:38:57.000,0:38:58.000 that 0:38:58.000,0:39:01.000 the vector and the sorting and the BST are all trying to get at, but ? and sometimes the 0:39:01.000,0:39:04.000 sorter, vector and the BST are solving actually a much harder problem, right, 0:39:04.000,0:39:09.000 which is keeping a total ordering of all the data and being able to flip it, 0:39:09.000,0:39:12.000 traverse it and sorted order is one of the advantages that the sorted vector 0:39:12.000,0:39:12.000 and 0:39:12.000,0:39:14.000 the BST have that hash table doesn't have at all. 0:39:14.000,0:39:17.000 In fact, it's jumbled it all up, and so if you think about how the iterator for 0:39:17.000,0:39:21.000 the map works, our map is actually implemented using a hash table, 0:39:21.000,0:39:24.000 it actually just iterates over the link list, and that explains why it almost appears 0:39:24.000,0:39:27.000 completely random. If you put a bunch of things in the table and you go to iterate 0:39:27.000,0:39:28.000 and 0:39:28.000,0:39:29.000 visit them again, 0:39:29.000,0:39:33.000 that the order you visit the keys seems to make no sense, and that's because it's based on 0:39:33.000,0:39:35.000 their hash codes, right, which 0:39:35.000,0:39:36.000 aren't designed 0:39:36.000,0:39:37.000 to be 0:39:37.000,0:39:39.000 any real sensible ordering 0:39:39.000,0:39:42.000 to anybody not in the know of how the hash function works, 0:39:42.000,0:39:46.000 whereas iterating over a vector that's sorted or iterating over a BST 0:39:46.000,0:39:47.000 or a, 0:39:47.000,0:39:51.000 in this case, our set, for example, is backed by a binary research tree 0:39:51.000,0:39:54.000 will give you that sorted order. So it solves this harder problem, right, which cause there to 0:39:54.000,0:39:57.000 be more kinda more investment in that problem, whereas hashing solves just exactly the 0:39:57.000,0:39:58.000 one problem, which is 0:39:58.000,0:40:02.000 I want to be able to find exactly this value again 0:40:02.000,0:40:03.000 and update this value, 0:40:03.000,0:40:08.000 and nothing more is needed than just identifying this, 0:40:08.000,0:40:10.000 not finding the things that are less than this. For example, if you needed to find all 0:40:10.000,0:40:13.000 the values that were less than this key alphabetically, 0:40:13.000,0:40:15.000 right, the hash table makes that 0:40:15.000,0:40:16.000 no easier 0:40:16.000,0:40:18.000 than an unsorted vector, 0:40:18.000,0:40:21.000 whereas things like assorted vector and binary search tree actually enable that kinda search, 0:40:21.000,0:40:24.000 you could find the place in the tree and kind of work your way down the left side 0:40:24.000,0:40:27.000 to find the things that were less than that. 0:40:27.000,0:40:31.000 That's not actually being solved by the hash at all. Its use of 0:40:31.000,0:40:34.000 memory is actually comparable to a 0:40:34.000,0:40:35.000 binary search tree 0:40:35.000,0:40:38.000 in that it has a four byte pointer per entry, which is the next field on the 0:40:38.000,0:40:40.000 link list chain, and 0:40:40.000,0:40:44.000 then there's a four byte pointer for each bucket, the head of the link list 0:40:44.000,0:40:47.000 cell. Given that we expect that the buckets to be roughly equal to entry, 0:40:47.000,0:40:51.000 than we can kinda just summarize that as well. Each entry 0:40:51.000,0:40:53.000 represented an eight byte overhead, 0:40:53.000,0:40:56.000 which is the same eight byte overhead, right, that the left and right pointers of 0:40:56.000,0:40:58.000 the binary search 0:40:58.000,0:40:59.000 tree add. 0:40:59.000,0:41:01.000 Does it have any degenerate cases? That's a good 0:41:01.000,0:41:06.000 question. So one of the things that made the binary search tree a little bit of a 0:41:06.000,0:41:08.000 heavy investment was that to get that 0:41:08.000,0:41:11.000 balancing behavior, even when we're getting data coming in in a way that's 0:41:11.000,0:41:16.000 causing a lopsided construction, we have to go out of our way to do something special. 0:41:16.000,0:41:19.000 What is the degenerate case for hash? Is there something that caused it to behave really, 0:41:19.000,0:41:24.000 really badly? Anyone wanna help me out with 0:41:24.000,0:41:28.000 that? Is it always good? Does it always give a good spread? Can we 0:41:28.000,0:41:32.000 end up with everything in the same bucket somehow? [Inaudible] repeated it that way? 0:41:32.000,0:41:35.000 So if you have repeated elements the way that ? 0:41:35.000,0:41:37.000 both versions of the map work is they overwrite. 0:41:37.000,0:41:41.000 So actually the nice thing is, yeah, if you did it again you just take that same cell and 0:41:41.000,0:41:43.000 overwrite it, so in fact we wouldn't get a clustering based on the same entry 0:41:43.000,0:41:45.000 going in and out. 0:41:45.000,0:41:51.000 But how'd we end up with everything being in the same bucket? 0:41:51.000,0:41:54.000 Mostly comes out, if you look at your hash function, when would a hash function collide? 0:41:54.000,0:41:58.000 Right, and if you have a dumb hash function, right, you can definitely have some degenerate 0:41:58.000,0:42:01.000 cases. My dumb hash function of return zero, right, being the 0:42:01.000,0:42:03.000 worst example of that, 0:42:03.000,0:42:07.000 but any hash function, right, that wasn't being particularly clever about using all of its 0:42:07.000,0:42:09.000 information might actually have some clustering in 0:42:09.000,0:42:12.000 it. It is possible, for example, even with a particularly good hash function that 0:42:12.000,0:42:15.000 there are strings that will hash to the same thing, and it's like, if you somehow got 0:42:15.000,0:42:18.000 really, really unlucky 0:42:18.000,0:42:19.000 that you ? 0:42:19.000,0:42:23.000 and had a hash function that wasn't doing the best job possible that you could get some 0:42:23.000,0:42:25.000 clustering, but in general it doesn't require 0:42:25.000,0:42:29.000 ? the sole responsibility, right, for the generate case comes down to your hash function. 0:42:29.000,0:42:31.000 So as long as your hash function 0:42:31.000,0:42:34.000 and your inputs match your expectation, 0:42:34.000,0:42:35.000 you don't have any surprises about 0:42:35.000,0:42:38.000 how they got inserted the way it was in a binary search tree, which is sometimes hard 0:42:38.000,0:42:40.000 to control. 0:42:40.000,0:42:43.000 Just as you could underestimate the number of buckets you need ? Yeah. ? could 0:42:43.000,0:42:46.000 you over estimate the number of buckets you 0:42:46.000,0:42:51.000 need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and 0:42:51.000,0:42:52.000 do you then go through and take 0:42:52.000,0:42:54.000 that memory back? Yeah. 0:42:54.000,0:42:57.000 Yeah, so that's a great question. So a lot of data structures 0:42:57.000,0:42:58.000 don't shrink, 0:42:58.000,0:43:02.000 right, and for example, the vector often will only grow on demand and then even if you deleted a 0:43:02.000,0:43:04.000 bunch of, or removed a bunch of elements, it doesn't 0:43:04.000,0:43:06.000 necessarily consider shrinking a big deal, and so 0:43:06.000,0:43:10.000 that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around 0:43:10.000,0:43:12.000 you're not using 0:43:12.000,0:43:15.000 doesn't have nearly the same penalty that not 0:43:15.000,0:43:17.000 having the memory when you needed it and having to do a lot of extra work to find 0:43:17.000,0:43:18.000 things. 0:43:18.000,0:43:21.000 So typically, right, you would try to pick a size that you are 0:43:21.000,0:43:26.000 willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.'' 0:43:26.000,0:43:28.000 0:43:28.000,0:43:30.000 But that, as you grow, you might not shrink back to that size. You might just 0:43:30.000,0:43:34.000 let the capacity just kinda lay in waste. There are good 0:43:34.000,0:43:37.000 reasons to actually be tidy about it, but again, some of this is hard to predict. 0:43:37.000,0:43:40.000 It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows 0:43:40.000,0:43:41.000 big again. Like, maybe you 0:43:41.000,0:43:43.000 have a new freshman class come in, 0:43:43.000,0:43:45.000 and then you graduate a whole bunch of people. Then 0:43:45.000,0:43:47.000 at the point where you graduate them all, you're about ready to take in a new freshman 0:43:47.000,0:43:51.000 class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on 0:43:51.000,0:43:53.000 reusing in a minute anyway, and so maybe 0:43:53.000,0:43:56.000 that clearing it out and totally releasing 0:43:56.000,0:43:57.000 may actually be an exercise 0:43:57.000,0:44:01.000 that was unimportant. You're planning on getting back to that size eventually, anyway. 0:44:01.000,0:44:04.000 Most hash tables tend to grow is actually the truth. Right, you tend to 0:44:04.000,0:44:07.000 be collecting data in there and 0:44:07.000,0:44:11.000 they tend to only enlarge. It's kind of unusual that you take out a bunch of 0:44:11.000,0:44:15.000 entries you already put in. 0:44:15.000,0:44:18.000 And so I just put a little last note, which is like, okay, well I had said earlier that when we 0:44:18.000,0:44:19.000 have the map, 0:44:19.000,0:44:22.000 the key had to be string type and I was gonna at 0:44:22.000,0:44:25.000 some point come back to this and talk about 0:44:25.000,0:44:28.000 why that was the one restriction in all our template classes, right, you can put anything you 0:44:28.000,0:44:31.000 want in statter or vector or stacker queue. 0:44:31.000,0:44:34.000 That map lets you store any kind of value associated with that key, but that key was 0:44:34.000,0:44:36.000 string type, and 0:44:36.000,0:44:40.000 so given how our map is implemented as a hash table, that actually starts to make sense, that if 0:44:40.000,0:44:42.000 you're gonna take the key and kind of jumble it up and map it to 0:44:42.000,0:44:44.000 a bucket, 0:44:44.000,0:44:47.000 you need to know something about how to extract information from the key, and so 0:44:47.000,0:44:51.000 this case, knowing it's a string means you know it has characters, you know its length, and you can do 0:44:51.000,0:44:52.000 those things. 0:44:52.000,0:44:53.000 If you wanted to make a map 0:44:53.000,0:44:57.000 that could take any kinda key, maybe integers is key or student structures is 0:44:57.000,0:44:59.000 key or, you 0:44:59.000,0:45:00.000 know, 0:45:00.000,0:45:01.000 doubles is key, 0:45:01.000,0:45:05.000 any of these things, it's like, well, how can you write a generic form 0:45:05.000,0:45:06.000 that could use 0:45:06.000,0:45:08.000 a hashing operation on that 0:45:08.000,0:45:11.000 kind of key and map it to bucket numbers. 0:45:11.000,0:45:14.000 You would build a two type template to kinda get this working in C++, so 0:45:14.000,0:45:18.000 you can actually have a template that has a type name for the key type, a type name for 0:45:18.000,0:45:19.000 the value type, 0:45:19.000,0:45:20.000 and then 0:45:20.000,0:45:25.000 in the member functions you'd refer to both key type and value type as these two 0:45:25.000,0:45:26.000 distinct place holders. 0:45:26.000,0:45:29.000 And then when the client set it up, right, they'd be saying, ''Okay, I want a map 0:45:29.000,0:45:32.000 that has strings that map to integers.'' So maybe this is 0:45:32.000,0:45:36.000 words and the page they appear in a document. I have a map of integers and a vector 0:45:36.000,0:45:40.000 string. Maybe this is score on an exam and the names of the students who got that 0:45:40.000,0:45:42.000 score, doing it that way, 0:45:42.000,0:45:44.000 and to make this work, right, 0:45:44.000,0:45:48.000 there has to be some kind of universal hashing that, given some generic type, 0:45:48.000,0:45:53.000 can turn it into an integer in the range of buckets. The 0:45:53.000,0:45:53.000 ? 0:45:53.000,0:45:56.000 what it's gonna require here is that the client get involved. 0:45:56.000,0:45:58.000 That you can not write this 0:45:58.000,0:46:00.000 generic 0:46:00.000,0:46:01.000 hash function 0:46:01.000,0:46:04.000 that will work for all types, right, if it's a student structure, what are you gonna look at? The 0:46:04.000,0:46:08.000 ID field, the names field, under like ? there's just no sensible way that you can 0:46:08.000,0:46:10.000 talk about a generic type and talk about how to hash 0:46:10.000,0:46:14.000 it, so what you would have to do is have some sort of client call back, so probably you would 0:46:14.000,0:46:18.000 create this by passing out a call back that's given a key type thing, and a number of 0:46:18.000,0:46:20.000 buckets would 0:46:20.000,0:46:24.000 do the necessary machinations, right, to hash 0:46:24.000,0:46:26.000 it into a right number. 0:46:26.000,0:46:30.000 And so that would be kinda one of those coordination things between the client and the implementer 0:46:30.000,0:46:34.000 about the client knows what the data is what the data is it's trying to store, and kinda how to 0:46:34.000,0:46:35.000 0:46:35.000,0:46:41.000 mash it up and then the map would know, okay, given that number, where to stick it. So 0:46:41.000,0:46:42.000 that's what it would take, and then 0:46:42.000,0:46:45.000 rather sorta load that onto our 0:46:45.000,0:46:48.000 novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can 0:46:48.000,0:46:53.000 do a hash internally without having to get involved.'' All right, 0:46:53.000,0:46:55.000 any questions about hashing? I'm gonna 0:46:55.000,0:46:59.000 give you like, a two second talk about set 0:46:59.000,0:47:01.000 because it's the one 0:47:01.000,0:47:04.000 ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking 0:47:04.000,0:47:07.000 about, it doesn't add any new things that we haven't already done, so in fact I'm just 0:47:07.000,0:47:10.000 gonna tell you what it does, and 0:47:10.000,0:47:13.000 then your picture will be complete, right. 0:47:13.000,0:47:16.000 It wants to do fast searching, fast updating, add and remove, and it also has all 0:47:16.000,0:47:21.000 these high level operations, such as intersect, union and difference, 0:47:21.000,0:47:24.000 that are kind of its primary set of things to do. 0:47:24.000,0:47:25.000 0:47:25.000,0:47:27.000 A bunch of the things we've already seen, right, 0:47:27.000,0:47:30.000 would actually be a reasonable strategy for upholding stat, right, using some kind 0:47:30.000,0:47:31.000 of vector or array. 0:47:31.000,0:47:34.000 Most likely you'd want to keep it in sorted order because that would buy you 0:47:34.000,0:47:37.000 the fast lookup for contains, 0:47:37.000,0:47:41.000 but the add and remove, right, would necessarily be slow in those cases because of the 0:47:41.000,0:47:43.000 continuous memory. 0:47:43.000,0:47:44.000 0:47:44.000,0:47:47.000 The link list, not really probably too good of an option here 0:47:47.000,0:47:50.000 because it contains then becomes 0:47:50.000,0:47:51.000 linear and 0:47:51.000,0:47:54.000 add and remove similarly, right, liner, so it doesn't actually get you anywhere, in 0:47:54.000,0:47:57.000 fact it just adds the memory in pointers and gunk like that. 0:47:57.000,0:47:59.000 The binary search tree, probably a 0:47:59.000,0:48:00.000 pretty good call, 0:48:00.000,0:48:03.000 right, that's the ? 0:48:03.000,0:48:06.000 kinda meshes the advantage of Adamic structure with the sorted property to 0:48:06.000,0:48:11.000 be able to give it fast update adding and removing and searching all can be done 0:48:11.000,0:48:14.000 in logarithmic time if you have balancing. 0:48:14.000,0:48:17.000 And then it actually also enables the high level ops, so the idea of being able to do a union 0:48:17.000,0:48:19.000 intersection difference, and actually being able to 0:48:19.000,0:48:23.000 walk over both of the trees in sorted order, which is very easily done 0:48:23.000,0:48:24.000 with a [inaudible] reversal, 0:48:24.000,0:48:28.000 makes those operations actually more efficient to implement than actually kinda having to deal with 0:48:28.000,0:48:32.000 data coming at you all different ways. It 0:48:32.000,0:48:33.000 turns out hashing really 0:48:33.000,0:48:35.000 won't quite do it 0:48:35.000,0:48:36.000 0:48:36.000,0:48:37.000 very easily for us 0:48:37.000,0:48:40.000 because of the same need about the template situations, like, well you have to have a hash function that 0:48:40.000,0:48:43.000 can hash these things and do the right thing, and so the tree happens to be a 0:48:43.000,0:48:46.000 good of just ? if the client can just give you ordering information, you can 0:48:46.000,0:48:48.000 do all the hard work without 0:48:48.000,0:48:50.000 pushing hashing onto them. So in fact 0:48:50.000,0:48:53.000 the way our set really does work is it does use a binary search tree. It happens to 0:48:53.000,0:48:56.000 use it through another layered abstraction 0:48:56.000,0:48:59.000 that there actually is a BST class that we've never really 0:48:59.000,0:49:02.000 encountered directly, but the BST class just takes the idea of a binary search 0:49:02.000,0:49:06.000 tree, adds balancing, and 0:49:06.000,0:49:08.000 all the properties about finding and inserting and adding, 0:49:08.000,0:49:12.000 and packages it up, and then set actually just turns out to be one liners 0:49:12.000,0:49:14.000 making calls into the binary search tree. 0:49:14.000,0:49:17.000 This is where all the heavy lifting 0:49:17.000,0:49:18.000 goes, is down there. 0:49:18.000,0:49:23.000 That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our 0:49:23.000,0:49:26.000 BST uses ? or set uses the BST, which is the binary search tree, 0:49:26.000,0:49:28.000 right, our vector, right, using an array, 0:49:28.000,0:49:33.000 and then that stacks and queues having both viable strategies both for array 0:49:33.000,0:49:37.000 vector based backing as well as link list both getting good time performance. And then 0:49:37.000,0:49:38.000 you saw with the PQ, right, 0:49:38.000,0:49:40.000 even 0:49:40.000,0:49:44.000 yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the 0:49:44.000,0:49:46.000 kind of really classic things in which a lot of things get built, 0:49:46.000,0:49:51.000 and so having a chance to use them both as a client and kinda dig 0:49:51.000,0:49:53.000 around as an implementer kinda gives you this big picture of 0:49:53.000,0:49:55.000 some of the more common ways that 0:49:55.000,0:49:57.000 cool data structures get built. 0:49:57.000,0:50:00.000 So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman 0:50:00.000,0:50:03.000 cafe in a little bit, if you've got some time, please come, and have 0:50:03.000,0:50:04.000 a good weekend. 0:50:04.000,0:50:07.000 Work hard on pathfinder.