Lecture 24 | Programming Abstractions (Stanford)
-
0:00 - 0:12
-
0:12 - 0:15This presentation is delivered by the Stanford Center for Professional
-
0:15 - 0:28Development.
-
0:28 - 0:29
-
0:29 - 0:33What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever
-
0:33 - 0:36gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda
-
0:36 - 0:39clever, inventive idea for how you do search and retrieval in a very efficient
-
0:39 - 0:40way.
-
0:40 - 0:44So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday,
-
0:44 - 0:48where we had talked about ways we might implement the map abstraction, right, the key
-
0:48 - 0:49value pairing
-
0:49 - 0:52to allow for quick search and retrieval
-
0:52 - 0:56and we had looked at vector and had just done a little thought experiment about
-
0:56 - 0:59what sorting the vector would buy us, right, and that would give us the
-
0:59 - 1:02ability to do searches faster because we could use binary search,
-
1:02 - 1:07right. But in the case of add, right, we'd still be slow because the
-
1:07 - 1:10need be kinda shuffled or rearranged in the continuous memory
-
1:10 - 1:13and then that's what motivated our introduction to trees there, the binary search tree,
-
1:13 - 1:17which is designed exactly to enable that kind of searching, right, that logarithmic
-
1:17 - 1:20divide and conquer, divide in half and work your way down
-
1:20 - 1:22and then because of the flexibility of the pointer base structure there, right, we
-
1:22 - 1:24can also do the insert
-
1:24 - 1:26in the same logarithmic
-
1:26 - 1:30time because we're no longer constrained by that contiguous memory that the array
-
1:30 - 1:30fact
-
1:30 - 1:32data structures force us to.
-
1:32 - 1:36And so, at that point what we have seen this logarithmic and both get valued at and so
-
1:36 - 1:38on the PQ, right, which you were just finishing,
-
1:38 - 1:42you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you
-
1:42 - 1:44would normally run into in an
-
1:44 - 1:45ordinary use, right,
-
1:45 - 1:49it's un-measurable, right. So those of you who got your heap up and running and
-
1:49 - 1:51correctly working logarithmic and time was discovering that you can put 10,000, 20,000,
-
1:51 - 1:5450,000, 100,000 things in it and just never be able to measure
-
1:54 - 1:58what it took to NQ or DQ from those structures because logarithmic is very,
-
1:58 - 2:00very fast.
-
2:00 - 2:01So that said,
-
2:01 - 2:04that's actually a fine point to say, we have a great math notation, we don't have to look further,
-
2:04 - 2:07but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where
-
2:07 - 2:10we can do both those operations in constant time, which is to say that
-
2:10 - 2:13no matter how large the table gets, right, we expect
-
2:13 - 2:16we'll be able to insert and retrieve with
-
2:16 - 2:21the exact same time required kind of for a 10,000 table as a million entry
-
2:21 - 2:22table which is pretty neat.
-
2:22 - 2:26To note that as we start to get to these more complicated data structures,
-
2:26 - 2:29one thing we also need to keep in mind, right, is that there are other things
-
2:29 - 2:33of trade offs, right, in adding that complexity of the code that
-
2:33 - 2:36starting with something that's simple and
-
2:36 - 2:39a little slower performer but that's easy to write might actually solve the problem
-
2:39 - 2:42we have at hand. So maybe that we don't really wanna go to one of these fancier
-
2:42 - 2:45things because we don't really need it to be that much faster
-
2:45 - 2:48because the BST, right, adds a bunch of memory, right, so left and right pointers
-
2:48 - 2:53for every cell, which is an 8 byte overhead on top of the entry itself.
-
2:53 - 2:56Now we have to deal with pointers, dynamic memories, all this allocation,
-
2:56 - 2:59deallocation, right, we've got a lot of recursion going on in there
-
2:59 - 3:01and so just opportunities for error
-
3:01 - 3:03and stuff to creep into
-
3:03 - 3:07and if we take the steps to provide tree balancing, which we only kind of grazed
-
3:07 - 3:11the surface of. We just mentioned the need for it. It actually adds quite a lot of
-
3:11 - 3:14complication to the code to do the rotations
-
3:14 - 3:16or the rebalancing
-
3:16 - 3:21that from a simple binary search tree. So in fact the whole package, right, of
-
3:21 - 3:24building a really guaranteed log n performing
-
3:24 - 3:27binary search tree is actually a pretty big investment of human capital
-
3:27 - 3:29that you
-
3:29 - 3:30have to kinda
-
3:30 - 3:33balance against the future hit off of it running faster. I'm
-
3:33 - 3:36gonna show this strategy today that actually is
-
3:36 - 3:37kind of simple to write.
-
3:37 - 3:40It gets us solve one case and doesn't have the degenerate
-
3:40 - 3:43looming that the binary search tree did.
-
3:43 - 3:46So, let me kinda just give you a theoretical concept to think about how
-
3:46 - 3:49you do stuff.
-
3:49 - 3:51You have a big dictionary. You've got, you know, like a
-
3:51 - 3:54many thousand page dictionary, right, and you want to look up a word. You certainly
-
3:54 - 3:57don't do linear search. You're about to look up the word xylophone and you're actually not
-
3:57 - 4:00going to start from the A's and kind of page your way through, you
-
4:00 - 4:03know, 2,000 pages to get to the X's. You
-
4:03 - 4:07don't even ? like we do binary search. You can do binary searches pretty fast, right? It
-
4:07 - 4:10turns out like starting at the M's to get something that's in X, right? You
-
4:10 - 4:11tend to know about where X
-
4:11 - 4:14is and often dictionaries have little tabs
-
4:14 - 4:17along the pages that give you an idea of kinda where the breaks for
-
4:17 - 4:22certain words are. Maybe it says Xa is here and Xe is there or something,
-
4:22 - 4:25that lets you kinda more quickly just illuminate all the surrounding noise
-
4:25 - 4:29in the dictionary, kinda get straight to the right region where the X words are
-
4:29 - 4:30and then you can do
-
4:30 - 4:33a simple search from there, maybe a little bit of a linear or binary search to kind of zero
-
4:33 - 4:37in on the page where the word is you wanna find is.
-
4:37 - 4:38This is the idea
-
4:38 - 4:45behind this concept called a hash table or based on the algorithmic technique of hashing
-
4:45 - 4:46is that
-
4:46 - 4:49rather than try to figure out how to kinda keep this very large collection sorted or organized
-
4:49 - 4:51in a way that you can kind of
-
4:51 - 4:52jump around within it, it says,
-
4:52 - 4:56''Well how about we maintain a bunch of different collections.'' We break up the
-
4:56 - 4:59data set into little tiny pieces
-
4:59 - 5:00and if we can
-
5:00 - 5:02tell, given a key,
-
5:02 - 5:05like which piece of the data set it's supposed to be in and we'll call those pieces
-
5:05 - 5:08buckets. I'm gonna put these in different buckets.
-
5:08 - 5:11So if I had a whole bunch of students I might put all the freshmen in this bucket
-
5:11 - 5:13and the sophomores in this bucket or something like that and then when I look
-
5:13 - 5:16for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the
-
5:16 - 5:18freshmen bucket.
-
5:18 - 5:19Okay. You know, the idea
-
5:19 - 5:22in a simple case would be like, ''Okay, well I ? it has some
-
5:22 - 5:26criteria by which I can divide you into these buckets that don't overlap and everyone
-
5:26 - 5:28has a unique bucket to go in
-
5:28 - 5:31and if I can make up enough of those categorizations,
-
5:31 - 5:33have as many of those buckets around
-
5:33 - 5:35to divide you up into,
-
5:35 - 5:38then I can hopefully keep the set that lands in that bucket pretty small,
-
5:38 - 5:42so that I don't have a smaller of students. So maybe I did it by the
-
5:42 - 5:45first letter of your last name. I could say, ''Well everybody whose last name begins with W
-
5:45 - 5:47goes here and everybody who's B goes here and so on.'' Well,
-
5:47 - 5:50when somebody comes to me then I don't have to look through the B's.
-
5:50 - 5:51You know, and if you have
-
5:51 - 5:53a class of 200 students
-
5:53 - 5:56and you have 26 letters of the alphabet then you'd think that if it were
-
5:56 - 5:59evenly divided across those letters, it's not going to be to be, it turns out in practice,
-
5:59 - 6:02but as a thought experiment there'd be about 4 people in a bucket and
-
6:02 - 6:06it'd be pretty fast to just look through the bucket and find the one who I was
-
6:06 - 6:08looking for. Okay,
-
6:08 - 6:10so let's take this idea. This idea of, like,
-
6:10 - 6:14there's going to be some strategy for how we divide into buckets and we call
-
6:14 - 6:16that strategy the hash function.
-
6:16 - 6:18That, given a key,
-
6:18 - 6:23and a buckets and we'll just label them with numbers. Zero to the number of buckets
-
6:23 - 6:26minus one. So let's say I have 99 buckets, I've got zero to
-
6:26 - 6:2798
-
6:27 - 6:28that I will take the key
-
6:28 - 6:32and the key in this case is the string key that we're using in the map,
-
6:32 - 6:33and I have a hash function
-
6:33 - 6:37which given a key is input produces a number as output
-
6:37 - 6:39in the range zero to B minus one.
-
6:39 - 6:42And that is what's called the hash value for that key
-
6:42 - 6:45and that tells me which bucket to look at,
-
6:45 - 6:48either find a match for that if I'm trying to do a get value or where to place a
-
6:48 - 6:52new entry if I'm trying to add into the table.
-
6:52 - 6:57So that's the basic idea. I'm
-
6:57 - 7:00gonna talk a little bit about hash functions.
-
7:00 - 7:03In terms of writing a really good hash function is actually pretty complicated
-
7:03 - 7:04and so
-
7:04 - 7:07we're gonna think about it in sort of a ? kind
-
7:07 - 7:10of a big picture perspective and think about what qualities a hash
-
7:10 - 7:13function has to have and then I'm actually not gonna go through
-
7:13 - 7:16proving to you about what mathematically constitutes a good hash function. I'm just gonna give you
-
7:16 - 7:21some intuitive sense of what kinda things matter when designing a hash function.
-
7:21 - 7:23So it's given a key that maps it to a number.
-
7:23 - 7:25Typically it's gonna say, you know,
-
7:25 - 7:27zero to the number of buckets minus one.
-
7:27 - 7:30Often that means that somehow it's gonna compute something and then use
-
7:30 - 7:33the mod operation to bring it back in range.
-
7:33 - 7:35So it's gonna take your number
-
7:35 - 7:37and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you
-
7:37 - 7:39a number zero to four.''
-
7:39 - 7:41It's very important that it be
-
7:41 - 7:44stable, that you put the key in, you get a number out. If you put that same key in
-
7:44 - 7:48later you should get the same number out. So it has to be sort of reliable guaranteed
-
7:48 - 7:52mapping. It can't be random. That
-
7:52 - 7:55said, you want it to almost look as though it's random in terms of
-
7:55 - 7:59how it maps things to buckets, that you don't want it to have a real systematic plan
-
7:59 - 8:02where it puts a whole bunch of things in bucket one and very few things in bucket two
-
8:02 - 8:06or somehow always uses the even buckets and not the odds ones or something like that. So
-
8:06 - 8:08you wanna have a strategy that's gonna take those keys
-
8:08 - 8:11and use your whole bucket range
-
8:11 - 8:12in an equal manner.
-
8:12 - 8:15So that if you put 100 keys in and you have 100 buckets you're hoping
-
8:15 - 8:16that,
-
8:16 - 8:19on average, each bucket will have one thing. That you won't have these clusters
-
8:19 - 8:22where 25 things are in one bucket, ten here and then a whole bunch of buckets
-
8:22 - 8:24are empty. So
-
8:24 - 8:27you want this kinda divided across the range.
-
8:27 - 8:29So given that we have string input,
-
8:29 - 8:32likely what we're gonna be using is the characters in the string as part of
-
8:32 - 8:34the ? how can I take these characters
-
8:34 - 8:40and somehow kinda jungle them up in a way to sort of move them around in these
-
8:40 - 8:43buckets in a way that's reliable so that I get that same string back, I get the
-
8:43 - 8:44same one.
-
8:44 - 8:47But that doesn't actually produce a lot of artifacts of
-
8:47 - 8:50similar strings, let's say mapping to the same place, hopefully spreading them
-
8:50 - 8:52around.
-
8:52 - 8:55So for example, some really simple things you could use, you could just take the first
-
8:55 - 8:56letter. So I
-
8:56 - 8:59talked about this sorta with last names. If I was trying to divide up my students, right, I could divide
-
8:59 - 9:01them by the first letter of their last name.
-
9:01 - 9:05That would use exactly 26 buckets, no more, no less,
-
9:05 - 9:08and if I happen to have 26 buckets
-
9:08 - 9:11then at least there's some chance it would get some coverage across the range.
-
9:11 - 9:14There is likely to be clustering, right, some
-
9:14 - 9:15names are a lot more common
-
9:15 - 9:18or letters are a lot more commonly used than others.
-
9:18 - 9:21So it would tend to produce a little bit of an artifact, but it would be reliable and it would
-
9:21 - 9:24use 26 buckets. If I had 100 buckets,
-
9:24 - 9:26then this would be kind of a crummy function
-
9:26 - 9:28because in fact it would use the first 26 buckets and none of the other ones,
-
9:28 - 9:30right, if it was just taking
-
9:30 - 9:32that letter and deciding. So,
-
9:32 - 9:36you know, in some situations this would get me a little bit of the thing I wanted.
-
9:36 - 9:40Things like using the length of the word. A
-
9:40 - 9:42very, very bad clustering function here,
-
9:42 - 9:44right because there'd be a bunch of strings that are, like, six characters, seven
-
9:44 - 9:46characters, ten characters,
-
9:46 - 9:50but very few that are one or two, and very few that are more than ten, or something.
-
9:50 - 9:53You'd have this very tight little cluster in the middle, right, and would not use a
-
9:53 - 9:57large bucket range effectively at all.
-
9:57 - 10:01More likely what you're gonna try to use is kind of more of the information about the word,
-
10:01 - 10:04the key that you have. Instead of just a single letter or the count of letters, it's
-
10:04 - 10:06trying to use kinda all the data you have at hand,
-
10:06 - 10:08which is to say all the letters,
-
10:08 - 10:12and so one way to do it would be to take the ASCII values for the letters, so A is
-
10:12 - 10:1397, you
-
10:13 - 10:15know, B is 98,
-
10:15 - 10:17is you would take a letter, a word like atoms and you'd add
-
10:17 - 10:1896
-
10:18 - 10:22plus 99 plus 96, you know, add
-
10:22 - 10:23them together
-
10:23 - 10:26and then you'd get some sum of them, 500, 600, whatever it is,
-
10:26 - 10:28depending on how many letters there are
-
10:28 - 10:32and then use mod to back it back into ? let's say your range was, you had 100
-
10:32 - 10:34buckets, right, so you'd want it to be zero to 99, so
-
10:34 - 10:37you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd
-
10:37 - 10:42be, okay, the sum to 524 then it goes into bucket 24.
-
10:42 - 10:47So if you did that, right, and so if act
-
10:47 - 10:50is the word that you have, right, you'd add these codes together and let's say this is
-
10:50 - 10:5397 and then this is 99 and then
-
10:53 - 10:57that's, you know, a hundred and something and so we end up with ? I'm making this
-
10:57 - 10:59number up, right ? 283, then we might
-
10:59 - 11:03mod to the last two digit and we'll say this one goes into bucket 83.
-
11:03 - 11:07And if we have, you know, dog,
-
11:07 - 11:11maybe that adds to 267 and we would
-
11:11 - 11:13mod that down to
-
11:13 - 11:1767. So this actually tries to use, kinda, all the information at hand, all the letters
-
11:17 - 11:18that we have
-
11:18 - 11:19and bring them down,
-
11:19 - 11:22and so if you figure that across the space of all the numbers that are being added here
-
11:22 - 11:25you are just as likely to get 210 as you are to get 299, you know
-
11:25 - 11:28that any of the subsequent numbers in between,
-
11:28 - 11:32then hopefully you'd get kind of a spread across your buckets.
-
11:32 - 11:36There is one thing this clusters on. Does anybody see what words cluster in this
-
11:36 - 11:40system?
-
11:40 - 11:41That you'd think of as
-
11:41 - 11:47maybe not being related, but would land in the same bucket? To
-
11:47 - 11:49rearrange some numbers. Yes. Any
-
11:49 - 11:51anagram, right, so
-
11:51 - 11:52you
-
11:52 - 11:54take cat which is just a ranagram ?
-
11:54 - 11:57anagram of act, right, it's gonna add to the same thing, right,
-
11:57 - 12:00and so it would be a little bit of clustering in that respect.
-
12:00 - 12:02So anagrams come down to the same place.
-
12:02 - 12:05And so the words that you expect to, kinda like, act and cat to
-
12:05 - 12:09you seem like different words, they are gonna collide in this system and that might be a little
-
12:09 - 12:11bit of something we worry about. So one way you can actually kind of avoid that in the
-
12:11 - 12:12way that
-
12:12 - 12:15the hash function we're going to eventually use will do, is
-
12:15 - 12:16it'll do this thing where
-
12:16 - 12:21not only does it add the letters together, but it also multiplies the letter by a
-
12:21 - 12:22very large prime number
-
12:22 - 12:25that helps to kinda say, ''Well the large prime number
-
12:25 - 12:26times C plus
-
12:26 - 12:30the large prime number squared times A plus the large prime number cubed times C and
-
12:30 - 12:33then it just allows the, each position of the letter actually is also then
-
12:33 - 12:35encoded in the number that comes out
-
12:35 - 12:37and so something that's positionally different
-
12:37 - 12:40won't necessarily land in the same bucket
-
12:40 - 12:43without some other factors kinda coming into play. So a little
-
12:43 - 12:49bit of trying to just take the information and jumble it up a little bit.
-
12:49 - 12:51So let me show you a little bit about how this looks. Actually I'm gonna tell you about
-
12:51 - 12:54the collisions and then I'll show you the data structure.
-
12:54 - 12:58So certainly I said there's gonna be this problem where things collide,
-
12:58 - 13:01that 83 and 83 in this simple system of adding the codes will
-
13:01 - 13:03come out the same.
-
13:03 - 13:06So it can't be that each of these buckets holds exactly one thing.
-
13:06 - 13:07The hashing,
-
13:07 - 13:08right, although we may
-
13:08 - 13:11try to get those buckets as small as possible we have to account for the
-
13:11 - 13:14possibility that two things will collide, that their hash function, even though
-
13:14 - 13:16they were different keys,
-
13:16 - 13:18will land in the same bucket
-
13:18 - 13:20based on how I've jumbled it up
-
13:20 - 13:23and so we're trying to avoid that, but when it does happen we also have to have a
-
13:23 - 13:25strategy for it. So we have to have a way of saying, ''Well
-
13:25 - 13:28there can be more than one thing in a bucket.'' Or some other way of when there's
-
13:28 - 13:33a collision, deciding where to put the one that was placed in the table second.
-
13:33 - 13:36And in the strategy, we're gonna use the one called chaning,
-
13:36 - 13:39and the chaning basically says, ''Well if once we have something in a bucket like ACT,
-
13:39 - 13:43if we hash something and it comes to the same bucket, we'll just tack it onto that bucket,
-
13:43 - 13:45in this case using a link list.''
-
13:45 - 13:48So each of these buckets is actually gonna be a link list of
-
13:48 - 13:49the entries
-
13:49 - 13:52and so the entire table will be a
-
13:52 - 13:54vector or array of those.
-
13:54 - 13:56So let me show you,
-
13:56 - 13:59a little
-
13:59 - 14:05live action.
-
14:05 - 14:08So this is a hash table that has seven buckets.
-
14:08 - 14:08In
-
14:08 - 14:11this case number 0 through 6 are assigned to each one.
-
14:11 - 14:14They all start empty, and so these are all empty link lists illustrated by the
-
14:14 - 14:15null,
-
14:15 - 14:17and then the hash function
-
14:17 - 14:20is shown here like a black box. You don't actually know how the workings of hash
-
14:20 - 14:23function is, but you know what it's job is to do, which is you give it a key and it gives you
-
14:23 - 14:27a number in the range 0 to 6, and that same key when passed, and again gives you
-
14:27 - 14:28the same number.
-
14:28 - 14:31So if I say 106B
-
14:31 - 14:34equals Julie, it pushes 106B through the hash function and says,
-
14:34 - 14:35''Okay, well 106B
-
14:35 - 14:37comes out as one.''
-
14:37 - 14:40You add those numbers together, jumble them up, the letters, and you get one.
-
14:40 - 14:43And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new
-
14:43 - 14:46cell for 106B Julie and stick it in the table.''
-
14:46 - 14:48So now there's a
-
14:48 - 14:51non empty chain in one place, the rest are empty.
-
14:51 - 14:53If I say 107
-
14:53 - 14:55is being taught by
-
14:55 - 14:57Jerry, 107 happens to get the same hash function.
-
14:57 - 15:00107, 106B, different letters but it just happen to come out that way,
-
15:00 - 15:01hash to one
-
15:01 - 15:04and so it went ahead in this case just tacked it on to the front of the list.
-
15:04 - 15:05If I do
-
15:05 - 15:08106A
-
15:08 - 15:10being taught by Patrick,
-
15:10 - 15:11106A
-
15:11 - 15:14happens to sum to zero in the hash function.
-
15:14 - 15:17That's all we know about the black box that ends up loading it up into the zero's
-
15:17 - 15:17slot.
-
15:17 - 15:21I say, well wait, it's being taught by
-
15:21 - 15:25Nick, it also come out to be zero and so
-
15:25 - 15:27it ends up kinda pushing on the front of that. So I go 103
-
15:27 - 15:30is being taught by Maron and it's in
-
15:30 - 15:33bucket five and so as I kinda type things in I'm starting to see that things are just kinda
-
15:33 - 15:35going around in difference place in the table.
-
15:35 - 15:38A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue
-
15:38 - 15:42to type in strings for other course numbers, I'd hopefully kinda fill out the table
-
15:42 - 15:43to where there was kind of an even,
-
15:43 - 15:46roughly even spread across that.
-
15:46 - 15:47When I want to look something up
-
15:47 - 15:51because I wanna see who's teaching 106B, then
-
15:51 - 15:54it will use the same process in reverse. It asks the hash function,
-
15:54 - 15:56''Well which bucket should I look in?'' And it
-
15:56 - 15:59says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So
-
15:59 - 16:01all the other buckets are eliminated
-
16:01 - 16:04from consideration, so very quick kind of
-
16:04 - 16:07focused down to the right place and then it actually just does a link list reversal to
-
16:07 - 16:10find 106B in the bucket
-
16:10 - 16:11where it has to be,
-
16:11 - 16:13and so
-
16:13 - 16:16the key to realizing what makes this so magical, right,
-
16:16 - 16:19is that the choice of the number of buckets
-
16:19 - 16:21is up to me.
-
16:21 - 16:24I can pick the number of buckets to be whatever I want.
-
16:24 - 16:27I can pick it to be five, I can pick it to be 10, I can be to be
-
16:27 - 16:291,000, I can pick it to be a million. If
-
16:29 - 16:34I choose the number of buckets to be kind of roughly in the same order of magnitude
-
16:34 - 16:35as the number of entries. So if I
-
16:35 - 16:38wanna put a thousand things in my table,
-
16:38 - 16:40if I make a thousand buckets to hold them in
-
16:40 - 16:43and my hash function can divide stuff
-
16:43 - 16:45pretty equally across those thousand buckets,
-
16:45 - 16:47then I will expect that each of those link lists is about length one.
-
16:47 - 16:52Some will be two, some will be zero, but if it's doing its job right then I have a
-
16:52 - 16:56bunch of little tiny buckets, each of which holds a tiny fraction of
-
16:56 - 16:58the data set, in this case one or two,
-
16:58 - 17:00and it's very easy to search through them.
-
17:00 - 17:05If I know I'm gonna put a million things in, then I make a million buckets, and then
-
17:05 - 17:08I divide one out of a million things, I run it through the hash function, it gives me
-
17:08 - 17:09this number
-
17:09 - 17:12that says, ''Oh, it's 862,444.'' I go to
-
17:12 - 17:16that bucket and the item is either there in that bucket or I have to look a little
-
17:16 - 17:18bit to find it,
-
17:18 - 17:20but there's no other place in the table I have to look. So the fact that there's
-
17:20 - 17:21a
-
17:21 - 17:24million other entries kinda near by in these other buckets is completely not
-
17:24 - 17:26important to us, right, and so
-
17:26 - 17:28the searching and
-
17:28 - 17:32updating, adding, and replacing, and removing, and stuff all happens at the bucket level
-
17:32 - 17:35and by choosing the buckets to be
-
17:35 - 17:36
-
17:36 - 17:39the number, total number are gonna be roughly equal to the size of the things, then
-
17:39 - 17:40I have this
-
17:40 - 17:43constant time access to
-
17:43 - 17:50the tiny part of the subset of the set that matters. That's pretty cool. Let's write some
-
17:50 - 17:51code. It's kinda
-
17:51 - 17:53good to see the whole thing
-
17:53 - 18:00doing what it needs to do. So I'm gonna go ahead and go back over here.
-
18:00 - 18:02All right, so I've got my
-
18:02 - 18:05map which has add and get value and
-
18:05 - 18:07not much else there, right,
-
18:07 - 18:12but kinda just set up and ready for us to go. What I'm gonna build here
-
18:12 - 18:13is a
-
18:13 - 18:15link list cell
-
18:15 - 18:20that's gonna hold a string key, a val type value, and
-
18:20 - 18:24a pointer to the next, and
-
18:24 - 18:27so I plan on having each one of these cells, right, for each entry that goes in the
-
18:27 - 18:27table
-
18:27 - 18:31and then I'm going to have an array of
-
18:31 - 18:32these, and
-
18:32 - 18:35I'm gonna make a constant for it, and then I'll talk a little bit about the
-
18:35 - 18:42consequences of that. So I'm gonna make a constant that is ? so
-
18:42 - 18:45I make it 99. So then I have 99 buckets to start with. That's not
-
18:45 - 18:49gonna solve all my problems, it's gonna show that it's kinda tuned for maps that
-
18:49 - 18:51expect to hold about a hundred things, but
-
18:51 - 18:56? and so I have an array in this case, so if you look at deciphering this
-
18:56 - 18:58declaration is a little bit complicated here.
-
18:58 - 19:00It says that buckets is the name of the variable,
-
19:00 - 19:06it is an array of num buckets length, so a fixed size, 99
-
19:06 - 19:07entry array here.
-
19:07 - 19:11Each of those entries is a pointer to a cell, so ready to be the
-
19:11 - 19:16head pointer of a link list for us.
-
19:16 - 19:19I'm gonna add a ?
-
19:19 - 19:23well, we'll get there in a second. I'm gonna go over here and work on the constructor first.
-
19:23 - 19:25So
-
19:25 - 19:30the first thing I'm gonna do is make sure that those pointers have good values.
-
19:30 - 19:32If I do not do this, right, they will be junk
-
19:32 - 19:36and there will be badness that will happen, so I can make sure that each of them starts off as an
-
19:36 - 19:37empty list,
-
19:37 - 19:42and then correspondently, right, I would need to be doing a delete all list
-
19:42 - 19:43cells here, but
-
19:43 - 19:48I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then
-
19:48 - 19:52delete all the cells in each of those lists that are there.
-
19:52 - 19:54And then for add and get value
-
19:54 - 19:56what I'm gonna ? else we need to do
-
19:56 - 19:57is figure out
-
19:57 - 19:58in both cases, right, which
-
19:58 - 20:00is the appropriate list to operate on.
-
20:00 - 20:03So running the hash function on my key,
-
20:03 - 20:08seeing where it tells me to look, and then either searching it to find the match
-
20:08 - 20:09for get value
-
20:09 - 20:13to return the matching value or in the add case, to search for the matching
-
20:13 - 20:16entry to overwrite if it exists and if not, then to create a new cell and tack it on the
-
20:16 - 20:19front of the list. So
-
20:19 - 20:21let's write a
-
20:21 - 20:25helper function because they're both gonna do the same thing. I'm gonna need
-
20:25 - 20:28the same thing twice. So I'm gonna put a hash function in here
-
20:28 - 20:31that, given a key and a number of buckets, is gonna
-
20:31 - 20:35return to me the right bucket to look in,
-
20:35 - 20:39and then I'm gonna write a find cell function that, given a key and
-
20:39 - 20:39the pointer to a list,
-
20:39 - 20:43will find the cell that matches that key or return a null if it didn't find
-
20:43 - 20:52it. Okay, let me put my hash
-
20:52 - 21:03function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ?
-
21:03 - 21:05so what I'm gonna do, I'm
-
21:05 - 21:07gonna make an interesting choice here
-
21:07 - 21:12and I'm gonna have my hash function just return zero, which
-
21:12 - 21:13seems pretty surprising.
-
21:13 - 21:16But I'm gonna talk about why that is. It turns out, when I'm first building it that
-
21:16 - 21:19actually is an easy way to test that my code is working
-
21:19 - 21:22and to not be dependent on the behavior of the hash function. At this point it says, ''We'll
-
21:22 - 21:23just put
-
21:23 - 21:25everything in the zero's bucket.''
-
21:25 - 21:27Now that's not a good strategy for my long term efficiency,
-
21:27 - 21:29but it is a good way to test that
-
21:29 - 21:33my handling of the bucket, and the searching, and the finding, and
-
21:33 - 21:36inserting is correct, and the performance I should expect to be abysmal on this, right, it should be
-
21:36 - 21:37totally linear
-
21:37 - 21:39in terms of the number of elements to add or to get them because they'll all
-
21:39 - 21:41be in one link list
-
21:41 - 21:45with no easy access to them. And then eventually I can change this to divide across
-
21:45 - 21:48the buckets and get better spread, but the correctness of the hash table
-
21:48 - 21:52isn't affected by this choice. It's kind of an interesting way to think about designing
-
21:52 - 21:54the code.
-
21:54 - 22:01The other helper I have down here is gonna be my find cell that, given the
-
22:01 - 22:04head pointer to a list and the key,
-
22:04 - 22:14is gonna go find it, and it's basically just
-
22:14 - 22:17if this ker cells key equals the key,
-
22:17 - 22:20then we return it and then if we've gone through the whole list and haven't found it, we return
-
22:20 - 22:22a null.
-
22:22 - 22:25So this guy is returning a cell
-
22:25 - 22:27T star, right,
-
22:27 - 22:30as the matching cell within the thing, and this one is gonna provoke
-
22:30 - 22:31that
-
22:31 - 22:35little C++ compilation snafu that we encountered on the binary search
-
22:35 - 22:36tree, as well,
-
22:36 - 22:40which is that cell T, right, is a type that's declared in the private
-
22:40 - 22:43section within the scope of my map. So
-
22:43 - 22:46outside of the scope, which to say, before I've crossed the my map
-
22:46 - 22:51angle bracket val type colon, colon, it doesn't actually believe we're in my map scope
-
22:51 - 22:54yet. It hasn't seen that and it can't anticipate it, so I have to actually give
-
22:54 - 22:56the full name
-
22:56 - 22:58with the scope specifier on it,
-
22:58 - 23:03and then because of this use of the template name outside of its class
-
23:03 - 23:09is a special place for the C++ compiler, the type name keyword also
-
23:09 - 23:12needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the
-
23:12 - 23:14right place, which I kinda
-
23:14 - 23:15short changed,
-
23:15 - 23:18and then searching to find a cell. If
-
23:18 - 23:23I go back up here to get value,
-
23:23 - 23:25and I call the hash function,
-
23:25 - 23:31given my key and the number of buckets to find out which bucket to work on,
-
23:31 - 23:35I go looking for a match by calling find cell
-
23:35 - 23:40of the key in buckets of which bucket,
-
23:40 - 23:43and then if match does not equal null then
-
23:43 - 23:45I can return match as val and
-
23:45 - 23:46then
-
23:46 - 23:47otherwise
-
23:47 - 23:52we say, ''No such key found.'' So that's
-
23:52 - 23:58the error case, right, for get value was that if it didn't find one, its behavior is to
-
23:58 - 23:59raise an error
-
23:59 - 24:01so that someone knows that they've
-
24:01 - 24:05asked something it can't deal
-
24:05 - 24:09with. Okay, so the key thing here usually is that the hash is some sort of constant operation that
-
24:09 - 24:12just jumbles the key up, in this case it just returned zero, but could actually look at
-
24:12 - 24:14the elements of the key,
-
24:14 - 24:16and then it does its reversal down that link list,
-
24:16 - 24:20which given a correct working hash function should have divided them up in
-
24:20 - 24:20these
-
24:20 - 24:23little tiny chains, then we can just find the one
-
24:23 - 24:26quickly by looking through that link list. If we find it, right, we've got it,
-
24:26 - 24:28otherwise, right, we're in
-
24:28 - 24:31the error case.
-
24:31 - 24:37Add looks almost exactly the same. I
-
24:37 - 24:40start by getting the bucket, doing the find cell,
-
24:40 - 24:41if it's not ?
-
24:41 - 24:43if it did find it, right,
-
24:43 - 24:46then we overwrite,
-
24:46 - 24:49and then in the second case, right, where we didn't find it, right, then we need to
-
24:49 - 24:55make ourselves a new cell, and
-
24:55 - 24:59copy in the key, copy in the
-
24:59 - 25:01value,
-
25:01 - 25:05set it ? in this case, the easiest place, right, to add something to a link list is I should just do append
-
25:05 - 25:08it to the front, right, no need to make it hard on ourselves, right, we're not
-
25:08 - 25:11keeping them in any particular order, whatsoever, right, they're base on kind of order
-
25:11 - 25:12of entry.
-
25:12 - 25:17So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So
-
25:17 - 25:18it will
-
25:18 - 25:20point to the
-
25:20 - 25:23current front pointer of the cell,
-
25:23 - 25:26and then we will update the
-
25:26 - 25:34bucket pointer to point to this guy. So I think
-
25:34 - 25:35we are
-
25:35 - 25:37in okay shape here. Let me
-
25:37 - 25:40take a look and make sure I like what I did, right,
-
25:40 - 25:44so hashing, finding the match, if it found a non empty, so ? this is the
-
25:44 - 25:46overwrite case of finding an existing cell, we just overwrite with that. Otherwise,
-
25:46 - 25:50making a new cell and then attaching it to be the
-
25:50 - 25:50front
-
25:50 - 25:52of the bucket there.
-
25:52 - 25:55So for example, in the case where the bucket's totally empty, it's good to kinda trace
-
25:55 - 25:59that. That's the most common case when we start, right, if we hash to bucket zero
-
25:59 - 26:01and we do a search of a null link list, we should get a null out,
-
26:01 - 26:03so then we create a new cell, right,
-
26:03 - 26:06and we set the next field to be what the current bucket's head pointer was, which
-
26:06 - 26:10was null. So there should be a single to list and then we update the bucket
-
26:10 - 26:12to point to this guy. So then we should have a
-
26:12 - 26:14single list cell with null trailing it
-
26:14 - 26:18or inserting into the empty bucket case.
-
26:18 - 26:20I like it. Let's see if I have a little code over
-
26:20 - 26:24here that adds
-
26:24 - 26:27some things, car, car, cat, with some numbers. Okay, and that tries to
-
26:27 - 26:31retrieve the value of car, which it wrote a three and then it wrote a four on top
-
26:31 - 26:39of it, so hopefully the answer we should get out of this is four. Okay, if I ? be
-
26:39 - 26:41interesting to do something, like ask for something that we know is not in the
-
26:41 - 26:45table, right, to see that our error case gets handled correctly,
-
26:45 - 26:46the switch
-
26:46 - 26:47key's
-
26:47 - 26:50found. And if I were to continue to do this, I could do a little bit of stress testing to see how
-
26:50 - 26:55that zero is causing us some performance implications, but if I sat there and just put tons and
-
26:55 - 26:56tons of things in there, right,
-
26:56 - 27:00made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one
-
27:00 - 27:04long chain off the zero bucket, ignoring all the other ones, right,
-
27:04 - 27:08but then it's kinda stress testing my link list handling about the adding and
-
27:08 - 27:09searching that list, but
-
27:09 - 27:12would also show us that we expect that as we got to be, have 100 entries, 200
-
27:12 - 27:16entries, 300 entries that subsequent adds and get values, right, would
-
27:16 - 27:21show a linear increase in performance because of the
-
27:21 - 27:27cluster that we got going there.
-
27:27 - 27:30So let me give you a real hash function. I'll talk
-
27:30 - 27:37you through what it's doing and then I'll
-
27:37 - 27:39leave you with the thought of,
-
27:39 - 27:41that writing one of these actually correctly and reliably is
-
27:41 - 27:44actually
-
27:44 - 27:47more of an advance exercise than we're gonna look at, but this is actually a hash
-
27:47 - 27:52function that's taken from Don Knuth's Art of Computer Science,
-
27:52 - 27:55some of the work that guides a lot of computer scientists still to this day,
-
27:55 - 27:59and it has the strategy I basically told you about of,
-
27:59 - 28:02for all the characters that are in the string,
-
28:02 - 28:06right, adding it into the thing, but having multiplied,
-
28:06 - 28:10right, the previous hash code that's being built up by this large multiplier
-
28:10 - 28:12which is in effect kinda taking
-
28:12 - 28:14advantage of the positional information.
-
28:14 - 28:17And so the multiplier happens to be a very large negative number
-
28:17 - 28:19that's a prime that says, okay,
-
28:19 - 28:20the first time through, right,
-
28:20 - 28:23it'll take the hash code of zero and multiply it by and add the first character. So the first
-
28:23 - 28:26character gets added in without any modification.
-
28:26 - 28:29The next character, though, will take the previous one and multiply it by
-
28:29 - 28:31one power of the multiplier
-
28:31 - 28:33and then add in the next character,
-
28:33 - 28:36and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so
-
28:36 - 28:39raising it to the squared, and the third, and to the fourth powers, it works down the
-
28:39 - 28:40string. The
-
28:40 - 28:43effectiveness actually is that it's gonna do a lot of overflow.
-
28:43 - 28:45That's a very large number
-
28:45 - 28:47and it's adding
-
28:47 - 28:51and multiplying in this very large space, so in effect we're actually counting on the
-
28:51 - 28:54fact that the addition and multiplication
-
28:54 - 28:56that is built into C++
-
28:56 - 28:58doesn't raise any errors when the numbers get so large that they can't be represented. It
-
28:58 - 29:03just kind of truncates back down to what fits, and so it kind of wraps around using a modular
-
29:03 - 29:06strategy here, like a ring. And so we'll actually end up kind of making a very big number out
-
29:06 - 29:10of this. The longer the string is, the more those multiplies, the more those powers raise. We've got
-
29:10 - 29:13this huge number, what we do is kinda truncating it back to what fits in to a long
-
29:13 - 29:16and then when we're done, we take whatever
-
29:16 - 29:18number fell out of this process
-
29:18 - 29:21and we mod it by the number of buckets to give us a number from zero to buckets minus
-
29:21 - 29:22one.
-
29:22 - 29:26And so whatever gets stuffed into here, right, it's reliable in the sense that it's,
-
29:26 - 29:28every time you put in the number you'll get the same thing back.
-
29:28 - 29:33We know it will be in the range end buckets because of that last mod
-
29:33 - 29:34call
-
29:34 - 29:36that worked it out for us
-
29:36 - 29:37and we
-
29:37 - 29:40then use that to say which bucket to go look in when we're ready
-
29:40 - 29:41to do our work.
-
29:41 - 29:44So if I switch that in for my old hash
-
29:44 - 29:47function, I shouldn't actually see any
-
29:47 - 29:47change
-
29:47 - 29:48from the outside,
-
29:48 - 29:51other than the performance, right, should suddenly actually be a lot faster, which is
-
29:51 - 29:55hard to tell when I have three things in there, but ? I'm gonna
-
29:55 - 30:01take out that error case. In this case car
-
30:01 - 30:03and cat now
-
30:03 - 30:06are probably not very likely to be going to the same bucket. In fact I can
-
30:06 - 30:09guarantee they don't go in the same bucket, whereas
-
30:09 - 30:16before they were kind of overlapping.
-
30:16 - 30:18I'll leave that code there for a second, although
-
30:18 - 30:19really I would say that point of
-
30:19 - 30:22hashing really is not to get so worried about how it is that the hash code does it's
-
30:22 - 30:25work, but to understand the general idea of what a hashing function has to do,
-
30:25 - 30:29right, what its role is and how it relates to getting this constant time
-
30:29 - 30:30performance.
-
30:30 - 30:33So I'll give you an example of hashing in the real world, I think is really interesting
-
30:33 - 30:37to see it actually happen and people not even realizing how hashing is so useful.
-
30:37 - 30:39So I ordered something from REI, which is
-
30:39 - 30:41this camping store and
-
30:41 - 30:42it ?
-
30:42 - 30:43they didn't have it in stock, so they had to
-
30:43 - 30:46place an order for it. This is actually kinda awhile ago, but I ?
-
30:46 - 30:50so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't
-
30:50 - 30:52ask me my name. I thought this was very funny. Like,
-
30:52 - 30:55''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like,
-
30:55 - 30:59''Okay, what are the last two digits of your phone number?'' And I'm
-
30:59 - 31:02like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last
-
31:02 - 31:04two digits of my phone number are 21.
-
31:04 - 31:06And they say ? and they go and they look in the 21 basket and then they say, ''What's your
-
31:06 - 31:09name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're
-
31:09 - 31:11like, ''Okay, it's here.'' I'm like,
-
31:11 - 31:12''Okay.'' So then I go to get it, like
-
31:12 - 31:13a couple days later, and I go to the store
-
31:13 - 31:15and I'm standing in the line waiting for them,
-
31:15 - 31:17and I look up on the wall
-
31:17 - 31:20and they have
-
31:20 - 31:22100 baskets, little
-
31:22 - 31:24wire baskets up on the wall
-
31:24 - 31:27and they're labeled 0 through 9 over here
-
31:27 - 31:29and 0 through 9 over there,
-
31:29 - 31:33and when I come in they ask me the same question. ''What's the last two digits of your phone number?''
-
31:33 - 31:36Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then
-
31:36 - 31:38they walk over here, right,
-
31:38 - 31:40to the 21 basket here.
-
31:40 - 31:42It's like the top digit in this digit,
-
31:42 - 31:45and there's only one piece of paper in there, and they pull out my order thing,
-
31:45 - 31:48and then they get me my tent and everybody's happy.
-
31:48 - 31:52So while I'm there, I try to engage the cashier on how this is an example of a real world
-
31:52 - 31:55hashing system
-
31:55 - 31:58and I still got my tent, I'll have you know, but it was close,
-
31:58 - 32:02right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.''
-
32:02 - 32:06I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a
-
32:06 - 32:10very good example of what the investment in the number of buckets is versus what
-
32:10 - 32:12tradeoff it gives you, right. Because there's this very physical
-
32:12 - 32:13sort of set up of this, it's like
-
32:13 - 32:17by choosing a larger number of buckets, right, you can more fine grain your
-
32:17 - 32:18access,
-
32:18 - 32:22but that investment in buckets means you're kind of permanently installing that story. So
-
32:22 - 32:25in this case, right, they didn't use just the last digit of my phone number. They could've done that and
-
32:25 - 32:28they would have only had ten buckets on the wall, but then they would expect,
-
32:28 - 32:30you know, ten times as many things in each bucket.
-
32:30 - 32:33And apparently their
-
32:33 - 32:37estimate was the number of active orders in this backorder category was enough
-
32:37 - 32:38to
-
32:38 - 32:41warrant being more than ten, you know, significantly more than ten,
-
32:41 - 32:45but not so much more than 100, then in fact 100 buckets was the right
-
32:45 - 32:46investment to make in their bucket strategy.
-
32:46 - 32:49They didn't put three buckets on the wall because, you know, like what are they
-
32:49 - 32:52gonna do, have this big 3D sort of thing. They didn't enjoy this
-
32:52 - 32:56discussion, I went on for hours with them and they were, like, really not amused. But it
-
32:56 - 32:59had, you know, the three digits, then you'd get this even more likelihood that each
-
32:59 - 33:03bucket would be empty, but ? or have a very small number of things, but given
-
33:03 - 33:07their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me
-
33:07 - 33:09the
-
33:09 - 33:11first two digits of my phone number.
-
33:11 - 33:14They asked me the last two. Why does that
-
33:14 - 33:18matter? Because you're using all [inaudible]. Yeah. It's hated a
-
33:18 - 33:21lot. If you ask, like Stanford students, right, like especially
-
33:21 - 33:25when, you know, like campus numbers used to all be 49's or 72's or something, so like, if
-
33:25 - 33:27you use the first two digits, right, you'd have everybody's in the 49 bucket or
-
33:27 - 33:29the 72 bucket or
-
33:29 - 33:32something, and a whole bunch of other ones not used. An example, even the first two digits are never
-
33:32 - 33:3500, like there's a whole bunch of buckets that don't even get used as the first two digits of a
-
33:35 - 33:38phone number. Phone numbers never start with zeros.
-
33:38 - 33:42That they rarely actually have zeros or ones in them at all. They try to use those for the
-
33:42 - 33:44area code and stuff in the leading digits. So I
-
33:44 - 33:45thought it was just a really
-
33:45 - 33:48interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of
-
33:48 - 33:52course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me
-
33:52 - 33:56my stuff, but I paid my money and they [inaudible],
-
33:56 - 34:00but it kinda shows you, okay what you're trying to do here is try to make that number of
-
34:00 - 34:00buckets,
-
34:00 - 34:03right, roughly equal to the number of elements
-
34:03 - 34:04so that
-
34:04 - 34:09if your hash function's dividing up across your whole world,
-
34:09 - 34:11you've got
-
34:11 - 34:14constant access to what
-
34:14 - 34:17you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value
-
34:17 - 34:21that without it, it assumes it's an int and that number is to big to fit in an int and it
-
34:21 - 34:25is twice as big as a long in terms of space and so the L needs to be there, otherwise it
-
34:25 - 34:26will
-
34:26 - 34:30try to read it as an int and through away part of the information I was trying to give it. So it's the
-
34:30 - 34:31way
-
34:31 - 34:35of identifying it along constant.
-
34:35 - 34:37So let me talk about this just a little bit more
-
34:37 - 34:39in terms of
-
34:39 - 34:41
-
34:41 - 34:43actual performance, right, we've got N
-
34:43 - 34:46over B, right, so if the division is complete, right, the number of entries over
-
34:46 - 34:48the number of buckets,
-
34:48 - 34:48
-
34:48 - 34:51if we make them kind of on the same order of
-
34:51 - 34:52magnitude, kind of in the same range,
-
34:52 - 34:55so having to sum this could be to make a little bit of an estimated guess about
-
34:55 - 34:58where we're going,
-
34:58 - 35:01and then there are some choices here in how we store each bucket, right. We could use
-
35:01 - 35:05a little array, we could use a link list, right, we could use a vector.
-
35:05 - 35:07We expect this to be very small, is the truth.
-
35:07 - 35:09We're hoping that it's gonna be one or two,
-
35:09 - 35:13and so there's not a lot of reason to buy any real expensive structure or even bother
-
35:13 - 35:17doing things like sorting. Like you could sort them, but like if you expect there to be two
-
35:17 - 35:19things, what's the point? You're gonna spend more time
-
35:19 - 35:21figuring out whether to put it in the front or the back
-
35:21 - 35:25then you would gain at all by being able to
-
35:25 - 35:27find it a little faster.
-
35:27 - 35:28So the link list
-
35:28 - 35:32is just gonna be easiest way to get kind of allocated without over capacity,
-
35:32 - 35:35excess capacity that's unused in the bucket. You know it's gonna be small, use the
-
35:35 - 35:37simple strategy.
-
35:37 - 35:41It's also, though, an important thing that the hash type is likely to do in kind of a
-
35:41 - 35:44industrial strength situation is it does have to deal with this idea, well what if that number of
-
35:44 - 35:47buckets that you predicted was wrong?
-
35:47 - 35:51And so the map as we have given it to you, right, has to take this sponge because
-
35:51 - 35:54it doesn't know in advance how many things are you gonna put in.
-
35:54 - 35:57The only way to know that is to wait and see as things get added,
-
35:57 - 35:59and then realize your buckets are getting full.
-
35:59 - 36:02So typically the industrial strength version of the hash table is gonna
-
36:02 - 36:06track what's called the load factor. And the load factor is just the number of total
-
36:06 - 36:08entries divided by the number of buckets.
-
36:08 - 36:09When that
-
36:09 - 36:13factor gets high, and high actually is actually quite small, in fact. If
-
36:13 - 36:15it's two, is often the trigger for
-
36:15 - 36:19causing a rehash situation, so
-
36:19 - 36:22not even letting it get to be three or four. Just saying as soon as you realize that
-
36:22 - 36:22you have
-
36:22 - 36:26exceeded by double the capacity you intended, you planned for 100 things,
-
36:26 - 36:27you got 200 things,
-
36:27 - 36:28go ahead
-
36:28 - 36:31and redo your whole bucket array.
-
36:31 - 36:39So in the case, for example, of our simple like 0 through 7 case, right, maybe
-
36:39 - 36:43it's 0 through 6. One, two, three,
-
36:43 - 36:45four, so I have one that had six buckets, right,
-
36:45 - 36:48and then I've gotten to where
-
36:48 - 36:50each of them has
-
36:50 - 36:54two or maybe three in some and one in another.
-
36:54 - 36:57The plan is to just take this whole bucket array and
-
36:57 - 37:01enlarge it by a factor of two, so grow it, and in this case, from 6 to 12,
-
37:01 - 37:02and then rehash
-
37:02 - 37:04to move everybody. The
-
37:04 - 37:08idea is that the earlier decision about where to place them was based on knowing that
-
37:08 - 37:11there was exactly six buckets and so where it was placed before
-
37:11 - 37:15is not likely to be the place it will place if you have, you know,
-
37:15 - 37:1812 choices to lay them down into.
-
37:18 - 37:21And ideal ? in fact you don't even want it to be that case. Like, if they all land into the
-
37:21 - 37:23same bucket, right, you would have
-
37:23 - 37:25the same clustering and then a big empty clump,
-
37:25 - 37:28and so you would rehash everything, run it through the new hash function having
-
37:28 - 37:31changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?''
-
37:31 - 37:34And hopefully you'd end up getting a clean table, right, where you had 12
-
37:34 - 37:37items now with one in each bucket,
-
37:37 - 37:40ready to kinda give constant performance through
-
37:40 - 37:42that and then potentially,
-
37:42 - 37:43again if you overload
-
37:43 - 37:47your load factor, go back in and rearrange stuff again.
-
37:47 - 37:50So using this strategy a little bit like the one we talked about with the binary search tree of just
-
37:50 - 37:51wait and see. Rather
-
37:51 - 37:55than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any
-
37:55 - 37:57real alarm or crisis,
-
37:57 - 38:00it waits until there's kinda sufficiently clear demand
-
38:00 - 38:03that exceeds the capacity plan for,
-
38:03 - 38:05to do a big rearrangement.
-
38:05 - 38:08So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast.
-
38:08 - 38:12Starting to get just a hair slower, right, having to do a search through a few things as opposed
-
38:12 - 38:13to one,
-
38:13 - 38:16and then every now and then, right, you'd see an add which caused a big reshuffle of
-
38:16 - 38:19the table, which would be a totally linear operation in the number of
-
38:19 - 38:20elements
-
38:20 - 38:22and then they'd be fast again
-
38:22 - 38:24for another big clump of inserts until that
-
38:24 - 38:27same expansion might be complete or triggered by
-
38:27 - 38:28it growing even more.
-
38:28 - 38:31But dividing it sorta by the total number of inserts, so if you had 1,000 inserts
-
38:31 - 38:35before you overloaded and then you had to copy all 1,000 things to get ready for the
-
38:35 - 38:36next thousand,
-
38:36 - 38:39if you average that out, it still ends up
-
38:39 - 38:43being called a constant operation. So
-
38:43 - 38:46that's pretty neat.
-
38:46 - 38:47It's a ?
-
38:47 - 38:50really one of the, sort of, more, I think,
-
38:50 - 38:53easily understood and beautiful algorithmic techniques, right, for solving, right,
-
38:53 - 38:55this hard problem of
-
38:55 - 38:57search and retrieval
-
38:57 - 38:58that
-
38:58 - 39:01the vector and the sorting and the BST are all trying to get at, but ? and sometimes the
-
39:01 - 39:04sorter, vector and the BST are solving actually a much harder problem, right,
-
39:04 - 39:09which is keeping a total ordering of all the data and being able to flip it,
-
39:09 - 39:12traverse it and sorted order is one of the advantages that the sorted vector
-
39:12 - 39:12and
-
39:12 - 39:14the BST have that hash table doesn't have at all.
-
39:14 - 39:17In fact, it's jumbled it all up, and so if you think about how the iterator for
-
39:17 - 39:21the map works, our map is actually implemented using a hash table,
-
39:21 - 39:24it actually just iterates over the link list, and that explains why it almost appears
-
39:24 - 39:27completely random. If you put a bunch of things in the table and you go to iterate
-
39:27 - 39:28and
-
39:28 - 39:29visit them again,
-
39:29 - 39:33that the order you visit the keys seems to make no sense, and that's because it's based on
-
39:33 - 39:35their hash codes, right, which
-
39:35 - 39:36aren't designed
-
39:36 - 39:37to be
-
39:37 - 39:39any real sensible ordering
-
39:39 - 39:42to anybody not in the know of how the hash function works,
-
39:42 - 39:46whereas iterating over a vector that's sorted or iterating over a BST
-
39:46 - 39:47or a,
-
39:47 - 39:51in this case, our set, for example, is backed by a binary research tree
-
39:51 - 39:54will give you that sorted order. So it solves this harder problem, right, which cause there to
-
39:54 - 39:57be more kinda more investment in that problem, whereas hashing solves just exactly the
-
39:57 - 39:58one problem, which is
-
39:58 - 40:02I want to be able to find exactly this value again
-
40:02 - 40:03and update this value,
-
40:03 - 40:08and nothing more is needed than just identifying this,
-
40:08 - 40:10not finding the things that are less than this. For example, if you needed to find all
-
40:10 - 40:13the values that were less than this key alphabetically,
-
40:13 - 40:15right, the hash table makes that
-
40:15 - 40:16no easier
-
40:16 - 40:18than an unsorted vector,
-
40:18 - 40:21whereas things like assorted vector and binary search tree actually enable that kinda search,
-
40:21 - 40:24you could find the place in the tree and kind of work your way down the left side
-
40:24 - 40:27to find the things that were less than that.
-
40:27 - 40:31That's not actually being solved by the hash at all. Its use of
-
40:31 - 40:34memory is actually comparable to a
-
40:34 - 40:35binary search tree
-
40:35 - 40:38in that it has a four byte pointer per entry, which is the next field on the
-
40:38 - 40:40link list chain, and
-
40:40 - 40:44then there's a four byte pointer for each bucket, the head of the link list
-
40:44 - 40:47cell. Given that we expect that the buckets to be roughly equal to entry,
-
40:47 - 40:51than we can kinda just summarize that as well. Each entry
-
40:51 - 40:53represented an eight byte overhead,
-
40:53 - 40:56which is the same eight byte overhead, right, that the left and right pointers of
-
40:56 - 40:58the binary search
-
40:58 - 40:59tree add.
-
40:59 - 41:01Does it have any degenerate cases? That's a good
-
41:01 - 41:06question. So one of the things that made the binary search tree a little bit of a
-
41:06 - 41:08heavy investment was that to get that
-
41:08 - 41:11balancing behavior, even when we're getting data coming in in a way that's
-
41:11 - 41:16causing a lopsided construction, we have to go out of our way to do something special.
-
41:16 - 41:19What is the degenerate case for hash? Is there something that caused it to behave really,
-
41:19 - 41:24really badly? Anyone wanna help me out with
-
41:24 - 41:28that? Is it always good? Does it always give a good spread? Can we
-
41:28 - 41:32end up with everything in the same bucket somehow? [Inaudible] repeated it that way?
-
41:32 - 41:35So if you have repeated elements the way that ?
-
41:35 - 41:37both versions of the map work is they overwrite.
-
41:37 - 41:41So actually the nice thing is, yeah, if you did it again you just take that same cell and
-
41:41 - 41:43overwrite it, so in fact we wouldn't get a clustering based on the same entry
-
41:43 - 41:45going in and out.
-
41:45 - 41:51But how'd we end up with everything being in the same bucket?
-
41:51 - 41:54Mostly comes out, if you look at your hash function, when would a hash function collide?
-
41:54 - 41:58Right, and if you have a dumb hash function, right, you can definitely have some degenerate
-
41:58 - 42:01cases. My dumb hash function of return zero, right, being the
-
42:01 - 42:03worst example of that,
-
42:03 - 42:07but any hash function, right, that wasn't being particularly clever about using all of its
-
42:07 - 42:09information might actually have some clustering in
-
42:09 - 42:12it. It is possible, for example, even with a particularly good hash function that
-
42:12 - 42:15there are strings that will hash to the same thing, and it's like, if you somehow got
-
42:15 - 42:18really, really unlucky
-
42:18 - 42:19that you ?
-
42:19 - 42:23and had a hash function that wasn't doing the best job possible that you could get some
-
42:23 - 42:25clustering, but in general it doesn't require
-
42:25 - 42:29? the sole responsibility, right, for the generate case comes down to your hash function.
-
42:29 - 42:31So as long as your hash function
-
42:31 - 42:34and your inputs match your expectation,
-
42:34 - 42:35you don't have any surprises about
-
42:35 - 42:38how they got inserted the way it was in a binary search tree, which is sometimes hard
-
42:38 - 42:40to control.
-
42:40 - 42:43Just as you could underestimate the number of buckets you need ? Yeah. ? could
-
42:43 - 42:46you over estimate the number of buckets you
-
42:46 - 42:51need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and
-
42:51 - 42:52do you then go through and take
-
42:52 - 42:54that memory back? Yeah.
-
42:54 - 42:57Yeah, so that's a great question. So a lot of data structures
-
42:57 - 42:58don't shrink,
-
42:58 - 43:02right, and for example, the vector often will only grow on demand and then even if you deleted a
-
43:02 - 43:04bunch of, or removed a bunch of elements, it doesn't
-
43:04 - 43:06necessarily consider shrinking a big deal, and so
-
43:06 - 43:10that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around
-
43:10 - 43:12you're not using
-
43:12 - 43:15doesn't have nearly the same penalty that not
-
43:15 - 43:17having the memory when you needed it and having to do a lot of extra work to find
-
43:17 - 43:18things.
-
43:18 - 43:21So typically, right, you would try to pick a size that you are
-
43:21 - 43:26willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.''
-
43:26 - 43:28
-
43:28 - 43:30But that, as you grow, you might not shrink back to that size. You might just
-
43:30 - 43:34let the capacity just kinda lay in waste. There are good
-
43:34 - 43:37reasons to actually be tidy about it, but again, some of this is hard to predict.
-
43:37 - 43:40It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows
-
43:40 - 43:41big again. Like, maybe you
-
43:41 - 43:43have a new freshman class come in,
-
43:43 - 43:45and then you graduate a whole bunch of people. Then
-
43:45 - 43:47at the point where you graduate them all, you're about ready to take in a new freshman
-
43:47 - 43:51class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on
-
43:51 - 43:53reusing in a minute anyway, and so maybe
-
43:53 - 43:56that clearing it out and totally releasing
-
43:56 - 43:57may actually be an exercise
-
43:57 - 44:01that was unimportant. You're planning on getting back to that size eventually, anyway.
-
44:01 - 44:04Most hash tables tend to grow is actually the truth. Right, you tend to
-
44:04 - 44:07be collecting data in there and
-
44:07 - 44:11they tend to only enlarge. It's kind of unusual that you take out a bunch of
-
44:11 - 44:15entries you already put in.
-
44:15 - 44:18And so I just put a little last note, which is like, okay, well I had said earlier that when we
-
44:18 - 44:19have the map,
-
44:19 - 44:22the key had to be string type and I was gonna at
-
44:22 - 44:25some point come back to this and talk about
-
44:25 - 44:28why that was the one restriction in all our template classes, right, you can put anything you
-
44:28 - 44:31want in statter or vector or stacker queue.
-
44:31 - 44:34That map lets you store any kind of value associated with that key, but that key was
-
44:34 - 44:36string type, and
-
44:36 - 44:40so given how our map is implemented as a hash table, that actually starts to make sense, that if
-
44:40 - 44:42you're gonna take the key and kind of jumble it up and map it to
-
44:42 - 44:44a bucket,
-
44:44 - 44:47you need to know something about how to extract information from the key, and so
-
44:47 - 44:51this case, knowing it's a string means you know it has characters, you know its length, and you can do
-
44:51 - 44:52those things.
-
44:52 - 44:53If you wanted to make a map
-
44:53 - 44:57that could take any kinda key, maybe integers is key or student structures is
-
44:57 - 44:59key or, you
-
44:59 - 45:00know,
-
45:00 - 45:01doubles is key,
-
45:01 - 45:05any of these things, it's like, well, how can you write a generic form
-
45:05 - 45:06that could use
-
45:06 - 45:08a hashing operation on that
-
45:08 - 45:11kind of key and map it to bucket numbers.
-
45:11 - 45:14You would build a two type template to kinda get this working in C++, so
-
45:14 - 45:18you can actually have a template that has a type name for the key type, a type name for
-
45:18 - 45:19the value type,
-
45:19 - 45:20and then
-
45:20 - 45:25in the member functions you'd refer to both key type and value type as these two
-
45:25 - 45:26distinct place holders.
-
45:26 - 45:29And then when the client set it up, right, they'd be saying, ''Okay, I want a map
-
45:29 - 45:32that has strings that map to integers.'' So maybe this is
-
45:32 - 45:36words and the page they appear in a document. I have a map of integers and a vector
-
45:36 - 45:40string. Maybe this is score on an exam and the names of the students who got that
-
45:40 - 45:42score, doing it that way,
-
45:42 - 45:44and to make this work, right,
-
45:44 - 45:48there has to be some kind of universal hashing that, given some generic type,
-
45:48 - 45:53can turn it into an integer in the range of buckets. The
-
45:53 - 45:53?
-
45:53 - 45:56what it's gonna require here is that the client get involved.
-
45:56 - 45:58That you can not write this
-
45:58 - 46:00generic
-
46:00 - 46:01hash function
-
46:01 - 46:04that will work for all types, right, if it's a student structure, what are you gonna look at? The
-
46:04 - 46:08ID field, the names field, under like ? there's just no sensible way that you can
-
46:08 - 46:10talk about a generic type and talk about how to hash
-
46:10 - 46:14it, so what you would have to do is have some sort of client call back, so probably you would
-
46:14 - 46:18create this by passing out a call back that's given a key type thing, and a number of
-
46:18 - 46:20buckets would
-
46:20 - 46:24do the necessary machinations, right, to hash
-
46:24 - 46:26it into a right number.
-
46:26 - 46:30And so that would be kinda one of those coordination things between the client and the implementer
-
46:30 - 46:34about the client knows what the data is what the data is it's trying to store, and kinda how to
-
46:34 - 46:35
-
46:35 - 46:41mash it up and then the map would know, okay, given that number, where to stick it. So
-
46:41 - 46:42that's what it would take, and then
-
46:42 - 46:45rather sorta load that onto our
-
46:45 - 46:48novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can
-
46:48 - 46:53do a hash internally without having to get involved.'' All right,
-
46:53 - 46:55any questions about hashing? I'm gonna
-
46:55 - 46:59give you like, a two second talk about set
-
46:59 - 47:01because it's the one
-
47:01 - 47:04ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking
-
47:04 - 47:07about, it doesn't add any new things that we haven't already done, so in fact I'm just
-
47:07 - 47:10gonna tell you what it does, and
-
47:10 - 47:13then your picture will be complete, right.
-
47:13 - 47:16It wants to do fast searching, fast updating, add and remove, and it also has all
-
47:16 - 47:21these high level operations, such as intersect, union and difference,
-
47:21 - 47:24that are kind of its primary set of things to do.
-
47:24 - 47:25
-
47:25 - 47:27A bunch of the things we've already seen, right,
-
47:27 - 47:30would actually be a reasonable strategy for upholding stat, right, using some kind
-
47:30 - 47:31of vector or array.
-
47:31 - 47:34Most likely you'd want to keep it in sorted order because that would buy you
-
47:34 - 47:37the fast lookup for contains,
-
47:37 - 47:41but the add and remove, right, would necessarily be slow in those cases because of the
-
47:41 - 47:43continuous memory.
-
47:43 - 47:44
-
47:44 - 47:47The link list, not really probably too good of an option here
-
47:47 - 47:50because it contains then becomes
-
47:50 - 47:51linear and
-
47:51 - 47:54add and remove similarly, right, liner, so it doesn't actually get you anywhere, in
-
47:54 - 47:57fact it just adds the memory in pointers and gunk like that.
-
47:57 - 47:59The binary search tree, probably a
-
47:59 - 48:00pretty good call,
-
48:00 - 48:03right, that's the ?
-
48:03 - 48:06kinda meshes the advantage of Adamic structure with the sorted property to
-
48:06 - 48:11be able to give it fast update adding and removing and searching all can be done
-
48:11 - 48:14in logarithmic time if you have balancing.
-
48:14 - 48:17And then it actually also enables the high level ops, so the idea of being able to do a union
-
48:17 - 48:19intersection difference, and actually being able to
-
48:19 - 48:23walk over both of the trees in sorted order, which is very easily done
-
48:23 - 48:24with a [inaudible] reversal,
-
48:24 - 48:28makes those operations actually more efficient to implement than actually kinda having to deal with
-
48:28 - 48:32data coming at you all different ways. It
-
48:32 - 48:33turns out hashing really
-
48:33 - 48:35won't quite do it
-
48:35 - 48:36
-
48:36 - 48:37very easily for us
-
48:37 - 48:40because of the same need about the template situations, like, well you have to have a hash function that
-
48:40 - 48:43can hash these things and do the right thing, and so the tree happens to be a
-
48:43 - 48:46good of just ? if the client can just give you ordering information, you can
-
48:46 - 48:48do all the hard work without
-
48:48 - 48:50pushing hashing onto them. So in fact
-
48:50 - 48:53the way our set really does work is it does use a binary search tree. It happens to
-
48:53 - 48:56use it through another layered abstraction
-
48:56 - 48:59that there actually is a BST class that we've never really
-
48:59 - 49:02encountered directly, but the BST class just takes the idea of a binary search
-
49:02 - 49:06tree, adds balancing, and
-
49:06 - 49:08all the properties about finding and inserting and adding,
-
49:08 - 49:12and packages it up, and then set actually just turns out to be one liners
-
49:12 - 49:14making calls into the binary search tree.
-
49:14 - 49:17This is where all the heavy lifting
-
49:17 - 49:18goes, is down there.
-
49:18 - 49:23That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our
-
49:23 - 49:26BST uses ? or set uses the BST, which is the binary search tree,
-
49:26 - 49:28right, our vector, right, using an array,
-
49:28 - 49:33and then that stacks and queues having both viable strategies both for array
-
49:33 - 49:37vector based backing as well as link list both getting good time performance. And then
-
49:37 - 49:38you saw with the PQ, right,
-
49:38 - 49:40even
-
49:40 - 49:44yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the
-
49:44 - 49:46kind of really classic things in which a lot of things get built,
-
49:46 - 49:51and so having a chance to use them both as a client and kinda dig
-
49:51 - 49:53around as an implementer kinda gives you this big picture of
-
49:53 - 49:55some of the more common ways that
-
49:55 - 49:57cool data structures get built.
-
49:57 - 50:00So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman
-
50:00 - 50:03cafe in a little bit, if you've got some time, please come, and have
-
50:03 - 50:04a good weekend.
-
50:04 - 50:07Work hard on pathfinder.
- Title:
- Lecture 24 | Programming Abstractions (Stanford)
- Description:
-
Lecture 24 by Julie Zelenski for the Programming Abstractions Course (CS106B) in the Stanford Computer Science Department.
Julie introduces hashing and it's uses in search and retrieval; map implementations and the different kinds of search algorithms are then discussed. Thereafter she explains that logarithmic searches are relatively fast and often finish the search in an immeasurable amount of time. She introduces a different approach to search that works in a faster manner than linear search.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=FE6E58F856038C69CS 106B Course Website:
http://cs106b.stanford.eduStanford Center for Professional Development:
http://scpd.stanford.edu/Stanford University:
http://www.stanford.edu/Stanford University Channel on YouTube:
http://www.youtube.com/stanforduniversity/ - Video Language:
- English
- Duration:
- 50:19
N. Ueda edited English subtitles for Lecture 24 | Programming Abstractions (Stanford) | ||
Eunjeong_Kim added a translation |