Lecture 24 | Programming Abstractions (Stanford)

Edit subtitles

0:00 - 0:12
0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:28

Development.
0:28 - 0:29
0:29 - 0:33

What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever
0:33 - 0:36

gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda
0:36 - 0:39

clever, inventive idea for how you do search and retrieval in a very efficient
0:39 - 0:40

way.
0:40 - 0:44

So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday,
0:44 - 0:48

where we had talked about ways we might implement the map abstraction, right, the key
0:48 - 0:49

value pairing
0:49 - 0:52

to allow for quick search and retrieval
0:52 - 0:56

and we had looked at vector and had just done a little thought experiment about
0:56 - 0:59

what sorting the vector would buy us, right, and that would give us the
0:59 - 1:02

ability to do searches faster because we could use binary search,
1:02 - 1:07

right. But in the case of add, right, we'd still be slow because the
1:07 - 1:10

need be kinda shuffled or rearranged in the continuous memory
1:10 - 1:13

and then that's what motivated our introduction to trees there, the binary search tree,
1:13 - 1:17

which is designed exactly to enable that kind of searching, right, that logarithmic
1:17 - 1:20

divide and conquer, divide in half and work your way down
1:20 - 1:22

and then because of the flexibility of the pointer base structure there, right, we
1:22 - 1:24

can also do the insert
1:24 - 1:26

in the same logarithmic
1:26 - 1:30

time because we're no longer constrained by that contiguous memory that the array
1:30 - 1:30

fact
1:30 - 1:32

data structures force us to.
1:32 - 1:36

And so, at that point what we have seen this logarithmic and both get valued at and so
1:36 - 1:38

on the PQ, right, which you were just finishing,
1:38 - 1:42

you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you
1:42 - 1:44

would normally run into in an
1:44 - 1:45

ordinary use, right,
1:45 - 1:49

it's un-measurable, right. So those of you who got your heap up and running and
1:49 - 1:51

correctly working logarithmic and time was discovering that you can put 10,000, 20,000,
1:51 - 1:54

50,000, 100,000 things in it and just never be able to measure
1:54 - 1:58

what it took to NQ or DQ from those structures because logarithmic is very,
1:58 - 2:00

very fast.
2:00 - 2:01

So that said,
2:01 - 2:04

that's actually a fine point to say, we have a great math notation, we don't have to look further,
2:04 - 2:07

but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where
2:07 - 2:10

we can do both those operations in constant time, which is to say that
2:10 - 2:13

no matter how large the table gets, right, we expect
2:13 - 2:16

we'll be able to insert and retrieve with
2:16 - 2:21

the exact same time required kind of for a 10,000 table as a million entry
2:21 - 2:22

table which is pretty neat.
2:22 - 2:26

To note that as we start to get to these more complicated data structures,
2:26 - 2:29

one thing we also need to keep in mind, right, is that there are other things
2:29 - 2:33

of trade offs, right, in adding that complexity of the code that
2:33 - 2:36

starting with something that's simple and
2:36 - 2:39

a little slower performer but that's easy to write might actually solve the problem
2:39 - 2:42

we have at hand. So maybe that we don't really wanna go to one of these fancier
2:42 - 2:45

things because we don't really need it to be that much faster
2:45 - 2:48

because the BST, right, adds a bunch of memory, right, so left and right pointers
2:48 - 2:53

for every cell, which is an 8 byte overhead on top of the entry itself.
2:53 - 2:56

Now we have to deal with pointers, dynamic memories, all this allocation,
2:56 - 2:59

deallocation, right, we've got a lot of recursion going on in there
2:59 - 3:01

and so just opportunities for error
3:01 - 3:03

and stuff to creep into
3:03 - 3:07

and if we take the steps to provide tree balancing, which we only kind of grazed
3:07 - 3:11

the surface of. We just mentioned the need for it. It actually adds quite a lot of
3:11 - 3:14

complication to the code to do the rotations
3:14 - 3:16

or the rebalancing
3:16 - 3:21

that from a simple binary search tree. So in fact the whole package, right, of
3:21 - 3:24

building a really guaranteed log n performing
3:24 - 3:27

binary search tree is actually a pretty big investment of human capital
3:27 - 3:29

that you
3:29 - 3:30

have to kinda
3:30 - 3:33

balance against the future hit off of it running faster. I'm
3:33 - 3:36

gonna show this strategy today that actually is
3:36 - 3:37

kind of simple to write.
3:37 - 3:40

It gets us solve one case and doesn't have the degenerate
3:40 - 3:43

looming that the binary search tree did.
3:43 - 3:46

So, let me kinda just give you a theoretical concept to think about how
3:46 - 3:49

you do stuff.
3:49 - 3:51

You have a big dictionary. You've got, you know, like a
3:51 - 3:54

many thousand page dictionary, right, and you want to look up a word. You certainly
3:54 - 3:57

don't do linear search. You're about to look up the word xylophone and you're actually not
3:57 - 4:00

going to start from the A's and kind of page your way through, you
4:00 - 4:03

know, 2,000 pages to get to the X's. You
4:03 - 4:07

don't even ? like we do binary search. You can do binary searches pretty fast, right? It
4:07 - 4:10

turns out like starting at the M's to get something that's in X, right? You
4:10 - 4:11

tend to know about where X
4:11 - 4:14

is and often dictionaries have little tabs
4:14 - 4:17

along the pages that give you an idea of kinda where the breaks for
4:17 - 4:22

certain words are. Maybe it says Xa is here and Xe is there or something,
4:22 - 4:25

that lets you kinda more quickly just illuminate all the surrounding noise
4:25 - 4:29

in the dictionary, kinda get straight to the right region where the X words are
4:29 - 4:30

and then you can do
4:30 - 4:33

a simple search from there, maybe a little bit of a linear or binary search to kind of zero
4:33 - 4:37

in on the page where the word is you wanna find is.
4:37 - 4:38

This is the idea
4:38 - 4:45

behind this concept called a hash table or based on the algorithmic technique of hashing
4:45 - 4:46

is that
4:46 - 4:49

rather than try to figure out how to kinda keep this very large collection sorted or organized
4:49 - 4:51

in a way that you can kind of
4:51 - 4:52

jump around within it, it says,
4:52 - 4:56

''Well how about we maintain a bunch of different collections.'' We break up the
4:56 - 4:59

data set into little tiny pieces
4:59 - 5:00

and if we can
5:00 - 5:02

tell, given a key,
5:02 - 5:05

like which piece of the data set it's supposed to be in and we'll call those pieces
5:05 - 5:08

buckets. I'm gonna put these in different buckets.
5:08 - 5:11

So if I had a whole bunch of students I might put all the freshmen in this bucket
5:11 - 5:13

and the sophomores in this bucket or something like that and then when I look
5:13 - 5:16

for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the
5:16 - 5:18

freshmen bucket.
5:18 - 5:19

Okay. You know, the idea
5:19 - 5:22

in a simple case would be like, ''Okay, well I ? it has some
5:22 - 5:26

criteria by which I can divide you into these buckets that don't overlap and everyone
5:26 - 5:28

has a unique bucket to go in
5:28 - 5:31

and if I can make up enough of those categorizations,
5:31 - 5:33

have as many of those buckets around
5:33 - 5:35

to divide you up into,
5:35 - 5:38

then I can hopefully keep the set that lands in that bucket pretty small,
5:38 - 5:42

so that I don't have a smaller of students. So maybe I did it by the
5:42 - 5:45

first letter of your last name. I could say, ''Well everybody whose last name begins with W
5:45 - 5:47

goes here and everybody who's B goes here and so on.'' Well,
5:47 - 5:50

when somebody comes to me then I don't have to look through the B's.
5:50 - 5:51

You know, and if you have
5:51 - 5:53

a class of 200 students
5:53 - 5:56

and you have 26 letters of the alphabet then you'd think that if it were
5:56 - 5:59

evenly divided across those letters, it's not going to be to be, it turns out in practice,
5:59 - 6:02

but as a thought experiment there'd be about 4 people in a bucket and
6:02 - 6:06

it'd be pretty fast to just look through the bucket and find the one who I was
6:06 - 6:08

looking for. Okay,
6:08 - 6:10

so let's take this idea. This idea of, like,
6:10 - 6:14

there's going to be some strategy for how we divide into buckets and we call
6:14 - 6:16

that strategy the hash function.
6:16 - 6:18

That, given a key,
6:18 - 6:23

and a buckets and we'll just label them with numbers. Zero to the number of buckets
6:23 - 6:26

minus one. So let's say I have 99 buckets, I've got zero to
6:26 - 6:27

98
6:27 - 6:28

that I will take the key
6:28 - 6:32

and the key in this case is the string key that we're using in the map,
6:32 - 6:33

and I have a hash function
6:33 - 6:37

which given a key is input produces a number as output
6:37 - 6:39

in the range zero to B minus one.
6:39 - 6:42

And that is what's called the hash value for that key
6:42 - 6:45

and that tells me which bucket to look at,
6:45 - 6:48

either find a match for that if I'm trying to do a get value or where to place a
6:48 - 6:52

new entry if I'm trying to add into the table.
6:52 - 6:57

So that's the basic idea. I'm
6:57 - 7:00

gonna talk a little bit about hash functions.
7:00 - 7:03

In terms of writing a really good hash function is actually pretty complicated
7:03 - 7:04

and so
7:04 - 7:07

we're gonna think about it in sort of a ? kind
7:07 - 7:10

of a big picture perspective and think about what qualities a hash
7:10 - 7:13

function has to have and then I'm actually not gonna go through
7:13 - 7:16

proving to you about what mathematically constitutes a good hash function. I'm just gonna give you
7:16 - 7:21

some intuitive sense of what kinda things matter when designing a hash function.
7:21 - 7:23

So it's given a key that maps it to a number.
7:23 - 7:25

Typically it's gonna say, you know,
7:25 - 7:27

zero to the number of buckets minus one.
7:27 - 7:30

Often that means that somehow it's gonna compute something and then use
7:30 - 7:33

the mod operation to bring it back in range.
7:33 - 7:35

So it's gonna take your number
7:35 - 7:37

and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you
7:37 - 7:39

a number zero to four.''
7:39 - 7:41

It's very important that it be
7:41 - 7:44

stable, that you put the key in, you get a number out. If you put that same key in
7:44 - 7:48

later you should get the same number out. So it has to be sort of reliable guaranteed
7:48 - 7:52

mapping. It can't be random. That
7:52 - 7:55

said, you want it to almost look as though it's random in terms of
7:55 - 7:59

how it maps things to buckets, that you don't want it to have a real systematic plan
7:59 - 8:02

where it puts a whole bunch of things in bucket one and very few things in bucket two
8:02 - 8:06

or somehow always uses the even buckets and not the odds ones or something like that. So
8:06 - 8:08

you wanna have a strategy that's gonna take those keys
8:08 - 8:11

and use your whole bucket range
8:11 - 8:12

in an equal manner.
8:12 - 8:15

So that if you put 100 keys in and you have 100 buckets you're hoping
8:15 - 8:16

that,
8:16 - 8:19

on average, each bucket will have one thing. That you won't have these clusters
8:19 - 8:22

where 25 things are in one bucket, ten here and then a whole bunch of buckets
8:22 - 8:24

are empty. So
8:24 - 8:27

you want this kinda divided across the range.
8:27 - 8:29

So given that we have string input,
8:29 - 8:32

likely what we're gonna be using is the characters in the string as part of
8:32 - 8:34

the ? how can I take these characters
8:34 - 8:40

and somehow kinda jungle them up in a way to sort of move them around in these
8:40 - 8:43

buckets in a way that's reliable so that I get that same string back, I get the
8:43 - 8:44

same one.
8:44 - 8:47

But that doesn't actually produce a lot of artifacts of
8:47 - 8:50

similar strings, let's say mapping to the same place, hopefully spreading them
8:50 - 8:52

around.
8:52 - 8:55

So for example, some really simple things you could use, you could just take the first
8:55 - 8:56

letter. So I
8:56 - 8:59

talked about this sorta with last names. If I was trying to divide up my students, right, I could divide
8:59 - 9:01

them by the first letter of their last name.
9:01 - 9:05

That would use exactly 26 buckets, no more, no less,
9:05 - 9:08

and if I happen to have 26 buckets
9:08 - 9:11

then at least there's some chance it would get some coverage across the range.
9:11 - 9:14

There is likely to be clustering, right, some
9:14 - 9:15

names are a lot more common
9:15 - 9:18

or letters are a lot more commonly used than others.
9:18 - 9:21

So it would tend to produce a little bit of an artifact, but it would be reliable and it would
9:21 - 9:24

use 26 buckets. If I had 100 buckets,
9:24 - 9:26

then this would be kind of a crummy function
9:26 - 9:28

because in fact it would use the first 26 buckets and none of the other ones,
9:28 - 9:30

right, if it was just taking
9:30 - 9:32

that letter and deciding. So,
9:32 - 9:36

you know, in some situations this would get me a little bit of the thing I wanted.
9:36 - 9:40

Things like using the length of the word. A
9:40 - 9:42

very, very bad clustering function here,
9:42 - 9:44

right because there'd be a bunch of strings that are, like, six characters, seven
9:44 - 9:46

characters, ten characters,
9:46 - 9:50

but very few that are one or two, and very few that are more than ten, or something.
9:50 - 9:53

You'd have this very tight little cluster in the middle, right, and would not use a
9:53 - 9:57

large bucket range effectively at all.
9:57 - 10:01

More likely what you're gonna try to use is kind of more of the information about the word,
10:01 - 10:04

the key that you have. Instead of just a single letter or the count of letters, it's
10:04 - 10:06

trying to use kinda all the data you have at hand,
10:06 - 10:08

which is to say all the letters,
10:08 - 10:12

and so one way to do it would be to take the ASCII values for the letters, so A is
10:12 - 10:13

97, you
10:13 - 10:15

know, B is 98,
10:15 - 10:17

is you would take a letter, a word like atoms and you'd add
10:17 - 10:18

96
10:18 - 10:22

plus 99 plus 96, you know, add
10:22 - 10:23

them together
10:23 - 10:26

and then you'd get some sum of them, 500, 600, whatever it is,
10:26 - 10:28

depending on how many letters there are
10:28 - 10:32

and then use mod to back it back into ? let's say your range was, you had 100
10:32 - 10:34

buckets, right, so you'd want it to be zero to 99, so
10:34 - 10:37

you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd
10:37 - 10:42

be, okay, the sum to 524 then it goes into bucket 24.
10:42 - 10:47

So if you did that, right, and so if act
10:47 - 10:50

is the word that you have, right, you'd add these codes together and let's say this is
10:50 - 10:53

97 and then this is 99 and then
10:53 - 10:57

that's, you know, a hundred and something and so we end up with ? I'm making this
10:57 - 10:59

number up, right ? 283, then we might
10:59 - 11:03

mod to the last two digit and we'll say this one goes into bucket 83.
11:03 - 11:07

And if we have, you know, dog,
11:07 - 11:11

maybe that adds to 267 and we would
11:11 - 11:13

mod that down to
11:13 - 11:17

67. So this actually tries to use, kinda, all the information at hand, all the letters
11:17 - 11:18

that we have
11:18 - 11:19

and bring them down,
11:19 - 11:22

and so if you figure that across the space of all the numbers that are being added here
11:22 - 11:25

you are just as likely to get 210 as you are to get 299, you know
11:25 - 11:28

that any of the subsequent numbers in between,
11:28 - 11:32

then hopefully you'd get kind of a spread across your buckets.
11:32 - 11:36

There is one thing this clusters on. Does anybody see what words cluster in this
11:36 - 11:40

system?
11:40 - 11:41

That you'd think of as
11:41 - 11:47

maybe not being related, but would land in the same bucket? To
11:47 - 11:49

rearrange some numbers. Yes. Any
11:49 - 11:51

anagram, right, so
11:51 - 11:52

you
11:52 - 11:54

take cat which is just a ranagram ?
11:54 - 11:57

anagram of act, right, it's gonna add to the same thing, right,
11:57 - 12:00

and so it would be a little bit of clustering in that respect.
12:00 - 12:02

So anagrams come down to the same place.
12:02 - 12:05

And so the words that you expect to, kinda like, act and cat to
12:05 - 12:09

you seem like different words, they are gonna collide in this system and that might be a little
12:09 - 12:11

bit of something we worry about. So one way you can actually kind of avoid that in the
12:11 - 12:12

way that
12:12 - 12:15

the hash function we're going to eventually use will do, is
12:15 - 12:16

it'll do this thing where
12:16 - 12:21

not only does it add the letters together, but it also multiplies the letter by a
12:21 - 12:22

very large prime number
12:22 - 12:25

that helps to kinda say, ''Well the large prime number
12:25 - 12:26

times C plus
12:26 - 12:30

the large prime number squared times A plus the large prime number cubed times C and
12:30 - 12:33

then it just allows the, each position of the letter actually is also then
12:33 - 12:35

encoded in the number that comes out
12:35 - 12:37

and so something that's positionally different
12:37 - 12:40

won't necessarily land in the same bucket
12:40 - 12:43

without some other factors kinda coming into play. So a little
12:43 - 12:49

bit of trying to just take the information and jumble it up a little bit.
12:49 - 12:51

So let me show you a little bit about how this looks. Actually I'm gonna tell you about
12:51 - 12:54

the collisions and then I'll show you the data structure.
12:54 - 12:58

So certainly I said there's gonna be this problem where things collide,
12:58 - 13:01

that 83 and 83 in this simple system of adding the codes will
13:01 - 13:03

come out the same.
13:03 - 13:06

So it can't be that each of these buckets holds exactly one thing.
13:06 - 13:07

The hashing,
13:07 - 13:08

right, although we may
13:08 - 13:11

try to get those buckets as small as possible we have to account for the
13:11 - 13:14

possibility that two things will collide, that their hash function, even though
13:14 - 13:16

they were different keys,
13:16 - 13:18

will land in the same bucket
13:18 - 13:20

based on how I've jumbled it up
13:20 - 13:23

and so we're trying to avoid that, but when it does happen we also have to have a
13:23 - 13:25

strategy for it. So we have to have a way of saying, ''Well
13:25 - 13:28

there can be more than one thing in a bucket.'' Or some other way of when there's
13:28 - 13:33

a collision, deciding where to put the one that was placed in the table second.
13:33 - 13:36

And in the strategy, we're gonna use the one called chaning,
13:36 - 13:39

and the chaning basically says, ''Well if once we have something in a bucket like ACT,
13:39 - 13:43

if we hash something and it comes to the same bucket, we'll just tack it onto that bucket,
13:43 - 13:45

in this case using a link list.''
13:45 - 13:48

So each of these buckets is actually gonna be a link list of
13:48 - 13:49

the entries
13:49 - 13:52

and so the entire table will be a
13:52 - 13:54

vector or array of those.
13:54 - 13:56

So let me show you,
13:56 - 13:59

a little
13:59 - 14:05

live action.
14:05 - 14:08

So this is a hash table that has seven buckets.
14:08 - 14:08

In
14:08 - 14:11

this case number 0 through 6 are assigned to each one.
14:11 - 14:14

They all start empty, and so these are all empty link lists illustrated by the
14:14 - 14:15

null,
14:15 - 14:17

and then the hash function
14:17 - 14:20

is shown here like a black box. You don't actually know how the workings of hash
14:20 - 14:23

function is, but you know what it's job is to do, which is you give it a key and it gives you
14:23 - 14:27

a number in the range 0 to 6, and that same key when passed, and again gives you
14:27 - 14:28

the same number.
14:28 - 14:31

So if I say 106B
14:31 - 14:34

equals Julie, it pushes 106B through the hash function and says,
14:34 - 14:35

''Okay, well 106B
14:35 - 14:37

comes out as one.''
14:37 - 14:40

You add those numbers together, jumble them up, the letters, and you get one.
14:40 - 14:43

And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new
14:43 - 14:46

cell for 106B Julie and stick it in the table.''
14:46 - 14:48

So now there's a
14:48 - 14:51

non empty chain in one place, the rest are empty.
14:51 - 14:53

If I say 107
14:53 - 14:55

is being taught by
14:55 - 14:57

Jerry, 107 happens to get the same hash function.
14:57 - 15:00

107, 106B, different letters but it just happen to come out that way,
15:00 - 15:01

hash to one
15:01 - 15:04

and so it went ahead in this case just tacked it on to the front of the list.
15:04 - 15:05

If I do
15:05 - 15:08

106A
15:08 - 15:10

being taught by Patrick,
15:10 - 15:11

106A
15:11 - 15:14

happens to sum to zero in the hash function.
15:14 - 15:17

That's all we know about the black box that ends up loading it up into the zero's
15:17 - 15:17

slot.
15:17 - 15:21

I say, well wait, it's being taught by
15:21 - 15:25

Nick, it also come out to be zero and so
15:25 - 15:27

it ends up kinda pushing on the front of that. So I go 103
15:27 - 15:30

is being taught by Maron and it's in
15:30 - 15:33

bucket five and so as I kinda type things in I'm starting to see that things are just kinda
15:33 - 15:35

going around in difference place in the table.
15:35 - 15:38

A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue
15:38 - 15:42

to type in strings for other course numbers, I'd hopefully kinda fill out the table
15:42 - 15:43

to where there was kind of an even,
15:43 - 15:46

roughly even spread across that.
15:46 - 15:47

When I want to look something up
15:47 - 15:51

because I wanna see who's teaching 106B, then
15:51 - 15:54

it will use the same process in reverse. It asks the hash function,
15:54 - 15:56

''Well which bucket should I look in?'' And it
15:56 - 15:59

says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So
15:59 - 16:01

all the other buckets are eliminated
16:01 - 16:04

from consideration, so very quick kind of
16:04 - 16:07

focused down to the right place and then it actually just does a link list reversal to
16:07 - 16:10

find 106B in the bucket
16:10 - 16:11

where it has to be,
16:11 - 16:13

and so
16:13 - 16:16

the key to realizing what makes this so magical, right,
16:16 - 16:19

is that the choice of the number of buckets
16:19 - 16:21

is up to me.
16:21 - 16:24

I can pick the number of buckets to be whatever I want.
16:24 - 16:27

I can pick it to be five, I can pick it to be 10, I can be to be
16:27 - 16:29

1,000, I can pick it to be a million. If
16:29 - 16:34

I choose the number of buckets to be kind of roughly in the same order of magnitude
16:34 - 16:35

as the number of entries. So if I
16:35 - 16:38

wanna put a thousand things in my table,
16:38 - 16:40

if I make a thousand buckets to hold them in
16:40 - 16:43

and my hash function can divide stuff
16:43 - 16:45

pretty equally across those thousand buckets,
16:45 - 16:47

then I will expect that each of those link lists is about length one.
16:47 - 16:52

Some will be two, some will be zero, but if it's doing its job right then I have a
16:52 - 16:56

bunch of little tiny buckets, each of which holds a tiny fraction of
16:56 - 16:58

the data set, in this case one or two,
16:58 - 17:00

and it's very easy to search through them.
17:00 - 17:05

If I know I'm gonna put a million things in, then I make a million buckets, and then
17:05 - 17:08

I divide one out of a million things, I run it through the hash function, it gives me
17:08 - 17:09

this number
17:09 - 17:12

that says, ''Oh, it's 862,444.'' I go to
17:12 - 17:16

that bucket and the item is either there in that bucket or I have to look a little
17:16 - 17:18

bit to find it,
17:18 - 17:20

but there's no other place in the table I have to look. So the fact that there's
17:20 - 17:21

a
17:21 - 17:24

million other entries kinda near by in these other buckets is completely not
17:24 - 17:26

important to us, right, and so
17:26 - 17:28

the searching and
17:28 - 17:32

updating, adding, and replacing, and removing, and stuff all happens at the bucket level
17:32 - 17:35

and by choosing the buckets to be
17:35 - 17:36
17:36 - 17:39

the number, total number are gonna be roughly equal to the size of the things, then
17:39 - 17:40

I have this
17:40 - 17:43

constant time access to
17:43 - 17:50

the tiny part of the subset of the set that matters. That's pretty cool. Let's write some
17:50 - 17:51

code. It's kinda
17:51 - 17:53

good to see the whole thing
17:53 - 18:00

doing what it needs to do. So I'm gonna go ahead and go back over here.
18:00 - 18:02

All right, so I've got my
18:02 - 18:05

map which has add and get value and
18:05 - 18:07

not much else there, right,
18:07 - 18:12

but kinda just set up and ready for us to go. What I'm gonna build here
18:12 - 18:13

is a
18:13 - 18:15

link list cell
18:15 - 18:20

that's gonna hold a string key, a val type value, and
18:20 - 18:24

a pointer to the next, and
18:24 - 18:27

so I plan on having each one of these cells, right, for each entry that goes in the
18:27 - 18:27

table
18:27 - 18:31

and then I'm going to have an array of
18:31 - 18:32

these, and
18:32 - 18:35

I'm gonna make a constant for it, and then I'll talk a little bit about the
18:35 - 18:42

consequences of that. So I'm gonna make a constant that is ? so
18:42 - 18:45

I make it 99. So then I have 99 buckets to start with. That's not
18:45 - 18:49

gonna solve all my problems, it's gonna show that it's kinda tuned for maps that
18:49 - 18:51

expect to hold about a hundred things, but
18:51 - 18:56

? and so I have an array in this case, so if you look at deciphering this
18:56 - 18:58

declaration is a little bit complicated here.
18:58 - 19:00

It says that buckets is the name of the variable,
19:00 - 19:06

it is an array of num buckets length, so a fixed size, 99
19:06 - 19:07

entry array here.
19:07 - 19:11

Each of those entries is a pointer to a cell, so ready to be the
19:11 - 19:16

head pointer of a link list for us.
19:16 - 19:19

I'm gonna add a ?
19:19 - 19:23

well, we'll get there in a second. I'm gonna go over here and work on the constructor first.
19:23 - 19:25

So
19:25 - 19:30

the first thing I'm gonna do is make sure that those pointers have good values.
19:30 - 19:32

If I do not do this, right, they will be junk
19:32 - 19:36

and there will be badness that will happen, so I can make sure that each of them starts off as an
19:36 - 19:37

empty list,
19:37 - 19:42

and then correspondently, right, I would need to be doing a delete all list
19:42 - 19:43

cells here, but
19:43 - 19:48

I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then
19:48 - 19:52

delete all the cells in each of those lists that are there.
19:52 - 19:54

And then for add and get value
19:54 - 19:56

what I'm gonna ? else we need to do
19:56 - 19:57

is figure out
19:57 - 19:58

in both cases, right, which
19:58 - 20:00

is the appropriate list to operate on.
20:00 - 20:03

So running the hash function on my key,
20:03 - 20:08

seeing where it tells me to look, and then either searching it to find the match
20:08 - 20:09

for get value
20:09 - 20:13

to return the matching value or in the add case, to search for the matching
20:13 - 20:16

entry to overwrite if it exists and if not, then to create a new cell and tack it on the
20:16 - 20:19

front of the list. So
20:19 - 20:21

let's write a
20:21 - 20:25

helper function because they're both gonna do the same thing. I'm gonna need
20:25 - 20:28

the same thing twice. So I'm gonna put a hash function in here
20:28 - 20:31

that, given a key and a number of buckets, is gonna
20:31 - 20:35

return to me the right bucket to look in,
20:35 - 20:39

and then I'm gonna write a find cell function that, given a key and
20:39 - 20:39

the pointer to a list,
20:39 - 20:43

will find the cell that matches that key or return a null if it didn't find
20:43 - 20:52

it. Okay, let me put my hash
20:52 - 21:03

function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ?
21:03 - 21:05

so what I'm gonna do, I'm
21:05 - 21:07

gonna make an interesting choice here
21:07 - 21:12

and I'm gonna have my hash function just return zero, which
21:12 - 21:13

seems pretty surprising.
21:13 - 21:16

But I'm gonna talk about why that is. It turns out, when I'm first building it that
21:16 - 21:19

actually is an easy way to test that my code is working
21:19 - 21:22

and to not be dependent on the behavior of the hash function. At this point it says, ''We'll
21:22 - 21:23

just put
21:23 - 21:25

everything in the zero's bucket.''
21:25 - 21:27

Now that's not a good strategy for my long term efficiency,
21:27 - 21:29

but it is a good way to test that
21:29 - 21:33

my handling of the bucket, and the searching, and the finding, and
21:33 - 21:36

inserting is correct, and the performance I should expect to be abysmal on this, right, it should be
21:36 - 21:37

totally linear
21:37 - 21:39

in terms of the number of elements to add or to get them because they'll all
21:39 - 21:41

be in one link list
21:41 - 21:45

with no easy access to them. And then eventually I can change this to divide across
21:45 - 21:48

the buckets and get better spread, but the correctness of the hash table
21:48 - 21:52

isn't affected by this choice. It's kind of an interesting way to think about designing
21:52 - 21:54

the code.
21:54 - 22:01

The other helper I have down here is gonna be my find cell that, given the
22:01 - 22:04

head pointer to a list and the key,
22:04 - 22:14

is gonna go find it, and it's basically just
22:14 - 22:17

if this ker cells key equals the key,
22:17 - 22:20

then we return it and then if we've gone through the whole list and haven't found it, we return
22:20 - 22:22

a null.
22:22 - 22:25

So this guy is returning a cell
22:25 - 22:27

T star, right,
22:27 - 22:30

as the matching cell within the thing, and this one is gonna provoke
22:30 - 22:31

that
22:31 - 22:35

little C++ compilation snafu that we encountered on the binary search
22:35 - 22:36

tree, as well,
22:36 - 22:40

which is that cell T, right, is a type that's declared in the private
22:40 - 22:43

section within the scope of my map. So
22:43 - 22:46

outside of the scope, which to say, before I've crossed the my map
22:46 - 22:51

angle bracket val type colon, colon, it doesn't actually believe we're in my map scope
22:51 - 22:54

yet. It hasn't seen that and it can't anticipate it, so I have to actually give
22:54 - 22:56

the full name
22:56 - 22:58

with the scope specifier on it,
22:58 - 23:03

and then because of this use of the template name outside of its class
23:03 - 23:09

is a special place for the C++ compiler, the type name keyword also
23:09 - 23:12

needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the
23:12 - 23:14

right place, which I kinda
23:14 - 23:15

short changed,
23:15 - 23:18

and then searching to find a cell. If
23:18 - 23:23

I go back up here to get value,
23:23 - 23:25

and I call the hash function,
23:25 - 23:31

given my key and the number of buckets to find out which bucket to work on,
23:31 - 23:35

I go looking for a match by calling find cell
23:35 - 23:40

of the key in buckets of which bucket,
23:40 - 23:43

and then if match does not equal null then
23:43 - 23:45

I can return match as val and
23:45 - 23:46

then
23:46 - 23:47

otherwise
23:47 - 23:52

we say, ''No such key found.'' So that's
23:52 - 23:58

the error case, right, for get value was that if it didn't find one, its behavior is to
23:58 - 23:59

raise an error
23:59 - 24:01

so that someone knows that they've
24:01 - 24:05

asked something it can't deal
24:05 - 24:09

with. Okay, so the key thing here usually is that the hash is some sort of constant operation that
24:09 - 24:12

just jumbles the key up, in this case it just returned zero, but could actually look at
24:12 - 24:14

the elements of the key,
24:14 - 24:16

and then it does its reversal down that link list,
24:16 - 24:20

which given a correct working hash function should have divided them up in
24:20 - 24:20

these
24:20 - 24:23

little tiny chains, then we can just find the one
24:23 - 24:26

quickly by looking through that link list. If we find it, right, we've got it,
24:26 - 24:28

otherwise, right, we're in
24:28 - 24:31

the error case.
24:31 - 24:37

Add looks almost exactly the same. I
24:37 - 24:40

start by getting the bucket, doing the find cell,
24:40 - 24:41

if it's not ?
24:41 - 24:43

if it did find it, right,
24:43 - 24:46

then we overwrite,
24:46 - 24:49

and then in the second case, right, where we didn't find it, right, then we need to
24:49 - 24:55

make ourselves a new cell, and
24:55 - 24:59

copy in the key, copy in the
24:59 - 25:01

value,
25:01 - 25:05

set it ? in this case, the easiest place, right, to add something to a link list is I should just do append
25:05 - 25:08

it to the front, right, no need to make it hard on ourselves, right, we're not
25:08 - 25:11

keeping them in any particular order, whatsoever, right, they're base on kind of order
25:11 - 25:12

of entry.
25:12 - 25:17

So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So
25:17 - 25:18

it will
25:18 - 25:20

point to the
25:20 - 25:23

current front pointer of the cell,
25:23 - 25:26

and then we will update the
25:26 - 25:34

bucket pointer to point to this guy. So I think
25:34 - 25:35

we are
25:35 - 25:37

in okay shape here. Let me
25:37 - 25:40

take a look and make sure I like what I did, right,
25:40 - 25:44

so hashing, finding the match, if it found a non empty, so ? this is the
25:44 - 25:46

overwrite case of finding an existing cell, we just overwrite with that. Otherwise,
25:46 - 25:50

making a new cell and then attaching it to be the
25:50 - 25:50

front
25:50 - 25:52

of the bucket there.
25:52 - 25:55

So for example, in the case where the bucket's totally empty, it's good to kinda trace
25:55 - 25:59

that. That's the most common case when we start, right, if we hash to bucket zero
25:59 - 26:01

and we do a search of a null link list, we should get a null out,
26:01 - 26:03

so then we create a new cell, right,
26:03 - 26:06

and we set the next field to be what the current bucket's head pointer was, which
26:06 - 26:10

was null. So there should be a single to list and then we update the bucket
26:10 - 26:12

to point to this guy. So then we should have a
26:12 - 26:14

single list cell with null trailing it
26:14 - 26:18

or inserting into the empty bucket case.
26:18 - 26:20

I like it. Let's see if I have a little code over
26:20 - 26:24

here that adds
26:24 - 26:27

some things, car, car, cat, with some numbers. Okay, and that tries to
26:27 - 26:31

retrieve the value of car, which it wrote a three and then it wrote a four on top
26:31 - 26:39

of it, so hopefully the answer we should get out of this is four. Okay, if I ? be
26:39 - 26:41

interesting to do something, like ask for something that we know is not in the
26:41 - 26:45

table, right, to see that our error case gets handled correctly,
26:45 - 26:46

the switch
26:46 - 26:47

key's
26:47 - 26:50

found. And if I were to continue to do this, I could do a little bit of stress testing to see how
26:50 - 26:55

that zero is causing us some performance implications, but if I sat there and just put tons and
26:55 - 26:56

tons of things in there, right,
26:56 - 27:00

made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one
27:00 - 27:04

long chain off the zero bucket, ignoring all the other ones, right,
27:04 - 27:08

but then it's kinda stress testing my link list handling about the adding and
27:08 - 27:09

searching that list, but
27:09 - 27:12

would also show us that we expect that as we got to be, have 100 entries, 200
27:12 - 27:16

entries, 300 entries that subsequent adds and get values, right, would
27:16 - 27:21

show a linear increase in performance because of the
27:21 - 27:27

cluster that we got going there.
27:27 - 27:30

So let me give you a real hash function. I'll talk
27:30 - 27:37

you through what it's doing and then I'll
27:37 - 27:39

leave you with the thought of,
27:39 - 27:41

that writing one of these actually correctly and reliably is
27:41 - 27:44

actually
27:44 - 27:47

more of an advance exercise than we're gonna look at, but this is actually a hash
27:47 - 27:52

function that's taken from Don Knuth's Art of Computer Science,
27:52 - 27:55

some of the work that guides a lot of computer scientists still to this day,
27:55 - 27:59

and it has the strategy I basically told you about of,
27:59 - 28:02

for all the characters that are in the string,
28:02 - 28:06

right, adding it into the thing, but having multiplied,
28:06 - 28:10

right, the previous hash code that's being built up by this large multiplier
28:10 - 28:12

which is in effect kinda taking
28:12 - 28:14

advantage of the positional information.
28:14 - 28:17

And so the multiplier happens to be a very large negative number
28:17 - 28:19

that's a prime that says, okay,
28:19 - 28:20

the first time through, right,
28:20 - 28:23

it'll take the hash code of zero and multiply it by and add the first character. So the first
28:23 - 28:26

character gets added in without any modification.
28:26 - 28:29

The next character, though, will take the previous one and multiply it by
28:29 - 28:31

one power of the multiplier
28:31 - 28:33

and then add in the next character,
28:33 - 28:36

and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so
28:36 - 28:39

raising it to the squared, and the third, and to the fourth powers, it works down the
28:39 - 28:40

string. The
28:40 - 28:43

effectiveness actually is that it's gonna do a lot of overflow.
28:43 - 28:45

That's a very large number
28:45 - 28:47

and it's adding
28:47 - 28:51

and multiplying in this very large space, so in effect we're actually counting on the
28:51 - 28:54

fact that the addition and multiplication
28:54 - 28:56

that is built into C++
28:56 - 28:58

doesn't raise any errors when the numbers get so large that they can't be represented. It
28:58 - 29:03

just kind of truncates back down to what fits, and so it kind of wraps around using a modular
29:03 - 29:06

strategy here, like a ring. And so we'll actually end up kind of making a very big number out
29:06 - 29:10

of this. The longer the string is, the more those multiplies, the more those powers raise. We've got
29:10 - 29:13

this huge number, what we do is kinda truncating it back to what fits in to a long
29:13 - 29:16

and then when we're done, we take whatever
29:16 - 29:18

number fell out of this process
29:18 - 29:21

and we mod it by the number of buckets to give us a number from zero to buckets minus
29:21 - 29:22

one.
29:22 - 29:26

And so whatever gets stuffed into here, right, it's reliable in the sense that it's,
29:26 - 29:28

every time you put in the number you'll get the same thing back.
29:28 - 29:33

We know it will be in the range end buckets because of that last mod
29:33 - 29:34

call
29:34 - 29:36

that worked it out for us
29:36 - 29:37

and we
29:37 - 29:40

then use that to say which bucket to go look in when we're ready
29:40 - 29:41

to do our work.
29:41 - 29:44

So if I switch that in for my old hash
29:44 - 29:47

function, I shouldn't actually see any
29:47 - 29:47

change
29:47 - 29:48

from the outside,
29:48 - 29:51

other than the performance, right, should suddenly actually be a lot faster, which is
29:51 - 29:55

hard to tell when I have three things in there, but ? I'm gonna
29:55 - 30:01

take out that error case. In this case car
30:01 - 30:03

and cat now
30:03 - 30:06

are probably not very likely to be going to the same bucket. In fact I can
30:06 - 30:09

guarantee they don't go in the same bucket, whereas
30:09 - 30:16

before they were kind of overlapping.
30:16 - 30:18

I'll leave that code there for a second, although
30:18 - 30:19

really I would say that point of
30:19 - 30:22

hashing really is not to get so worried about how it is that the hash code does it's
30:22 - 30:25

work, but to understand the general idea of what a hashing function has to do,
30:25 - 30:29

right, what its role is and how it relates to getting this constant time
30:29 - 30:30

performance.
30:30 - 30:33

So I'll give you an example of hashing in the real world, I think is really interesting
30:33 - 30:37

to see it actually happen and people not even realizing how hashing is so useful.
30:37 - 30:39

So I ordered something from REI, which is
30:39 - 30:41

this camping store and
30:41 - 30:42

it ?
30:42 - 30:43

they didn't have it in stock, so they had to
30:43 - 30:46

place an order for it. This is actually kinda awhile ago, but I ?
30:46 - 30:50

so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't
30:50 - 30:52

ask me my name. I thought this was very funny. Like,
30:52 - 30:55

''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like,
30:55 - 30:59

''Okay, what are the last two digits of your phone number?'' And I'm
30:59 - 31:02

like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last
31:02 - 31:04

two digits of my phone number are 21.
31:04 - 31:06

And they say ? and they go and they look in the 21 basket and then they say, ''What's your
31:06 - 31:09

name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're
31:09 - 31:11

like, ''Okay, it's here.'' I'm like,
31:11 - 31:12

''Okay.'' So then I go to get it, like
31:12 - 31:13

a couple days later, and I go to the store
31:13 - 31:15

and I'm standing in the line waiting for them,
31:15 - 31:17

and I look up on the wall
31:17 - 31:20

and they have
31:20 - 31:22

100 baskets, little
31:22 - 31:24

wire baskets up on the wall
31:24 - 31:27

and they're labeled 0 through 9 over here
31:27 - 31:29

and 0 through 9 over there,
31:29 - 31:33

and when I come in they ask me the same question. ''What's the last two digits of your phone number?''
31:33 - 31:36

Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then
31:36 - 31:38

they walk over here, right,
31:38 - 31:40

to the 21 basket here.
31:40 - 31:42

It's like the top digit in this digit,
31:42 - 31:45

and there's only one piece of paper in there, and they pull out my order thing,
31:45 - 31:48

and then they get me my tent and everybody's happy.
31:48 - 31:52

So while I'm there, I try to engage the cashier on how this is an example of a real world
31:52 - 31:55

hashing system
31:55 - 31:58

and I still got my tent, I'll have you know, but it was close,
31:58 - 32:02

right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.''
32:02 - 32:06

I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a
32:06 - 32:10

very good example of what the investment in the number of buckets is versus what
32:10 - 32:12

tradeoff it gives you, right. Because there's this very physical
32:12 - 32:13

sort of set up of this, it's like
32:13 - 32:17

by choosing a larger number of buckets, right, you can more fine grain your
32:17 - 32:18

access,
32:18 - 32:22

but that investment in buckets means you're kind of permanently installing that story. So
32:22 - 32:25

in this case, right, they didn't use just the last digit of my phone number. They could've done that and
32:25 - 32:28

they would have only had ten buckets on the wall, but then they would expect,
32:28 - 32:30

you know, ten times as many things in each bucket.
32:30 - 32:33

And apparently their
32:33 - 32:37

estimate was the number of active orders in this backorder category was enough
32:37 - 32:38

to
32:38 - 32:41

warrant being more than ten, you know, significantly more than ten,
32:41 - 32:45

but not so much more than 100, then in fact 100 buckets was the right
32:45 - 32:46

investment to make in their bucket strategy.
32:46 - 32:49

They didn't put three buckets on the wall because, you know, like what are they
32:49 - 32:52

gonna do, have this big 3D sort of thing. They didn't enjoy this
32:52 - 32:56

discussion, I went on for hours with them and they were, like, really not amused. But it
32:56 - 32:59

had, you know, the three digits, then you'd get this even more likelihood that each
32:59 - 33:03

bucket would be empty, but ? or have a very small number of things, but given
33:03 - 33:07

their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me
33:07 - 33:09

the
33:09 - 33:11

first two digits of my phone number.
33:11 - 33:14

They asked me the last two. Why does that
33:14 - 33:18

matter? Because you're using all [inaudible]. Yeah. It's hated a
33:18 - 33:21

lot. If you ask, like Stanford students, right, like especially
33:21 - 33:25

when, you know, like campus numbers used to all be 49's or 72's or something, so like, if
33:25 - 33:27

you use the first two digits, right, you'd have everybody's in the 49 bucket or
33:27 - 33:29

the 72 bucket or
33:29 - 33:32

something, and a whole bunch of other ones not used. An example, even the first two digits are never
33:32 - 33:35

00, like there's a whole bunch of buckets that don't even get used as the first two digits of a
33:35 - 33:38

phone number. Phone numbers never start with zeros.
33:38 - 33:42

That they rarely actually have zeros or ones in them at all. They try to use those for the
33:42 - 33:44

area code and stuff in the leading digits. So I
33:44 - 33:45

thought it was just a really
33:45 - 33:48

interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of
33:48 - 33:52

course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me
33:52 - 33:56

my stuff, but I paid my money and they [inaudible],
33:56 - 34:00

but it kinda shows you, okay what you're trying to do here is try to make that number of
34:00 - 34:00

buckets,
34:00 - 34:03

right, roughly equal to the number of elements
34:03 - 34:04

so that
34:04 - 34:09

if your hash function's dividing up across your whole world,
34:09 - 34:11

you've got
34:11 - 34:14

constant access to what
34:14 - 34:17

you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value
34:17 - 34:21

that without it, it assumes it's an int and that number is to big to fit in an int and it
34:21 - 34:25

is twice as big as a long in terms of space and so the L needs to be there, otherwise it
34:25 - 34:26

will
34:26 - 34:30

try to read it as an int and through away part of the information I was trying to give it. So it's the
34:30 - 34:31

way
34:31 - 34:35

of identifying it along constant.
34:35 - 34:37

So let me talk about this just a little bit more
34:37 - 34:39

in terms of
34:39 - 34:41
34:41 - 34:43

actual performance, right, we've got N
34:43 - 34:46

over B, right, so if the division is complete, right, the number of entries over
34:46 - 34:48

the number of buckets,
34:48 - 34:48
34:48 - 34:51

if we make them kind of on the same order of
34:51 - 34:52

magnitude, kind of in the same range,
34:52 - 34:55

so having to sum this could be to make a little bit of an estimated guess about
34:55 - 34:58

where we're going,
34:58 - 35:01

and then there are some choices here in how we store each bucket, right. We could use
35:01 - 35:05

a little array, we could use a link list, right, we could use a vector.
35:05 - 35:07

We expect this to be very small, is the truth.
35:07 - 35:09

We're hoping that it's gonna be one or two,
35:09 - 35:13

and so there's not a lot of reason to buy any real expensive structure or even bother
35:13 - 35:17

doing things like sorting. Like you could sort them, but like if you expect there to be two
35:17 - 35:19

things, what's the point? You're gonna spend more time
35:19 - 35:21

figuring out whether to put it in the front or the back
35:21 - 35:25

then you would gain at all by being able to
35:25 - 35:27

find it a little faster.
35:27 - 35:28

So the link list
35:28 - 35:32

is just gonna be easiest way to get kind of allocated without over capacity,
35:32 - 35:35

excess capacity that's unused in the bucket. You know it's gonna be small, use the
35:35 - 35:37

simple strategy.
35:37 - 35:41

It's also, though, an important thing that the hash type is likely to do in kind of a
35:41 - 35:44

industrial strength situation is it does have to deal with this idea, well what if that number of
35:44 - 35:47

buckets that you predicted was wrong?
35:47 - 35:51

And so the map as we have given it to you, right, has to take this sponge because
35:51 - 35:54

it doesn't know in advance how many things are you gonna put in.
35:54 - 35:57

The only way to know that is to wait and see as things get added,
35:57 - 35:59

and then realize your buckets are getting full.
35:59 - 36:02

So typically the industrial strength version of the hash table is gonna
36:02 - 36:06

track what's called the load factor. And the load factor is just the number of total
36:06 - 36:08

entries divided by the number of buckets.
36:08 - 36:09

When that
36:09 - 36:13

factor gets high, and high actually is actually quite small, in fact. If
36:13 - 36:15

it's two, is often the trigger for
36:15 - 36:19

causing a rehash situation, so
36:19 - 36:22

not even letting it get to be three or four. Just saying as soon as you realize that
36:22 - 36:22

you have
36:22 - 36:26

exceeded by double the capacity you intended, you planned for 100 things,
36:26 - 36:27

you got 200 things,
36:27 - 36:28

go ahead
36:28 - 36:31

and redo your whole bucket array.
36:31 - 36:39

So in the case, for example, of our simple like 0 through 7 case, right, maybe
36:39 - 36:43

it's 0 through 6. One, two, three,
36:43 - 36:45

four, so I have one that had six buckets, right,
36:45 - 36:48

and then I've gotten to where
36:48 - 36:50

each of them has
36:50 - 36:54

two or maybe three in some and one in another.
36:54 - 36:57

The plan is to just take this whole bucket array and
36:57 - 37:01

enlarge it by a factor of two, so grow it, and in this case, from 6 to 12,
37:01 - 37:02

and then rehash
37:02 - 37:04

to move everybody. The
37:04 - 37:08

idea is that the earlier decision about where to place them was based on knowing that
37:08 - 37:11

there was exactly six buckets and so where it was placed before
37:11 - 37:15

is not likely to be the place it will place if you have, you know,
37:15 - 37:18

12 choices to lay them down into.
37:18 - 37:21

And ideal ? in fact you don't even want it to be that case. Like, if they all land into the
37:21 - 37:23

same bucket, right, you would have
37:23 - 37:25

the same clustering and then a big empty clump,
37:25 - 37:28

and so you would rehash everything, run it through the new hash function having
37:28 - 37:31

changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?''
37:31 - 37:34

And hopefully you'd end up getting a clean table, right, where you had 12
37:34 - 37:37

items now with one in each bucket,
37:37 - 37:40

ready to kinda give constant performance through
37:40 - 37:42

that and then potentially,
37:42 - 37:43

again if you overload
37:43 - 37:47

your load factor, go back in and rearrange stuff again.
37:47 - 37:50

So using this strategy a little bit like the one we talked about with the binary search tree of just
37:50 - 37:51

wait and see. Rather
37:51 - 37:55

than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any
37:55 - 37:57

real alarm or crisis,
37:57 - 38:00

it waits until there's kinda sufficiently clear demand
38:00 - 38:03

that exceeds the capacity plan for,
38:03 - 38:05

to do a big rearrangement.
38:05 - 38:08

So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast.
38:08 - 38:12

Starting to get just a hair slower, right, having to do a search through a few things as opposed
38:12 - 38:13

to one,
38:13 - 38:16

and then every now and then, right, you'd see an add which caused a big reshuffle of
38:16 - 38:19

the table, which would be a totally linear operation in the number of
38:19 - 38:20

elements
38:20 - 38:22

and then they'd be fast again
38:22 - 38:24

for another big clump of inserts until that
38:24 - 38:27

same expansion might be complete or triggered by
38:27 - 38:28

it growing even more.
38:28 - 38:31

But dividing it sorta by the total number of inserts, so if you had 1,000 inserts
38:31 - 38:35

before you overloaded and then you had to copy all 1,000 things to get ready for the
38:35 - 38:36

next thousand,
38:36 - 38:39

if you average that out, it still ends up
38:39 - 38:43

being called a constant operation. So
38:43 - 38:46

that's pretty neat.
38:46 - 38:47

It's a ?
38:47 - 38:50

really one of the, sort of, more, I think,
38:50 - 38:53

easily understood and beautiful algorithmic techniques, right, for solving, right,
38:53 - 38:55

this hard problem of
38:55 - 38:57

search and retrieval
38:57 - 38:58

that
38:58 - 39:01

the vector and the sorting and the BST are all trying to get at, but ? and sometimes the
39:01 - 39:04

sorter, vector and the BST are solving actually a much harder problem, right,
39:04 - 39:09

which is keeping a total ordering of all the data and being able to flip it,
39:09 - 39:12

traverse it and sorted order is one of the advantages that the sorted vector
39:12 - 39:12

and
39:12 - 39:14

the BST have that hash table doesn't have at all.
39:14 - 39:17

In fact, it's jumbled it all up, and so if you think about how the iterator for
39:17 - 39:21

the map works, our map is actually implemented using a hash table,
39:21 - 39:24

it actually just iterates over the link list, and that explains why it almost appears
39:24 - 39:27

completely random. If you put a bunch of things in the table and you go to iterate
39:27 - 39:28

and
39:28 - 39:29

visit them again,
39:29 - 39:33

that the order you visit the keys seems to make no sense, and that's because it's based on
39:33 - 39:35

their hash codes, right, which
39:35 - 39:36

aren't designed
39:36 - 39:37

to be
39:37 - 39:39

any real sensible ordering
39:39 - 39:42

to anybody not in the know of how the hash function works,
39:42 - 39:46

whereas iterating over a vector that's sorted or iterating over a BST
39:46 - 39:47

or a,
39:47 - 39:51

in this case, our set, for example, is backed by a binary research tree
39:51 - 39:54

will give you that sorted order. So it solves this harder problem, right, which cause there to
39:54 - 39:57

be more kinda more investment in that problem, whereas hashing solves just exactly the
39:57 - 39:58

one problem, which is
39:58 - 40:02

I want to be able to find exactly this value again
40:02 - 40:03

and update this value,
40:03 - 40:08

and nothing more is needed than just identifying this,
40:08 - 40:10

not finding the things that are less than this. For example, if you needed to find all
40:10 - 40:13

the values that were less than this key alphabetically,
40:13 - 40:15

right, the hash table makes that
40:15 - 40:16

no easier
40:16 - 40:18

than an unsorted vector,
40:18 - 40:21

whereas things like assorted vector and binary search tree actually enable that kinda search,
40:21 - 40:24

you could find the place in the tree and kind of work your way down the left side
40:24 - 40:27

to find the things that were less than that.
40:27 - 40:31

That's not actually being solved by the hash at all. Its use of
40:31 - 40:34

memory is actually comparable to a
40:34 - 40:35

binary search tree
40:35 - 40:38

in that it has a four byte pointer per entry, which is the next field on the
40:38 - 40:40

link list chain, and
40:40 - 40:44

then there's a four byte pointer for each bucket, the head of the link list
40:44 - 40:47

cell. Given that we expect that the buckets to be roughly equal to entry,
40:47 - 40:51

than we can kinda just summarize that as well. Each entry
40:51 - 40:53

represented an eight byte overhead,
40:53 - 40:56

which is the same eight byte overhead, right, that the left and right pointers of
40:56 - 40:58

the binary search
40:58 - 40:59

tree add.
40:59 - 41:01

Does it have any degenerate cases? That's a good
41:01 - 41:06

question. So one of the things that made the binary search tree a little bit of a
41:06 - 41:08

heavy investment was that to get that
41:08 - 41:11

balancing behavior, even when we're getting data coming in in a way that's
41:11 - 41:16

causing a lopsided construction, we have to go out of our way to do something special.
41:16 - 41:19

What is the degenerate case for hash? Is there something that caused it to behave really,
41:19 - 41:24

really badly? Anyone wanna help me out with
41:24 - 41:28

that? Is it always good? Does it always give a good spread? Can we
41:28 - 41:32

end up with everything in the same bucket somehow? [Inaudible] repeated it that way?
41:32 - 41:35

So if you have repeated elements the way that ?
41:35 - 41:37

both versions of the map work is they overwrite.
41:37 - 41:41

So actually the nice thing is, yeah, if you did it again you just take that same cell and
41:41 - 41:43

overwrite it, so in fact we wouldn't get a clustering based on the same entry
41:43 - 41:45

going in and out.
41:45 - 41:51

But how'd we end up with everything being in the same bucket?
41:51 - 41:54

Mostly comes out, if you look at your hash function, when would a hash function collide?
41:54 - 41:58

Right, and if you have a dumb hash function, right, you can definitely have some degenerate
41:58 - 42:01

cases. My dumb hash function of return zero, right, being the
42:01 - 42:03

worst example of that,
42:03 - 42:07

but any hash function, right, that wasn't being particularly clever about using all of its
42:07 - 42:09

information might actually have some clustering in
42:09 - 42:12

it. It is possible, for example, even with a particularly good hash function that
42:12 - 42:15

there are strings that will hash to the same thing, and it's like, if you somehow got
42:15 - 42:18

really, really unlucky
42:18 - 42:19

that you ?
42:19 - 42:23

and had a hash function that wasn't doing the best job possible that you could get some
42:23 - 42:25

clustering, but in general it doesn't require
42:25 - 42:29

? the sole responsibility, right, for the generate case comes down to your hash function.
42:29 - 42:31

So as long as your hash function
42:31 - 42:34

and your inputs match your expectation,
42:34 - 42:35

you don't have any surprises about
42:35 - 42:38

how they got inserted the way it was in a binary search tree, which is sometimes hard
42:38 - 42:40

to control.
42:40 - 42:43

Just as you could underestimate the number of buckets you need ? Yeah. ? could
42:43 - 42:46

you over estimate the number of buckets you
42:46 - 42:51

need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and
42:51 - 42:52

do you then go through and take
42:52 - 42:54

that memory back? Yeah.
42:54 - 42:57

Yeah, so that's a great question. So a lot of data structures
42:57 - 42:58

don't shrink,
42:58 - 43:02

right, and for example, the vector often will only grow on demand and then even if you deleted a
43:02 - 43:04

bunch of, or removed a bunch of elements, it doesn't
43:04 - 43:06

necessarily consider shrinking a big deal, and so
43:06 - 43:10

that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around
43:10 - 43:12

you're not using
43:12 - 43:15

doesn't have nearly the same penalty that not
43:15 - 43:17

having the memory when you needed it and having to do a lot of extra work to find
43:17 - 43:18

things.
43:18 - 43:21

So typically, right, you would try to pick a size that you are
43:21 - 43:26

willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.''
43:26 - 43:28
43:28 - 43:30

But that, as you grow, you might not shrink back to that size. You might just
43:30 - 43:34

let the capacity just kinda lay in waste. There are good
43:34 - 43:37

reasons to actually be tidy about it, but again, some of this is hard to predict.
43:37 - 43:40

It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows
43:40 - 43:41

big again. Like, maybe you
43:41 - 43:43

have a new freshman class come in,
43:43 - 43:45

and then you graduate a whole bunch of people. Then
43:45 - 43:47

at the point where you graduate them all, you're about ready to take in a new freshman
43:47 - 43:51

class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on
43:51 - 43:53

reusing in a minute anyway, and so maybe
43:53 - 43:56

that clearing it out and totally releasing
43:56 - 43:57

may actually be an exercise
43:57 - 44:01

that was unimportant. You're planning on getting back to that size eventually, anyway.
44:01 - 44:04

Most hash tables tend to grow is actually the truth. Right, you tend to
44:04 - 44:07

be collecting data in there and
44:07 - 44:11

they tend to only enlarge. It's kind of unusual that you take out a bunch of
44:11 - 44:15

entries you already put in.
44:15 - 44:18

And so I just put a little last note, which is like, okay, well I had said earlier that when we
44:18 - 44:19

have the map,
44:19 - 44:22

the key had to be string type and I was gonna at
44:22 - 44:25

some point come back to this and talk about
44:25 - 44:28

why that was the one restriction in all our template classes, right, you can put anything you
44:28 - 44:31

want in statter or vector or stacker queue.
44:31 - 44:34

That map lets you store any kind of value associated with that key, but that key was
44:34 - 44:36

string type, and
44:36 - 44:40

so given how our map is implemented as a hash table, that actually starts to make sense, that if
44:40 - 44:42

you're gonna take the key and kind of jumble it up and map it to
44:42 - 44:44

a bucket,
44:44 - 44:47

you need to know something about how to extract information from the key, and so
44:47 - 44:51

this case, knowing it's a string means you know it has characters, you know its length, and you can do
44:51 - 44:52

those things.
44:52 - 44:53

If you wanted to make a map
44:53 - 44:57

that could take any kinda key, maybe integers is key or student structures is
44:57 - 44:59

key or, you
44:59 - 45:00

know,
45:00 - 45:01

doubles is key,
45:01 - 45:05

any of these things, it's like, well, how can you write a generic form
45:05 - 45:06

that could use
45:06 - 45:08

a hashing operation on that
45:08 - 45:11

kind of key and map it to bucket numbers.
45:11 - 45:14

You would build a two type template to kinda get this working in C++, so
45:14 - 45:18

you can actually have a template that has a type name for the key type, a type name for
45:18 - 45:19

the value type,
45:19 - 45:20

and then
45:20 - 45:25

in the member functions you'd refer to both key type and value type as these two
45:25 - 45:26

distinct place holders.
45:26 - 45:29

And then when the client set it up, right, they'd be saying, ''Okay, I want a map
45:29 - 45:32

that has strings that map to integers.'' So maybe this is
45:32 - 45:36

words and the page they appear in a document. I have a map of integers and a vector
45:36 - 45:40

string. Maybe this is score on an exam and the names of the students who got that
45:40 - 45:42

score, doing it that way,
45:42 - 45:44

and to make this work, right,
45:44 - 45:48

there has to be some kind of universal hashing that, given some generic type,
45:48 - 45:53

can turn it into an integer in the range of buckets. The
45:53 - 45:53

?
45:53 - 45:56

what it's gonna require here is that the client get involved.
45:56 - 45:58

That you can not write this
45:58 - 46:00

generic
46:00 - 46:01

hash function
46:01 - 46:04

that will work for all types, right, if it's a student structure, what are you gonna look at? The
46:04 - 46:08

ID field, the names field, under like ? there's just no sensible way that you can
46:08 - 46:10

talk about a generic type and talk about how to hash
46:10 - 46:14

it, so what you would have to do is have some sort of client call back, so probably you would
46:14 - 46:18

create this by passing out a call back that's given a key type thing, and a number of
46:18 - 46:20

buckets would
46:20 - 46:24

do the necessary machinations, right, to hash
46:24 - 46:26

it into a right number.
46:26 - 46:30

And so that would be kinda one of those coordination things between the client and the implementer
46:30 - 46:34

about the client knows what the data is what the data is it's trying to store, and kinda how to
46:34 - 46:35
46:35 - 46:41

mash it up and then the map would know, okay, given that number, where to stick it. So
46:41 - 46:42

that's what it would take, and then
46:42 - 46:45

rather sorta load that onto our
46:45 - 46:48

novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can
46:48 - 46:53

do a hash internally without having to get involved.'' All right,
46:53 - 46:55

any questions about hashing? I'm gonna
46:55 - 46:59

give you like, a two second talk about set
46:59 - 47:01

because it's the one
47:01 - 47:04

ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking
47:04 - 47:07

about, it doesn't add any new things that we haven't already done, so in fact I'm just
47:07 - 47:10

gonna tell you what it does, and
47:10 - 47:13

then your picture will be complete, right.
47:13 - 47:16

It wants to do fast searching, fast updating, add and remove, and it also has all
47:16 - 47:21

these high level operations, such as intersect, union and difference,
47:21 - 47:24

that are kind of its primary set of things to do.
47:24 - 47:25
47:25 - 47:27

A bunch of the things we've already seen, right,
47:27 - 47:30

would actually be a reasonable strategy for upholding stat, right, using some kind
47:30 - 47:31

of vector or array.
47:31 - 47:34

Most likely you'd want to keep it in sorted order because that would buy you
47:34 - 47:37

the fast lookup for contains,
47:37 - 47:41

but the add and remove, right, would necessarily be slow in those cases because of the
47:41 - 47:43

continuous memory.
47:43 - 47:44
47:44 - 47:47

The link list, not really probably too good of an option here
47:47 - 47:50

because it contains then becomes
47:50 - 47:51

linear and
47:51 - 47:54

add and remove similarly, right, liner, so it doesn't actually get you anywhere, in
47:54 - 47:57

fact it just adds the memory in pointers and gunk like that.
47:57 - 47:59

The binary search tree, probably a
47:59 - 48:00

pretty good call,
48:00 - 48:03

right, that's the ?
48:03 - 48:06

kinda meshes the advantage of Adamic structure with the sorted property to
48:06 - 48:11

be able to give it fast update adding and removing and searching all can be done
48:11 - 48:14

in logarithmic time if you have balancing.
48:14 - 48:17

And then it actually also enables the high level ops, so the idea of being able to do a union
48:17 - 48:19

intersection difference, and actually being able to
48:19 - 48:23

walk over both of the trees in sorted order, which is very easily done
48:23 - 48:24

with a [inaudible] reversal,
48:24 - 48:28

makes those operations actually more efficient to implement than actually kinda having to deal with
48:28 - 48:32

data coming at you all different ways. It
48:32 - 48:33

turns out hashing really
48:33 - 48:35

won't quite do it
48:35 - 48:36
48:36 - 48:37

very easily for us
48:37 - 48:40

because of the same need about the template situations, like, well you have to have a hash function that
48:40 - 48:43

can hash these things and do the right thing, and so the tree happens to be a
48:43 - 48:46

good of just ? if the client can just give you ordering information, you can
48:46 - 48:48

do all the hard work without
48:48 - 48:50

pushing hashing onto them. So in fact
48:50 - 48:53

the way our set really does work is it does use a binary search tree. It happens to
48:53 - 48:56

use it through another layered abstraction
48:56 - 48:59

that there actually is a BST class that we've never really
48:59 - 49:02

encountered directly, but the BST class just takes the idea of a binary search
49:02 - 49:06

tree, adds balancing, and
49:06 - 49:08

all the properties about finding and inserting and adding,
49:08 - 49:12

and packages it up, and then set actually just turns out to be one liners
49:12 - 49:14

making calls into the binary search tree.
49:14 - 49:17

This is where all the heavy lifting
49:17 - 49:18

goes, is down there.
49:18 - 49:23

That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our
49:23 - 49:26

BST uses ? or set uses the BST, which is the binary search tree,
49:26 - 49:28

right, our vector, right, using an array,
49:28 - 49:33

and then that stacks and queues having both viable strategies both for array
49:33 - 49:37

vector based backing as well as link list both getting good time performance. And then
49:37 - 49:38

you saw with the PQ, right,
49:38 - 49:40

even
49:40 - 49:44

yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the
49:44 - 49:46

kind of really classic things in which a lot of things get built,
49:46 - 49:51

and so having a chance to use them both as a client and kinda dig
49:51 - 49:53

around as an implementer kinda gives you this big picture of
49:53 - 49:55

some of the more common ways that
49:55 - 49:57

cool data structures get built.
49:57 - 50:00

So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman
50:00 - 50:03

cafe in a little bit, if you've got some time, please come, and have
50:03 - 50:04

a good weekend.
50:04 - 50:07

Work hard on pathfinder.

Title:: Lecture 24 | Programming Abstractions (Stanford)
Description:: Lecture 24 by Julie Zelenski for the Programming Abstractions Course (CS106B) in the Stanford Computer Science Department.

Julie introduces hashing and it's uses in search and retrieval; map implementations and the different kinds of search algorithms are then discussed. Thereafter she explains that logarithmic searches are relatively fast and often finish the search in an immeasurable amount of time. She introduces a different approach to search that works in a faster manner than linear search.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=FE6E58F856038C69

CS 106B Course Website:
http://cs106b.stanford.edu

Stanford Center for Professional Development:
http://scpd.stanford.edu/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanforduniversity/

more » « less
Video Language:: English
Duration:: 50:19

	N. Ueda edited English subtitles for Lecture 24 \| Programming Abstractions (Stanford)
	Eunjeong_Kim added a translation

English subtitles

Revisions

Revision 2

N. Ueda

Lecture 24 | Programming Abstractions (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)