WEBVTT

00:00:00.000 --> 00:00:12.000


00:00:12.000 --> 00:00:15.000
This presentation is delivered by the Stanford Center for Professional

00:00:15.000 --> 00:00:28.000
Development.

00:00:28.000 --> 00:00:29.000


00:00:29.000 --> 00:00:33.000
What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever

00:00:33.000 --> 00:00:36.000
gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda

00:00:36.000 --> 00:00:39.000
clever, inventive idea for how you do search and retrieval in a very efficient

00:00:39.000 --> 00:00:40.000
way.

00:00:40.000 --> 00:00:44.000
So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday,

00:00:44.000 --> 00:00:48.000
where we had talked about ways we might implement the map abstraction, right, the key

00:00:48.000 --> 00:00:49.000
value pairing

00:00:49.000 --> 00:00:52.000
to allow for quick search and retrieval

00:00:52.000 --> 00:00:56.000
and we had looked at vector and had just done a little thought experiment about

00:00:56.000 --> 00:00:59.000
what sorting the vector would buy us, right, and that would give us the

00:00:59.000 --> 00:01:02.000
ability to do searches faster because we could use binary search,

00:01:02.000 --> 00:01:07.000
right. But in the case of add, right, we'd still be slow because the

00:01:07.000 --> 00:01:10.000
need be kinda shuffled or rearranged in the continuous memory

00:01:10.000 --> 00:01:13.000
and then that's what motivated our introduction to trees there, the binary search tree,

00:01:13.000 --> 00:01:17.000
which is designed exactly to enable that kind of searching, right, that logarithmic

00:01:17.000 --> 00:01:20.000
divide and conquer, divide in half and work your way down

00:01:20.000 --> 00:01:22.000
and then because of the flexibility of the pointer base structure there, right, we

00:01:22.000 --> 00:01:24.000
can also do the insert

00:01:24.000 --> 00:01:26.000
in the same logarithmic

00:01:26.000 --> 00:01:30.000
time because we're no longer constrained by that contiguous memory that the array

00:01:30.000 --> 00:01:30.000
fact

00:01:30.000 --> 00:01:32.000
data structures force us to.

00:01:32.000 --> 00:01:36.000
And so, at that point what we have seen this logarithmic and both get valued at and so

00:01:36.000 --> 00:01:38.000
on the PQ, right, which you were just finishing,

00:01:38.000 --> 00:01:42.000
you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you

00:01:42.000 --> 00:01:44.000
would normally run into in an

00:01:44.000 --> 00:01:45.000
ordinary use, right,

00:01:45.000 --> 00:01:49.000
it's un-measurable, right. So those of you who got your heap up and running and

00:01:49.000 --> 00:01:51.000
correctly working logarithmic and time was discovering that you can put 10,000, 20,000,

00:01:51.000 --> 00:01:54.000
50,000, 100,000 things in it and just never be able to measure

00:01:54.000 --> 00:01:58.000
what it took to NQ or DQ from those structures because logarithmic is very,

00:01:58.000 --> 00:02:00.000
very fast.

00:02:00.000 --> 00:02:01.000
So that said,

00:02:01.000 --> 00:02:04.000
that's actually a fine point to say, we have a great math notation, we don't have to look further,

00:02:04.000 --> 00:02:07.000
but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where

00:02:07.000 --> 00:02:10.000
we can do both those operations in constant time, which is to say that

00:02:10.000 --> 00:02:13.000
no matter how large the table gets, right, we expect

00:02:13.000 --> 00:02:16.000
we'll be able to insert and retrieve with

00:02:16.000 --> 00:02:21.000
the exact same time required kind of for a 10,000 table as a million entry

00:02:21.000 --> 00:02:22.000
table which is pretty neat.

00:02:22.000 --> 00:02:26.000
To note that as we start to get to these more complicated data structures,

00:02:26.000 --> 00:02:29.000
one thing we also need to keep in mind, right, is that there are other things

00:02:29.000 --> 00:02:33.000
of trade offs, right, in adding that complexity of the code that

00:02:33.000 --> 00:02:36.000
starting with something that's simple and

00:02:36.000 --> 00:02:39.000
a little slower performer but that's easy to write might actually solve the problem

00:02:39.000 --> 00:02:42.000
we have at hand. So maybe that we don't really wanna go to one of these fancier

00:02:42.000 --> 00:02:45.000
things because we don't really need it to be that much faster

00:02:45.000 --> 00:02:48.000
because the BST, right, adds a bunch of memory, right, so left and right pointers

00:02:48.000 --> 00:02:53.000
for every cell, which is an 8 byte overhead on top of the entry itself.

00:02:53.000 --> 00:02:56.000
Now we have to deal with pointers, dynamic memories, all this allocation,

00:02:56.000 --> 00:02:59.000
deallocation, right, we've got a lot of recursion going on in there

00:02:59.000 --> 00:03:01.000
and so just opportunities for error

00:03:01.000 --> 00:03:03.000
and stuff to creep into

00:03:03.000 --> 00:03:07.000
and if we take the steps to provide tree balancing, which we only kind of grazed

00:03:07.000 --> 00:03:11.000
the surface of. We just mentioned the need for it. It actually adds quite a lot of

00:03:11.000 --> 00:03:14.000
complication to the code to do the rotations

00:03:14.000 --> 00:03:16.000
or the rebalancing

00:03:16.000 --> 00:03:21.000
that from a simple binary search tree. So in fact the whole package, right, of

00:03:21.000 --> 00:03:24.000
building a really guaranteed log n performing

00:03:24.000 --> 00:03:27.000
binary search tree is actually a pretty big investment of human capital

00:03:27.000 --> 00:03:29.000
that you

00:03:29.000 --> 00:03:30.000
have to kinda

00:03:30.000 --> 00:03:33.000
balance against the future hit off of it running faster. I'm

00:03:33.000 --> 00:03:36.000
gonna show this strategy today that actually is

00:03:36.000 --> 00:03:37.000
kind of simple to write.

00:03:37.000 --> 00:03:40.000
It gets us solve one case and doesn't have the degenerate

00:03:40.000 --> 00:03:43.000
looming that the binary search tree did.

00:03:43.000 --> 00:03:46.000
So, let me kinda just give you a theoretical concept to think about how

00:03:46.000 --> 00:03:49.000
you do stuff.

00:03:49.000 --> 00:03:51.000
You have a big dictionary. You've got, you know, like a

00:03:51.000 --> 00:03:54.000
many thousand page dictionary, right, and you want to look up a word. You certainly

00:03:54.000 --> 00:03:57.000
don't do linear search. You're about to look up the word xylophone and you're actually not

00:03:57.000 --> 00:04:00.000
going to start from the A's and kind of page your way through, you

00:04:00.000 --> 00:04:03.000
know, 2,000 pages to get to the X's. You

00:04:03.000 --> 00:04:07.000
don't even ? like we do binary search. You can do binary searches pretty fast, right? It

00:04:07.000 --> 00:04:10.000
turns out like starting at the M's to get something that's in X, right? You

00:04:10.000 --> 00:04:11.000
tend to know about where X

00:04:11.000 --> 00:04:14.000
is and often dictionaries have little tabs

00:04:14.000 --> 00:04:17.000
along the pages that give you an idea of kinda where the breaks for

00:04:17.000 --> 00:04:22.000
certain words are. Maybe it says Xa is here and Xe is there or something,

00:04:22.000 --> 00:04:25.000
that lets you kinda more quickly just illuminate all the surrounding noise

00:04:25.000 --> 00:04:29.000
in the dictionary, kinda get straight to the right region where the X words are

00:04:29.000 --> 00:04:30.000
and then you can do

00:04:30.000 --> 00:04:33.000
a simple search from there, maybe a little bit of a linear or binary search to kind of zero

00:04:33.000 --> 00:04:37.000
in on the page where the word is you wanna find is.

00:04:37.000 --> 00:04:38.000
This is the idea

00:04:38.000 --> 00:04:45.000
behind this concept called a hash table or based on the algorithmic technique of hashing

00:04:45.000 --> 00:04:46.000
is that

00:04:46.000 --> 00:04:49.000
rather than try to figure out how to kinda keep this very large collection sorted or organized

00:04:49.000 --> 00:04:51.000
in a way that you can kind of

00:04:51.000 --> 00:04:52.000
jump around within it, it says,

00:04:52.000 --> 00:04:56.000
''Well how about we maintain a bunch of different collections.'' We break up the

00:04:56.000 --> 00:04:59.000
data set into little tiny pieces

00:04:59.000 --> 00:05:00.000
and if we can

00:05:00.000 --> 00:05:02.000
tell, given a key,

00:05:02.000 --> 00:05:05.000
like which piece of the data set it's supposed to be in and we'll call those pieces

00:05:05.000 --> 00:05:08.000
buckets. I'm gonna put these in different buckets.

00:05:08.000 --> 00:05:11.000
So if I had a whole bunch of students I might put all the freshmen in this bucket

00:05:11.000 --> 00:05:13.000
and the sophomores in this bucket or something like that and then when I look

00:05:13.000 --> 00:05:16.000
for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the

00:05:16.000 --> 00:05:18.000
freshmen bucket.

00:05:18.000 --> 00:05:19.000
Okay. You know, the idea

00:05:19.000 --> 00:05:22.000
in a simple case would be like, ''Okay, well I ? it has some

00:05:22.000 --> 00:05:26.000
criteria by which I can divide you into these buckets that don't overlap and everyone

00:05:26.000 --> 00:05:28.000
has a unique bucket to go in

00:05:28.000 --> 00:05:31.000
and if I can make up enough of those categorizations,

00:05:31.000 --> 00:05:33.000
have as many of those buckets around

00:05:33.000 --> 00:05:35.000
to divide you up into,

00:05:35.000 --> 00:05:38.000
then I can hopefully keep the set that lands in that bucket pretty small,

00:05:38.000 --> 00:05:42.000
so that I don't have a smaller of students. So maybe I did it by the

00:05:42.000 --> 00:05:45.000
first letter of your last name. I could say, ''Well everybody whose last name begins with W

00:05:45.000 --> 00:05:47.000
goes here and everybody who's B goes here and so on.'' Well,

00:05:47.000 --> 00:05:50.000
when somebody comes to me then I don't have to look through the B's.

00:05:50.000 --> 00:05:51.000
You know, and if you have

00:05:51.000 --> 00:05:53.000
a class of 200 students

00:05:53.000 --> 00:05:56.000
and you have 26 letters of the alphabet then you'd think that if it were

00:05:56.000 --> 00:05:59.000
evenly divided across those letters, it's not going to be to be, it turns out in practice,

00:05:59.000 --> 00:06:02.000
but as a thought experiment there'd be about 4 people in a bucket and

00:06:02.000 --> 00:06:06.000
it'd be pretty fast to just look through the bucket and find the one who I was

00:06:06.000 --> 00:06:08.000
looking for. Okay,

00:06:08.000 --> 00:06:10.000
so let's take this idea. This idea of, like,

00:06:10.000 --> 00:06:14.000
there's going to be some strategy for how we divide into buckets and we call

00:06:14.000 --> 00:06:16.000
that strategy the hash function.

00:06:16.000 --> 00:06:18.000
That, given a key,

00:06:18.000 --> 00:06:23.000
and a buckets and we'll just label them with numbers. Zero to the number of buckets

00:06:23.000 --> 00:06:26.000
minus one. So let's say I have 99 buckets, I've got zero to

00:06:26.000 --> 00:06:27.000
98

00:06:27.000 --> 00:06:28.000
that I will take the key

00:06:28.000 --> 00:06:32.000
and the key in this case is the string key that we're using in the map,

00:06:32.000 --> 00:06:33.000
and I have a hash function

00:06:33.000 --> 00:06:37.000
which given a key is input produces a number as output

00:06:37.000 --> 00:06:39.000
in the range zero to B minus one.

00:06:39.000 --> 00:06:42.000
And that is what's called the hash value for that key

00:06:42.000 --> 00:06:45.000
and that tells me which bucket to look at,

00:06:45.000 --> 00:06:48.000
either find a match for that if I'm trying to do a get value or where to place a

00:06:48.000 --> 00:06:52.000
new entry if I'm trying to add into the table.

00:06:52.000 --> 00:06:57.000
So that's the basic idea. I'm

00:06:57.000 --> 00:07:00.000
gonna talk a little bit about hash functions.

00:07:00.000 --> 00:07:03.000
In terms of writing a really good hash function is actually pretty complicated

00:07:03.000 --> 00:07:04.000
and so

00:07:04.000 --> 00:07:07.000
we're gonna think about it in sort of a ? kind

00:07:07.000 --> 00:07:10.000
of a big picture perspective and think about what qualities a hash

00:07:10.000 --> 00:07:13.000
function has to have and then I'm actually not gonna go through

00:07:13.000 --> 00:07:16.000
proving to you about what mathematically constitutes a good hash function. I'm just gonna give you

00:07:16.000 --> 00:07:21.000
some intuitive sense of what kinda things matter when designing a hash function.

00:07:21.000 --> 00:07:23.000
So it's given a key that maps it to a number.

00:07:23.000 --> 00:07:25.000
Typically it's gonna say, you know,

00:07:25.000 --> 00:07:27.000
zero to the number of buckets minus one.

00:07:27.000 --> 00:07:30.000
Often that means that somehow it's gonna compute something and then use

00:07:30.000 --> 00:07:33.000
the mod operation to bring it back in range.

00:07:33.000 --> 00:07:35.000
So it's gonna take your number

00:07:35.000 --> 00:07:37.000
and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you

00:07:37.000 --> 00:07:39.000
a number zero to four.''

00:07:39.000 --> 00:07:41.000
It's very important that it be

00:07:41.000 --> 00:07:44.000
stable, that you put the key in, you get a number out. If you put that same key in

00:07:44.000 --> 00:07:48.000
later you should get the same number out. So it has to be sort of reliable guaranteed

00:07:48.000 --> 00:07:52.000
mapping. It can't be random. That

00:07:52.000 --> 00:07:55.000
said, you want it to almost look as though it's random in terms of

00:07:55.000 --> 00:07:59.000
how it maps things to buckets, that you don't want it to have a real systematic plan

00:07:59.000 --> 00:08:02.000
where it puts a whole bunch of things in bucket one and very few things in bucket two

00:08:02.000 --> 00:08:06.000
or somehow always uses the even buckets and not the odds ones or something like that. So

00:08:06.000 --> 00:08:08.000
you wanna have a strategy that's gonna take those keys

00:08:08.000 --> 00:08:11.000
and use your whole bucket range

00:08:11.000 --> 00:08:12.000
in an equal manner.

00:08:12.000 --> 00:08:15.000
So that if you put 100 keys in and you have 100 buckets you're hoping

00:08:15.000 --> 00:08:16.000
that,

00:08:16.000 --> 00:08:19.000
on average, each bucket will have one thing. That you won't have these clusters

00:08:19.000 --> 00:08:22.000
where 25 things are in one bucket, ten here and then a whole bunch of buckets

00:08:22.000 --> 00:08:24.000
are empty. So

00:08:24.000 --> 00:08:27.000
you want this kinda divided across the range.

00:08:27.000 --> 00:08:29.000
So given that we have string input,

00:08:29.000 --> 00:08:32.000
likely what we're gonna be using is the characters in the string as part of

00:08:32.000 --> 00:08:34.000
the ? how can I take these characters

00:08:34.000 --> 00:08:40.000
and somehow kinda jungle them up in a way to sort of move them around in these

00:08:40.000 --> 00:08:43.000
buckets in a way that's reliable so that I get that same string back, I get the

00:08:43.000 --> 00:08:44.000
same one.

00:08:44.000 --> 00:08:47.000
But that doesn't actually produce a lot of artifacts of

00:08:47.000 --> 00:08:50.000
similar strings, let's say mapping to the same place, hopefully spreading them

00:08:50.000 --> 00:08:52.000
around.

00:08:52.000 --> 00:08:55.000
So for example, some really simple things you could use, you could just take the first

00:08:55.000 --> 00:08:56.000
letter. So I

00:08:56.000 --> 00:08:59.000
talked about this sorta with last names. If I was trying to divide up my students, right, I could divide

00:08:59.000 --> 00:09:01.000
them by the first letter of their last name.

00:09:01.000 --> 00:09:05.000
That would use exactly 26 buckets, no more, no less,

00:09:05.000 --> 00:09:08.000
and if I happen to have 26 buckets

00:09:08.000 --> 00:09:11.000
then at least there's some chance it would get some coverage across the range.

00:09:11.000 --> 00:09:14.000
There is likely to be clustering, right, some

00:09:14.000 --> 00:09:15.000
names are a lot more common

00:09:15.000 --> 00:09:18.000
or letters are a lot more commonly used than others.

00:09:18.000 --> 00:09:21.000
So it would tend to produce a little bit of an artifact, but it would be reliable and it would

00:09:21.000 --> 00:09:24.000
use 26 buckets. If I had 100 buckets,

00:09:24.000 --> 00:09:26.000
then this would be kind of a crummy function

00:09:26.000 --> 00:09:28.000
because in fact it would use the first 26 buckets and none of the other ones,

00:09:28.000 --> 00:09:30.000
right, if it was just taking

00:09:30.000 --> 00:09:32.000
that letter and deciding. So,

00:09:32.000 --> 00:09:36.000
you know, in some situations this would get me a little bit of the thing I wanted.

00:09:36.000 --> 00:09:40.000
Things like using the length of the word. A

00:09:40.000 --> 00:09:42.000
very, very bad clustering function here,

00:09:42.000 --> 00:09:44.000
right because there'd be a bunch of strings that are, like, six characters, seven

00:09:44.000 --> 00:09:46.000
characters, ten characters,

00:09:46.000 --> 00:09:50.000
but very few that are one or two, and very few that are more than ten, or something.

00:09:50.000 --> 00:09:53.000
You'd have this very tight little cluster in the middle, right, and would not use a

00:09:53.000 --> 00:09:57.000
large bucket range effectively at all.

00:09:57.000 --> 00:10:01.000
More likely what you're gonna try to use is kind of more of the information about the word,

00:10:01.000 --> 00:10:04.000
the key that you have. Instead of just a single letter or the count of letters, it's

00:10:04.000 --> 00:10:06.000
trying to use kinda all the data you have at hand,

00:10:06.000 --> 00:10:08.000
which is to say all the letters,

00:10:08.000 --> 00:10:12.000
and so one way to do it would be to take the ASCII values for the letters, so A is

00:10:12.000 --> 00:10:13.000
97, you

00:10:13.000 --> 00:10:15.000
know, B is 98,

00:10:15.000 --> 00:10:17.000
is you would take a letter, a word like atoms and you'd add

00:10:17.000 --> 00:10:18.000
96

00:10:18.000 --> 00:10:22.000
plus 99 plus 96, you know, add

00:10:22.000 --> 00:10:23.000
them together

00:10:23.000 --> 00:10:26.000
and then you'd get some sum of them, 500, 600, whatever it is,

00:10:26.000 --> 00:10:28.000
depending on how many letters there are

00:10:28.000 --> 00:10:32.000
and then use mod to back it back into ? let's say your range was, you had 100

00:10:32.000 --> 00:10:34.000
buckets, right, so you'd want it to be zero to 99, so

00:10:34.000 --> 00:10:37.000
you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd

00:10:37.000 --> 00:10:42.000
be, okay, the sum to 524 then it goes into bucket 24.

00:10:42.000 --> 00:10:47.000
So if you did that, right, and so if act

00:10:47.000 --> 00:10:50.000
is the word that you have, right, you'd add these codes together and let's say this is

00:10:50.000 --> 00:10:53.000
97 and then this is 99 and then

00:10:53.000 --> 00:10:57.000
that's, you know, a hundred and something and so we end up with ? I'm making this

00:10:57.000 --> 00:10:59.000
number up, right ? 283, then we might

00:10:59.000 --> 00:11:03.000
mod to the last two digit and we'll say this one goes into bucket 83.

00:11:03.000 --> 00:11:07.000
And if we have, you know, dog,

00:11:07.000 --> 00:11:11.000
maybe that adds to 267 and we would

00:11:11.000 --> 00:11:13.000
mod that down to

00:11:13.000 --> 00:11:17.000
67. So this actually tries to use, kinda, all the information at hand, all the letters

00:11:17.000 --> 00:11:18.000
that we have

00:11:18.000 --> 00:11:19.000
and bring them down,

00:11:19.000 --> 00:11:22.000
and so if you figure that across the space of all the numbers that are being added here

00:11:22.000 --> 00:11:25.000
you are just as likely to get 210 as you are to get 299, you know

00:11:25.000 --> 00:11:28.000
that any of the subsequent numbers in between,

00:11:28.000 --> 00:11:32.000
then hopefully you'd get kind of a spread across your buckets.

00:11:32.000 --> 00:11:36.000
There is one thing this clusters on. Does anybody see what words cluster in this

00:11:36.000 --> 00:11:40.000
system?

00:11:40.000 --> 00:11:41.000
That you'd think of as

00:11:41.000 --> 00:11:47.000
maybe not being related, but would land in the same bucket? To

00:11:47.000 --> 00:11:49.000
rearrange some numbers. Yes. Any

00:11:49.000 --> 00:11:51.000
anagram, right, so

00:11:51.000 --> 00:11:52.000
you

00:11:52.000 --> 00:11:54.000
take cat which is just a ranagram ?

00:11:54.000 --> 00:11:57.000
anagram of act, right, it's gonna add to the same thing, right,

00:11:57.000 --> 00:12:00.000
and so it would be a little bit of clustering in that respect.

00:12:00.000 --> 00:12:02.000
So anagrams come down to the same place.

00:12:02.000 --> 00:12:05.000
And so the words that you expect to, kinda like, act and cat to

00:12:05.000 --> 00:12:09.000
you seem like different words, they are gonna collide in this system and that might be a little

00:12:09.000 --> 00:12:11.000
bit of something we worry about. So one way you can actually kind of avoid that in the

00:12:11.000 --> 00:12:12.000
way that

00:12:12.000 --> 00:12:15.000
the hash function we're going to eventually use will do, is

00:12:15.000 --> 00:12:16.000
it'll do this thing where

00:12:16.000 --> 00:12:21.000
not only does it add the letters together, but it also multiplies the letter by a

00:12:21.000 --> 00:12:22.000
very large prime number

00:12:22.000 --> 00:12:25.000
that helps to kinda say, ''Well the large prime number

00:12:25.000 --> 00:12:26.000
times C plus

00:12:26.000 --> 00:12:30.000
the large prime number squared times A plus the large prime number cubed times C and

00:12:30.000 --> 00:12:33.000
then it just allows the, each position of the letter actually is also then

00:12:33.000 --> 00:12:35.000
encoded in the number that comes out

00:12:35.000 --> 00:12:37.000
and so something that's positionally different

00:12:37.000 --> 00:12:40.000
won't necessarily land in the same bucket

00:12:40.000 --> 00:12:43.000
without some other factors kinda coming into play. So a little

00:12:43.000 --> 00:12:49.000
bit of trying to just take the information and jumble it up a little bit.

00:12:49.000 --> 00:12:51.000
So let me show you a little bit about how this looks. Actually I'm gonna tell you about

00:12:51.000 --> 00:12:54.000
the collisions and then I'll show you the data structure.

00:12:54.000 --> 00:12:58.000
So certainly I said there's gonna be this problem where things collide,

00:12:58.000 --> 00:13:01.000
that 83 and 83 in this simple system of adding the codes will

00:13:01.000 --> 00:13:03.000
come out the same.

00:13:03.000 --> 00:13:06.000
So it can't be that each of these buckets holds exactly one thing.

00:13:06.000 --> 00:13:07.000
The hashing,

00:13:07.000 --> 00:13:08.000
right, although we may

00:13:08.000 --> 00:13:11.000
try to get those buckets as small as possible we have to account for the

00:13:11.000 --> 00:13:14.000
possibility that two things will collide, that their hash function, even though

00:13:14.000 --> 00:13:16.000
they were different keys,

00:13:16.000 --> 00:13:18.000
will land in the same bucket

00:13:18.000 --> 00:13:20.000
based on how I've jumbled it up

00:13:20.000 --> 00:13:23.000
and so we're trying to avoid that, but when it does happen we also have to have a

00:13:23.000 --> 00:13:25.000
strategy for it. So we have to have a way of saying, ''Well

00:13:25.000 --> 00:13:28.000
there can be more than one thing in a bucket.'' Or some other way of when there's

00:13:28.000 --> 00:13:33.000
a collision, deciding where to put the one that was placed in the table second.

00:13:33.000 --> 00:13:36.000
And in the strategy, we're gonna use the one called chaning,

00:13:36.000 --> 00:13:39.000
and the chaning basically says, ''Well if once we have something in a bucket like ACT,

00:13:39.000 --> 00:13:43.000
if we hash something and it comes to the same bucket, we'll just tack it onto that bucket,

00:13:43.000 --> 00:13:45.000
in this case using a link list.''

00:13:45.000 --> 00:13:48.000
So each of these buckets is actually gonna be a link list of

00:13:48.000 --> 00:13:49.000
the entries

00:13:49.000 --> 00:13:52.000
and so the entire table will be a

00:13:52.000 --> 00:13:54.000
vector or array of those.

00:13:54.000 --> 00:13:56.000
So let me show you,

00:13:56.000 --> 00:13:59.000
a little

00:13:59.000 --> 00:14:05.000
live action.

00:14:05.000 --> 00:14:08.000
So this is a hash table that has seven buckets.

00:14:08.000 --> 00:14:08.000
In

00:14:08.000 --> 00:14:11.000
this case number 0 through 6 are assigned to each one.

00:14:11.000 --> 00:14:14.000
They all start empty, and so these are all empty link lists illustrated by the

00:14:14.000 --> 00:14:15.000
null,

00:14:15.000 --> 00:14:17.000
and then the hash function

00:14:17.000 --> 00:14:20.000
is shown here like a black box. You don't actually know how the workings of hash

00:14:20.000 --> 00:14:23.000
function is, but you know what it's job is to do, which is you give it a key and it gives you

00:14:23.000 --> 00:14:27.000
a number in the range 0 to 6, and that same key when passed, and again gives you

00:14:27.000 --> 00:14:28.000
the same number.

00:14:28.000 --> 00:14:31.000
So if I say 106B

00:14:31.000 --> 00:14:34.000
equals Julie, it pushes 106B through the hash function and says,

00:14:34.000 --> 00:14:35.000
''Okay, well 106B

00:14:35.000 --> 00:14:37.000
comes out as one.''

00:14:37.000 --> 00:14:40.000
You add those numbers together, jumble them up, the letters, and you get one.

00:14:40.000 --> 00:14:43.000
And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new

00:14:43.000 --> 00:14:46.000
cell for 106B Julie and stick it in the table.''

00:14:46.000 --> 00:14:48.000
So now there's a

00:14:48.000 --> 00:14:51.000
non empty chain in one place, the rest are empty.

00:14:51.000 --> 00:14:53.000
If I say 107

00:14:53.000 --> 00:14:55.000
is being taught by

00:14:55.000 --> 00:14:57.000
Jerry, 107 happens to get the same hash function.

00:14:57.000 --> 00:15:00.000
107, 106B, different letters but it just happen to come out that way,

00:15:00.000 --> 00:15:01.000
hash to one

00:15:01.000 --> 00:15:04.000
and so it went ahead in this case just tacked it on to the front of the list.

00:15:04.000 --> 00:15:05.000
If I do

00:15:05.000 --> 00:15:08.000
106A

00:15:08.000 --> 00:15:10.000
being taught by Patrick,

00:15:10.000 --> 00:15:11.000
106A

00:15:11.000 --> 00:15:14.000
happens to sum to zero in the hash function.

00:15:14.000 --> 00:15:17.000
That's all we know about the black box that ends up loading it up into the zero's

00:15:17.000 --> 00:15:17.000
slot.

00:15:17.000 --> 00:15:21.000
I say, well wait, it's being taught by

00:15:21.000 --> 00:15:25.000
Nick, it also come out to be zero and so

00:15:25.000 --> 00:15:27.000
it ends up kinda pushing on the front of that. So I go 103

00:15:27.000 --> 00:15:30.000
is being taught by Maron and it's in

00:15:30.000 --> 00:15:33.000
bucket five and so as I kinda type things in I'm starting to see that things are just kinda

00:15:33.000 --> 00:15:35.000
going around in difference place in the table.

00:15:35.000 --> 00:15:38.000
A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue

00:15:38.000 --> 00:15:42.000
to type in strings for other course numbers, I'd hopefully kinda fill out the table

00:15:42.000 --> 00:15:43.000
to where there was kind of an even,

00:15:43.000 --> 00:15:46.000
roughly even spread across that.

00:15:46.000 --> 00:15:47.000
When I want to look something up

00:15:47.000 --> 00:15:51.000
because I wanna see who's teaching 106B, then

00:15:51.000 --> 00:15:54.000
it will use the same process in reverse. It asks the hash function,

00:15:54.000 --> 00:15:56.000
''Well which bucket should I look in?'' And it

00:15:56.000 --> 00:15:59.000
says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So

00:15:59.000 --> 00:16:01.000
all the other buckets are eliminated

00:16:01.000 --> 00:16:04.000
from consideration, so very quick kind of

00:16:04.000 --> 00:16:07.000
focused down to the right place and then it actually just does a link list reversal to

00:16:07.000 --> 00:16:10.000
find 106B in the bucket

00:16:10.000 --> 00:16:11.000
where it has to be,

00:16:11.000 --> 00:16:13.000
and so

00:16:13.000 --> 00:16:16.000
the key to realizing what makes this so magical, right,

00:16:16.000 --> 00:16:19.000
is that the choice of the number of buckets

00:16:19.000 --> 00:16:21.000
is up to me.

00:16:21.000 --> 00:16:24.000
I can pick the number of buckets to be whatever I want.

00:16:24.000 --> 00:16:27.000
I can pick it to be five, I can pick it to be 10, I can be to be

00:16:27.000 --> 00:16:29.000
1,000, I can pick it to be a million. If

00:16:29.000 --> 00:16:34.000
I choose the number of buckets to be kind of roughly in the same order of magnitude

00:16:34.000 --> 00:16:35.000
as the number of entries. So if I

00:16:35.000 --> 00:16:38.000
wanna put a thousand things in my table,

00:16:38.000 --> 00:16:40.000
if I make a thousand buckets to hold them in

00:16:40.000 --> 00:16:43.000
and my hash function can divide stuff

00:16:43.000 --> 00:16:45.000
pretty equally across those thousand buckets,

00:16:45.000 --> 00:16:47.000
then I will expect that each of those link lists is about length one.

00:16:47.000 --> 00:16:52.000
Some will be two, some will be zero, but if it's doing its job right then I have a

00:16:52.000 --> 00:16:56.000
bunch of little tiny buckets, each of which holds a tiny fraction of

00:16:56.000 --> 00:16:58.000
the data set, in this case one or two,

00:16:58.000 --> 00:17:00.000
and it's very easy to search through them.

00:17:00.000 --> 00:17:05.000
If I know I'm gonna put a million things in, then I make a million buckets, and then

00:17:05.000 --> 00:17:08.000
I divide one out of a million things, I run it through the hash function, it gives me

00:17:08.000 --> 00:17:09.000
this number

00:17:09.000 --> 00:17:12.000
that says, ''Oh, it's 862,444.'' I go to

00:17:12.000 --> 00:17:16.000
that bucket and the item is either there in that bucket or I have to look a little

00:17:16.000 --> 00:17:18.000
bit to find it,

00:17:18.000 --> 00:17:20.000
but there's no other place in the table I have to look. So the fact that there's

00:17:20.000 --> 00:17:21.000
a

00:17:21.000 --> 00:17:24.000
million other entries kinda near by in these other buckets is completely not

00:17:24.000 --> 00:17:26.000
important to us, right, and so

00:17:26.000 --> 00:17:28.000
the searching and

00:17:28.000 --> 00:17:32.000
updating, adding, and replacing, and removing, and stuff all happens at the bucket level

00:17:32.000 --> 00:17:35.000
and by choosing the buckets to be

00:17:35.000 --> 00:17:36.000


00:17:36.000 --> 00:17:39.000
the number, total number are gonna be roughly equal to the size of the things, then

00:17:39.000 --> 00:17:40.000
I have this

00:17:40.000 --> 00:17:43.000
constant time access to

00:17:43.000 --> 00:17:50.000
the tiny part of the subset of the set that matters. That's pretty cool. Let's write some

00:17:50.000 --> 00:17:51.000
code. It's kinda

00:17:51.000 --> 00:17:53.000
good to see the whole thing

00:17:53.000 --> 00:18:00.000
doing what it needs to do. So I'm gonna go ahead and go back over here.

00:18:00.000 --> 00:18:02.000
All right, so I've got my

00:18:02.000 --> 00:18:05.000
map which has add and get value and

00:18:05.000 --> 00:18:07.000
not much else there, right,

00:18:07.000 --> 00:18:12.000
but kinda just set up and ready for us to go. What I'm gonna build here

00:18:12.000 --> 00:18:13.000
is a

00:18:13.000 --> 00:18:15.000
link list cell

00:18:15.000 --> 00:18:20.000
that's gonna hold a string key, a val type value, and

00:18:20.000 --> 00:18:24.000
a pointer to the next, and

00:18:24.000 --> 00:18:27.000
so I plan on having each one of these cells, right, for each entry that goes in the

00:18:27.000 --> 00:18:27.000
table

00:18:27.000 --> 00:18:31.000
and then I'm going to have an array of

00:18:31.000 --> 00:18:32.000
these, and

00:18:32.000 --> 00:18:35.000
I'm gonna make a constant for it, and then I'll talk a little bit about the

00:18:35.000 --> 00:18:42.000
consequences of that. So I'm gonna make a constant that is ? so

00:18:42.000 --> 00:18:45.000
I make it 99. So then I have 99 buckets to start with. That's not

00:18:45.000 --> 00:18:49.000
gonna solve all my problems, it's gonna show that it's kinda tuned for maps that

00:18:49.000 --> 00:18:51.000
expect to hold about a hundred things, but

00:18:51.000 --> 00:18:56.000
? and so I have an array in this case, so if you look at deciphering this

00:18:56.000 --> 00:18:58.000
declaration is a little bit complicated here.

00:18:58.000 --> 00:19:00.000
It says that buckets is the name of the variable,

00:19:00.000 --> 00:19:06.000
it is an array of num buckets length, so a fixed size, 99

00:19:06.000 --> 00:19:07.000
entry array here.

00:19:07.000 --> 00:19:11.000
Each of those entries is a pointer to a cell, so ready to be the

00:19:11.000 --> 00:19:16.000
head pointer of a link list for us.

00:19:16.000 --> 00:19:19.000
I'm gonna add a ?

00:19:19.000 --> 00:19:23.000
well, we'll get there in a second. I'm gonna go over here and work on the constructor first.

00:19:23.000 --> 00:19:25.000
So

00:19:25.000 --> 00:19:30.000
the first thing I'm gonna do is make sure that those pointers have good values.

00:19:30.000 --> 00:19:32.000
If I do not do this, right, they will be junk

00:19:32.000 --> 00:19:36.000
and there will be badness that will happen, so I can make sure that each of them starts off as an

00:19:36.000 --> 00:19:37.000
empty list,

00:19:37.000 --> 00:19:42.000
and then correspondently, right, I would need to be doing a delete all list

00:19:42.000 --> 00:19:43.000
cells here, but

00:19:43.000 --> 00:19:48.000
I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then

00:19:48.000 --> 00:19:52.000
delete all the cells in each of those lists that are there.

00:19:52.000 --> 00:19:54.000
And then for add and get value

00:19:54.000 --> 00:19:56.000
what I'm gonna ? else we need to do

00:19:56.000 --> 00:19:57.000
is figure out

00:19:57.000 --> 00:19:58.000
in both cases, right, which

00:19:58.000 --> 00:20:00.000
is the appropriate list to operate on.

00:20:00.000 --> 00:20:03.000
So running the hash function on my key,

00:20:03.000 --> 00:20:08.000
seeing where it tells me to look, and then either searching it to find the match

00:20:08.000 --> 00:20:09.000
for get value

00:20:09.000 --> 00:20:13.000
to return the matching value or in the add case, to search for the matching

00:20:13.000 --> 00:20:16.000
entry to overwrite if it exists and if not, then to create a new cell and tack it on the

00:20:16.000 --> 00:20:19.000
front of the list. So

00:20:19.000 --> 00:20:21.000
let's write a

00:20:21.000 --> 00:20:25.000
helper function because they're both gonna do the same thing. I'm gonna need

00:20:25.000 --> 00:20:28.000
the same thing twice. So I'm gonna put a hash function in here

00:20:28.000 --> 00:20:31.000
that, given a key and a number of buckets, is gonna

00:20:31.000 --> 00:20:35.000
return to me the right bucket to look in,

00:20:35.000 --> 00:20:39.000
and then I'm gonna write a find cell function that, given a key and

00:20:39.000 --> 00:20:39.000
the pointer to a list,

00:20:39.000 --> 00:20:43.000
will find the cell that matches that key or return a null if it didn't find

00:20:43.000 --> 00:20:52.000
it. Okay, let me put my hash

00:20:52.000 --> 00:21:03.000
function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ?

00:21:03.000 --> 00:21:05.000
so what I'm gonna do, I'm

00:21:05.000 --> 00:21:07.000
gonna make an interesting choice here

00:21:07.000 --> 00:21:12.000
and I'm gonna have my hash function just return zero, which

00:21:12.000 --> 00:21:13.000
seems pretty surprising.

00:21:13.000 --> 00:21:16.000
But I'm gonna talk about why that is. It turns out, when I'm first building it that

00:21:16.000 --> 00:21:19.000
actually is an easy way to test that my code is working

00:21:19.000 --> 00:21:22.000
and to not be dependent on the behavior of the hash function. At this point it says, ''We'll

00:21:22.000 --> 00:21:23.000
just put

00:21:23.000 --> 00:21:25.000
everything in the zero's bucket.''

00:21:25.000 --> 00:21:27.000
Now that's not a good strategy for my long term efficiency,

00:21:27.000 --> 00:21:29.000
but it is a good way to test that

00:21:29.000 --> 00:21:33.000
my handling of the bucket, and the searching, and the finding, and

00:21:33.000 --> 00:21:36.000
inserting is correct, and the performance I should expect to be abysmal on this, right, it should be

00:21:36.000 --> 00:21:37.000
totally linear

00:21:37.000 --> 00:21:39.000
in terms of the number of elements to add or to get them because they'll all

00:21:39.000 --> 00:21:41.000
be in one link list

00:21:41.000 --> 00:21:45.000
with no easy access to them. And then eventually I can change this to divide across

00:21:45.000 --> 00:21:48.000
the buckets and get better spread, but the correctness of the hash table

00:21:48.000 --> 00:21:52.000
isn't affected by this choice. It's kind of an interesting way to think about designing

00:21:52.000 --> 00:21:54.000
the code.

00:21:54.000 --> 00:22:01.000
The other helper I have down here is gonna be my find cell that, given the

00:22:01.000 --> 00:22:04.000
head pointer to a list and the key,

00:22:04.000 --> 00:22:14.000
is gonna go find it, and it's basically just

00:22:14.000 --> 00:22:17.000
if this ker cells key equals the key,

00:22:17.000 --> 00:22:20.000
then we return it and then if we've gone through the whole list and haven't found it, we return

00:22:20.000 --> 00:22:22.000
a null.

00:22:22.000 --> 00:22:25.000
So this guy is returning a cell

00:22:25.000 --> 00:22:27.000
T star, right,

00:22:27.000 --> 00:22:30.000
as the matching cell within the thing, and this one is gonna provoke

00:22:30.000 --> 00:22:31.000
that

00:22:31.000 --> 00:22:35.000
little C++ compilation snafu that we encountered on the binary search

00:22:35.000 --> 00:22:36.000
tree, as well,

00:22:36.000 --> 00:22:40.000
which is that cell T, right, is a type that's declared in the private

00:22:40.000 --> 00:22:43.000
section within the scope of my map. So

00:22:43.000 --> 00:22:46.000
outside of the scope, which to say, before I've crossed the my map

00:22:46.000 --> 00:22:51.000
angle bracket val type colon, colon, it doesn't actually believe we're in my map scope

00:22:51.000 --> 00:22:54.000
yet. It hasn't seen that and it can't anticipate it, so I have to actually give

00:22:54.000 --> 00:22:56.000
the full name

00:22:56.000 --> 00:22:58.000
with the scope specifier on it,

00:22:58.000 --> 00:23:03.000
and then because of this use of the template name outside of its class

00:23:03.000 --> 00:23:09.000
is a special place for the C++ compiler, the type name keyword also

00:23:09.000 --> 00:23:12.000
needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the

00:23:12.000 --> 00:23:14.000
right place, which I kinda

00:23:14.000 --> 00:23:15.000
short changed,

00:23:15.000 --> 00:23:18.000
and then searching to find a cell. If

00:23:18.000 --> 00:23:23.000
I go back up here to get value,

00:23:23.000 --> 00:23:25.000
and I call the hash function,

00:23:25.000 --> 00:23:31.000
given my key and the number of buckets to find out which bucket to work on,

00:23:31.000 --> 00:23:35.000
I go looking for a match by calling find cell

00:23:35.000 --> 00:23:40.000
of the key in buckets of which bucket,

00:23:40.000 --> 00:23:43.000
and then if match does not equal null then

00:23:43.000 --> 00:23:45.000
I can return match as val and

00:23:45.000 --> 00:23:46.000
then

00:23:46.000 --> 00:23:47.000
otherwise

00:23:47.000 --> 00:23:52.000
we say, ''No such key found.'' So that's

00:23:52.000 --> 00:23:58.000
the error case, right, for get value was that if it didn't find one, its behavior is to

00:23:58.000 --> 00:23:59.000
raise an error

00:23:59.000 --> 00:24:01.000
so that someone knows that they've

00:24:01.000 --> 00:24:05.000
asked something it can't deal

00:24:05.000 --> 00:24:09.000
with. Okay, so the key thing here usually is that the hash is some sort of constant operation that

00:24:09.000 --> 00:24:12.000
just jumbles the key up, in this case it just returned zero, but could actually look at

00:24:12.000 --> 00:24:14.000
the elements of the key,

00:24:14.000 --> 00:24:16.000
and then it does its reversal down that link list,

00:24:16.000 --> 00:24:20.000
which given a correct working hash function should have divided them up in

00:24:20.000 --> 00:24:20.000
these

00:24:20.000 --> 00:24:23.000
little tiny chains, then we can just find the one

00:24:23.000 --> 00:24:26.000
quickly by looking through that link list. If we find it, right, we've got it,

00:24:26.000 --> 00:24:28.000
otherwise, right, we're in

00:24:28.000 --> 00:24:31.000
the error case.

00:24:31.000 --> 00:24:37.000
Add looks almost exactly the same. I

00:24:37.000 --> 00:24:40.000
start by getting the bucket, doing the find cell,

00:24:40.000 --> 00:24:41.000
if it's not ?

00:24:41.000 --> 00:24:43.000
if it did find it, right,

00:24:43.000 --> 00:24:46.000
then we overwrite,

00:24:46.000 --> 00:24:49.000
and then in the second case, right, where we didn't find it, right, then we need to

00:24:49.000 --> 00:24:55.000
make ourselves a new cell, and

00:24:55.000 --> 00:24:59.000
copy in the key, copy in the

00:24:59.000 --> 00:25:01.000
value,

00:25:01.000 --> 00:25:05.000
set it ? in this case, the easiest place, right, to add something to a link list is I should just do append

00:25:05.000 --> 00:25:08.000
it to the front, right, no need to make it hard on ourselves, right, we're not

00:25:08.000 --> 00:25:11.000
keeping them in any particular order, whatsoever, right, they're base on kind of order

00:25:11.000 --> 00:25:12.000
of entry.

00:25:12.000 --> 00:25:17.000
So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So

00:25:17.000 --> 00:25:18.000
it will

00:25:18.000 --> 00:25:20.000
point to the

00:25:20.000 --> 00:25:23.000
current front pointer of the cell,

00:25:23.000 --> 00:25:26.000
and then we will update the

00:25:26.000 --> 00:25:34.000
bucket pointer to point to this guy. So I think

00:25:34.000 --> 00:25:35.000
we are

00:25:35.000 --> 00:25:37.000
in okay shape here. Let me

00:25:37.000 --> 00:25:40.000
take a look and make sure I like what I did, right,

00:25:40.000 --> 00:25:44.000
so hashing, finding the match, if it found a non empty, so ? this is the

00:25:44.000 --> 00:25:46.000
overwrite case of finding an existing cell, we just overwrite with that. Otherwise,

00:25:46.000 --> 00:25:50.000
making a new cell and then attaching it to be the

00:25:50.000 --> 00:25:50.000
front

00:25:50.000 --> 00:25:52.000
of the bucket there.

00:25:52.000 --> 00:25:55.000
So for example, in the case where the bucket's totally empty, it's good to kinda trace

00:25:55.000 --> 00:25:59.000
that. That's the most common case when we start, right, if we hash to bucket zero

00:25:59.000 --> 00:26:01.000
and we do a search of a null link list, we should get a null out,

00:26:01.000 --> 00:26:03.000
so then we create a new cell, right,

00:26:03.000 --> 00:26:06.000
and we set the next field to be what the current bucket's head pointer was, which

00:26:06.000 --> 00:26:10.000
was null. So there should be a single to list and then we update the bucket

00:26:10.000 --> 00:26:12.000
to point to this guy. So then we should have a

00:26:12.000 --> 00:26:14.000
single list cell with null trailing it

00:26:14.000 --> 00:26:18.000
or inserting into the empty bucket case.

00:26:18.000 --> 00:26:20.000
I like it. Let's see if I have a little code over

00:26:20.000 --> 00:26:24.000
here that adds

00:26:24.000 --> 00:26:27.000
some things, car, car, cat, with some numbers. Okay, and that tries to

00:26:27.000 --> 00:26:31.000
retrieve the value of car, which it wrote a three and then it wrote a four on top

00:26:31.000 --> 00:26:39.000
of it, so hopefully the answer we should get out of this is four. Okay, if I ? be

00:26:39.000 --> 00:26:41.000
interesting to do something, like ask for something that we know is not in the

00:26:41.000 --> 00:26:45.000
table, right, to see that our error case gets handled correctly,

00:26:45.000 --> 00:26:46.000
the switch

00:26:46.000 --> 00:26:47.000
key's

00:26:47.000 --> 00:26:50.000
found. And if I were to continue to do this, I could do a little bit of stress testing to see how

00:26:50.000 --> 00:26:55.000
that zero is causing us some performance implications, but if I sat there and just put tons and

00:26:55.000 --> 00:26:56.000
tons of things in there, right,

00:26:56.000 --> 00:27:00.000
made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one

00:27:00.000 --> 00:27:04.000
long chain off the zero bucket, ignoring all the other ones, right,

00:27:04.000 --> 00:27:08.000
but then it's kinda stress testing my link list handling about the adding and

00:27:08.000 --> 00:27:09.000
searching that list, but

00:27:09.000 --> 00:27:12.000
would also show us that we expect that as we got to be, have 100 entries, 200

00:27:12.000 --> 00:27:16.000
entries, 300 entries that subsequent adds and get values, right, would

00:27:16.000 --> 00:27:21.000
show a linear increase in performance because of the

00:27:21.000 --> 00:27:27.000
cluster that we got going there.

00:27:27.000 --> 00:27:30.000
So let me give you a real hash function. I'll talk

00:27:30.000 --> 00:27:37.000
you through what it's doing and then I'll

00:27:37.000 --> 00:27:39.000
leave you with the thought of,

00:27:39.000 --> 00:27:41.000
that writing one of these actually correctly and reliably is

00:27:41.000 --> 00:27:44.000
actually

00:27:44.000 --> 00:27:47.000
more of an advance exercise than we're gonna look at, but this is actually a hash

00:27:47.000 --> 00:27:52.000
function that's taken from Don Knuth's Art of Computer Science,

00:27:52.000 --> 00:27:55.000
some of the work that guides a lot of computer scientists still to this day,

00:27:55.000 --> 00:27:59.000
and it has the strategy I basically told you about of,

00:27:59.000 --> 00:28:02.000
for all the characters that are in the string,

00:28:02.000 --> 00:28:06.000
right, adding it into the thing, but having multiplied,

00:28:06.000 --> 00:28:10.000
right, the previous hash code that's being built up by this large multiplier

00:28:10.000 --> 00:28:12.000
which is in effect kinda taking

00:28:12.000 --> 00:28:14.000
advantage of the positional information.

00:28:14.000 --> 00:28:17.000
And so the multiplier happens to be a very large negative number

00:28:17.000 --> 00:28:19.000
that's a prime that says, okay,

00:28:19.000 --> 00:28:20.000
the first time through, right,

00:28:20.000 --> 00:28:23.000
it'll take the hash code of zero and multiply it by and add the first character. So the first

00:28:23.000 --> 00:28:26.000
character gets added in without any modification.

00:28:26.000 --> 00:28:29.000
The next character, though, will take the previous one and multiply it by

00:28:29.000 --> 00:28:31.000
one power of the multiplier

00:28:31.000 --> 00:28:33.000
and then add in the next character,

00:28:33.000 --> 00:28:36.000
and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so

00:28:36.000 --> 00:28:39.000
raising it to the squared, and the third, and to the fourth powers, it works down the

00:28:39.000 --> 00:28:40.000
string. The

00:28:40.000 --> 00:28:43.000
effectiveness actually is that it's gonna do a lot of overflow.

00:28:43.000 --> 00:28:45.000
That's a very large number

00:28:45.000 --> 00:28:47.000
and it's adding

00:28:47.000 --> 00:28:51.000
and multiplying in this very large space, so in effect we're actually counting on the

00:28:51.000 --> 00:28:54.000
fact that the addition and multiplication

00:28:54.000 --> 00:28:56.000
that is built into C++

00:28:56.000 --> 00:28:58.000
doesn't raise any errors when the numbers get so large that they can't be represented. It

00:28:58.000 --> 00:29:03.000
just kind of truncates back down to what fits, and so it kind of wraps around using a modular

00:29:03.000 --> 00:29:06.000
strategy here, like a ring. And so we'll actually end up kind of making a very big number out

00:29:06.000 --> 00:29:10.000
of this. The longer the string is, the more those multiplies, the more those powers raise. We've got

00:29:10.000 --> 00:29:13.000
this huge number, what we do is kinda truncating it back to what fits in to a long

00:29:13.000 --> 00:29:16.000
and then when we're done, we take whatever

00:29:16.000 --> 00:29:18.000
number fell out of this process

00:29:18.000 --> 00:29:21.000
and we mod it by the number of buckets to give us a number from zero to buckets minus

00:29:21.000 --> 00:29:22.000
one.

00:29:22.000 --> 00:29:26.000
And so whatever gets stuffed into here, right, it's reliable in the sense that it's,

00:29:26.000 --> 00:29:28.000
every time you put in the number you'll get the same thing back.

00:29:28.000 --> 00:29:33.000
We know it will be in the range end buckets because of that last mod

00:29:33.000 --> 00:29:34.000
call

00:29:34.000 --> 00:29:36.000
that worked it out for us

00:29:36.000 --> 00:29:37.000
and we

00:29:37.000 --> 00:29:40.000
then use that to say which bucket to go look in when we're ready

00:29:40.000 --> 00:29:41.000
to do our work.

00:29:41.000 --> 00:29:44.000
So if I switch that in for my old hash

00:29:44.000 --> 00:29:47.000
function, I shouldn't actually see any

00:29:47.000 --> 00:29:47.000
change

00:29:47.000 --> 00:29:48.000
from the outside,

00:29:48.000 --> 00:29:51.000
other than the performance, right, should suddenly actually be a lot faster, which is

00:29:51.000 --> 00:29:55.000
hard to tell when I have three things in there, but ? I'm gonna

00:29:55.000 --> 00:30:01.000
take out that error case. In this case car

00:30:01.000 --> 00:30:03.000
and cat now

00:30:03.000 --> 00:30:06.000
are probably not very likely to be going to the same bucket. In fact I can

00:30:06.000 --> 00:30:09.000
guarantee they don't go in the same bucket, whereas

00:30:09.000 --> 00:30:16.000
before they were kind of overlapping.

00:30:16.000 --> 00:30:18.000
I'll leave that code there for a second, although

00:30:18.000 --> 00:30:19.000
really I would say that point of

00:30:19.000 --> 00:30:22.000
hashing really is not to get so worried about how it is that the hash code does it's

00:30:22.000 --> 00:30:25.000
work, but to understand the general idea of what a hashing function has to do,

00:30:25.000 --> 00:30:29.000
right, what its role is and how it relates to getting this constant time

00:30:29.000 --> 00:30:30.000
performance.

00:30:30.000 --> 00:30:33.000
So I'll give you an example of hashing in the real world, I think is really interesting

00:30:33.000 --> 00:30:37.000
to see it actually happen and people not even realizing how hashing is so useful.

00:30:37.000 --> 00:30:39.000
So I ordered something from REI, which is

00:30:39.000 --> 00:30:41.000
this camping store and

00:30:41.000 --> 00:30:42.000
it ?

00:30:42.000 --> 00:30:43.000
they didn't have it in stock, so they had to

00:30:43.000 --> 00:30:46.000
place an order for it. This is actually kinda awhile ago, but I ?

00:30:46.000 --> 00:30:50.000
so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't

00:30:50.000 --> 00:30:52.000
ask me my name. I thought this was very funny. Like,

00:30:52.000 --> 00:30:55.000
''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like,

00:30:55.000 --> 00:30:59.000
''Okay, what are the last two digits of your phone number?'' And I'm

00:30:59.000 --> 00:31:02.000
like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last

00:31:02.000 --> 00:31:04.000
two digits of my phone number are 21.

00:31:04.000 --> 00:31:06.000
And they say ? and they go and they look in the 21 basket and then they say, ''What's your

00:31:06.000 --> 00:31:09.000
name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're

00:31:09.000 --> 00:31:11.000
like, ''Okay, it's here.'' I'm like,

00:31:11.000 --> 00:31:12.000
''Okay.'' So then I go to get it, like

00:31:12.000 --> 00:31:13.000
a couple days later, and I go to the store

00:31:13.000 --> 00:31:15.000
and I'm standing in the line waiting for them,

00:31:15.000 --> 00:31:17.000
and I look up on the wall

00:31:17.000 --> 00:31:20.000
and they have

00:31:20.000 --> 00:31:22.000
100 baskets, little

00:31:22.000 --> 00:31:24.000
wire baskets up on the wall

00:31:24.000 --> 00:31:27.000
and they're labeled 0 through 9 over here

00:31:27.000 --> 00:31:29.000
and 0 through 9 over there,

00:31:29.000 --> 00:31:33.000
and when I come in they ask me the same question. ''What's the last two digits of your phone number?''

00:31:33.000 --> 00:31:36.000
Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then

00:31:36.000 --> 00:31:38.000
they walk over here, right,

00:31:38.000 --> 00:31:40.000
to the 21 basket here.

00:31:40.000 --> 00:31:42.000
It's like the top digit in this digit,

00:31:42.000 --> 00:31:45.000
and there's only one piece of paper in there, and they pull out my order thing,

00:31:45.000 --> 00:31:48.000
and then they get me my tent and everybody's happy.

00:31:48.000 --> 00:31:52.000
So while I'm there, I try to engage the cashier on how this is an example of a real world

00:31:52.000 --> 00:31:55.000
hashing system

00:31:55.000 --> 00:31:58.000
and I still got my tent, I'll have you know, but it was close,

00:31:58.000 --> 00:32:02.000
right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.''

00:32:02.000 --> 00:32:06.000
I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a

00:32:06.000 --> 00:32:10.000
very good example of what the investment in the number of buckets is versus what

00:32:10.000 --> 00:32:12.000
tradeoff it gives you, right. Because there's this very physical

00:32:12.000 --> 00:32:13.000
sort of set up of this, it's like

00:32:13.000 --> 00:32:17.000
by choosing a larger number of buckets, right, you can more fine grain your

00:32:17.000 --> 00:32:18.000
access,

00:32:18.000 --> 00:32:22.000
but that investment in buckets means you're kind of permanently installing that story. So

00:32:22.000 --> 00:32:25.000
in this case, right, they didn't use just the last digit of my phone number. They could've done that and

00:32:25.000 --> 00:32:28.000
they would have only had ten buckets on the wall, but then they would expect,

00:32:28.000 --> 00:32:30.000
you know, ten times as many things in each bucket.

00:32:30.000 --> 00:32:33.000
And apparently their

00:32:33.000 --> 00:32:37.000
estimate was the number of active orders in this backorder category was enough

00:32:37.000 --> 00:32:38.000
to

00:32:38.000 --> 00:32:41.000
warrant being more than ten, you know, significantly more than ten,

00:32:41.000 --> 00:32:45.000
but not so much more than 100, then in fact 100 buckets was the right

00:32:45.000 --> 00:32:46.000
investment to make in their bucket strategy.

00:32:46.000 --> 00:32:49.000
They didn't put three buckets on the wall because, you know, like what are they

00:32:49.000 --> 00:32:52.000
gonna do, have this big 3D sort of thing. They didn't enjoy this

00:32:52.000 --> 00:32:56.000
discussion, I went on for hours with them and they were, like, really not amused. But it

00:32:56.000 --> 00:32:59.000
had, you know, the three digits, then you'd get this even more likelihood that each

00:32:59.000 --> 00:33:03.000
bucket would be empty, but ? or have a very small number of things, but given

00:33:03.000 --> 00:33:07.000
their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me

00:33:07.000 --> 00:33:09.000
the

00:33:09.000 --> 00:33:11.000
first two digits of my phone number.

00:33:11.000 --> 00:33:14.000
They asked me the last two. Why does that

00:33:14.000 --> 00:33:18.000
matter? Because you're using all [inaudible]. Yeah. It's hated a

00:33:18.000 --> 00:33:21.000
lot. If you ask, like Stanford students, right, like especially

00:33:21.000 --> 00:33:25.000
when, you know, like campus numbers used to all be 49's or 72's or something, so like, if

00:33:25.000 --> 00:33:27.000
you use the first two digits, right, you'd have everybody's in the 49 bucket or

00:33:27.000 --> 00:33:29.000
the 72 bucket or

00:33:29.000 --> 00:33:32.000
something, and a whole bunch of other ones not used. An example, even the first two digits are never

00:33:32.000 --> 00:33:35.000
00, like there's a whole bunch of buckets that don't even get used as the first two digits of a

00:33:35.000 --> 00:33:38.000
phone number. Phone numbers never start with zeros.

00:33:38.000 --> 00:33:42.000
That they rarely actually have zeros or ones in them at all. They try to use those for the

00:33:42.000 --> 00:33:44.000
area code and stuff in the leading digits. So I

00:33:44.000 --> 00:33:45.000
thought it was just a really

00:33:45.000 --> 00:33:48.000
interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of

00:33:48.000 --> 00:33:52.000
course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me

00:33:52.000 --> 00:33:56.000
my stuff, but I paid my money and they [inaudible],

00:33:56.000 --> 00:34:00.000
but it kinda shows you, okay what you're trying to do here is try to make that number of

00:34:00.000 --> 00:34:00.000
buckets,

00:34:00.000 --> 00:34:03.000
right, roughly equal to the number of elements

00:34:03.000 --> 00:34:04.000
so that

00:34:04.000 --> 00:34:09.000
if your hash function's dividing up across your whole world,

00:34:09.000 --> 00:34:11.000
you've got

00:34:11.000 --> 00:34:14.000
constant access to what

00:34:14.000 --> 00:34:17.000
you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value

00:34:17.000 --> 00:34:21.000
that without it, it assumes it's an int and that number is to big to fit in an int and it

00:34:21.000 --> 00:34:25.000
is twice as big as a long in terms of space and so the L needs to be there, otherwise it

00:34:25.000 --> 00:34:26.000
will

00:34:26.000 --> 00:34:30.000
try to read it as an int and through away part of the information I was trying to give it. So it's the

00:34:30.000 --> 00:34:31.000
way

00:34:31.000 --> 00:34:35.000
of identifying it along constant.

00:34:35.000 --> 00:34:37.000
So let me talk about this just a little bit more

00:34:37.000 --> 00:34:39.000
in terms of

00:34:39.000 --> 00:34:41.000


00:34:41.000 --> 00:34:43.000
actual performance, right, we've got N

00:34:43.000 --> 00:34:46.000
over B, right, so if the division is complete, right, the number of entries over

00:34:46.000 --> 00:34:48.000
the number of buckets,

00:34:48.000 --> 00:34:48.000


00:34:48.000 --> 00:34:51.000
if we make them kind of on the same order of

00:34:51.000 --> 00:34:52.000
magnitude, kind of in the same range,

00:34:52.000 --> 00:34:55.000
so having to sum this could be to make a little bit of an estimated guess about

00:34:55.000 --> 00:34:58.000
where we're going,

00:34:58.000 --> 00:35:01.000
and then there are some choices here in how we store each bucket, right. We could use

00:35:01.000 --> 00:35:05.000
a little array, we could use a link list, right, we could use a vector.

00:35:05.000 --> 00:35:07.000
We expect this to be very small, is the truth.

00:35:07.000 --> 00:35:09.000
We're hoping that it's gonna be one or two,

00:35:09.000 --> 00:35:13.000
and so there's not a lot of reason to buy any real expensive structure or even bother

00:35:13.000 --> 00:35:17.000
doing things like sorting. Like you could sort them, but like if you expect there to be two

00:35:17.000 --> 00:35:19.000
things, what's the point? You're gonna spend more time

00:35:19.000 --> 00:35:21.000
figuring out whether to put it in the front or the back

00:35:21.000 --> 00:35:25.000
then you would gain at all by being able to

00:35:25.000 --> 00:35:27.000
find it a little faster.

00:35:27.000 --> 00:35:28.000
So the link list

00:35:28.000 --> 00:35:32.000
is just gonna be easiest way to get kind of allocated without over capacity,

00:35:32.000 --> 00:35:35.000
excess capacity that's unused in the bucket. You know it's gonna be small, use the

00:35:35.000 --> 00:35:37.000
simple strategy.

00:35:37.000 --> 00:35:41.000
It's also, though, an important thing that the hash type is likely to do in kind of a

00:35:41.000 --> 00:35:44.000
industrial strength situation is it does have to deal with this idea, well what if that number of

00:35:44.000 --> 00:35:47.000
buckets that you predicted was wrong?

00:35:47.000 --> 00:35:51.000
And so the map as we have given it to you, right, has to take this sponge because

00:35:51.000 --> 00:35:54.000
it doesn't know in advance how many things are you gonna put in.

00:35:54.000 --> 00:35:57.000
The only way to know that is to wait and see as things get added,

00:35:57.000 --> 00:35:59.000
and then realize your buckets are getting full.

00:35:59.000 --> 00:36:02.000
So typically the industrial strength version of the hash table is gonna

00:36:02.000 --> 00:36:06.000
track what's called the load factor. And the load factor is just the number of total

00:36:06.000 --> 00:36:08.000
entries divided by the number of buckets.

00:36:08.000 --> 00:36:09.000
When that

00:36:09.000 --> 00:36:13.000
factor gets high, and high actually is actually quite small, in fact. If

00:36:13.000 --> 00:36:15.000
it's two, is often the trigger for

00:36:15.000 --> 00:36:19.000
causing a rehash situation, so

00:36:19.000 --> 00:36:22.000
not even letting it get to be three or four. Just saying as soon as you realize that

00:36:22.000 --> 00:36:22.000
you have

00:36:22.000 --> 00:36:26.000
exceeded by double the capacity you intended, you planned for 100 things,

00:36:26.000 --> 00:36:27.000
you got 200 things,

00:36:27.000 --> 00:36:28.000
go ahead

00:36:28.000 --> 00:36:31.000
and redo your whole bucket array.

00:36:31.000 --> 00:36:39.000
So in the case, for example, of our simple like 0 through 7 case, right, maybe

00:36:39.000 --> 00:36:43.000
it's 0 through 6. One, two, three,

00:36:43.000 --> 00:36:45.000
four, so I have one that had six buckets, right,

00:36:45.000 --> 00:36:48.000
and then I've gotten to where

00:36:48.000 --> 00:36:50.000
each of them has

00:36:50.000 --> 00:36:54.000
two or maybe three in some and one in another.

00:36:54.000 --> 00:36:57.000
The plan is to just take this whole bucket array and

00:36:57.000 --> 00:37:01.000
enlarge it by a factor of two, so grow it, and in this case, from 6 to 12,

00:37:01.000 --> 00:37:02.000
and then rehash

00:37:02.000 --> 00:37:04.000
to move everybody. The

00:37:04.000 --> 00:37:08.000
idea is that the earlier decision about where to place them was based on knowing that

00:37:08.000 --> 00:37:11.000
there was exactly six buckets and so where it was placed before

00:37:11.000 --> 00:37:15.000
is not likely to be the place it will place if you have, you know,

00:37:15.000 --> 00:37:18.000
12 choices to lay them down into.

00:37:18.000 --> 00:37:21.000
And ideal ? in fact you don't even want it to be that case. Like, if they all land into the

00:37:21.000 --> 00:37:23.000
same bucket, right, you would have

00:37:23.000 --> 00:37:25.000
the same clustering and then a big empty clump,

00:37:25.000 --> 00:37:28.000
and so you would rehash everything, run it through the new hash function having

00:37:28.000 --> 00:37:31.000
changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?''

00:37:31.000 --> 00:37:34.000
And hopefully you'd end up getting a clean table, right, where you had 12

00:37:34.000 --> 00:37:37.000
items now with one in each bucket,

00:37:37.000 --> 00:37:40.000
ready to kinda give constant performance through

00:37:40.000 --> 00:37:42.000
that and then potentially,

00:37:42.000 --> 00:37:43.000
again if you overload

00:37:43.000 --> 00:37:47.000
your load factor, go back in and rearrange stuff again.

00:37:47.000 --> 00:37:50.000
So using this strategy a little bit like the one we talked about with the binary search tree of just

00:37:50.000 --> 00:37:51.000
wait and see. Rather

00:37:51.000 --> 00:37:55.000
than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any

00:37:55.000 --> 00:37:57.000
real alarm or crisis,

00:37:57.000 --> 00:38:00.000
it waits until there's kinda sufficiently clear demand

00:38:00.000 --> 00:38:03.000
that exceeds the capacity plan for,

00:38:03.000 --> 00:38:05.000
to do a big rearrangement.

00:38:05.000 --> 00:38:08.000
So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast.

00:38:08.000 --> 00:38:12.000
Starting to get just a hair slower, right, having to do a search through a few things as opposed

00:38:12.000 --> 00:38:13.000
to one,

00:38:13.000 --> 00:38:16.000
and then every now and then, right, you'd see an add which caused a big reshuffle of

00:38:16.000 --> 00:38:19.000
the table, which would be a totally linear operation in the number of

00:38:19.000 --> 00:38:20.000
elements

00:38:20.000 --> 00:38:22.000
and then they'd be fast again

00:38:22.000 --> 00:38:24.000
for another big clump of inserts until that

00:38:24.000 --> 00:38:27.000
same expansion might be complete or triggered by

00:38:27.000 --> 00:38:28.000
it growing even more.

00:38:28.000 --> 00:38:31.000
But dividing it sorta by the total number of inserts, so if you had 1,000 inserts

00:38:31.000 --> 00:38:35.000
before you overloaded and then you had to copy all 1,000 things to get ready for the

00:38:35.000 --> 00:38:36.000
next thousand,

00:38:36.000 --> 00:38:39.000
if you average that out, it still ends up

00:38:39.000 --> 00:38:43.000
being called a constant operation. So

00:38:43.000 --> 00:38:46.000
that's pretty neat.

00:38:46.000 --> 00:38:47.000
It's a ?

00:38:47.000 --> 00:38:50.000
really one of the, sort of, more, I think,

00:38:50.000 --> 00:38:53.000
easily understood and beautiful algorithmic techniques, right, for solving, right,

00:38:53.000 --> 00:38:55.000
this hard problem of

00:38:55.000 --> 00:38:57.000
search and retrieval

00:38:57.000 --> 00:38:58.000
that

00:38:58.000 --> 00:39:01.000
the vector and the sorting and the BST are all trying to get at, but ? and sometimes the

00:39:01.000 --> 00:39:04.000
sorter, vector and the BST are solving actually a much harder problem, right,

00:39:04.000 --> 00:39:09.000
which is keeping a total ordering of all the data and being able to flip it,

00:39:09.000 --> 00:39:12.000
traverse it and sorted order is one of the advantages that the sorted vector

00:39:12.000 --> 00:39:12.000
and

00:39:12.000 --> 00:39:14.000
the BST have that hash table doesn't have at all.

00:39:14.000 --> 00:39:17.000
In fact, it's jumbled it all up, and so if you think about how the iterator for

00:39:17.000 --> 00:39:21.000
the map works, our map is actually implemented using a hash table,

00:39:21.000 --> 00:39:24.000
it actually just iterates over the link list, and that explains why it almost appears

00:39:24.000 --> 00:39:27.000
completely random. If you put a bunch of things in the table and you go to iterate

00:39:27.000 --> 00:39:28.000
and

00:39:28.000 --> 00:39:29.000
visit them again,

00:39:29.000 --> 00:39:33.000
that the order you visit the keys seems to make no sense, and that's because it's based on

00:39:33.000 --> 00:39:35.000
their hash codes, right, which

00:39:35.000 --> 00:39:36.000
aren't designed

00:39:36.000 --> 00:39:37.000
to be

00:39:37.000 --> 00:39:39.000
any real sensible ordering

00:39:39.000 --> 00:39:42.000
to anybody not in the know of how the hash function works,

00:39:42.000 --> 00:39:46.000
whereas iterating over a vector that's sorted or iterating over a BST

00:39:46.000 --> 00:39:47.000
or a,

00:39:47.000 --> 00:39:51.000
in this case, our set, for example, is backed by a binary research tree

00:39:51.000 --> 00:39:54.000
will give you that sorted order. So it solves this harder problem, right, which cause there to

00:39:54.000 --> 00:39:57.000
be more kinda more investment in that problem, whereas hashing solves just exactly the

00:39:57.000 --> 00:39:58.000
one problem, which is

00:39:58.000 --> 00:40:02.000
I want to be able to find exactly this value again

00:40:02.000 --> 00:40:03.000
and update this value,

00:40:03.000 --> 00:40:08.000
and nothing more is needed than just identifying this,

00:40:08.000 --> 00:40:10.000
not finding the things that are less than this. For example, if you needed to find all

00:40:10.000 --> 00:40:13.000
the values that were less than this key alphabetically,

00:40:13.000 --> 00:40:15.000
right, the hash table makes that

00:40:15.000 --> 00:40:16.000
no easier

00:40:16.000 --> 00:40:18.000
than an unsorted vector,

00:40:18.000 --> 00:40:21.000
whereas things like assorted vector and binary search tree actually enable that kinda search,

00:40:21.000 --> 00:40:24.000
you could find the place in the tree and kind of work your way down the left side

00:40:24.000 --> 00:40:27.000
to find the things that were less than that.

00:40:27.000 --> 00:40:31.000
That's not actually being solved by the hash at all. Its use of

00:40:31.000 --> 00:40:34.000
memory is actually comparable to a

00:40:34.000 --> 00:40:35.000
binary search tree

00:40:35.000 --> 00:40:38.000
in that it has a four byte pointer per entry, which is the next field on the

00:40:38.000 --> 00:40:40.000
link list chain, and

00:40:40.000 --> 00:40:44.000
then there's a four byte pointer for each bucket, the head of the link list

00:40:44.000 --> 00:40:47.000
cell. Given that we expect that the buckets to be roughly equal to entry,

00:40:47.000 --> 00:40:51.000
than we can kinda just summarize that as well. Each entry

00:40:51.000 --> 00:40:53.000
represented an eight byte overhead,

00:40:53.000 --> 00:40:56.000
which is the same eight byte overhead, right, that the left and right pointers of

00:40:56.000 --> 00:40:58.000
the binary search

00:40:58.000 --> 00:40:59.000
tree add.

00:40:59.000 --> 00:41:01.000
Does it have any degenerate cases? That's a good

00:41:01.000 --> 00:41:06.000
question. So one of the things that made the binary search tree a little bit of a

00:41:06.000 --> 00:41:08.000
heavy investment was that to get that

00:41:08.000 --> 00:41:11.000
balancing behavior, even when we're getting data coming in in a way that's

00:41:11.000 --> 00:41:16.000
causing a lopsided construction, we have to go out of our way to do something special.

00:41:16.000 --> 00:41:19.000
What is the degenerate case for hash? Is there something that caused it to behave really,

00:41:19.000 --> 00:41:24.000
really badly? Anyone wanna help me out with

00:41:24.000 --> 00:41:28.000
that? Is it always good? Does it always give a good spread? Can we

00:41:28.000 --> 00:41:32.000
end up with everything in the same bucket somehow? [Inaudible] repeated it that way?

00:41:32.000 --> 00:41:35.000
So if you have repeated elements the way that ?

00:41:35.000 --> 00:41:37.000
both versions of the map work is they overwrite.

00:41:37.000 --> 00:41:41.000
So actually the nice thing is, yeah, if you did it again you just take that same cell and

00:41:41.000 --> 00:41:43.000
overwrite it, so in fact we wouldn't get a clustering based on the same entry

00:41:43.000 --> 00:41:45.000
going in and out.

00:41:45.000 --> 00:41:51.000
But how'd we end up with everything being in the same bucket?

00:41:51.000 --> 00:41:54.000
Mostly comes out, if you look at your hash function, when would a hash function collide?

00:41:54.000 --> 00:41:58.000
Right, and if you have a dumb hash function, right, you can definitely have some degenerate

00:41:58.000 --> 00:42:01.000
cases. My dumb hash function of return zero, right, being the

00:42:01.000 --> 00:42:03.000
worst example of that,

00:42:03.000 --> 00:42:07.000
but any hash function, right, that wasn't being particularly clever about using all of its

00:42:07.000 --> 00:42:09.000
information might actually have some clustering in

00:42:09.000 --> 00:42:12.000
it. It is possible, for example, even with a particularly good hash function that

00:42:12.000 --> 00:42:15.000
there are strings that will hash to the same thing, and it's like, if you somehow got

00:42:15.000 --> 00:42:18.000
really, really unlucky

00:42:18.000 --> 00:42:19.000
that you ?

00:42:19.000 --> 00:42:23.000
and had a hash function that wasn't doing the best job possible that you could get some

00:42:23.000 --> 00:42:25.000
clustering, but in general it doesn't require

00:42:25.000 --> 00:42:29.000
? the sole responsibility, right, for the generate case comes down to your hash function.

00:42:29.000 --> 00:42:31.000
So as long as your hash function

00:42:31.000 --> 00:42:34.000
and your inputs match your expectation,

00:42:34.000 --> 00:42:35.000
you don't have any surprises about

00:42:35.000 --> 00:42:38.000
how they got inserted the way it was in a binary search tree, which is sometimes hard

00:42:38.000 --> 00:42:40.000
to control.

00:42:40.000 --> 00:42:43.000
Just as you could underestimate the number of buckets you need ? Yeah. ? could

00:42:43.000 --> 00:42:46.000
you over estimate the number of buckets you

00:42:46.000 --> 00:42:51.000
need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and

00:42:51.000 --> 00:42:52.000
do you then go through and take

00:42:52.000 --> 00:42:54.000
that memory back? Yeah.

00:42:54.000 --> 00:42:57.000
Yeah, so that's a great question. So a lot of data structures

00:42:57.000 --> 00:42:58.000
don't shrink,

00:42:58.000 --> 00:43:02.000
right, and for example, the vector often will only grow on demand and then even if you deleted a

00:43:02.000 --> 00:43:04.000
bunch of, or removed a bunch of elements, it doesn't

00:43:04.000 --> 00:43:06.000
necessarily consider shrinking a big deal, and so

00:43:06.000 --> 00:43:10.000
that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around

00:43:10.000 --> 00:43:12.000
you're not using

00:43:12.000 --> 00:43:15.000
doesn't have nearly the same penalty that not

00:43:15.000 --> 00:43:17.000
having the memory when you needed it and having to do a lot of extra work to find

00:43:17.000 --> 00:43:18.000
things.

00:43:18.000 --> 00:43:21.000
So typically, right, you would try to pick a size that you are

00:43:21.000 --> 00:43:26.000
willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.''

00:43:26.000 --> 00:43:28.000


00:43:28.000 --> 00:43:30.000
But that, as you grow, you might not shrink back to that size. You might just

00:43:30.000 --> 00:43:34.000
let the capacity just kinda lay in waste. There are good

00:43:34.000 --> 00:43:37.000
reasons to actually be tidy about it, but again, some of this is hard to predict.

00:43:37.000 --> 00:43:40.000
It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows

00:43:40.000 --> 00:43:41.000
big again. Like, maybe you

00:43:41.000 --> 00:43:43.000
have a new freshman class come in,

00:43:43.000 --> 00:43:45.000
and then you graduate a whole bunch of people. Then

00:43:45.000 --> 00:43:47.000
at the point where you graduate them all, you're about ready to take in a new freshman

00:43:47.000 --> 00:43:51.000
class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on

00:43:51.000 --> 00:43:53.000
reusing in a minute anyway, and so maybe

00:43:53.000 --> 00:43:56.000
that clearing it out and totally releasing

00:43:56.000 --> 00:43:57.000
may actually be an exercise

00:43:57.000 --> 00:44:01.000
that was unimportant. You're planning on getting back to that size eventually, anyway.

00:44:01.000 --> 00:44:04.000
Most hash tables tend to grow is actually the truth. Right, you tend to

00:44:04.000 --> 00:44:07.000
be collecting data in there and

00:44:07.000 --> 00:44:11.000
they tend to only enlarge. It's kind of unusual that you take out a bunch of

00:44:11.000 --> 00:44:15.000
entries you already put in.

00:44:15.000 --> 00:44:18.000
And so I just put a little last note, which is like, okay, well I had said earlier that when we

00:44:18.000 --> 00:44:19.000
have the map,

00:44:19.000 --> 00:44:22.000
the key had to be string type and I was gonna at

00:44:22.000 --> 00:44:25.000
some point come back to this and talk about

00:44:25.000 --> 00:44:28.000
why that was the one restriction in all our template classes, right, you can put anything you

00:44:28.000 --> 00:44:31.000
want in statter or vector or stacker queue.

00:44:31.000 --> 00:44:34.000
That map lets you store any kind of value associated with that key, but that key was

00:44:34.000 --> 00:44:36.000
string type, and

00:44:36.000 --> 00:44:40.000
so given how our map is implemented as a hash table, that actually starts to make sense, that if

00:44:40.000 --> 00:44:42.000
you're gonna take the key and kind of jumble it up and map it to

00:44:42.000 --> 00:44:44.000
a bucket,

00:44:44.000 --> 00:44:47.000
you need to know something about how to extract information from the key, and so

00:44:47.000 --> 00:44:51.000
this case, knowing it's a string means you know it has characters, you know its length, and you can do

00:44:51.000 --> 00:44:52.000
those things.

00:44:52.000 --> 00:44:53.000
If you wanted to make a map

00:44:53.000 --> 00:44:57.000
that could take any kinda key, maybe integers is key or student structures is

00:44:57.000 --> 00:44:59.000
key or, you

00:44:59.000 --> 00:45:00.000
know,

00:45:00.000 --> 00:45:01.000
doubles is key,

00:45:01.000 --> 00:45:05.000
any of these things, it's like, well, how can you write a generic form

00:45:05.000 --> 00:45:06.000
that could use

00:45:06.000 --> 00:45:08.000
a hashing operation on that

00:45:08.000 --> 00:45:11.000
kind of key and map it to bucket numbers.

00:45:11.000 --> 00:45:14.000
You would build a two type template to kinda get this working in C++, so

00:45:14.000 --> 00:45:18.000
you can actually have a template that has a type name for the key type, a type name for

00:45:18.000 --> 00:45:19.000
the value type,

00:45:19.000 --> 00:45:20.000
and then

00:45:20.000 --> 00:45:25.000
in the member functions you'd refer to both key type and value type as these two

00:45:25.000 --> 00:45:26.000
distinct place holders.

00:45:26.000 --> 00:45:29.000
And then when the client set it up, right, they'd be saying, ''Okay, I want a map

00:45:29.000 --> 00:45:32.000
that has strings that map to integers.'' So maybe this is

00:45:32.000 --> 00:45:36.000
words and the page they appear in a document. I have a map of integers and a vector

00:45:36.000 --> 00:45:40.000
string. Maybe this is score on an exam and the names of the students who got that

00:45:40.000 --> 00:45:42.000
score, doing it that way,

00:45:42.000 --> 00:45:44.000
and to make this work, right,

00:45:44.000 --> 00:45:48.000
there has to be some kind of universal hashing that, given some generic type,

00:45:48.000 --> 00:45:53.000
can turn it into an integer in the range of buckets. The

00:45:53.000 --> 00:45:53.000
?

00:45:53.000 --> 00:45:56.000
what it's gonna require here is that the client get involved.

00:45:56.000 --> 00:45:58.000
That you can not write this

00:45:58.000 --> 00:46:00.000
generic

00:46:00.000 --> 00:46:01.000
hash function

00:46:01.000 --> 00:46:04.000
that will work for all types, right, if it's a student structure, what are you gonna look at? The

00:46:04.000 --> 00:46:08.000
ID field, the names field, under like ? there's just no sensible way that you can

00:46:08.000 --> 00:46:10.000
talk about a generic type and talk about how to hash

00:46:10.000 --> 00:46:14.000
it, so what you would have to do is have some sort of client call back, so probably you would

00:46:14.000 --> 00:46:18.000
create this by passing out a call back that's given a key type thing, and a number of

00:46:18.000 --> 00:46:20.000
buckets would

00:46:20.000 --> 00:46:24.000
do the necessary machinations, right, to hash

00:46:24.000 --> 00:46:26.000
it into a right number.

00:46:26.000 --> 00:46:30.000
And so that would be kinda one of those coordination things between the client and the implementer

00:46:30.000 --> 00:46:34.000
about the client knows what the data is what the data is it's trying to store, and kinda how to

00:46:34.000 --> 00:46:35.000


00:46:35.000 --> 00:46:41.000
mash it up and then the map would know, okay, given that number, where to stick it. So

00:46:41.000 --> 00:46:42.000
that's what it would take, and then

00:46:42.000 --> 00:46:45.000
rather sorta load that onto our

00:46:45.000 --> 00:46:48.000
novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can

00:46:48.000 --> 00:46:53.000
do a hash internally without having to get involved.'' All right,

00:46:53.000 --> 00:46:55.000
any questions about hashing? I'm gonna

00:46:55.000 --> 00:46:59.000
give you like, a two second talk about set

00:46:59.000 --> 00:47:01.000
because it's the one

00:47:01.000 --> 00:47:04.000
ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking

00:47:04.000 --> 00:47:07.000
about, it doesn't add any new things that we haven't already done, so in fact I'm just

00:47:07.000 --> 00:47:10.000
gonna tell you what it does, and

00:47:10.000 --> 00:47:13.000
then your picture will be complete, right.

00:47:13.000 --> 00:47:16.000
It wants to do fast searching, fast updating, add and remove, and it also has all

00:47:16.000 --> 00:47:21.000
these high level operations, such as intersect, union and difference,

00:47:21.000 --> 00:47:24.000
that are kind of its primary set of things to do.

00:47:24.000 --> 00:47:25.000


00:47:25.000 --> 00:47:27.000
A bunch of the things we've already seen, right,

00:47:27.000 --> 00:47:30.000
would actually be a reasonable strategy for upholding stat, right, using some kind

00:47:30.000 --> 00:47:31.000
of vector or array.

00:47:31.000 --> 00:47:34.000
Most likely you'd want to keep it in sorted order because that would buy you

00:47:34.000 --> 00:47:37.000
the fast lookup for contains,

00:47:37.000 --> 00:47:41.000
but the add and remove, right, would necessarily be slow in those cases because of the

00:47:41.000 --> 00:47:43.000
continuous memory.

00:47:43.000 --> 00:47:44.000


00:47:44.000 --> 00:47:47.000
The link list, not really probably too good of an option here

00:47:47.000 --> 00:47:50.000
because it contains then becomes

00:47:50.000 --> 00:47:51.000
linear and

00:47:51.000 --> 00:47:54.000
add and remove similarly, right, liner, so it doesn't actually get you anywhere, in

00:47:54.000 --> 00:47:57.000
fact it just adds the memory in pointers and gunk like that.

00:47:57.000 --> 00:47:59.000
The binary search tree, probably a

00:47:59.000 --> 00:48:00.000
pretty good call,

00:48:00.000 --> 00:48:03.000
right, that's the ?

00:48:03.000 --> 00:48:06.000
kinda meshes the advantage of Adamic structure with the sorted property to

00:48:06.000 --> 00:48:11.000
be able to give it fast update adding and removing and searching all can be done

00:48:11.000 --> 00:48:14.000
in logarithmic time if you have balancing.

00:48:14.000 --> 00:48:17.000
And then it actually also enables the high level ops, so the idea of being able to do a union

00:48:17.000 --> 00:48:19.000
intersection difference, and actually being able to

00:48:19.000 --> 00:48:23.000
walk over both of the trees in sorted order, which is very easily done

00:48:23.000 --> 00:48:24.000
with a [inaudible] reversal,

00:48:24.000 --> 00:48:28.000
makes those operations actually more efficient to implement than actually kinda having to deal with

00:48:28.000 --> 00:48:32.000
data coming at you all different ways. It

00:48:32.000 --> 00:48:33.000
turns out hashing really

00:48:33.000 --> 00:48:35.000
won't quite do it

00:48:35.000 --> 00:48:36.000


00:48:36.000 --> 00:48:37.000
very easily for us

00:48:37.000 --> 00:48:40.000
because of the same need about the template situations, like, well you have to have a hash function that

00:48:40.000 --> 00:48:43.000
can hash these things and do the right thing, and so the tree happens to be a

00:48:43.000 --> 00:48:46.000
good of just ? if the client can just give you ordering information, you can

00:48:46.000 --> 00:48:48.000
do all the hard work without

00:48:48.000 --> 00:48:50.000
pushing hashing onto them. So in fact

00:48:50.000 --> 00:48:53.000
the way our set really does work is it does use a binary search tree. It happens to

00:48:53.000 --> 00:48:56.000
use it through another layered abstraction

00:48:56.000 --> 00:48:59.000
that there actually is a BST class that we've never really

00:48:59.000 --> 00:49:02.000
encountered directly, but the BST class just takes the idea of a binary search

00:49:02.000 --> 00:49:06.000
tree, adds balancing, and

00:49:06.000 --> 00:49:08.000
all the properties about finding and inserting and adding,

00:49:08.000 --> 00:49:12.000
and packages it up, and then set actually just turns out to be one liners

00:49:12.000 --> 00:49:14.000
making calls into the binary search tree.

00:49:14.000 --> 00:49:17.000
This is where all the heavy lifting

00:49:17.000 --> 00:49:18.000
goes, is down there.

00:49:18.000 --> 00:49:23.000
That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our

00:49:23.000 --> 00:49:26.000
BST uses ? or set uses the BST, which is the binary search tree,

00:49:26.000 --> 00:49:28.000
right, our vector, right, using an array,

00:49:28.000 --> 00:49:33.000
and then that stacks and queues having both viable strategies both for array

00:49:33.000 --> 00:49:37.000
vector based backing as well as link list both getting good time performance. And then

00:49:37.000 --> 00:49:38.000
you saw with the PQ, right,

00:49:38.000 --> 00:49:40.000
even

00:49:40.000 --> 00:49:44.000
yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the

00:49:44.000 --> 00:49:46.000
kind of really classic things in which a lot of things get built,

00:49:46.000 --> 00:49:51.000
and so having a chance to use them both as a client and kinda dig

00:49:51.000 --> 00:49:53.000
around as an implementer kinda gives you this big picture of

00:49:53.000 --> 00:49:55.000
some of the more common ways that

00:49:55.000 --> 00:49:57.000
cool data structures get built.

00:49:57.000 --> 00:50:00.000
So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman

00:50:00.000 --> 00:50:03.000
cafe in a little bit, if you've got some time, please come, and have

00:50:03.000 --> 00:50:04.000
a good weekend.

00:50:04.000 --> 00:50:07.000
Work hard on pathfinder.