0:00:00.000,0:00:12.000


0:00:12.000,0:00:15.000
This presentation is delivered by the Stanford Center for Professional

0:00:15.000,0:00:28.000
Development.

0:00:28.000,0:00:29.000


0:00:29.000,0:00:33.000
What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever

0:00:33.000,0:00:36.000
gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda

0:00:36.000,0:00:39.000
clever, inventive idea for how you do search and retrieval in a very efficient

0:00:39.000,0:00:40.000
way.

0:00:40.000,0:00:44.000
So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday,

0:00:44.000,0:00:48.000
where we had talked about ways we might implement the map abstraction, right, the key

0:00:48.000,0:00:49.000
value pairing

0:00:49.000,0:00:52.000
to allow for quick search and retrieval

0:00:52.000,0:00:56.000
and we had looked at vector and had just done a little thought experiment about

0:00:56.000,0:00:59.000
what sorting the vector would buy us, right, and that would give us the

0:00:59.000,0:01:02.000
ability to do searches faster because we could use binary search,

0:01:02.000,0:01:07.000
right. But in the case of add, right, we'd still be slow because the

0:01:07.000,0:01:10.000
need be kinda shuffled or rearranged in the continuous memory

0:01:10.000,0:01:13.000
and then that's what motivated our introduction to trees there, the binary search tree,

0:01:13.000,0:01:17.000
which is designed exactly to enable that kind of searching, right, that logarithmic

0:01:17.000,0:01:20.000
divide and conquer, divide in half and work your way down

0:01:20.000,0:01:22.000
and then because of the flexibility of the pointer base structure there, right, we

0:01:22.000,0:01:24.000
can also do the insert

0:01:24.000,0:01:26.000
in the same logarithmic

0:01:26.000,0:01:30.000
time because we're no longer constrained by that contiguous memory that the array

0:01:30.000,0:01:30.000
fact

0:01:30.000,0:01:32.000
data structures force us to.

0:01:32.000,0:01:36.000
And so, at that point what we have seen this logarithmic and both get valued at and so

0:01:36.000,0:01:38.000
on the PQ, right, which you were just finishing,

0:01:38.000,0:01:42.000
you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you

0:01:42.000,0:01:44.000
would normally run into in an

0:01:44.000,0:01:45.000
ordinary use, right,

0:01:45.000,0:01:49.000
it's un-measurable, right. So those of you who got your heap up and running and

0:01:49.000,0:01:51.000
correctly working logarithmic and time was discovering that you can put 10,000, 20,000,

0:01:51.000,0:01:54.000
50,000, 100,000 things in it and just never be able to measure

0:01:54.000,0:01:58.000
what it took to NQ or DQ from those structures because logarithmic is very,

0:01:58.000,0:02:00.000
very fast.

0:02:00.000,0:02:01.000
So that said,

0:02:01.000,0:02:04.000
that's actually a fine point to say, we have a great math notation, we don't have to look further,

0:02:04.000,0:02:07.000
but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where

0:02:07.000,0:02:10.000
we can do both those operations in constant time, which is to say that

0:02:10.000,0:02:13.000
no matter how large the table gets, right, we expect

0:02:13.000,0:02:16.000
we'll be able to insert and retrieve with

0:02:16.000,0:02:21.000
the exact same time required kind of for a 10,000 table as a million entry

0:02:21.000,0:02:22.000
table which is pretty neat.

0:02:22.000,0:02:26.000
To note that as we start to get to these more complicated data structures,

0:02:26.000,0:02:29.000
one thing we also need to keep in mind, right, is that there are other things

0:02:29.000,0:02:33.000
of trade offs, right, in adding that complexity of the code that

0:02:33.000,0:02:36.000
starting with something that's simple and

0:02:36.000,0:02:39.000
a little slower performer but that's easy to write might actually solve the problem

0:02:39.000,0:02:42.000
we have at hand. So maybe that we don't really wanna go to one of these fancier

0:02:42.000,0:02:45.000
things because we don't really need it to be that much faster

0:02:45.000,0:02:48.000
because the BST, right, adds a bunch of memory, right, so left and right pointers

0:02:48.000,0:02:53.000
for every cell, which is an 8 byte overhead on top of the entry itself.

0:02:53.000,0:02:56.000
Now we have to deal with pointers, dynamic memories, all this allocation,

0:02:56.000,0:02:59.000
deallocation, right, we've got a lot of recursion going on in there

0:02:59.000,0:03:01.000
and so just opportunities for error

0:03:01.000,0:03:03.000
and stuff to creep into

0:03:03.000,0:03:07.000
and if we take the steps to provide tree balancing, which we only kind of grazed

0:03:07.000,0:03:11.000
the surface of. We just mentioned the need for it. It actually adds quite a lot of

0:03:11.000,0:03:14.000
complication to the code to do the rotations

0:03:14.000,0:03:16.000
or the rebalancing

0:03:16.000,0:03:21.000
that from a simple binary search tree. So in fact the whole package, right, of

0:03:21.000,0:03:24.000
building a really guaranteed log n performing

0:03:24.000,0:03:27.000
binary search tree is actually a pretty big investment of human capital

0:03:27.000,0:03:29.000
that you

0:03:29.000,0:03:30.000
have to kinda

0:03:30.000,0:03:33.000
balance against the future hit off of it running faster. I'm

0:03:33.000,0:03:36.000
gonna show this strategy today that actually is

0:03:36.000,0:03:37.000
kind of simple to write.

0:03:37.000,0:03:40.000
It gets us solve one case and doesn't have the degenerate

0:03:40.000,0:03:43.000
looming that the binary search tree did.

0:03:43.000,0:03:46.000
So, let me kinda just give you a theoretical concept to think about how

0:03:46.000,0:03:49.000
you do stuff.

0:03:49.000,0:03:51.000
You have a big dictionary. You've got, you know, like a

0:03:51.000,0:03:54.000
many thousand page dictionary, right, and you want to look up a word. You certainly

0:03:54.000,0:03:57.000
don't do linear search. You're about to look up the word xylophone and you're actually not

0:03:57.000,0:04:00.000
going to start from the A's and kind of page your way through, you

0:04:00.000,0:04:03.000
know, 2,000 pages to get to the X's. You

0:04:03.000,0:04:07.000
don't even ? like we do binary search. You can do binary searches pretty fast, right? It

0:04:07.000,0:04:10.000
turns out like starting at the M's to get something that's in X, right? You

0:04:10.000,0:04:11.000
tend to know about where X

0:04:11.000,0:04:14.000
is and often dictionaries have little tabs

0:04:14.000,0:04:17.000
along the pages that give you an idea of kinda where the breaks for

0:04:17.000,0:04:22.000
certain words are. Maybe it says Xa is here and Xe is there or something,

0:04:22.000,0:04:25.000
that lets you kinda more quickly just illuminate all the surrounding noise

0:04:25.000,0:04:29.000
in the dictionary, kinda get straight to the right region where the X words are

0:04:29.000,0:04:30.000
and then you can do

0:04:30.000,0:04:33.000
a simple search from there, maybe a little bit of a linear or binary search to kind of zero

0:04:33.000,0:04:37.000
in on the page where the word is you wanna find is.

0:04:37.000,0:04:38.000
This is the idea

0:04:38.000,0:04:45.000
behind this concept called a hash table or based on the algorithmic technique of hashing

0:04:45.000,0:04:46.000
is that

0:04:46.000,0:04:49.000
rather than try to figure out how to kinda keep this very large collection sorted or organized

0:04:49.000,0:04:51.000
in a way that you can kind of

0:04:51.000,0:04:52.000
jump around within it, it says,

0:04:52.000,0:04:56.000
''Well how about we maintain a bunch of different collections.'' We break up the

0:04:56.000,0:04:59.000
data set into little tiny pieces

0:04:59.000,0:05:00.000
and if we can

0:05:00.000,0:05:02.000
tell, given a key,

0:05:02.000,0:05:05.000
like which piece of the data set it's supposed to be in and we'll call those pieces

0:05:05.000,0:05:08.000
buckets. I'm gonna put these in different buckets.

0:05:08.000,0:05:11.000
So if I had a whole bunch of students I might put all the freshmen in this bucket

0:05:11.000,0:05:13.000
and the sophomores in this bucket or something like that and then when I look

0:05:13.000,0:05:16.000
for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the

0:05:16.000,0:05:18.000
freshmen bucket.

0:05:18.000,0:05:19.000
Okay. You know, the idea

0:05:19.000,0:05:22.000
in a simple case would be like, ''Okay, well I ? it has some

0:05:22.000,0:05:26.000
criteria by which I can divide you into these buckets that don't overlap and everyone

0:05:26.000,0:05:28.000
has a unique bucket to go in

0:05:28.000,0:05:31.000
and if I can make up enough of those categorizations,

0:05:31.000,0:05:33.000
have as many of those buckets around

0:05:33.000,0:05:35.000
to divide you up into,

0:05:35.000,0:05:38.000
then I can hopefully keep the set that lands in that bucket pretty small,

0:05:38.000,0:05:42.000
so that I don't have a smaller of students. So maybe I did it by the

0:05:42.000,0:05:45.000
first letter of your last name. I could say, ''Well everybody whose last name begins with W

0:05:45.000,0:05:47.000
goes here and everybody who's B goes here and so on.'' Well,

0:05:47.000,0:05:50.000
when somebody comes to me then I don't have to look through the B's.

0:05:50.000,0:05:51.000
You know, and if you have

0:05:51.000,0:05:53.000
a class of 200 students

0:05:53.000,0:05:56.000
and you have 26 letters of the alphabet then you'd think that if it were

0:05:56.000,0:05:59.000
evenly divided across those letters, it's not going to be to be, it turns out in practice,

0:05:59.000,0:06:02.000
but as a thought experiment there'd be about 4 people in a bucket and

0:06:02.000,0:06:06.000
it'd be pretty fast to just look through the bucket and find the one who I was

0:06:06.000,0:06:08.000
looking for. Okay,

0:06:08.000,0:06:10.000
so let's take this idea. This idea of, like,

0:06:10.000,0:06:14.000
there's going to be some strategy for how we divide into buckets and we call

0:06:14.000,0:06:16.000
that strategy the hash function.

0:06:16.000,0:06:18.000
That, given a key,

0:06:18.000,0:06:23.000
and a buckets and we'll just label them with numbers. Zero to the number of buckets

0:06:23.000,0:06:26.000
minus one. So let's say I have 99 buckets, I've got zero to

0:06:26.000,0:06:27.000
98

0:06:27.000,0:06:28.000
that I will take the key

0:06:28.000,0:06:32.000
and the key in this case is the string key that we're using in the map,

0:06:32.000,0:06:33.000
and I have a hash function

0:06:33.000,0:06:37.000
which given a key is input produces a number as output

0:06:37.000,0:06:39.000
in the range zero to B minus one.

0:06:39.000,0:06:42.000
And that is what's called the hash value for that key

0:06:42.000,0:06:45.000
and that tells me which bucket to look at,

0:06:45.000,0:06:48.000
either find a match for that if I'm trying to do a get value or where to place a

0:06:48.000,0:06:52.000
new entry if I'm trying to add into the table.

0:06:52.000,0:06:57.000
So that's the basic idea. I'm

0:06:57.000,0:07:00.000
gonna talk a little bit about hash functions.

0:07:00.000,0:07:03.000
In terms of writing a really good hash function is actually pretty complicated

0:07:03.000,0:07:04.000
and so

0:07:04.000,0:07:07.000
we're gonna think about it in sort of a ? kind

0:07:07.000,0:07:10.000
of a big picture perspective and think about what qualities a hash

0:07:10.000,0:07:13.000
function has to have and then I'm actually not gonna go through

0:07:13.000,0:07:16.000
proving to you about what mathematically constitutes a good hash function. I'm just gonna give you

0:07:16.000,0:07:21.000
some intuitive sense of what kinda things matter when designing a hash function.

0:07:21.000,0:07:23.000
So it's given a key that maps it to a number.

0:07:23.000,0:07:25.000
Typically it's gonna say, you know,

0:07:25.000,0:07:27.000
zero to the number of buckets minus one.

0:07:27.000,0:07:30.000
Often that means that somehow it's gonna compute something and then use

0:07:30.000,0:07:33.000
the mod operation to bring it back in range.

0:07:33.000,0:07:35.000
So it's gonna take your number

0:07:35.000,0:07:37.000
and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you

0:07:37.000,0:07:39.000
a number zero to four.''

0:07:39.000,0:07:41.000
It's very important that it be

0:07:41.000,0:07:44.000
stable, that you put the key in, you get a number out. If you put that same key in

0:07:44.000,0:07:48.000
later you should get the same number out. So it has to be sort of reliable guaranteed

0:07:48.000,0:07:52.000
mapping. It can't be random. That

0:07:52.000,0:07:55.000
said, you want it to almost look as though it's random in terms of

0:07:55.000,0:07:59.000
how it maps things to buckets, that you don't want it to have a real systematic plan

0:07:59.000,0:08:02.000
where it puts a whole bunch of things in bucket one and very few things in bucket two

0:08:02.000,0:08:06.000
or somehow always uses the even buckets and not the odds ones or something like that. So

0:08:06.000,0:08:08.000
you wanna have a strategy that's gonna take those keys

0:08:08.000,0:08:11.000
and use your whole bucket range

0:08:11.000,0:08:12.000
in an equal manner.

0:08:12.000,0:08:15.000
So that if you put 100 keys in and you have 100 buckets you're hoping

0:08:15.000,0:08:16.000
that,

0:08:16.000,0:08:19.000
on average, each bucket will have one thing. That you won't have these clusters

0:08:19.000,0:08:22.000
where 25 things are in one bucket, ten here and then a whole bunch of buckets

0:08:22.000,0:08:24.000
are empty. So

0:08:24.000,0:08:27.000
you want this kinda divided across the range.

0:08:27.000,0:08:29.000
So given that we have string input,

0:08:29.000,0:08:32.000
likely what we're gonna be using is the characters in the string as part of

0:08:32.000,0:08:34.000
the ? how can I take these characters

0:08:34.000,0:08:40.000
and somehow kinda jungle them up in a way to sort of move them around in these

0:08:40.000,0:08:43.000
buckets in a way that's reliable so that I get that same string back, I get the

0:08:43.000,0:08:44.000
same one.

0:08:44.000,0:08:47.000
But that doesn't actually produce a lot of artifacts of

0:08:47.000,0:08:50.000
similar strings, let's say mapping to the same place, hopefully spreading them

0:08:50.000,0:08:52.000
around.

0:08:52.000,0:08:55.000
So for example, some really simple things you could use, you could just take the first

0:08:55.000,0:08:56.000
letter. So I

0:08:56.000,0:08:59.000
talked about this sorta with last names. If I was trying to divide up my students, right, I could divide

0:08:59.000,0:09:01.000
them by the first letter of their last name.

0:09:01.000,0:09:05.000
That would use exactly 26 buckets, no more, no less,

0:09:05.000,0:09:08.000
and if I happen to have 26 buckets

0:09:08.000,0:09:11.000
then at least there's some chance it would get some coverage across the range.

0:09:11.000,0:09:14.000
There is likely to be clustering, right, some

0:09:14.000,0:09:15.000
names are a lot more common

0:09:15.000,0:09:18.000
or letters are a lot more commonly used than others.

0:09:18.000,0:09:21.000
So it would tend to produce a little bit of an artifact, but it would be reliable and it would

0:09:21.000,0:09:24.000
use 26 buckets. If I had 100 buckets,

0:09:24.000,0:09:26.000
then this would be kind of a crummy function

0:09:26.000,0:09:28.000
because in fact it would use the first 26 buckets and none of the other ones,

0:09:28.000,0:09:30.000
right, if it was just taking

0:09:30.000,0:09:32.000
that letter and deciding. So,

0:09:32.000,0:09:36.000
you know, in some situations this would get me a little bit of the thing I wanted.

0:09:36.000,0:09:40.000
Things like using the length of the word. A

0:09:40.000,0:09:42.000
very, very bad clustering function here,

0:09:42.000,0:09:44.000
right because there'd be a bunch of strings that are, like, six characters, seven

0:09:44.000,0:09:46.000
characters, ten characters,

0:09:46.000,0:09:50.000
but very few that are one or two, and very few that are more than ten, or something.

0:09:50.000,0:09:53.000
You'd have this very tight little cluster in the middle, right, and would not use a

0:09:53.000,0:09:57.000
large bucket range effectively at all.

0:09:57.000,0:10:01.000
More likely what you're gonna try to use is kind of more of the information about the word,

0:10:01.000,0:10:04.000
the key that you have. Instead of just a single letter or the count of letters, it's

0:10:04.000,0:10:06.000
trying to use kinda all the data you have at hand,

0:10:06.000,0:10:08.000
which is to say all the letters,

0:10:08.000,0:10:12.000
and so one way to do it would be to take the ASCII values for the letters, so A is

0:10:12.000,0:10:13.000
97, you

0:10:13.000,0:10:15.000
know, B is 98,

0:10:15.000,0:10:17.000
is you would take a letter, a word like atoms and you'd add

0:10:17.000,0:10:18.000
96

0:10:18.000,0:10:22.000
plus 99 plus 96, you know, add

0:10:22.000,0:10:23.000
them together

0:10:23.000,0:10:26.000
and then you'd get some sum of them, 500, 600, whatever it is,

0:10:26.000,0:10:28.000
depending on how many letters there are

0:10:28.000,0:10:32.000
and then use mod to back it back into ? let's say your range was, you had 100

0:10:32.000,0:10:34.000
buckets, right, so you'd want it to be zero to 99, so

0:10:34.000,0:10:37.000
you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd

0:10:37.000,0:10:42.000
be, okay, the sum to 524 then it goes into bucket 24.

0:10:42.000,0:10:47.000
So if you did that, right, and so if act

0:10:47.000,0:10:50.000
is the word that you have, right, you'd add these codes together and let's say this is

0:10:50.000,0:10:53.000
97 and then this is 99 and then

0:10:53.000,0:10:57.000
that's, you know, a hundred and something and so we end up with ? I'm making this

0:10:57.000,0:10:59.000
number up, right ? 283, then we might

0:10:59.000,0:11:03.000
mod to the last two digit and we'll say this one goes into bucket 83.

0:11:03.000,0:11:07.000
And if we have, you know, dog,

0:11:07.000,0:11:11.000
maybe that adds to 267 and we would

0:11:11.000,0:11:13.000
mod that down to

0:11:13.000,0:11:17.000
67. So this actually tries to use, kinda, all the information at hand, all the letters

0:11:17.000,0:11:18.000
that we have

0:11:18.000,0:11:19.000
and bring them down,

0:11:19.000,0:11:22.000
and so if you figure that across the space of all the numbers that are being added here

0:11:22.000,0:11:25.000
you are just as likely to get 210 as you are to get 299, you know

0:11:25.000,0:11:28.000
that any of the subsequent numbers in between,

0:11:28.000,0:11:32.000
then hopefully you'd get kind of a spread across your buckets.

0:11:32.000,0:11:36.000
There is one thing this clusters on. Does anybody see what words cluster in this

0:11:36.000,0:11:40.000
system?

0:11:40.000,0:11:41.000
That you'd think of as

0:11:41.000,0:11:47.000
maybe not being related, but would land in the same bucket? To

0:11:47.000,0:11:49.000
rearrange some numbers. Yes. Any

0:11:49.000,0:11:51.000
anagram, right, so

0:11:51.000,0:11:52.000
you

0:11:52.000,0:11:54.000
take cat which is just a ranagram ?

0:11:54.000,0:11:57.000
anagram of act, right, it's gonna add to the same thing, right,

0:11:57.000,0:12:00.000
and so it would be a little bit of clustering in that respect.

0:12:00.000,0:12:02.000
So anagrams come down to the same place.

0:12:02.000,0:12:05.000
And so the words that you expect to, kinda like, act and cat to

0:12:05.000,0:12:09.000
you seem like different words, they are gonna collide in this system and that might be a little

0:12:09.000,0:12:11.000
bit of something we worry about. So one way you can actually kind of avoid that in the

0:12:11.000,0:12:12.000
way that

0:12:12.000,0:12:15.000
the hash function we're going to eventually use will do, is

0:12:15.000,0:12:16.000
it'll do this thing where

0:12:16.000,0:12:21.000
not only does it add the letters together, but it also multiplies the letter by a

0:12:21.000,0:12:22.000
very large prime number

0:12:22.000,0:12:25.000
that helps to kinda say, ''Well the large prime number

0:12:25.000,0:12:26.000
times C plus

0:12:26.000,0:12:30.000
the large prime number squared times A plus the large prime number cubed times C and

0:12:30.000,0:12:33.000
then it just allows the, each position of the letter actually is also then

0:12:33.000,0:12:35.000
encoded in the number that comes out

0:12:35.000,0:12:37.000
and so something that's positionally different

0:12:37.000,0:12:40.000
won't necessarily land in the same bucket

0:12:40.000,0:12:43.000
without some other factors kinda coming into play. So a little

0:12:43.000,0:12:49.000
bit of trying to just take the information and jumble it up a little bit.

0:12:49.000,0:12:51.000
So let me show you a little bit about how this looks. Actually I'm gonna tell you about

0:12:51.000,0:12:54.000
the collisions and then I'll show you the data structure.

0:12:54.000,0:12:58.000
So certainly I said there's gonna be this problem where things collide,

0:12:58.000,0:13:01.000
that 83 and 83 in this simple system of adding the codes will

0:13:01.000,0:13:03.000
come out the same.

0:13:03.000,0:13:06.000
So it can't be that each of these buckets holds exactly one thing.

0:13:06.000,0:13:07.000
The hashing,

0:13:07.000,0:13:08.000
right, although we may

0:13:08.000,0:13:11.000
try to get those buckets as small as possible we have to account for the

0:13:11.000,0:13:14.000
possibility that two things will collide, that their hash function, even though

0:13:14.000,0:13:16.000
they were different keys,

0:13:16.000,0:13:18.000
will land in the same bucket

0:13:18.000,0:13:20.000
based on how I've jumbled it up

0:13:20.000,0:13:23.000
and so we're trying to avoid that, but when it does happen we also have to have a

0:13:23.000,0:13:25.000
strategy for it. So we have to have a way of saying, ''Well

0:13:25.000,0:13:28.000
there can be more than one thing in a bucket.'' Or some other way of when there's

0:13:28.000,0:13:33.000
a collision, deciding where to put the one that was placed in the table second.

0:13:33.000,0:13:36.000
And in the strategy, we're gonna use the one called chaning,

0:13:36.000,0:13:39.000
and the chaning basically says, ''Well if once we have something in a bucket like ACT,

0:13:39.000,0:13:43.000
if we hash something and it comes to the same bucket, we'll just tack it onto that bucket,

0:13:43.000,0:13:45.000
in this case using a link list.''

0:13:45.000,0:13:48.000
So each of these buckets is actually gonna be a link list of

0:13:48.000,0:13:49.000
the entries

0:13:49.000,0:13:52.000
and so the entire table will be a

0:13:52.000,0:13:54.000
vector or array of those.

0:13:54.000,0:13:56.000
So let me show you,

0:13:56.000,0:13:59.000
a little

0:13:59.000,0:14:05.000
live action.

0:14:05.000,0:14:08.000
So this is a hash table that has seven buckets.

0:14:08.000,0:14:08.000
In

0:14:08.000,0:14:11.000
this case number 0 through 6 are assigned to each one.

0:14:11.000,0:14:14.000
They all start empty, and so these are all empty link lists illustrated by the

0:14:14.000,0:14:15.000
null,

0:14:15.000,0:14:17.000
and then the hash function

0:14:17.000,0:14:20.000
is shown here like a black box. You don't actually know how the workings of hash

0:14:20.000,0:14:23.000
function is, but you know what it's job is to do, which is you give it a key and it gives you

0:14:23.000,0:14:27.000
a number in the range 0 to 6, and that same key when passed, and again gives you

0:14:27.000,0:14:28.000
the same number.

0:14:28.000,0:14:31.000
So if I say 106B

0:14:31.000,0:14:34.000
equals Julie, it pushes 106B through the hash function and says,

0:14:34.000,0:14:35.000
''Okay, well 106B

0:14:35.000,0:14:37.000
comes out as one.''

0:14:37.000,0:14:40.000
You add those numbers together, jumble them up, the letters, and you get one.

0:14:40.000,0:14:43.000
And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new

0:14:43.000,0:14:46.000
cell for 106B Julie and stick it in the table.''

0:14:46.000,0:14:48.000
So now there's a

0:14:48.000,0:14:51.000
non empty chain in one place, the rest are empty.

0:14:51.000,0:14:53.000
If I say 107

0:14:53.000,0:14:55.000
is being taught by

0:14:55.000,0:14:57.000
Jerry, 107 happens to get the same hash function.

0:14:57.000,0:15:00.000
107, 106B, different letters but it just happen to come out that way,

0:15:00.000,0:15:01.000
hash to one

0:15:01.000,0:15:04.000
and so it went ahead in this case just tacked it on to the front of the list.

0:15:04.000,0:15:05.000
If I do

0:15:05.000,0:15:08.000
106A

0:15:08.000,0:15:10.000
being taught by Patrick,

0:15:10.000,0:15:11.000
106A

0:15:11.000,0:15:14.000
happens to sum to zero in the hash function.

0:15:14.000,0:15:17.000
That's all we know about the black box that ends up loading it up into the zero's

0:15:17.000,0:15:17.000
slot.

0:15:17.000,0:15:21.000
I say, well wait, it's being taught by

0:15:21.000,0:15:25.000
Nick, it also come out to be zero and so

0:15:25.000,0:15:27.000
it ends up kinda pushing on the front of that. So I go 103

0:15:27.000,0:15:30.000
is being taught by Maron and it's in

0:15:30.000,0:15:33.000
bucket five and so as I kinda type things in I'm starting to see that things are just kinda

0:15:33.000,0:15:35.000
going around in difference place in the table.

0:15:35.000,0:15:38.000
A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue

0:15:38.000,0:15:42.000
to type in strings for other course numbers, I'd hopefully kinda fill out the table

0:15:42.000,0:15:43.000
to where there was kind of an even,

0:15:43.000,0:15:46.000
roughly even spread across that.

0:15:46.000,0:15:47.000
When I want to look something up

0:15:47.000,0:15:51.000
because I wanna see who's teaching 106B, then

0:15:51.000,0:15:54.000
it will use the same process in reverse. It asks the hash function,

0:15:54.000,0:15:56.000
''Well which bucket should I look in?'' And it

0:15:56.000,0:15:59.000
says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So

0:15:59.000,0:16:01.000
all the other buckets are eliminated

0:16:01.000,0:16:04.000
from consideration, so very quick kind of

0:16:04.000,0:16:07.000
focused down to the right place and then it actually just does a link list reversal to

0:16:07.000,0:16:10.000
find 106B in the bucket

0:16:10.000,0:16:11.000
where it has to be,

0:16:11.000,0:16:13.000
and so

0:16:13.000,0:16:16.000
the key to realizing what makes this so magical, right,

0:16:16.000,0:16:19.000
is that the choice of the number of buckets

0:16:19.000,0:16:21.000
is up to me.

0:16:21.000,0:16:24.000
I can pick the number of buckets to be whatever I want.

0:16:24.000,0:16:27.000
I can pick it to be five, I can pick it to be 10, I can be to be

0:16:27.000,0:16:29.000
1,000, I can pick it to be a million. If

0:16:29.000,0:16:34.000
I choose the number of buckets to be kind of roughly in the same order of magnitude

0:16:34.000,0:16:35.000
as the number of entries. So if I

0:16:35.000,0:16:38.000
wanna put a thousand things in my table,

0:16:38.000,0:16:40.000
if I make a thousand buckets to hold them in

0:16:40.000,0:16:43.000
and my hash function can divide stuff

0:16:43.000,0:16:45.000
pretty equally across those thousand buckets,

0:16:45.000,0:16:47.000
then I will expect that each of those link lists is about length one.

0:16:47.000,0:16:52.000
Some will be two, some will be zero, but if it's doing its job right then I have a

0:16:52.000,0:16:56.000
bunch of little tiny buckets, each of which holds a tiny fraction of

0:16:56.000,0:16:58.000
the data set, in this case one or two,

0:16:58.000,0:17:00.000
and it's very easy to search through them.

0:17:00.000,0:17:05.000
If I know I'm gonna put a million things in, then I make a million buckets, and then

0:17:05.000,0:17:08.000
I divide one out of a million things, I run it through the hash function, it gives me

0:17:08.000,0:17:09.000
this number

0:17:09.000,0:17:12.000
that says, ''Oh, it's 862,444.'' I go to

0:17:12.000,0:17:16.000
that bucket and the item is either there in that bucket or I have to look a little

0:17:16.000,0:17:18.000
bit to find it,

0:17:18.000,0:17:20.000
but there's no other place in the table I have to look. So the fact that there's

0:17:20.000,0:17:21.000
a

0:17:21.000,0:17:24.000
million other entries kinda near by in these other buckets is completely not

0:17:24.000,0:17:26.000
important to us, right, and so

0:17:26.000,0:17:28.000
the searching and

0:17:28.000,0:17:32.000
updating, adding, and replacing, and removing, and stuff all happens at the bucket level

0:17:32.000,0:17:35.000
and by choosing the buckets to be

0:17:35.000,0:17:36.000


0:17:36.000,0:17:39.000
the number, total number are gonna be roughly equal to the size of the things, then

0:17:39.000,0:17:40.000
I have this

0:17:40.000,0:17:43.000
constant time access to

0:17:43.000,0:17:50.000
the tiny part of the subset of the set that matters. That's pretty cool. Let's write some

0:17:50.000,0:17:51.000
code. It's kinda

0:17:51.000,0:17:53.000
good to see the whole thing

0:17:53.000,0:18:00.000
doing what it needs to do. So I'm gonna go ahead and go back over here.

0:18:00.000,0:18:02.000
All right, so I've got my

0:18:02.000,0:18:05.000
map which has add and get value and

0:18:05.000,0:18:07.000
not much else there, right,

0:18:07.000,0:18:12.000
but kinda just set up and ready for us to go. What I'm gonna build here

0:18:12.000,0:18:13.000
is a

0:18:13.000,0:18:15.000
link list cell

0:18:15.000,0:18:20.000
that's gonna hold a string key, a val type value, and

0:18:20.000,0:18:24.000
a pointer to the next, and

0:18:24.000,0:18:27.000
so I plan on having each one of these cells, right, for each entry that goes in the

0:18:27.000,0:18:27.000
table

0:18:27.000,0:18:31.000
and then I'm going to have an array of

0:18:31.000,0:18:32.000
these, and

0:18:32.000,0:18:35.000
I'm gonna make a constant for it, and then I'll talk a little bit about the

0:18:35.000,0:18:42.000
consequences of that. So I'm gonna make a constant that is ? so

0:18:42.000,0:18:45.000
I make it 99. So then I have 99 buckets to start with. That's not

0:18:45.000,0:18:49.000
gonna solve all my problems, it's gonna show that it's kinda tuned for maps that

0:18:49.000,0:18:51.000
expect to hold about a hundred things, but

0:18:51.000,0:18:56.000
? and so I have an array in this case, so if you look at deciphering this

0:18:56.000,0:18:58.000
declaration is a little bit complicated here.

0:18:58.000,0:19:00.000
It says that buckets is the name of the variable,

0:19:00.000,0:19:06.000
it is an array of num buckets length, so a fixed size, 99

0:19:06.000,0:19:07.000
entry array here.

0:19:07.000,0:19:11.000
Each of those entries is a pointer to a cell, so ready to be the

0:19:11.000,0:19:16.000
head pointer of a link list for us.

0:19:16.000,0:19:19.000
I'm gonna add a ?

0:19:19.000,0:19:23.000
well, we'll get there in a second. I'm gonna go over here and work on the constructor first.

0:19:23.000,0:19:25.000
So

0:19:25.000,0:19:30.000
the first thing I'm gonna do is make sure that those pointers have good values.

0:19:30.000,0:19:32.000
If I do not do this, right, they will be junk

0:19:32.000,0:19:36.000
and there will be badness that will happen, so I can make sure that each of them starts off as an

0:19:36.000,0:19:37.000
empty list,

0:19:37.000,0:19:42.000
and then correspondently, right, I would need to be doing a delete all list

0:19:42.000,0:19:43.000
cells here, but

0:19:43.000,0:19:48.000
I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then

0:19:48.000,0:19:52.000
delete all the cells in each of those lists that are there.

0:19:52.000,0:19:54.000
And then for add and get value

0:19:54.000,0:19:56.000
what I'm gonna ? else we need to do

0:19:56.000,0:19:57.000
is figure out

0:19:57.000,0:19:58.000
in both cases, right, which

0:19:58.000,0:20:00.000
is the appropriate list to operate on.

0:20:00.000,0:20:03.000
So running the hash function on my key,

0:20:03.000,0:20:08.000
seeing where it tells me to look, and then either searching it to find the match

0:20:08.000,0:20:09.000
for get value

0:20:09.000,0:20:13.000
to return the matching value or in the add case, to search for the matching

0:20:13.000,0:20:16.000
entry to overwrite if it exists and if not, then to create a new cell and tack it on the

0:20:16.000,0:20:19.000
front of the list. So

0:20:19.000,0:20:21.000
let's write a

0:20:21.000,0:20:25.000
helper function because they're both gonna do the same thing. I'm gonna need

0:20:25.000,0:20:28.000
the same thing twice. So I'm gonna put a hash function in here

0:20:28.000,0:20:31.000
that, given a key and a number of buckets, is gonna

0:20:31.000,0:20:35.000
return to me the right bucket to look in,

0:20:35.000,0:20:39.000
and then I'm gonna write a find cell function that, given a key and

0:20:39.000,0:20:39.000
the pointer to a list,

0:20:39.000,0:20:43.000
will find the cell that matches that key or return a null if it didn't find

0:20:43.000,0:20:52.000
it. Okay, let me put my hash

0:20:52.000,0:21:03.000
function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ?

0:21:03.000,0:21:05.000
so what I'm gonna do, I'm

0:21:05.000,0:21:07.000
gonna make an interesting choice here

0:21:07.000,0:21:12.000
and I'm gonna have my hash function just return zero, which

0:21:12.000,0:21:13.000
seems pretty surprising.

0:21:13.000,0:21:16.000
But I'm gonna talk about why that is. It turns out, when I'm first building it that

0:21:16.000,0:21:19.000
actually is an easy way to test that my code is working

0:21:19.000,0:21:22.000
and to not be dependent on the behavior of the hash function. At this point it says, ''We'll

0:21:22.000,0:21:23.000
just put

0:21:23.000,0:21:25.000
everything in the zero's bucket.''

0:21:25.000,0:21:27.000
Now that's not a good strategy for my long term efficiency,

0:21:27.000,0:21:29.000
but it is a good way to test that

0:21:29.000,0:21:33.000
my handling of the bucket, and the searching, and the finding, and

0:21:33.000,0:21:36.000
inserting is correct, and the performance I should expect to be abysmal on this, right, it should be

0:21:36.000,0:21:37.000
totally linear

0:21:37.000,0:21:39.000
in terms of the number of elements to add or to get them because they'll all

0:21:39.000,0:21:41.000
be in one link list

0:21:41.000,0:21:45.000
with no easy access to them. And then eventually I can change this to divide across

0:21:45.000,0:21:48.000
the buckets and get better spread, but the correctness of the hash table

0:21:48.000,0:21:52.000
isn't affected by this choice. It's kind of an interesting way to think about designing

0:21:52.000,0:21:54.000
the code.

0:21:54.000,0:22:01.000
The other helper I have down here is gonna be my find cell that, given the

0:22:01.000,0:22:04.000
head pointer to a list and the key,

0:22:04.000,0:22:14.000
is gonna go find it, and it's basically just

0:22:14.000,0:22:17.000
if this ker cells key equals the key,

0:22:17.000,0:22:20.000
then we return it and then if we've gone through the whole list and haven't found it, we return

0:22:20.000,0:22:22.000
a null.

0:22:22.000,0:22:25.000
So this guy is returning a cell

0:22:25.000,0:22:27.000
T star, right,

0:22:27.000,0:22:30.000
as the matching cell within the thing, and this one is gonna provoke

0:22:30.000,0:22:31.000
that

0:22:31.000,0:22:35.000
little C++ compilation snafu that we encountered on the binary search

0:22:35.000,0:22:36.000
tree, as well,

0:22:36.000,0:22:40.000
which is that cell T, right, is a type that's declared in the private

0:22:40.000,0:22:43.000
section within the scope of my map. So

0:22:43.000,0:22:46.000
outside of the scope, which to say, before I've crossed the my map

0:22:46.000,0:22:51.000
angle bracket val type colon, colon, it doesn't actually believe we're in my map scope

0:22:51.000,0:22:54.000
yet. It hasn't seen that and it can't anticipate it, so I have to actually give

0:22:54.000,0:22:56.000
the full name

0:22:56.000,0:22:58.000
with the scope specifier on it,

0:22:58.000,0:23:03.000
and then because of this use of the template name outside of its class

0:23:03.000,0:23:09.000
is a special place for the C++ compiler, the type name keyword also

0:23:09.000,0:23:12.000
needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the

0:23:12.000,0:23:14.000
right place, which I kinda

0:23:14.000,0:23:15.000
short changed,

0:23:15.000,0:23:18.000
and then searching to find a cell. If

0:23:18.000,0:23:23.000
I go back up here to get value,

0:23:23.000,0:23:25.000
and I call the hash function,

0:23:25.000,0:23:31.000
given my key and the number of buckets to find out which bucket to work on,

0:23:31.000,0:23:35.000
I go looking for a match by calling find cell

0:23:35.000,0:23:40.000
of the key in buckets of which bucket,

0:23:40.000,0:23:43.000
and then if match does not equal null then

0:23:43.000,0:23:45.000
I can return match as val and

0:23:45.000,0:23:46.000
then

0:23:46.000,0:23:47.000
otherwise

0:23:47.000,0:23:52.000
we say, ''No such key found.'' So that's

0:23:52.000,0:23:58.000
the error case, right, for get value was that if it didn't find one, its behavior is to

0:23:58.000,0:23:59.000
raise an error

0:23:59.000,0:24:01.000
so that someone knows that they've

0:24:01.000,0:24:05.000
asked something it can't deal

0:24:05.000,0:24:09.000
with. Okay, so the key thing here usually is that the hash is some sort of constant operation that

0:24:09.000,0:24:12.000
just jumbles the key up, in this case it just returned zero, but could actually look at

0:24:12.000,0:24:14.000
the elements of the key,

0:24:14.000,0:24:16.000
and then it does its reversal down that link list,

0:24:16.000,0:24:20.000
which given a correct working hash function should have divided them up in

0:24:20.000,0:24:20.000
these

0:24:20.000,0:24:23.000
little tiny chains, then we can just find the one

0:24:23.000,0:24:26.000
quickly by looking through that link list. If we find it, right, we've got it,

0:24:26.000,0:24:28.000
otherwise, right, we're in

0:24:28.000,0:24:31.000
the error case.

0:24:31.000,0:24:37.000
Add looks almost exactly the same. I

0:24:37.000,0:24:40.000
start by getting the bucket, doing the find cell,

0:24:40.000,0:24:41.000
if it's not ?

0:24:41.000,0:24:43.000
if it did find it, right,

0:24:43.000,0:24:46.000
then we overwrite,

0:24:46.000,0:24:49.000
and then in the second case, right, where we didn't find it, right, then we need to

0:24:49.000,0:24:55.000
make ourselves a new cell, and

0:24:55.000,0:24:59.000
copy in the key, copy in the

0:24:59.000,0:25:01.000
value,

0:25:01.000,0:25:05.000
set it ? in this case, the easiest place, right, to add something to a link list is I should just do append

0:25:05.000,0:25:08.000
it to the front, right, no need to make it hard on ourselves, right, we're not

0:25:08.000,0:25:11.000
keeping them in any particular order, whatsoever, right, they're base on kind of order

0:25:11.000,0:25:12.000
of entry.

0:25:12.000,0:25:17.000
So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So

0:25:17.000,0:25:18.000
it will

0:25:18.000,0:25:20.000
point to the

0:25:20.000,0:25:23.000
current front pointer of the cell,

0:25:23.000,0:25:26.000
and then we will update the

0:25:26.000,0:25:34.000
bucket pointer to point to this guy. So I think

0:25:34.000,0:25:35.000
we are

0:25:35.000,0:25:37.000
in okay shape here. Let me

0:25:37.000,0:25:40.000
take a look and make sure I like what I did, right,

0:25:40.000,0:25:44.000
so hashing, finding the match, if it found a non empty, so ? this is the

0:25:44.000,0:25:46.000
overwrite case of finding an existing cell, we just overwrite with that. Otherwise,

0:25:46.000,0:25:50.000
making a new cell and then attaching it to be the

0:25:50.000,0:25:50.000
front

0:25:50.000,0:25:52.000
of the bucket there.

0:25:52.000,0:25:55.000
So for example, in the case where the bucket's totally empty, it's good to kinda trace

0:25:55.000,0:25:59.000
that. That's the most common case when we start, right, if we hash to bucket zero

0:25:59.000,0:26:01.000
and we do a search of a null link list, we should get a null out,

0:26:01.000,0:26:03.000
so then we create a new cell, right,

0:26:03.000,0:26:06.000
and we set the next field to be what the current bucket's head pointer was, which

0:26:06.000,0:26:10.000
was null. So there should be a single to list and then we update the bucket

0:26:10.000,0:26:12.000
to point to this guy. So then we should have a

0:26:12.000,0:26:14.000
single list cell with null trailing it

0:26:14.000,0:26:18.000
or inserting into the empty bucket case.

0:26:18.000,0:26:20.000
I like it. Let's see if I have a little code over

0:26:20.000,0:26:24.000
here that adds

0:26:24.000,0:26:27.000
some things, car, car, cat, with some numbers. Okay, and that tries to

0:26:27.000,0:26:31.000
retrieve the value of car, which it wrote a three and then it wrote a four on top

0:26:31.000,0:26:39.000
of it, so hopefully the answer we should get out of this is four. Okay, if I ? be

0:26:39.000,0:26:41.000
interesting to do something, like ask for something that we know is not in the

0:26:41.000,0:26:45.000
table, right, to see that our error case gets handled correctly,

0:26:45.000,0:26:46.000
the switch

0:26:46.000,0:26:47.000
key's

0:26:47.000,0:26:50.000
found. And if I were to continue to do this, I could do a little bit of stress testing to see how

0:26:50.000,0:26:55.000
that zero is causing us some performance implications, but if I sat there and just put tons and

0:26:55.000,0:26:56.000
tons of things in there, right,

0:26:56.000,0:27:00.000
made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one

0:27:00.000,0:27:04.000
long chain off the zero bucket, ignoring all the other ones, right,

0:27:04.000,0:27:08.000
but then it's kinda stress testing my link list handling about the adding and

0:27:08.000,0:27:09.000
searching that list, but

0:27:09.000,0:27:12.000
would also show us that we expect that as we got to be, have 100 entries, 200

0:27:12.000,0:27:16.000
entries, 300 entries that subsequent adds and get values, right, would

0:27:16.000,0:27:21.000
show a linear increase in performance because of the

0:27:21.000,0:27:27.000
cluster that we got going there.

0:27:27.000,0:27:30.000
So let me give you a real hash function. I'll talk

0:27:30.000,0:27:37.000
you through what it's doing and then I'll

0:27:37.000,0:27:39.000
leave you with the thought of,

0:27:39.000,0:27:41.000
that writing one of these actually correctly and reliably is

0:27:41.000,0:27:44.000
actually

0:27:44.000,0:27:47.000
more of an advance exercise than we're gonna look at, but this is actually a hash

0:27:47.000,0:27:52.000
function that's taken from Don Knuth's Art of Computer Science,

0:27:52.000,0:27:55.000
some of the work that guides a lot of computer scientists still to this day,

0:27:55.000,0:27:59.000
and it has the strategy I basically told you about of,

0:27:59.000,0:28:02.000
for all the characters that are in the string,

0:28:02.000,0:28:06.000
right, adding it into the thing, but having multiplied,

0:28:06.000,0:28:10.000
right, the previous hash code that's being built up by this large multiplier

0:28:10.000,0:28:12.000
which is in effect kinda taking

0:28:12.000,0:28:14.000
advantage of the positional information.

0:28:14.000,0:28:17.000
And so the multiplier happens to be a very large negative number

0:28:17.000,0:28:19.000
that's a prime that says, okay,

0:28:19.000,0:28:20.000
the first time through, right,

0:28:20.000,0:28:23.000
it'll take the hash code of zero and multiply it by and add the first character. So the first

0:28:23.000,0:28:26.000
character gets added in without any modification.

0:28:26.000,0:28:29.000
The next character, though, will take the previous one and multiply it by

0:28:29.000,0:28:31.000
one power of the multiplier

0:28:31.000,0:28:33.000
and then add in the next character,

0:28:33.000,0:28:36.000
and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so

0:28:36.000,0:28:39.000
raising it to the squared, and the third, and to the fourth powers, it works down the

0:28:39.000,0:28:40.000
string. The

0:28:40.000,0:28:43.000
effectiveness actually is that it's gonna do a lot of overflow.

0:28:43.000,0:28:45.000
That's a very large number

0:28:45.000,0:28:47.000
and it's adding

0:28:47.000,0:28:51.000
and multiplying in this very large space, so in effect we're actually counting on the

0:28:51.000,0:28:54.000
fact that the addition and multiplication

0:28:54.000,0:28:56.000
that is built into C++

0:28:56.000,0:28:58.000
doesn't raise any errors when the numbers get so large that they can't be represented. It

0:28:58.000,0:29:03.000
just kind of truncates back down to what fits, and so it kind of wraps around using a modular

0:29:03.000,0:29:06.000
strategy here, like a ring. And so we'll actually end up kind of making a very big number out

0:29:06.000,0:29:10.000
of this. The longer the string is, the more those multiplies, the more those powers raise. We've got

0:29:10.000,0:29:13.000
this huge number, what we do is kinda truncating it back to what fits in to a long

0:29:13.000,0:29:16.000
and then when we're done, we take whatever

0:29:16.000,0:29:18.000
number fell out of this process

0:29:18.000,0:29:21.000
and we mod it by the number of buckets to give us a number from zero to buckets minus

0:29:21.000,0:29:22.000
one.

0:29:22.000,0:29:26.000
And so whatever gets stuffed into here, right, it's reliable in the sense that it's,

0:29:26.000,0:29:28.000
every time you put in the number you'll get the same thing back.

0:29:28.000,0:29:33.000
We know it will be in the range end buckets because of that last mod

0:29:33.000,0:29:34.000
call

0:29:34.000,0:29:36.000
that worked it out for us

0:29:36.000,0:29:37.000
and we

0:29:37.000,0:29:40.000
then use that to say which bucket to go look in when we're ready

0:29:40.000,0:29:41.000
to do our work.

0:29:41.000,0:29:44.000
So if I switch that in for my old hash

0:29:44.000,0:29:47.000
function, I shouldn't actually see any

0:29:47.000,0:29:47.000
change

0:29:47.000,0:29:48.000
from the outside,

0:29:48.000,0:29:51.000
other than the performance, right, should suddenly actually be a lot faster, which is

0:29:51.000,0:29:55.000
hard to tell when I have three things in there, but ? I'm gonna

0:29:55.000,0:30:01.000
take out that error case. In this case car

0:30:01.000,0:30:03.000
and cat now

0:30:03.000,0:30:06.000
are probably not very likely to be going to the same bucket. In fact I can

0:30:06.000,0:30:09.000
guarantee they don't go in the same bucket, whereas

0:30:09.000,0:30:16.000
before they were kind of overlapping.

0:30:16.000,0:30:18.000
I'll leave that code there for a second, although

0:30:18.000,0:30:19.000
really I would say that point of

0:30:19.000,0:30:22.000
hashing really is not to get so worried about how it is that the hash code does it's

0:30:22.000,0:30:25.000
work, but to understand the general idea of what a hashing function has to do,

0:30:25.000,0:30:29.000
right, what its role is and how it relates to getting this constant time

0:30:29.000,0:30:30.000
performance.

0:30:30.000,0:30:33.000
So I'll give you an example of hashing in the real world, I think is really interesting

0:30:33.000,0:30:37.000
to see it actually happen and people not even realizing how hashing is so useful.

0:30:37.000,0:30:39.000
So I ordered something from REI, which is

0:30:39.000,0:30:41.000
this camping store and

0:30:41.000,0:30:42.000
it ?

0:30:42.000,0:30:43.000
they didn't have it in stock, so they had to

0:30:43.000,0:30:46.000
place an order for it. This is actually kinda awhile ago, but I ?

0:30:46.000,0:30:50.000
so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't

0:30:50.000,0:30:52.000
ask me my name. I thought this was very funny. Like,

0:30:52.000,0:30:55.000
''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like,

0:30:55.000,0:30:59.000
''Okay, what are the last two digits of your phone number?'' And I'm

0:30:59.000,0:31:02.000
like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last

0:31:02.000,0:31:04.000
two digits of my phone number are 21.

0:31:04.000,0:31:06.000
And they say ? and they go and they look in the 21 basket and then they say, ''What's your

0:31:06.000,0:31:09.000
name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're

0:31:09.000,0:31:11.000
like, ''Okay, it's here.'' I'm like,

0:31:11.000,0:31:12.000
''Okay.'' So then I go to get it, like

0:31:12.000,0:31:13.000
a couple days later, and I go to the store

0:31:13.000,0:31:15.000
and I'm standing in the line waiting for them,

0:31:15.000,0:31:17.000
and I look up on the wall

0:31:17.000,0:31:20.000
and they have

0:31:20.000,0:31:22.000
100 baskets, little

0:31:22.000,0:31:24.000
wire baskets up on the wall

0:31:24.000,0:31:27.000
and they're labeled 0 through 9 over here

0:31:27.000,0:31:29.000
and 0 through 9 over there,

0:31:29.000,0:31:33.000
and when I come in they ask me the same question. ''What's the last two digits of your phone number?''

0:31:33.000,0:31:36.000
Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then

0:31:36.000,0:31:38.000
they walk over here, right,

0:31:38.000,0:31:40.000
to the 21 basket here.

0:31:40.000,0:31:42.000
It's like the top digit in this digit,

0:31:42.000,0:31:45.000
and there's only one piece of paper in there, and they pull out my order thing,

0:31:45.000,0:31:48.000
and then they get me my tent and everybody's happy.

0:31:48.000,0:31:52.000
So while I'm there, I try to engage the cashier on how this is an example of a real world

0:31:52.000,0:31:55.000
hashing system

0:31:55.000,0:31:58.000
and I still got my tent, I'll have you know, but it was close,

0:31:58.000,0:32:02.000
right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.''

0:32:02.000,0:32:06.000
I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a

0:32:06.000,0:32:10.000
very good example of what the investment in the number of buckets is versus what

0:32:10.000,0:32:12.000
tradeoff it gives you, right. Because there's this very physical

0:32:12.000,0:32:13.000
sort of set up of this, it's like

0:32:13.000,0:32:17.000
by choosing a larger number of buckets, right, you can more fine grain your

0:32:17.000,0:32:18.000
access,

0:32:18.000,0:32:22.000
but that investment in buckets means you're kind of permanently installing that story. So

0:32:22.000,0:32:25.000
in this case, right, they didn't use just the last digit of my phone number. They could've done that and

0:32:25.000,0:32:28.000
they would have only had ten buckets on the wall, but then they would expect,

0:32:28.000,0:32:30.000
you know, ten times as many things in each bucket.

0:32:30.000,0:32:33.000
And apparently their

0:32:33.000,0:32:37.000
estimate was the number of active orders in this backorder category was enough

0:32:37.000,0:32:38.000
to

0:32:38.000,0:32:41.000
warrant being more than ten, you know, significantly more than ten,

0:32:41.000,0:32:45.000
but not so much more than 100, then in fact 100 buckets was the right

0:32:45.000,0:32:46.000
investment to make in their bucket strategy.

0:32:46.000,0:32:49.000
They didn't put three buckets on the wall because, you know, like what are they

0:32:49.000,0:32:52.000
gonna do, have this big 3D sort of thing. They didn't enjoy this

0:32:52.000,0:32:56.000
discussion, I went on for hours with them and they were, like, really not amused. But it

0:32:56.000,0:32:59.000
had, you know, the three digits, then you'd get this even more likelihood that each

0:32:59.000,0:33:03.000
bucket would be empty, but ? or have a very small number of things, but given

0:33:03.000,0:33:07.000
their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me

0:33:07.000,0:33:09.000
the

0:33:09.000,0:33:11.000
first two digits of my phone number.

0:33:11.000,0:33:14.000
They asked me the last two. Why does that

0:33:14.000,0:33:18.000
matter? Because you're using all [inaudible]. Yeah. It's hated a

0:33:18.000,0:33:21.000
lot. If you ask, like Stanford students, right, like especially

0:33:21.000,0:33:25.000
when, you know, like campus numbers used to all be 49's or 72's or something, so like, if

0:33:25.000,0:33:27.000
you use the first two digits, right, you'd have everybody's in the 49 bucket or

0:33:27.000,0:33:29.000
the 72 bucket or

0:33:29.000,0:33:32.000
something, and a whole bunch of other ones not used. An example, even the first two digits are never

0:33:32.000,0:33:35.000
00, like there's a whole bunch of buckets that don't even get used as the first two digits of a

0:33:35.000,0:33:38.000
phone number. Phone numbers never start with zeros.

0:33:38.000,0:33:42.000
That they rarely actually have zeros or ones in them at all. They try to use those for the

0:33:42.000,0:33:44.000
area code and stuff in the leading digits. So I

0:33:44.000,0:33:45.000
thought it was just a really

0:33:45.000,0:33:48.000
interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of

0:33:48.000,0:33:52.000
course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me

0:33:52.000,0:33:56.000
my stuff, but I paid my money and they [inaudible],

0:33:56.000,0:34:00.000
but it kinda shows you, okay what you're trying to do here is try to make that number of

0:34:00.000,0:34:00.000
buckets,

0:34:00.000,0:34:03.000
right, roughly equal to the number of elements

0:34:03.000,0:34:04.000
so that

0:34:04.000,0:34:09.000
if your hash function's dividing up across your whole world,

0:34:09.000,0:34:11.000
you've got

0:34:11.000,0:34:14.000
constant access to what

0:34:14.000,0:34:17.000
you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value

0:34:17.000,0:34:21.000
that without it, it assumes it's an int and that number is to big to fit in an int and it

0:34:21.000,0:34:25.000
is twice as big as a long in terms of space and so the L needs to be there, otherwise it

0:34:25.000,0:34:26.000
will

0:34:26.000,0:34:30.000
try to read it as an int and through away part of the information I was trying to give it. So it's the

0:34:30.000,0:34:31.000
way

0:34:31.000,0:34:35.000
of identifying it along constant.

0:34:35.000,0:34:37.000
So let me talk about this just a little bit more

0:34:37.000,0:34:39.000
in terms of

0:34:39.000,0:34:41.000


0:34:41.000,0:34:43.000
actual performance, right, we've got N

0:34:43.000,0:34:46.000
over B, right, so if the division is complete, right, the number of entries over

0:34:46.000,0:34:48.000
the number of buckets,

0:34:48.000,0:34:48.000


0:34:48.000,0:34:51.000
if we make them kind of on the same order of

0:34:51.000,0:34:52.000
magnitude, kind of in the same range,

0:34:52.000,0:34:55.000
so having to sum this could be to make a little bit of an estimated guess about

0:34:55.000,0:34:58.000
where we're going,

0:34:58.000,0:35:01.000
and then there are some choices here in how we store each bucket, right. We could use

0:35:01.000,0:35:05.000
a little array, we could use a link list, right, we could use a vector.

0:35:05.000,0:35:07.000
We expect this to be very small, is the truth.

0:35:07.000,0:35:09.000
We're hoping that it's gonna be one or two,

0:35:09.000,0:35:13.000
and so there's not a lot of reason to buy any real expensive structure or even bother

0:35:13.000,0:35:17.000
doing things like sorting. Like you could sort them, but like if you expect there to be two

0:35:17.000,0:35:19.000
things, what's the point? You're gonna spend more time

0:35:19.000,0:35:21.000
figuring out whether to put it in the front or the back

0:35:21.000,0:35:25.000
then you would gain at all by being able to

0:35:25.000,0:35:27.000
find it a little faster.

0:35:27.000,0:35:28.000
So the link list

0:35:28.000,0:35:32.000
is just gonna be easiest way to get kind of allocated without over capacity,

0:35:32.000,0:35:35.000
excess capacity that's unused in the bucket. You know it's gonna be small, use the

0:35:35.000,0:35:37.000
simple strategy.

0:35:37.000,0:35:41.000
It's also, though, an important thing that the hash type is likely to do in kind of a

0:35:41.000,0:35:44.000
industrial strength situation is it does have to deal with this idea, well what if that number of

0:35:44.000,0:35:47.000
buckets that you predicted was wrong?

0:35:47.000,0:35:51.000
And so the map as we have given it to you, right, has to take this sponge because

0:35:51.000,0:35:54.000
it doesn't know in advance how many things are you gonna put in.

0:35:54.000,0:35:57.000
The only way to know that is to wait and see as things get added,

0:35:57.000,0:35:59.000
and then realize your buckets are getting full.

0:35:59.000,0:36:02.000
So typically the industrial strength version of the hash table is gonna

0:36:02.000,0:36:06.000
track what's called the load factor. And the load factor is just the number of total

0:36:06.000,0:36:08.000
entries divided by the number of buckets.

0:36:08.000,0:36:09.000
When that

0:36:09.000,0:36:13.000
factor gets high, and high actually is actually quite small, in fact. If

0:36:13.000,0:36:15.000
it's two, is often the trigger for

0:36:15.000,0:36:19.000
causing a rehash situation, so

0:36:19.000,0:36:22.000
not even letting it get to be three or four. Just saying as soon as you realize that

0:36:22.000,0:36:22.000
you have

0:36:22.000,0:36:26.000
exceeded by double the capacity you intended, you planned for 100 things,

0:36:26.000,0:36:27.000
you got 200 things,

0:36:27.000,0:36:28.000
go ahead

0:36:28.000,0:36:31.000
and redo your whole bucket array.

0:36:31.000,0:36:39.000
So in the case, for example, of our simple like 0 through 7 case, right, maybe

0:36:39.000,0:36:43.000
it's 0 through 6. One, two, three,

0:36:43.000,0:36:45.000
four, so I have one that had six buckets, right,

0:36:45.000,0:36:48.000
and then I've gotten to where

0:36:48.000,0:36:50.000
each of them has

0:36:50.000,0:36:54.000
two or maybe three in some and one in another.

0:36:54.000,0:36:57.000
The plan is to just take this whole bucket array and

0:36:57.000,0:37:01.000
enlarge it by a factor of two, so grow it, and in this case, from 6 to 12,

0:37:01.000,0:37:02.000
and then rehash

0:37:02.000,0:37:04.000
to move everybody. The

0:37:04.000,0:37:08.000
idea is that the earlier decision about where to place them was based on knowing that

0:37:08.000,0:37:11.000
there was exactly six buckets and so where it was placed before

0:37:11.000,0:37:15.000
is not likely to be the place it will place if you have, you know,

0:37:15.000,0:37:18.000
12 choices to lay them down into.

0:37:18.000,0:37:21.000
And ideal ? in fact you don't even want it to be that case. Like, if they all land into the

0:37:21.000,0:37:23.000
same bucket, right, you would have

0:37:23.000,0:37:25.000
the same clustering and then a big empty clump,

0:37:25.000,0:37:28.000
and so you would rehash everything, run it through the new hash function having

0:37:28.000,0:37:31.000
changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?''

0:37:31.000,0:37:34.000
And hopefully you'd end up getting a clean table, right, where you had 12

0:37:34.000,0:37:37.000
items now with one in each bucket,

0:37:37.000,0:37:40.000
ready to kinda give constant performance through

0:37:40.000,0:37:42.000
that and then potentially,

0:37:42.000,0:37:43.000
again if you overload

0:37:43.000,0:37:47.000
your load factor, go back in and rearrange stuff again.

0:37:47.000,0:37:50.000
So using this strategy a little bit like the one we talked about with the binary search tree of just

0:37:50.000,0:37:51.000
wait and see. Rather

0:37:51.000,0:37:55.000
than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any

0:37:55.000,0:37:57.000
real alarm or crisis,

0:37:57.000,0:38:00.000
it waits until there's kinda sufficiently clear demand

0:38:00.000,0:38:03.000
that exceeds the capacity plan for,

0:38:03.000,0:38:05.000
to do a big rearrangement.

0:38:05.000,0:38:08.000
So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast.

0:38:08.000,0:38:12.000
Starting to get just a hair slower, right, having to do a search through a few things as opposed

0:38:12.000,0:38:13.000
to one,

0:38:13.000,0:38:16.000
and then every now and then, right, you'd see an add which caused a big reshuffle of

0:38:16.000,0:38:19.000
the table, which would be a totally linear operation in the number of

0:38:19.000,0:38:20.000
elements

0:38:20.000,0:38:22.000
and then they'd be fast again

0:38:22.000,0:38:24.000
for another big clump of inserts until that

0:38:24.000,0:38:27.000
same expansion might be complete or triggered by

0:38:27.000,0:38:28.000
it growing even more.

0:38:28.000,0:38:31.000
But dividing it sorta by the total number of inserts, so if you had 1,000 inserts

0:38:31.000,0:38:35.000
before you overloaded and then you had to copy all 1,000 things to get ready for the

0:38:35.000,0:38:36.000
next thousand,

0:38:36.000,0:38:39.000
if you average that out, it still ends up

0:38:39.000,0:38:43.000
being called a constant operation. So

0:38:43.000,0:38:46.000
that's pretty neat.

0:38:46.000,0:38:47.000
It's a ?

0:38:47.000,0:38:50.000
really one of the, sort of, more, I think,

0:38:50.000,0:38:53.000
easily understood and beautiful algorithmic techniques, right, for solving, right,

0:38:53.000,0:38:55.000
this hard problem of

0:38:55.000,0:38:57.000
search and retrieval

0:38:57.000,0:38:58.000
that

0:38:58.000,0:39:01.000
the vector and the sorting and the BST are all trying to get at, but ? and sometimes the

0:39:01.000,0:39:04.000
sorter, vector and the BST are solving actually a much harder problem, right,

0:39:04.000,0:39:09.000
which is keeping a total ordering of all the data and being able to flip it,

0:39:09.000,0:39:12.000
traverse it and sorted order is one of the advantages that the sorted vector

0:39:12.000,0:39:12.000
and

0:39:12.000,0:39:14.000
the BST have that hash table doesn't have at all.

0:39:14.000,0:39:17.000
In fact, it's jumbled it all up, and so if you think about how the iterator for

0:39:17.000,0:39:21.000
the map works, our map is actually implemented using a hash table,

0:39:21.000,0:39:24.000
it actually just iterates over the link list, and that explains why it almost appears

0:39:24.000,0:39:27.000
completely random. If you put a bunch of things in the table and you go to iterate

0:39:27.000,0:39:28.000
and

0:39:28.000,0:39:29.000
visit them again,

0:39:29.000,0:39:33.000
that the order you visit the keys seems to make no sense, and that's because it's based on

0:39:33.000,0:39:35.000
their hash codes, right, which

0:39:35.000,0:39:36.000
aren't designed

0:39:36.000,0:39:37.000
to be

0:39:37.000,0:39:39.000
any real sensible ordering

0:39:39.000,0:39:42.000
to anybody not in the know of how the hash function works,

0:39:42.000,0:39:46.000
whereas iterating over a vector that's sorted or iterating over a BST

0:39:46.000,0:39:47.000
or a,

0:39:47.000,0:39:51.000
in this case, our set, for example, is backed by a binary research tree

0:39:51.000,0:39:54.000
will give you that sorted order. So it solves this harder problem, right, which cause there to

0:39:54.000,0:39:57.000
be more kinda more investment in that problem, whereas hashing solves just exactly the

0:39:57.000,0:39:58.000
one problem, which is

0:39:58.000,0:40:02.000
I want to be able to find exactly this value again

0:40:02.000,0:40:03.000
and update this value,

0:40:03.000,0:40:08.000
and nothing more is needed than just identifying this,

0:40:08.000,0:40:10.000
not finding the things that are less than this. For example, if you needed to find all

0:40:10.000,0:40:13.000
the values that were less than this key alphabetically,

0:40:13.000,0:40:15.000
right, the hash table makes that

0:40:15.000,0:40:16.000
no easier

0:40:16.000,0:40:18.000
than an unsorted vector,

0:40:18.000,0:40:21.000
whereas things like assorted vector and binary search tree actually enable that kinda search,

0:40:21.000,0:40:24.000
you could find the place in the tree and kind of work your way down the left side

0:40:24.000,0:40:27.000
to find the things that were less than that.

0:40:27.000,0:40:31.000
That's not actually being solved by the hash at all. Its use of

0:40:31.000,0:40:34.000
memory is actually comparable to a

0:40:34.000,0:40:35.000
binary search tree

0:40:35.000,0:40:38.000
in that it has a four byte pointer per entry, which is the next field on the

0:40:38.000,0:40:40.000
link list chain, and

0:40:40.000,0:40:44.000
then there's a four byte pointer for each bucket, the head of the link list

0:40:44.000,0:40:47.000
cell. Given that we expect that the buckets to be roughly equal to entry,

0:40:47.000,0:40:51.000
than we can kinda just summarize that as well. Each entry

0:40:51.000,0:40:53.000
represented an eight byte overhead,

0:40:53.000,0:40:56.000
which is the same eight byte overhead, right, that the left and right pointers of

0:40:56.000,0:40:58.000
the binary search

0:40:58.000,0:40:59.000
tree add.

0:40:59.000,0:41:01.000
Does it have any degenerate cases? That's a good

0:41:01.000,0:41:06.000
question. So one of the things that made the binary search tree a little bit of a

0:41:06.000,0:41:08.000
heavy investment was that to get that

0:41:08.000,0:41:11.000
balancing behavior, even when we're getting data coming in in a way that's

0:41:11.000,0:41:16.000
causing a lopsided construction, we have to go out of our way to do something special.

0:41:16.000,0:41:19.000
What is the degenerate case for hash? Is there something that caused it to behave really,

0:41:19.000,0:41:24.000
really badly? Anyone wanna help me out with

0:41:24.000,0:41:28.000
that? Is it always good? Does it always give a good spread? Can we

0:41:28.000,0:41:32.000
end up with everything in the same bucket somehow? [Inaudible] repeated it that way?

0:41:32.000,0:41:35.000
So if you have repeated elements the way that ?

0:41:35.000,0:41:37.000
both versions of the map work is they overwrite.

0:41:37.000,0:41:41.000
So actually the nice thing is, yeah, if you did it again you just take that same cell and

0:41:41.000,0:41:43.000
overwrite it, so in fact we wouldn't get a clustering based on the same entry

0:41:43.000,0:41:45.000
going in and out.

0:41:45.000,0:41:51.000
But how'd we end up with everything being in the same bucket?

0:41:51.000,0:41:54.000
Mostly comes out, if you look at your hash function, when would a hash function collide?

0:41:54.000,0:41:58.000
Right, and if you have a dumb hash function, right, you can definitely have some degenerate

0:41:58.000,0:42:01.000
cases. My dumb hash function of return zero, right, being the

0:42:01.000,0:42:03.000
worst example of that,

0:42:03.000,0:42:07.000
but any hash function, right, that wasn't being particularly clever about using all of its

0:42:07.000,0:42:09.000
information might actually have some clustering in

0:42:09.000,0:42:12.000
it. It is possible, for example, even with a particularly good hash function that

0:42:12.000,0:42:15.000
there are strings that will hash to the same thing, and it's like, if you somehow got

0:42:15.000,0:42:18.000
really, really unlucky

0:42:18.000,0:42:19.000
that you ?

0:42:19.000,0:42:23.000
and had a hash function that wasn't doing the best job possible that you could get some

0:42:23.000,0:42:25.000
clustering, but in general it doesn't require

0:42:25.000,0:42:29.000
? the sole responsibility, right, for the generate case comes down to your hash function.

0:42:29.000,0:42:31.000
So as long as your hash function

0:42:31.000,0:42:34.000
and your inputs match your expectation,

0:42:34.000,0:42:35.000
you don't have any surprises about

0:42:35.000,0:42:38.000
how they got inserted the way it was in a binary search tree, which is sometimes hard

0:42:38.000,0:42:40.000
to control.

0:42:40.000,0:42:43.000
Just as you could underestimate the number of buckets you need ? Yeah. ? could

0:42:43.000,0:42:46.000
you over estimate the number of buckets you

0:42:46.000,0:42:51.000
need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and

0:42:51.000,0:42:52.000
do you then go through and take

0:42:52.000,0:42:54.000
that memory back? Yeah.

0:42:54.000,0:42:57.000
Yeah, so that's a great question. So a lot of data structures

0:42:57.000,0:42:58.000
don't shrink,

0:42:58.000,0:43:02.000
right, and for example, the vector often will only grow on demand and then even if you deleted a

0:43:02.000,0:43:04.000
bunch of, or removed a bunch of elements, it doesn't

0:43:04.000,0:43:06.000
necessarily consider shrinking a big deal, and so

0:43:06.000,0:43:10.000
that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around

0:43:10.000,0:43:12.000
you're not using

0:43:12.000,0:43:15.000
doesn't have nearly the same penalty that not

0:43:15.000,0:43:17.000
having the memory when you needed it and having to do a lot of extra work to find

0:43:17.000,0:43:18.000
things.

0:43:18.000,0:43:21.000
So typically, right, you would try to pick a size that you are

0:43:21.000,0:43:26.000
willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.''

0:43:26.000,0:43:28.000


0:43:28.000,0:43:30.000
But that, as you grow, you might not shrink back to that size. You might just

0:43:30.000,0:43:34.000
let the capacity just kinda lay in waste. There are good

0:43:34.000,0:43:37.000
reasons to actually be tidy about it, but again, some of this is hard to predict.

0:43:37.000,0:43:40.000
It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows

0:43:40.000,0:43:41.000
big again. Like, maybe you

0:43:41.000,0:43:43.000
have a new freshman class come in,

0:43:43.000,0:43:45.000
and then you graduate a whole bunch of people. Then

0:43:45.000,0:43:47.000
at the point where you graduate them all, you're about ready to take in a new freshman

0:43:47.000,0:43:51.000
class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on

0:43:51.000,0:43:53.000
reusing in a minute anyway, and so maybe

0:43:53.000,0:43:56.000
that clearing it out and totally releasing

0:43:56.000,0:43:57.000
may actually be an exercise

0:43:57.000,0:44:01.000
that was unimportant. You're planning on getting back to that size eventually, anyway.

0:44:01.000,0:44:04.000
Most hash tables tend to grow is actually the truth. Right, you tend to

0:44:04.000,0:44:07.000
be collecting data in there and

0:44:07.000,0:44:11.000
they tend to only enlarge. It's kind of unusual that you take out a bunch of

0:44:11.000,0:44:15.000
entries you already put in.

0:44:15.000,0:44:18.000
And so I just put a little last note, which is like, okay, well I had said earlier that when we

0:44:18.000,0:44:19.000
have the map,

0:44:19.000,0:44:22.000
the key had to be string type and I was gonna at

0:44:22.000,0:44:25.000
some point come back to this and talk about

0:44:25.000,0:44:28.000
why that was the one restriction in all our template classes, right, you can put anything you

0:44:28.000,0:44:31.000
want in statter or vector or stacker queue.

0:44:31.000,0:44:34.000
That map lets you store any kind of value associated with that key, but that key was

0:44:34.000,0:44:36.000
string type, and

0:44:36.000,0:44:40.000
so given how our map is implemented as a hash table, that actually starts to make sense, that if

0:44:40.000,0:44:42.000
you're gonna take the key and kind of jumble it up and map it to

0:44:42.000,0:44:44.000
a bucket,

0:44:44.000,0:44:47.000
you need to know something about how to extract information from the key, and so

0:44:47.000,0:44:51.000
this case, knowing it's a string means you know it has characters, you know its length, and you can do

0:44:51.000,0:44:52.000
those things.

0:44:52.000,0:44:53.000
If you wanted to make a map

0:44:53.000,0:44:57.000
that could take any kinda key, maybe integers is key or student structures is

0:44:57.000,0:44:59.000
key or, you

0:44:59.000,0:45:00.000
know,

0:45:00.000,0:45:01.000
doubles is key,

0:45:01.000,0:45:05.000
any of these things, it's like, well, how can you write a generic form

0:45:05.000,0:45:06.000
that could use

0:45:06.000,0:45:08.000
a hashing operation on that

0:45:08.000,0:45:11.000
kind of key and map it to bucket numbers.

0:45:11.000,0:45:14.000
You would build a two type template to kinda get this working in C++, so

0:45:14.000,0:45:18.000
you can actually have a template that has a type name for the key type, a type name for

0:45:18.000,0:45:19.000
the value type,

0:45:19.000,0:45:20.000
and then

0:45:20.000,0:45:25.000
in the member functions you'd refer to both key type and value type as these two

0:45:25.000,0:45:26.000
distinct place holders.

0:45:26.000,0:45:29.000
And then when the client set it up, right, they'd be saying, ''Okay, I want a map

0:45:29.000,0:45:32.000
that has strings that map to integers.'' So maybe this is

0:45:32.000,0:45:36.000
words and the page they appear in a document. I have a map of integers and a vector

0:45:36.000,0:45:40.000
string. Maybe this is score on an exam and the names of the students who got that

0:45:40.000,0:45:42.000
score, doing it that way,

0:45:42.000,0:45:44.000
and to make this work, right,

0:45:44.000,0:45:48.000
there has to be some kind of universal hashing that, given some generic type,

0:45:48.000,0:45:53.000
can turn it into an integer in the range of buckets. The

0:45:53.000,0:45:53.000
?

0:45:53.000,0:45:56.000
what it's gonna require here is that the client get involved.

0:45:56.000,0:45:58.000
That you can not write this

0:45:58.000,0:46:00.000
generic

0:46:00.000,0:46:01.000
hash function

0:46:01.000,0:46:04.000
that will work for all types, right, if it's a student structure, what are you gonna look at? The

0:46:04.000,0:46:08.000
ID field, the names field, under like ? there's just no sensible way that you can

0:46:08.000,0:46:10.000
talk about a generic type and talk about how to hash

0:46:10.000,0:46:14.000
it, so what you would have to do is have some sort of client call back, so probably you would

0:46:14.000,0:46:18.000
create this by passing out a call back that's given a key type thing, and a number of

0:46:18.000,0:46:20.000
buckets would

0:46:20.000,0:46:24.000
do the necessary machinations, right, to hash

0:46:24.000,0:46:26.000
it into a right number.

0:46:26.000,0:46:30.000
And so that would be kinda one of those coordination things between the client and the implementer

0:46:30.000,0:46:34.000
about the client knows what the data is what the data is it's trying to store, and kinda how to

0:46:34.000,0:46:35.000


0:46:35.000,0:46:41.000
mash it up and then the map would know, okay, given that number, where to stick it. So

0:46:41.000,0:46:42.000
that's what it would take, and then

0:46:42.000,0:46:45.000
rather sorta load that onto our

0:46:45.000,0:46:48.000
novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can

0:46:48.000,0:46:53.000
do a hash internally without having to get involved.'' All right,

0:46:53.000,0:46:55.000
any questions about hashing? I'm gonna

0:46:55.000,0:46:59.000
give you like, a two second talk about set

0:46:59.000,0:47:01.000
because it's the one

0:47:01.000,0:47:04.000
ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking

0:47:04.000,0:47:07.000
about, it doesn't add any new things that we haven't already done, so in fact I'm just

0:47:07.000,0:47:10.000
gonna tell you what it does, and

0:47:10.000,0:47:13.000
then your picture will be complete, right.

0:47:13.000,0:47:16.000
It wants to do fast searching, fast updating, add and remove, and it also has all

0:47:16.000,0:47:21.000
these high level operations, such as intersect, union and difference,

0:47:21.000,0:47:24.000
that are kind of its primary set of things to do.

0:47:24.000,0:47:25.000


0:47:25.000,0:47:27.000
A bunch of the things we've already seen, right,

0:47:27.000,0:47:30.000
would actually be a reasonable strategy for upholding stat, right, using some kind

0:47:30.000,0:47:31.000
of vector or array.

0:47:31.000,0:47:34.000
Most likely you'd want to keep it in sorted order because that would buy you

0:47:34.000,0:47:37.000
the fast lookup for contains,

0:47:37.000,0:47:41.000
but the add and remove, right, would necessarily be slow in those cases because of the

0:47:41.000,0:47:43.000
continuous memory.

0:47:43.000,0:47:44.000


0:47:44.000,0:47:47.000
The link list, not really probably too good of an option here

0:47:47.000,0:47:50.000
because it contains then becomes

0:47:50.000,0:47:51.000
linear and

0:47:51.000,0:47:54.000
add and remove similarly, right, liner, so it doesn't actually get you anywhere, in

0:47:54.000,0:47:57.000
fact it just adds the memory in pointers and gunk like that.

0:47:57.000,0:47:59.000
The binary search tree, probably a

0:47:59.000,0:48:00.000
pretty good call,

0:48:00.000,0:48:03.000
right, that's the ?

0:48:03.000,0:48:06.000
kinda meshes the advantage of Adamic structure with the sorted property to

0:48:06.000,0:48:11.000
be able to give it fast update adding and removing and searching all can be done

0:48:11.000,0:48:14.000
in logarithmic time if you have balancing.

0:48:14.000,0:48:17.000
And then it actually also enables the high level ops, so the idea of being able to do a union

0:48:17.000,0:48:19.000
intersection difference, and actually being able to

0:48:19.000,0:48:23.000
walk over both of the trees in sorted order, which is very easily done

0:48:23.000,0:48:24.000
with a [inaudible] reversal,

0:48:24.000,0:48:28.000
makes those operations actually more efficient to implement than actually kinda having to deal with

0:48:28.000,0:48:32.000
data coming at you all different ways. It

0:48:32.000,0:48:33.000
turns out hashing really

0:48:33.000,0:48:35.000
won't quite do it

0:48:35.000,0:48:36.000


0:48:36.000,0:48:37.000
very easily for us

0:48:37.000,0:48:40.000
because of the same need about the template situations, like, well you have to have a hash function that

0:48:40.000,0:48:43.000
can hash these things and do the right thing, and so the tree happens to be a

0:48:43.000,0:48:46.000
good of just ? if the client can just give you ordering information, you can

0:48:46.000,0:48:48.000
do all the hard work without

0:48:48.000,0:48:50.000
pushing hashing onto them. So in fact

0:48:50.000,0:48:53.000
the way our set really does work is it does use a binary search tree. It happens to

0:48:53.000,0:48:56.000
use it through another layered abstraction

0:48:56.000,0:48:59.000
that there actually is a BST class that we've never really

0:48:59.000,0:49:02.000
encountered directly, but the BST class just takes the idea of a binary search

0:49:02.000,0:49:06.000
tree, adds balancing, and

0:49:06.000,0:49:08.000
all the properties about finding and inserting and adding,

0:49:08.000,0:49:12.000
and packages it up, and then set actually just turns out to be one liners

0:49:12.000,0:49:14.000
making calls into the binary search tree.

0:49:14.000,0:49:17.000
This is where all the heavy lifting

0:49:17.000,0:49:18.000
goes, is down there.

0:49:18.000,0:49:23.000
That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our

0:49:23.000,0:49:26.000
BST uses ? or set uses the BST, which is the binary search tree,

0:49:26.000,0:49:28.000
right, our vector, right, using an array,

0:49:28.000,0:49:33.000
and then that stacks and queues having both viable strategies both for array

0:49:33.000,0:49:37.000
vector based backing as well as link list both getting good time performance. And then

0:49:37.000,0:49:38.000
you saw with the PQ, right,

0:49:38.000,0:49:40.000
even

0:49:40.000,0:49:44.000
yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the

0:49:44.000,0:49:46.000
kind of really classic things in which a lot of things get built,

0:49:46.000,0:49:51.000
and so having a chance to use them both as a client and kinda dig

0:49:51.000,0:49:53.000
around as an implementer kinda gives you this big picture of

0:49:53.000,0:49:55.000
some of the more common ways that

0:49:55.000,0:49:57.000
cool data structures get built.

0:49:57.000,0:50:00.000
So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman

0:50:00.000,0:50:03.000
cafe in a little bit, if you've got some time, please come, and have

0:50:03.000,0:50:04.000
a good weekend.

0:50:04.000,0:50:07.000
Work hard on pathfinder.