
Title:
Hash Tables  Intro to Computer Science

Description:

>>So we had a lot of great questions

about hash tables on the forums

and one example is Student Baracha asks

"How does Python decide how many buckets

a dictionary has as it grows?"

>>So this is an important question, and

there are lots and lots of interesting things about hash tables

that we did not get into in Unit 5.

If memory was free and cheap, and equally fast

no matter how much you needed,

then you'd want the hash table to be as big as possible, right?

You'd want billions of buckets in your hash tables

so they'd never have to have more than one entry in a bucket.

But as we saw in Unit 3, memory can be expensive.

And the faster and the closer to the processor the memory is,

the more expensive it is.

So you have a very limited amount of that,

there's a reason to try to keep the hash tables small as well.

So this is a tough tradeoff.

Good hash table implementations

try to make this tradeoff in a way that gives you the right balance

of performance and memory use.

They do this by looking at the load factor.

We actually used this in Unit 4.

This is just the number of entries divided by the number of buckets.

We had that question in Unit 5. The question where it was "N divided by B"

where you're looking at the impact of the change of the number of buckets and the number of entries.

One thing you have to worry about when you do that

is if you're just looking at the number of keywords and the number of buckets,

that's the average size. What matters in many applications

is more the worst size. Even if the average size is fairly small,

the worst size could be much larger than that,

and if the lookup for the worst case entry

starts to get very bad, then you want a larger number of buckets

or to change your hash function in some way.

So, for a typical hash table implementation,

the goal is to keep the load factor actually below 1 usually.

For Python's dictionary implementation,

if the number of keywords exceeds about 2/3,

and I think it's actually exactly 2/3 of the size of the table,

that's the point where the table is resized.

So the table would then be doubled, and that changes

the bucket each word would appear in,

because we saw that the result of the hash function

depends on the number of buckets you have.

So you'd have to copy things into the new hash table

that has more space, but that's going to make the lookups faster.

So this means that if you had about a million entries in your hash table,

you would expect to have about a million and a half buckets.

But as you increase above that 2/3 threshold,

then you'd double the size of the buckets.

So you'd end up with 3 million buckets,

if you added one more entry beyond that.

So if you compare this to what we did in Unit 5,

you might be surprised that it's so low.

We were doing example hash tables,

where the number of buckets was very small,

and each bucket had many entries in it.

This is partly to be easier to see what's going on,

because if you saw a hash table with thousands of

empty buckets, that would be kind of hard to print out.

But the other reason for that is the way we implemented

the hash table in Unit 5 was each bucket was a list.

And with a list, that's a fairly expensive data structure.

You've got to create all those empty lists to

create your hash table.

The way the Python dictionary is implemented,

there's just one flat list.

That means there's one space for each

hash value, that if you hashed to a given bucket

there's only one space there.

That means if two things hash to the same bucket,

you've got to do something else.

And what happens with the Python dictionary implementation

is you look for another free place to put it

and you have a way for deciding where you look for the next one

if the first one is full.

This makes lookups and adding to the table more complicated,

so that's why we didn't do it.

But it means you're using less memory,

because you don't have all those empty lists

for all the empty buckets.

There's a great chapter in a book called "Beautiful Code"

that's all about the Python dictionary implementation.

So if you're interested in how that's actually done,

I encourage you to look at that chapter.