English subtitles

← Hash Tables - Intro to Computer Science

Get Embed Code
3 Languages

Showing Revision 5 created 05/24/2016 by Udacity Robot.

  1. >>So we had a lot of great questions
  2. about hash tables on the forums
  3. and one example is Student Baracha asks
  4. "How does Python decide how many buckets
  5. a dictionary has as it grows?"
  6. >>So this is an important question, and
  7. there are lots and lots of interesting things about hash tables
  8. that we did not get into in Unit 5.
  9. If memory was free and cheap, and equally fast
  10. no matter how much you needed,
  11. then you'd want the hash table to be as big as possible, right?
  12. You'd want billions of buckets in your hash tables
  13. so they'd never have to have more than one entry in a bucket.
  14. But as we saw in Unit 3, memory can be expensive.
  15. And the faster and the closer to the processor the memory is,
  16. the more expensive it is.
  17. So you have a very limited amount of that,
  18. there's a reason to try to keep the hash tables small as well.
  19. So this is a tough trade-off.
  20. Good hash table implementations
  21. try to make this trade-off in a way that gives you the right balance
  22. of performance and memory use.
  23. They do this by looking at the load factor.
  24. We actually used this in Unit 4.
  25. This is just the number of entries divided by the number of buckets.
  26. We had that question in Unit 5. The question where it was "N divided by B"
  27. where you're looking at the impact of the change of the number of buckets and the number of entries.
  28. One thing you have to worry about when you do that
  29. is if you're just looking at the number of keywords and the number of buckets,
  30. that's the average size. What matters in many applications
  31. is more the worst size. Even if the average size is fairly small,
  32. the worst size could be much larger than that,
  33. and if the lookup for the worst case entry
  34. starts to get very bad, then you want a larger number of buckets
  35. or to change your hash function in some way.
  36. So, for a typical hash table implementation,
  37. the goal is to keep the load factor actually below 1 usually.
  38. For Python's dictionary implementation,
  39. if the number of keywords exceeds about 2/3,
  40. and I think it's actually exactly 2/3 of the size of the table,
  41. that's the point where the table is resized.
  42. So the table would then be doubled, and that changes
  43. the bucket each word would appear in,
  44. because we saw that the result of the hash function
  45. depends on the number of buckets you have.
  46. So you'd have to copy things into the new hash table
  47. that has more space, but that's going to make the look-ups faster.
  48. So this means that if you had about a million entries in your hash table,
  49. you would expect to have about a million and a half buckets.
  50. But as you increase above that 2/3 threshold,
  51. then you'd double the size of the buckets.
  52. So you'd end up with 3 million buckets,
  53. if you added one more entry beyond that.
  54. So if you compare this to what we did in Unit 5,
  55. you might be surprised that it's so low.
  56. We were doing example hash tables,
  57. where the number of buckets was very small,
  58. and each bucket had many entries in it.
  59. This is partly to be easier to see what's going on,
  60. because if you saw a hash table with thousands of
  61. empty buckets, that would be kind of hard to print out.
  62. But the other reason for that is the way we implemented
  63. the hash table in Unit 5 was each bucket was a list.
  64. And with a list, that's a fairly expensive data structure.
  65. You've got to create all those empty lists to
  66. create your hash table.
  67. The way the Python dictionary is implemented,
  68. there's just one flat list.
  69. That means there's one space for each
  70. hash value, that if you hashed to a given bucket
  71. there's only one space there.
  72. That means if two things hash to the same bucket,
  73. you've got to do something else.
  74. And what happens with the Python dictionary implementation
  75. is you look for another free place to put it
  76. and you have a way for deciding where you look for the next one
  77. if the first one is full.
  78. This makes look-ups and adding to the table more complicated,
  79. so that's why we didn't do it.
  80. But it means you're using less memory,
  81. because you don't have all those empty lists
  82. for all the empty buckets.
  83. There's a great chapter in a book called "Beautiful Code"
  84. that's all about the Python dictionary implementation.
  85. So if you're interested in how that's actually done,
  86. I encourage you to look at that chapter.