English subtitles

← Making Lookup Faster - Intro to Computer Science

Get Embed Code
2 Languages

Showing Revision 5 created 05/24/2016 by Udacity Robot.

  1. So what do we need to do to make lookup faster?
  2. Well let's think about why it was so slow, right? The reason
  3. it was so slow is that we were doing this for loop,
  4. we were going through all the elements in order, and we're checking
  5. if they match the keyword, right? And we had to do this
  6. for the entire index, for an entry, for a keyword that's not
  7. in the index, To determine that it's not there, we had to
  8. go through the whole index. This is not how we use indexes
  9. in real life, right? If you're looking for a word in the
  10. index of a book, you don't have to look through every single entry
  11. to see if that word exists. You can jump around, and the
  12. reason that you can jump around is because the entries in the index,
  13. are sorted, they're sorted in alphabetical order, so you know where that
  14. entry would belong. You just need to find the right place and see
  15. if it's there. So we can do that with our index instead
  16. of having our index kept in arbitrary order. If we kept our index
  17. in assorted order, then we could find the place where
  18. that entry belongs and look for it. Sorting is a very
  19. interesting problem. It's something we're not actually going to talk about more
  20. in this class. We're going to do a different way of doing
  21. that. What we're going to do is find a way to
  22. find where the entry should be that doesn't require actually keeping
  23. all the entries sorted. What we want is something that will
  24. allow us, given a keyword, we're going to have some function that
  25. tells us where it belongs. We're going to call that
  26. a hash function. That tells us where in the entry
  27. to look. And so instead of having to look through
  28. the whole index, the hash function will tell us where
  29. that entry belongs. So what we need for this is
  30. some function that's going to take a key word, map
  31. it to a number And that number is the position
  32. in the index where that number belongs. We could do this
  33. lots of different ways. One simple thing would be to
  34. think, well we know alphabet. This is more like the way
  35. an index for a book would work, and we're going to
  36. have for each entry and index, we'll have based on the
  37. first letter, we'll put all the entries that start with
  38. tha tfirst letter in the same place. So, if we're looking
  39. for a key word that starts with u, that prefer hash
  40. would tell us to look in the place where all the
  41. words that start with u are. And then we'd only have to
  42. look through the words that start with u. So this would allow
  43. us to do a look up much more quickly than looking through
  44. the whole index. This isn't quite the best way to do things. If
  45. we made our places based on the letter, well, then we have
  46. a problem if we have two words with the same first letter. You
  47. generally expect to have more than one word that starts with the
  48. same letter. So instead of having just an element here for each position,
  49. we're going to have a list of elements that would be all
  50. the words that start with u So when we look up the
  51. word udacity, we look in the entry for u, and if the
  52. word that's there doesn't match then we know udacity isn't in the
  53. index. There are lots of problems with this. The first problem is
  54. well, there might be more than one word that starts with u.
  55. So we can't just have one entry here. What we need to
  56. have is a list of entries We often call this a bucket.
  57. So we need a bucket of all the entries that
  58. start with u that would be in this position. So, instead
  59. of just having one entry like the old structure of our
  60. index, now we're going to have a list of entries, and
  61. each element of the index will now be a bucket, which
  62. is a list of entries that are in the right position.
  63. This is going to be our bucket of our, all the entries
  64. that start with u, and that would have all the different
  65. entries that start with the letter u in that bucket. So this
  66. is getting better. Now, for each look up, instead of having to
  67. look through all of the words in index, we just need to
  68. find the position that starts with the right letter. That's got a
  69. bucket of all the words that start with that letter, and then
  70. we just need to look through that bucket. This works okay, but
  71. this doesn't really scale very well. At best, if we have you
  72. know, ten million words well now instead of having ten million entries
  73. to go through, we need to go through ten million divided
  74. by say 26, if we have 26 letters. It's not making
  75. things much faster. It's making things maybe at best, 26 times
  76. letter. That assumes that all of the buckets are the same size.
  77. Certainly if we make the buckets based on the first letter,
  78. that's not going to be the same size. If the words
  79. are a typical English words. We're going to have many more
  80. words that start with s or t, say than start with u.
  81. So, we want to fix those two problems. We want to
  82. be able to have more buckets. So we're not going to just
  83. use the first letter, we're going to use some function on the
  84. whole word that tells us where it belongs. And we're going to try
  85. to make that function distribute the words fairly well. So the
  86. structure that I've described is what's called a hash table. This
  87. is a very useful data structure. It's so useful that it's
  88. built into Python. There's the Python type called a dictionary. Which provides
  89. this functionality. At the end of today's unit, I'll explain
  90. how the Python dictionary works, and how to use it,
  91. and we'll modify the search engine code to use dictionary
  92. instead of the lookup table that we built, but before
  93. we do that, we are going to implement it ourselves. We
  94. are going to make sure that we understand how the hash
  95. table works by writing all the code to do it
  96. ourselves and then we'll switch to using the built-in Python [INAUDIBLE]