English subtitles

← Finishing the Web Crawler - Intro to Computer Science

Get Embed Code
2 Languages

Showing Revision 5 created 05/24/2016 by Udacity Robot.

  1. So now we're ready to finish the Web Crawler. And remember
  2. what we want the Web Crawler to do. So, we have
  3. a seed page, and we're assuming we know some seed page
  4. to start with. And the seed page has some links on it.
  5. We want to be able to find those links, and we
  6. know how to do that now. We're going to get them to
  7. list, and then we're going to follow those links. So we'll follow
  8. the links. To new pages, and those new pages might also have
  9. links, and we want to follow those links. So
  10. in order to do this, we need to think about
  11. two things. We need to keep track of all
  12. the pages that we're waiting to crawl, and we'll introduce
  13. a variable, tocrawl, to do that. And what tocrawl
  14. will be is a list of pages left to crawl.
  15. So, initially it'll just be the seed page. Once
  16. we collect the links on the seed page well it
  17. will include those links as well. Once we finish crawling a
  18. page, we don't want to keep it in tocrawl anymore. And
  19. as we find new pages to crawl, they'll be added to
  20. the tocrawl list. The other variable we want is to keep track
  21. of all the pages that we've crawled. At the end of
  22. our crawl, this is the result. We want to know all the pages
  23. that we found. That will be stored in the variable we'll
  24. call crawled. So let's walk through an example of how this should
  25. work on the sample site.
  26. So I'll make the seed page, www.udacity.com/cs101x/index.html, that's
  27. this page here. That means when we start
  28. to crawl, we want tocrawl to be this
  29. index page. And I'm going to stop writing out the
  30. full URLs, just writing out the final part.
  31. because all the pages we crawl will be
  32. on our test site. So tocrawl will be
  33. the list containing just one element. The index.html page.
  34. We haven't crawled anything yet, we're just getting started, so
  35. crawled will start out as the empty list. The next
  36. thing we're going to do is crawl this page, so
  37. we'll get all the links on this page. That means
  38. we've crawled the index page, so that will now be
  39. added to crawled. But when we looked at the links
  40. on the index page, we found three new links on
  41. that page. We found a link here, which goes to crawling.html.
  42. We found the link here, which goes to walking.html.
  43. And we found the link here that goes to flying.html.
  44. So the new value of tocrawl, after crawling this page,
  45. will have those three links in it. The next thing
  46. we want to do is take one of those links
  47. and crawl it. The order actually matters a lot in
  48. terms of getting a good crawl. Let's assume for now
  49. that we'll do just do the last one first. So,
  50. we'll do the link fly. That links to
  51. the flying page. Here is the page there. So,
  52. we are going to crawl the page flying.html. This
  53. page doesn't have any links. If you're not sure
  54. why squeamish ossifrage is the magic words, I
  55. would encourage you to DuckDuckGo or Google that. And
  56. now we've finished crawling flying, so that's going to
  57. be added to the crawled list, which already had
  58. index.html, we don't lose that. We're going to add the
  59. new one, which is flying, to that list. And
  60. we finished crawling it so we don't want to crawl
  61. it again. Let's remove it from the tocrawl list. Now,
  62. after we're crawled flying, we have two more links
  63. left in our tocrawl list. We have two links
  64. that we've crawled. So let's try another link. Let's
  65. suppose we follow the crawling.html link. And we follow crawling.
  66. We get to this page. So, to follow crawling, we're
  67. going to follow the same algorithm we did with flying. Alright, so
  68. that removes this link from the tocrawl list, adds it
  69. to the crawl list. So, we're done crawling, crawling. And now
  70. we want to add to our tocrawl lists all the
  71. links that we find on this page. Well, we found the
  72. link, kicking, which goes to the page, kicking.html So we're going to
  73. add that to our list of pages to crawl, and now
  74. we keep going. And we're going to keep going.
  75. We'll follow kicking. We find that kicking does not
  76. have any links. So that would add kicking
  77. to the crawled list and remove it from the
  78. to crawl lists. And we're going to keep going
  79. until we have no more pages to crawl. So
  80. let me describe that process a little more formally
  81. and then I'll ask you a question about it.