English subtitles

← Finishing the Web Crawler - Intro to Computer Science

Get Embed Code
6 Languages

Showing Revision 6 created 05/24/2016 by Udacity Robot.

  1. So let's remember the code we had at the end of unit 2 for crawling the web.
  2. So we used 2 variables. We initialized "tocrawl" to the seed, a list containing just the seed,
  3. and we're going to use "tocrawl" to keep track of the pages to crawl.
  4. We initialized "crawled" to the empty list, and we're keeping track of the pages we found using "crawled."
  5. Then we had a loop that would continue as long as there were pages left to crawl.
  6. We'd pop the last page off the "tocrawl" list.
  7. If it's not already crawled, then we'll union into "tocrawl" all the links that we can find on that page,
  8. and then we'll add that page to the list of pages we've already crawled.
  9. So now we want to figure out how to change this so instead of just finding all the URLs,
  10. we're building up our index.
  11. We're looking at the actual content of the pages, and we're adding it to our index.
  12. So the first change to make, we're updating the index, and we're going to change the return result.
  13. So instead of returning "crawled" what we want to return at the end is the index.
  14. `If we wanted to keep track of all the URLs crawled, we could still return "crawled" and return both "crawled" and "index,"
  15. but let's keep things simple and just return "index."
  16. That's what we really want for being able to respond to search queries.
  17. So now we have one other change to make, and this is the important one.
  18. We need to find a way to update the index to reflect all the words that are found on the page that we've just crawled.
  19. I'm going to make one change before we do that.
  20. Since both "getalllinks" and what we need to do to add the words to the index depend on the page,
  21. let's introduce a new variable and store the content of the page in that variable.
  22. This will save us from having to call "get_page" twice. "Get_page" is fairly expensive.
  23. It requires a web request to get the content of the page.
  24. It makes a lot more sense to store that in a new variable, and that will simplify this code.
  25. So now we just need to pass in content.
  26. So we have one missing statement,
  27. and I'll leave it to you to see if you can figure out statement we need there to finish the web crawler.
  28. When it's done, the result of "crawl-web," what we return as "index"
  29. should be an index of all the content we find starting from the seed.