-
Title:
Finishing the Web Crawler - Intro to Computer Science
-
Description:
-
So let's remember the code we had at the end of unit 2 for crawling the web.
-
So we used 2 variables. We initialized "tocrawl" to the seed, a list containing just the seed,
-
and we're going to use "tocrawl" to keep track of the pages to crawl.
-
We initialized "crawled" to the empty list, and we're keeping track of the pages we found using "crawled."
-
Then we had a loop that would continue as long as there were pages left to crawl.
-
We'd pop the last page off the "tocrawl" list.
-
If it's not already crawled, then we'll union into "tocrawl" all the links that we can find on that page,
-
and then we'll add that page to the list of pages we've already crawled.
-
So now we want to figure out how to change this so instead of just finding all the URLs,
-
we're building up our index.
-
We're looking at the actual content of the pages, and we're adding it to our index.
-
So the first change to make, we're updating the index, and we're going to change the return result.
-
So instead of returning "crawled" what we want to return at the end is the index.
-
`If we wanted to keep track of all the URLs crawled, we could still return "crawled" and return both "crawled" and "index,"
-
but let's keep things simple and just return "index."
-
That's what we really want for being able to respond to search queries.
-
So now we have one other change to make, and this is the important one.
-
We need to find a way to update the index to reflect all the words that are found on the page that we've just crawled.
-
I'm going to make one change before we do that.
-
Since both "getalllinks" and what we need to do to add the words to the index depend on the page,
-
let's introduce a new variable and store the content of the page in that variable.
-
This will save us from having to call "get_page" twice. "Get_page" is fairly expensive.
-
It requires a web request to get the content of the page.
-
It makes a lot more sense to store that in a new variable, and that will simplify this code.
-
So now we just need to pass in content.
-
So we have one missing statement,
-
and I'll leave it to you to see if you can figure out statement we need there to finish the web crawler.
-
When it's done, the result of "crawl-web," what we return as "index"
-
should be an index of all the content we find starting from the seed.