English subtitles

← Urank - Intro to Computer Science

Get Embed Code
4 Languages

Showing Revision 4 created 05/25/2016 by Udacity Robot.

  1. So the goal for the rest of this unit
  2. is to modify our search engine code to implement
  3. the PageRank algorithm. We have one little problem. PageRank
  4. is a registered trademark of Google. So we're not going to
  5. call our algorithm PageRank, even though it will do
  6. the same thing. We'll call it URank. The first thing
  7. we need to be able to do, to implement
  8. this ranking algorithm, is keep track of the link graph.
  9. So our popularity of pages depends on the link structure. So that means, we need
  10. to keep track of, what pages link to what pages. So, for each link, there is
  11. a connection between pages and we can think
  12. of that, as a graph. Abstractly, a graph
  13. is just a set of nodes, we'll draw
  14. those as circles, with edges between the nodes.
  15. And because our edges go one way, just like
  16. links in a page, we call this a directed
  17. graph. So, in order to represent our web link
  18. structure we need to build the directed graph. The pages
  19. in the graph are the nodes. For each page,
  20. we need to keep track of the edges that connect
  21. that node to other nodes. And so the way
  22. that we're going to do this is to keep a dictionary.
  23. So we're going to have a dictionary where the entries
  24. in the dictionary are the node, which is the URL,
  25. that's the page. And for each URL, we'll have a
  26. list of all the pages that it links to. So
  27. if this was say node a, and these were nodes
  28. b, c, and d, our entry for node a would
  29. contain the list b, c, d. And our entry for
  30. node b, well, there are no edges out of b,
  31. so it would be an empty list. And finishing the example, C has an out link to
  32. one node, and D has no out links. So
  33. that's our goal. We want to build a structure
  34. like this that shows the structure of the
  35. webpages that we crawl, and we see that structure
  36. because we're following the links in our crawler. So
  37. our goal is to modify the crawl web procedure
  38. that we defined at the end of Unit Five. And to
  39. modify it so that instead of just producing an index, it
  40. also produces a graph. So we're going to modify crawl web. It's
  41. going to still take a seed page as its start. But what
  42. it's going to produce now is both an index and a
  43. graph. And the graph is a structure that gives a mapping
  44. from each node to the pages that it links to. So
  45. let's look at the code that we had at the end of
  46. Unit Five and see how we need to change that. So
  47. here's the code, that we had at the end of unit five,
  48. for crawling the web. And as a reminder, we're keeping track
  49. of the pages left to crawl in the list tocrawl starting with
  50. the seed page, and we're building up the index as a
  51. dictionary. And as long as there are pages left to crawl, we
  52. go through the y loop, which finds a page to crawl,
  53. popping the list of pages to crawl. As long as this one,
  54. we haven't crawled before, it gets the content from that
  55. page. It adds it to the index. It finds all the
  56. links, using get_all_links, passing in the content on the page and
  57. unions those with tocrawl to update the tocrawl list and then
  58. it appends this page to the list of pages, that have
  59. already been crawled. So to change this to build a graph,
  60. we're going to keep most of the code the same. In addition
  61. to producing just index, we're going to produce a graph. And the
  62. graph is also going to be a dictionary. And the reason the
  63. graph is a dictionary is that the mapping from nodes, which
  64. are urls, to the list of edges that go out from
  65. that node. So we'll create the graph as an empty dictionary.
  66. And as we find new pages, we're going to add them to
  67. the graph. And we're going to also change the return to return
  68. both the index and the graph. I'm going to make one more
  69. change before we give you a quiz. And the change I'm
  70. going to make is, instead of calling get_ all_links here,
  71. Since both the graph building and the tocrawl list depend
  72. on knowing all the links, we're going to create a
  73. new variable. And we'll assign the result of get_all_links(content) to
  74. that variable. That means we can use those links
  75. as the input to content. But we can also use
  76. them to build the graph. And I'm going to leave
  77. the line that we need to build a graph for
  78. you to complete. So we'll make that a quiz, to finish
  79. this code. Write the line that we need to update the graph.