1 00:00:00,000 --> 00:00:02,018 [Sebastian Thrun] So what's your take on how to build a search engine, 2 00:00:02,018 --> 00:00:03,077 you've build one before, right? 3 00:00:03,077 --> 00:00:06,008 [Sergey Brin - Co-Founder, Google] Yes. I think the most important thing 4 00:00:06,008 --> 00:00:08,013 if you're going to build a search engine 5 00:00:08,013 --> 00:00:12,051 is to have a really good corpus to start out with. 6 00:00:12,051 --> 00:00:19,020 In our case we used the world wide web, which at time was certainly smaller than it is today. 7 00:00:19,020 --> 00:00:21,036 But it was also very new and exciting. 8 00:00:21,036 --> 00:00:23,081 There were all sorts of unexpected things there. 9 00:00:23,081 --> 00:00:26,099 [David Evans] So the goal for the first three units for the course is to build that corpus. 10 00:00:27,003 --> 00:00:30,009 And we want to build the corpus for our search engine 11 00:00:30,009 --> 00:00:32,090 by crawling the web and that's what a web crawler does. 12 00:00:32,090 --> 00:00:36,038 What a web crawler is, it's a program that collects content from the web. 13 00:00:36,038 --> 00:00:40,054 If you think of a web page that you see in your browser, you have a page like this. 14 00:00:40,054 --> 00:00:43,099 And we'll use the udacity site as an example web page. 15 00:00:43,099 --> 00:00:47,097 It has lot's of content, it has some images, it has some text. 16 00:00:47,097 --> 00:00:51,038 All of this comes into your browser when you request the page. 17 00:00:51,038 --> 00:00:53,066 The important thing that it has is links. 18 00:00:53,066 --> 00:00:57,093 And what a link is, is something that goes to another page. 19 00:00:57,093 --> 00:01:00,050 So we have a link to the frequently asked questions, 20 00:01:00,050 --> 00:01:02,046 we have a link to CS 101 page. 21 00:01:02,046 --> 00:01:04,043 There's some other links on the page. 22 00:01:04,043 --> 00:01:07,054 And that link may show in you browser with an underscore, 23 00:01:07,054 --> 00:01:09,094 it may not, depending on how your browser is set. 24 00:01:09,094 --> 00:01:11,095 But the important thing that it does, 25 00:01:11,095 --> 00:01:13,088 is it's a pointer to some other web page. 26 00:01:13,088 --> 00:01:16,043 And those other web pages may also have links 27 00:01:16,043 --> 00:01:19,073 so we have another link on this page. 28 00:01:19,073 --> 00:01:23,052 Maybe it's to my name, you can follow to my home page. 29 00:01:23,052 --> 00:01:26,091 And all the pages that we can find with our web crawler 30 00:01:26,091 --> 00:01:29,009 are found by following the links. 31 00:01:29,009 --> 00:01:31,067 So it won't necessarily find every page on the web 32 00:01:31,067 --> 00:01:33,059 If we start with a good seed page 33 00:01:33,059 --> 00:01:35,003 we'll find lot's of pages, though. 34 00:01:35,003 --> 00:01:37,050 And what the crawler's gonna do is start with one page, 35 00:01:37,050 --> 00:01:41,056 find all the links on that page, follow them to find other pages 36 00:01:41,056 --> 00:01:45,013 and then on those other pages it will follow the links on those pages 37 00:01:45,013 --> 00:01:48,031 to find other pages and there will be lot's more links on those pages. 38 00:01:48,031 --> 00:01:51,043 And eventually we'll have a collection of lot's of pages on the web. 39 00:01:51,043 --> 00:01:54,007 So that's what we want to do to build a web crawler. 40 00:01:54,007 --> 00:01:56,095 We want to find some way to start from one seed page, 41 00:01:56,095 --> 00:01:59,056 extract the links on that page, 42 00:01:59,056 --> 00:02:01,078 follow those links to other pages, 43 00:02:01,078 --> 00:02:03,067 then collect the links on those other pages, 44 00:02:03,067 --> 00:02:05,024 follow them, collect all that. 45 00:02:05,024 --> 00:02:07,038 So that sounds like a lot to do. 46 00:02:07,038 --> 00:02:09,014 We're not going to all that this first class. 47 00:02:09,014 --> 00:02:12,072 What we're going to do this first unit, is just extract a link. 48 00:02:12,072 --> 00:02:14,058 So we're going to start with a bunch of text. 49 00:02:14,058 --> 00:02:17,033 It's going to have a link in it with a URL. 50 00:02:17,033 --> 00:02:19,064 What we want to find is that URL, 51 00:02:19,064 --> 00:02:21,089 so we can request the next page. 52 00:02:21,089 --> 00:02:23,082 The goal for the second unit 53 00:02:23,082 --> 00:02:25,016 is be able to keep going. 54 00:02:25,016 --> 00:02:28,049 if there's many links on one page, you will want to be able to find them all. 55 00:02:28,049 --> 00:02:30,014 So that's what we'll do in unit 2, 56 00:02:30,014 --> 00:02:32,069 is to figure out how to keep going to extract all those links. 57 00:02:32,069 --> 00:02:36,061 In unit three, well, we want to go beyond just one page. 58 00:02:36,061 --> 00:02:40,033 So by the end of unit two we can print out all the links on one page. 59 00:02:40,033 --> 00:02:44,002 For unit 3 we want to collect all those links, so we can keep going, 60 00:02:44,002 --> 00:02:47,018 end up following our crawler to collect many, many pages. 61 00:02:47,018 --> 00:02:50,013 So by the end of unit three we'll have built a web crawler. 62 00:02:50,013 --> 00:02:52,033 We'll have a way of building our corpus. 63 00:02:52,033 --> 00:02:57,079 Then the remaining three units will look at how to actually respond to queries. 64 00:02:57,079 --> 00:03:01,034 So in unit four we'll figure out how to give a good response. 65 00:03:01,034 --> 00:03:08,022 So if you search for a keyword, you want to get a response that's a list of the pages 66 00:03:08,022 --> 00:03:10,063 where that keyword appears. 67 00:03:10,063 --> 00:03:15,090 And we'll figure out in unit five a way to do that, that scales, if we have a large corpus. 68 00:03:15,090 --> 00:03:19,083 And then in unit six what we want to do is, well, we don't just want to find a list, 69 00:03:19,083 --> 00:03:21,069 we want to find the best one. 70 00:03:21,069 --> 00:03:24,084 So we'll figure out how to rank all the pages where that keyword appears. 71 00:03:24,084 --> 00:03:27,068 So we're getting a little ahead of ourselves now, 72 00:03:27,068 --> 00:03:30,035 because all we're going to do for unit one, 73 00:03:30,035 --> 00:03:32,064 is to figure out how to extract a link from the page. 74 00:03:32,064 --> 00:03:35,073 And the search engine that we'll build at the end of this 75 00:03:35,073 --> 00:03:37,034 will be a functional search engine. 76 00:03:37,034 --> 00:03:40,061 It will have the main components that a search engine like Google has. 77 00:03:40,061 --> 00:03:43,014 It certainly won't be as powerful as Google will be, 78 00:03:43,014 --> 00:03:44,029 we want to keep things simple. 79 00:03:44,029 --> 00:03:46,060 We want to have a small amount of code to write. 80 00:03:46,060 --> 00:03:48,010 And we should remember that our real goal 81 00:03:48,010 --> 00:03:50,024 is not as much to build a search engine, 82 00:03:50,024 --> 00:03:52,078 but to use the goal of building a search engine as a vehicle 83 00:03:52,078 --> 00:03:55,018 for learning about computer science 84 00:03:55,018 --> 00:03:56,075 and learning about programming 85 00:03:56,075 --> 00:03:58,018 so the things we learn by doing this 86 00:03:58,018 --> 99:59:59,999 will allow us to solve lot's and lot's of other problems.