0:00:00.000,0:00:02.018 [Sebastian Thrun] So what's your take on how to build a search engine, 0:00:02.018,0:00:03.077 you've build one before, right? 0:00:03.077,0:00:06.008 [Sergey Brin - Co-Founder, Google] Yes. I think the most important thing 0:00:06.008,0:00:08.013 if you're going to build a search engine 0:00:08.013,0:00:12.051 is to have a really good corpus to start out with. 0:00:12.051,0:00:19.020 In our case we used the world wide web, which at time was certainly smaller than it is today. 0:00:19.020,0:00:21.036 But it was also very new and exciting. 0:00:21.036,0:00:23.081 There were all sorts of unexpected things there. 0:00:23.081,0:00:26.099 [David Evans] So the goal for the first three units for the course is to build that corpus. 0:00:27.003,0:00:30.009 And we want to build the corpus for our search engine 0:00:30.009,0:00:32.090 by crawling the web and that's what a web crawler does. 0:00:32.090,0:00:36.038 What a web crawler is, it's a program that collects content from the web. 0:00:36.038,0:00:40.054 If you think of a web page that you see in your browser, you have a page like this. 0:00:40.054,0:00:43.099 And we'll use the udacity site as an example web page. 0:00:43.099,0:00:47.097 It has lot's of content, it has some images, it has some text. 0:00:47.097,0:00:51.038 All of this comes into your browser when you request the page. 0:00:51.038,0:00:53.066 The important thing that it has is links. 0:00:53.066,0:00:57.093 And what a link is, is something that goes to another page. 0:00:57.093,0:01:00.050 So we have a link to the frequently asked questions, 0:01:00.050,0:01:02.046 we have a link to CS 101 page. 0:01:02.046,0:01:04.043 There's some other links on the page. 0:01:04.043,0:01:07.054 And that link may show in you browser with an underscore, 0:01:07.054,0:01:09.094 it may not, depending on how your browser is set. 0:01:09.094,0:01:11.095 But the important thing that it does, 0:01:11.095,0:01:13.088 is it's a pointer to some other web page. 0:01:13.088,0:01:16.043 And those other web pages may also have links 0:01:16.043,0:01:19.073 so we have another link on this page. 0:01:19.073,0:01:23.052 Maybe it's to my name, you can follow to my home page. 0:01:23.052,0:01:26.091 And all the pages that we can find with our web crawler 0:01:26.091,0:01:29.009 are found by following the links. 0:01:29.009,0:01:31.067 So it won't necessarily find every page on the web 0:01:31.067,0:01:33.059 If we start with a good seed page 0:01:33.059,0:01:35.003 we'll find lot's of pages, though. 0:01:35.003,0:01:37.050 And what the crawler's gonna do is start with one page, 0:01:37.050,0:01:41.056 find all the links on that page, follow them to find other pages 0:01:41.056,0:01:45.013 and then on those other pages it will follow the links on those pages 0:01:45.013,0:01:48.031 to find other pages and there will be lot's more links on those pages. 0:01:48.031,0:01:51.043 And eventually we'll have a collection of lot's of pages on the web. 0:01:51.043,0:01:54.007 So that's what we want to do to build a web crawler. 0:01:54.007,0:01:56.095 We want to find some way to start from one seed page, 0:01:56.095,0:01:59.056 extract the links on that page, 0:01:59.056,0:02:01.078 follow those links to other pages, 0:02:01.078,0:02:03.067 then collect the links on those other pages, 0:02:03.067,0:02:05.024 follow them, collect all that. 0:02:05.024,0:02:07.038 So that sounds like a lot to do. 0:02:07.038,0:02:09.014 We're not going to all that this first class. 0:02:09.014,0:02:12.072 What we're going to do this first unit, is just extract a link. 0:02:12.072,0:02:14.058 So we're going to start with a bunch of text. 0:02:14.058,0:02:17.033 It's going to have a link in it with a URL. 0:02:17.033,0:02:19.064 What we want to find is that URL, 0:02:19.064,0:02:21.089 so we can request the next page. 0:02:21.089,0:02:23.082 The goal for the second unit 0:02:23.082,0:02:25.016 is be able to keep going. 0:02:25.016,0:02:28.049 if there's many links on one page, you will want to be able to find them all. 0:02:28.049,0:02:30.014 So that's what we'll do in unit 2, 0:02:30.014,0:02:32.069 is to figure out how to keep going to extract all those links. 0:02:32.069,0:02:36.061 In unit three, well, we want to go beyond just one page. 0:02:36.061,0:02:40.033 So by the end of unit two we can print out all the links on one page. 0:02:40.033,0:02:44.002 For unit 3 we want to collect all those links, so we can keep going, 0:02:44.002,0:02:47.018 end up following our crawler to collect many, many pages. 0:02:47.018,0:02:50.013 So by the end of unit three we'll have built a web crawler. 0:02:50.013,0:02:52.033 We'll have a way of building our corpus. 0:02:52.033,0:02:57.079 Then the remaining three units will look at how to actually respond to queries. 0:02:57.079,0:03:01.034 So in unit four we'll figure out how to give a good response. 0:03:01.034,0:03:08.022 So if you search for a keyword, you want to get a response that's a list of the pages 0:03:08.022,0:03:10.063 where that keyword appears. 0:03:10.063,0:03:15.090 And we'll figure out in unit five a way to do that, that scales, if we have a large corpus. 0:03:15.090,0:03:19.083 And then in unit six what we want to do is, well, we don't just want to find a list, 0:03:19.083,0:03:21.069 we want to find the best one. 0:03:21.069,0:03:24.084 So we'll figure out how to rank all the pages where that keyword appears. 0:03:24.084,0:03:27.068 So we're getting a little ahead of ourselves now, 0:03:27.068,0:03:30.035 because all we're going to do for unit one, 0:03:30.035,0:03:32.064 is to figure out how to extract a link from the page. 0:03:32.064,0:03:35.073 And the search engine that we'll build at the end of this 0:03:35.073,0:03:37.034 will be a functional search engine. 0:03:37.034,0:03:40.061 It will have the main components that a search engine like Google has. 0:03:40.061,0:03:43.014 It certainly won't be as powerful as Google will be, 0:03:43.014,0:03:44.029 we want to keep things simple. 0:03:44.029,0:03:46.060 We want to have a small amount of code to write. 0:03:46.060,0:03:48.010 And we should remember that our real goal 0:03:48.010,0:03:50.024 is not as much to build a search engine, 0:03:50.024,0:03:52.078 but to use the goal of building a search engine as a vehicle 0:03:52.078,0:03:55.018 for learning about computer science 0:03:55.018,0:03:56.075 and learning about programming 0:03:56.075,0:03:58.018 so the things we learn by doing this 0:03:58.018,9:59:59.000 will allow us to solve lot's and lot's of other problems.