WEBVTT 00:00:00.000 --> 00:00:02.018 [Sebastian Thrun] So what's your take on how to build a search engine, 00:00:02.018 --> 00:00:03.077 you've build one before, right? 00:00:03.077 --> 00:00:06.008 [Sergey Brin - Co-Founder, Google] Yes. I think the most important thing 00:00:06.008 --> 00:00:08.013 if you're going to build a search engine 00:00:08.013 --> 00:00:12.051 is to have a really good corpus to start out with. 00:00:12.051 --> 00:00:19.020 In our case we used the world wide web, which at time was certainly smaller than it is today. 00:00:19.020 --> 00:00:21.036 But it was also very new and exciting. 00:00:21.036 --> 00:00:23.081 There were all sorts of unexpected things there. 00:00:23.081 --> 00:00:26.099 [David Evans] So the goal for the first three units for the course is to build that corpus. 00:00:27.003 --> 00:00:30.009 And we want to build the corpus for our search engine 00:00:30.009 --> 00:00:32.090 by crawling the web and that's what a web crawler does. 00:00:32.090 --> 00:00:36.038 What a web crawler is, it's a program that collects content from the web. 00:00:36.038 --> 00:00:40.054 If you think of a web page that you see in your browser, you have a page like this. 00:00:40.054 --> 00:00:43.099 And we'll use the udacity site as an example web page. 00:00:43.099 --> 00:00:47.097 It has lot's of content, it has some images, it has some text. 00:00:47.097 --> 00:00:51.038 All of this comes into your browser when you request the page. 00:00:51.038 --> 00:00:53.066 The important thing that it has is links. 00:00:53.066 --> 00:00:57.093 And what a link is, is something that goes to another page. 00:00:57.093 --> 00:01:00.050 So we have a link to the frequently asked questions, 00:01:00.050 --> 00:01:02.046 we have a link to CS 101 page. 00:01:02.046 --> 00:01:04.043 There's some other links on the page. 00:01:04.043 --> 00:01:07.054 And that link may show in you browser with an underscore, 00:01:07.054 --> 00:01:09.094 it may not, depending on how your browser is set. 00:01:09.094 --> 00:01:11.095 But the important thing that it does, 00:01:11.095 --> 00:01:13.088 is it's a pointer to some other web page. 00:01:13.088 --> 00:01:16.043 And those other web pages may also have links 00:01:16.043 --> 00:01:19.073 so we have another link on this page. 00:01:19.073 --> 00:01:23.052 Maybe it's to my name, you can follow to my home page. 00:01:23.052 --> 00:01:26.091 And all the pages that we can find with our web crawler 00:01:26.091 --> 00:01:29.009 are found by following the links. 00:01:29.009 --> 00:01:31.067 So it won't necessarily find every page on the web 00:01:31.067 --> 00:01:33.059 If we start with a good seed page 00:01:33.059 --> 00:01:35.003 we'll find lot's of pages, though. 00:01:35.003 --> 00:01:37.050 And what the crawler's gonna do is start with one page, 00:01:37.050 --> 00:01:41.056 find all the links on that page, follow them to find other pages 00:01:41.056 --> 00:01:45.013 and then on those other pages it will follow the links on those pages 00:01:45.013 --> 00:01:48.031 to find other pages and there will be lot's more links on those pages. 00:01:48.031 --> 00:01:51.043 And eventually we'll have a collection of lot's of pages on the web. 00:01:51.043 --> 00:01:54.007 So that's what we want to do to build a web crawler. 00:01:54.007 --> 00:01:56.095 We want to find some way to start from one seed page, 00:01:56.095 --> 00:01:59.056 extract the links on that page, 00:01:59.056 --> 00:02:01.078 follow those links to other pages, 00:02:01.078 --> 00:02:03.067 then collect the links on those other pages, 00:02:03.067 --> 00:02:05.024 follow them, collect all that. 00:02:05.024 --> 00:02:07.038 So that sounds like a lot to do. 00:02:07.038 --> 00:02:09.014 We're not going to all that this first class. 00:02:09.014 --> 00:02:12.072 What we're going to do this first unit, is just extract a link. 00:02:12.072 --> 00:02:14.058 So we're going to start with a bunch of text. 00:02:14.058 --> 00:02:17.033 It's going to have a link in it with a URL. 00:02:17.033 --> 00:02:19.064 What we want to find is that URL, 00:02:19.064 --> 00:02:21.089 so we can request the next page. 00:02:21.089 --> 00:02:23.082 The goal for the second unit 00:02:23.082 --> 00:02:25.016 is be able to keep going. 00:02:25.016 --> 00:02:28.049 if there's many links on one page, you will want to be able to find them all. 00:02:28.049 --> 00:02:30.014 So that's what we'll do in unit 2, 00:02:30.014 --> 00:02:32.069 is to figure out how to keep going to extract all those links. 00:02:32.069 --> 00:02:36.061 In unit three, well, we want to go beyond just one page. 00:02:36.061 --> 00:02:40.033 So by the end of unit two we can print out all the links on one page. 00:02:40.033 --> 00:02:44.002 For unit 3 we want to collect all those links, so we can keep going, 00:02:44.002 --> 00:02:47.018 end up following our crawler to collect many, many pages. 00:02:47.018 --> 00:02:50.013 So by the end of unit three we'll have built a web crawler. 00:02:50.013 --> 00:02:52.033 We'll have a way of building our corpus. 00:02:52.033 --> 00:02:57.079 Then the remaining three units will look at how to actually respond to queries. 00:02:57.079 --> 00:03:01.034 So in unit four we'll figure out how to give a good response. 00:03:01.034 --> 00:03:08.022 So if you search for a keyword, you want to get a response that's a list of the pages 00:03:08.022 --> 00:03:10.063 where that keyword appears. 00:03:10.063 --> 00:03:15.090 And we'll figure out in unit five a way to do that, that scales, if we have a large corpus. 00:03:15.090 --> 00:03:19.083 And then in unit six what we want to do is, well, we don't just want to find a list, 00:03:19.083 --> 00:03:21.069 we want to find the best one. 00:03:21.069 --> 00:03:24.084 So we'll figure out how to rank all the pages where that keyword appears. 00:03:24.084 --> 00:03:27.068 So we're getting a little ahead of ourselves now, 00:03:27.068 --> 00:03:30.035 because all we're going to do for unit one, 00:03:30.035 --> 00:03:32.064 is to figure out how to extract a link from the page. 00:03:32.064 --> 00:03:35.073 And the search engine that we'll build at the end of this 00:03:35.073 --> 00:03:37.034 will be a functional search engine. 00:03:37.034 --> 00:03:40.061 It will have the main components that a search engine like Google has. 00:03:40.061 --> 00:03:43.014 It certainly won't be as powerful as Google will be, 00:03:43.014 --> 00:03:44.029 we want to keep things simple. 00:03:44.029 --> 00:03:46.060 We want to have a small amount of code to write. 00:03:46.060 --> 00:03:48.010 And we should remember that our real goal 00:03:48.010 --> 00:03:50.024 is not as much to build a search engine, 00:03:50.024 --> 00:03:52.078 but to use the goal of building a search engine as a vehicle 00:03:52.078 --> 00:03:55.018 for learning about computer science 00:03:55.018 --> 00:03:56.075 and learning about programming 00:03:56.075 --> 00:03:58.018 so the things we learn by doing this 00:03:58.018 --> 99:59:59.999 will allow us to solve lot's and lot's of other problems.