WEBVTT

00:00:00.000 --> 00:00:02.018
[Sebastian Thrun] So what's your take on how to build a search engine,

00:00:02.018 --> 00:00:03.077
you've build one before, right?

00:00:03.077 --> 00:00:06.008
[Sergey Brin - Co-Founder, Google] Yes. I think the most important thing

00:00:06.008 --> 00:00:08.013
if you're going to build a search engine

00:00:08.013 --> 00:00:12.051
is to have a really good corpus to start out with.

00:00:12.051 --> 00:00:19.020
In our case we used the world wide web, which at time was certainly smaller than it is today.

00:00:19.020 --> 00:00:21.036
But it was also very new and exciting.

00:00:21.036 --> 00:00:23.081
There were all sorts of unexpected things there.

00:00:23.081 --> 00:00:26.099
[David Evans] So the goal for the first three units for the course is to build that corpus.

00:00:27.003 --> 00:00:30.009
And we want to build the corpus for our search engine

00:00:30.009 --> 00:00:32.090
by crawling the web and that's what a web crawler does.

00:00:32.090 --> 00:00:36.038
What a web crawler is, it's a program that collects content from the web.

00:00:36.038 --> 00:00:40.054
If you think of a web page that you see in your browser, you have a page like this.

00:00:40.054 --> 00:00:43.099
And we'll use the udacity site as an example web page.

00:00:43.099 --> 00:00:47.097
It has lot's of content, it has some images, it has some text.

00:00:47.097 --> 00:00:51.038
All of this comes into your browser when you request the page.

00:00:51.038 --> 00:00:53.066
The important thing that it has is links.

00:00:53.066 --> 00:00:57.093
And what a link is, is something that goes to another page.

00:00:57.093 --> 00:01:00.050
So we have a link to the frequently asked questions,

00:01:00.050 --> 00:01:02.046
we have a link to CS 101 page.

00:01:02.046 --> 00:01:04.043
There's some other links on the page.

00:01:04.043 --> 00:01:07.054
And that link may show in you browser with an underscore,

00:01:07.054 --> 00:01:09.094
it may not, depending on how your browser is set.

00:01:09.094 --> 00:01:11.095
But the important thing that it does,

00:01:11.095 --> 00:01:13.088
is it's a pointer to some other web page.

00:01:13.088 --> 00:01:16.043
And those other web pages may also have links

00:01:16.043 --> 00:01:19.073
so we have another link on this page.

00:01:19.073 --> 00:01:23.052
Maybe it's to my name, you can follow to my home page.

00:01:23.052 --> 00:01:26.091
And all the pages that we can find with our web crawler

00:01:26.091 --> 00:01:29.009
are found by following the links.

00:01:29.009 --> 00:01:31.067
So it won't necessarily find every page on the web

00:01:31.067 --> 00:01:33.059
If we start with a good seed page

00:01:33.059 --> 00:01:35.003
we'll find lot's of pages, though.

00:01:35.003 --> 00:01:37.050
And what the crawler's gonna do is start with one page,

00:01:37.050 --> 00:01:41.056
find all the links on that page, follow them to find other pages

00:01:41.056 --> 00:01:45.013
and then on those other pages it will follow the links on those pages

00:01:45.013 --> 00:01:48.031
to find other pages and there will be lot's more links on those pages.

00:01:48.031 --> 00:01:51.043
And eventually we'll have a collection of lot's of pages on the web.

00:01:51.043 --> 00:01:54.007
So that's what we want to do to build a web crawler.

00:01:54.007 --> 00:01:56.095
We want to find some way to start from one seed page,

00:01:56.095 --> 00:01:59.056
extract the links on that page,

00:01:59.056 --> 00:02:01.078
follow those links to other pages,

00:02:01.078 --> 00:02:03.067
then collect the links on those other pages,

00:02:03.067 --> 00:02:05.024
follow them, collect all that.

00:02:05.024 --> 00:02:07.038
So that sounds like a lot to do.

00:02:07.038 --> 00:02:09.014
We're not going to all that this first class.

00:02:09.014 --> 00:02:12.072
What we're going to do this first unit, is just extract a link.

00:02:12.072 --> 00:02:14.058
So we're going to start with a bunch of text.

00:02:14.058 --> 00:02:17.033
It's going to have a link in it with a URL.

00:02:17.033 --> 00:02:19.064
What we want to find is that URL,

00:02:19.064 --> 00:02:21.089
so we can request the next page.

00:02:21.089 --> 00:02:23.082
The goal for the second unit

00:02:23.082 --> 00:02:25.016
is be able to keep going.

00:02:25.016 --> 00:02:28.049
if there's many links on one page, you will want to be able to find them all.

00:02:28.049 --> 00:02:30.014
So that's what we'll do in unit 2,

00:02:30.014 --> 00:02:32.069
is to figure out how to keep going to extract all those links.

00:02:32.069 --> 00:02:36.061
In unit three, well, we want to go beyond just one page.

00:02:36.061 --> 00:02:40.033
So by the end of unit two we can print out all the links on one page.

00:02:40.033 --> 00:02:44.002
For unit 3 we want to collect all those links, so we can keep going,

00:02:44.002 --> 00:02:47.018
end up following our crawler to collect many, many pages.

00:02:47.018 --> 00:02:50.013
So by the end of unit three we'll have built a web crawler.

00:02:50.013 --> 00:02:52.033
We'll have a way of building our corpus.

00:02:52.033 --> 00:02:57.079
Then the remaining three units will look at how to actually respond to queries.

00:02:57.079 --> 00:03:01.034
So in unit four we'll figure out how to give a good response.

00:03:01.034 --> 00:03:08.022
So if you search for a keyword, you want to get a response that's a list of the pages

00:03:08.022 --> 00:03:10.063
where that keyword appears.

00:03:10.063 --> 00:03:15.090
And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.

00:03:15.090 --> 00:03:19.083
And then in unit six what we want to do is, well, we don't just want to find a list,

00:03:19.083 --> 00:03:21.069
we want to find the best one.

00:03:21.069 --> 00:03:24.084
So we'll figure out how to rank all the pages where that keyword appears.

00:03:24.084 --> 00:03:27.068
So we're getting a little ahead of ourselves now,

00:03:27.068 --> 00:03:30.035
because all we're going to do for unit one,

00:03:30.035 --> 00:03:32.064
is to figure out how to extract a link from the page.

00:03:32.064 --> 00:03:35.073
And the search engine that we'll build at the end of this

00:03:35.073 --> 00:03:37.034
will be a functional search engine.

00:03:37.034 --> 00:03:40.061
It will have the main components that a search engine like Google has.

00:03:40.061 --> 00:03:43.014
It certainly won't be as powerful as Google will be,

00:03:43.014 --> 00:03:44.029
we want to keep things simple.

00:03:44.029 --> 00:03:46.060
We want to have a small amount of code to write.

00:03:46.060 --> 00:03:48.010
And we should remember that our real goal

00:03:48.010 --> 00:03:50.024
is not as much to build a search engine,

00:03:50.024 --> 00:03:52.078
but to use the goal of building a search engine as a vehicle

00:03:52.078 --> 00:03:55.018
for learning about computer science

00:03:55.018 --> 00:03:56.075
and learning about programming

00:03:56.075 --> 00:03:58.018
so the things we learn by doing this

00:03:58.018 --> 99:59:59.999
will allow us to solve lot's and lot's of other problems.