Return to Video

Extracting Links - Intro to Computer Science

  • 0:01 - 0:03
    So now you know enough about Python to be able
  • 0:03 - 0:05
    to solve the problem that we started with at the
  • 0:05 - 0:09
    beginning of this unit, which the problem of extracting a
  • 0:09 - 0:14
    link from its page. Before we get to the code,
  • 0:14 - 0:16
    I want to describe a little more carefully what's going
  • 0:16 - 0:19
    on in a webpage. So we've talked about strings in
  • 0:19 - 0:22
    Python and all a web page really is, is a
  • 0:22 - 0:26
    long string. When you see a web page in your browser,
  • 0:26 - 0:29
    it doesn't look like that. So here's an example web
  • 0:29 - 0:33
    page, one of my favorite XKCD comics. And hopefully, you're
  • 0:33 - 0:37
    starting to learn enough about Python to appreciate the power
  • 0:37 - 0:39
    of Python to make you fly. Probably the rest of
  • 0:39 - 0:42
    the comic, if you haven't done anything other than using
  • 0:42 - 0:45
    Python, is a little hard to relate for now. But
  • 0:45 - 0:48
    it's making fun of other languages where there's an awful
  • 0:48 - 0:51
    lot of work to do something simple, like we've seen here,
  • 0:51 - 0:54
    just being able to print out a string. But with
  • 0:54 - 0:57
    Python, we can fly quickly, and you're going to learn to fly
  • 0:57 - 1:00
    very quickly in this class. This doesn't lok like just
  • 1:00 - 1:04
    a string. We've seen just a string is a sequence of
  • 1:04 - 1:06
    characters. When we look at a webpage like this, well
  • 1:06 - 1:10
    we see images. We see buttons. We see some text. We
  • 1:10 - 1:11
    see things that are links and you can see the
  • 1:11 - 1:16
    underlines these are all links. And the browser renders the webpage
  • 1:16 - 1:20
    in a way that looks attractive. What actually was
  • 1:20 - 1:22
    there though, started just as a stream of text.
  • 1:22 - 1:25
    If you right-click on the webpage, one of the
  • 1:25 - 1:28
    options you see is View Page Source. When you click
  • 1:28 - 1:33
    on that, you'll see the actual source code. This is what came into the browser.
  • 1:33 - 1:38
    So, your browser sent a request, the URL is what's shown in the address bar.
  • 1:38 - 1:43
    So, it's sent a request to xkcd.com/355.
  • 1:43 - 1:47
    It sent that request and this is what came back. What came back is just a stream
  • 1:47 - 1:53
    of text. We can look at that text and some of it is fairly hard to understand.
  • 1:54 - 1:57
    So what's important is the links. Here's an
  • 1:57 - 2:00
    example of a link. So, the link starts with
  • 2:00 - 2:05
    a tag like this. The language HTML uses these
  • 2:05 - 2:09
    angle brackets. And the angle bracket a href equals
  • 2:09 - 2:12
    is how we start a link. That's followed by
  • 2:12 - 2:16
    a string which is surrounded by double quotes, similarly
  • 2:16 - 2:18
    to a string in Python. So, we have a
  • 2:18 - 2:22
    double quote. Between the double quotes is a URL. The
  • 2:22 - 2:25
    URL is the way of locating content on the
  • 2:25 - 2:28
    web so here we have the URL http colon, that
  • 2:28 - 2:31
    means it's a web request. We'll talk more in
  • 2:31 - 2:34
    a later class about what http means and the protocols
  • 2:34 - 2:37
    used to request web pages. What's important now is, that's a
  • 2:37 - 2:42
    location If we open that in a web browser, that will give
  • 2:42 - 2:46
    us another page. What I'm looking at here is the link that
  • 2:46 - 2:51
    is underneath the text for News/Blag. If we click on that link,
  • 2:54 - 2:58
    that will take us to the page blag.skcd.com. That
  • 2:58 - 3:01
    was the page that we saw in the link
  • 3:01 - 3:05
    here it said vlad.skcd.com. When we click on the
  • 3:05 - 3:08
    link, that's where we went. So to build our crawler,
  • 3:08 - 3:11
    what we want to do for each webpage, we want
  • 3:11 - 3:13
    to find these links in the page. We're going to keep
  • 3:13 - 3:16
    track of those links and we're going to follow them
  • 3:16 - 3:19
    to find more content on the web. This is similar
  • 3:19 - 3:21
    to what someone would do if their browsing. If they're
  • 3:21 - 3:24
    clicking on every link of a page, following all the links
  • 3:24 - 3:27
    they find, looking at all that content. That's a really good
  • 3:27 - 3:29
    way to waste a horrendous amount of time if you do
  • 3:29 - 3:32
    that yourself. We're going to build a web crawler that can
  • 3:32 - 3:36
    do that automatically. So our goal is to take the text
  • 3:36 - 3:41
    that came back from a web request, find a link in
  • 3:41 - 3:44
    that text, which is going to be a tag that starts
  • 3:44 - 3:48
    with a href equals and then extract from that
  • 3:48 - 3:52
    tag the URL of the webpage that it links to.
  • 3:52 - 3:53
    Those are the URLs that we're going to use in
  • 3:53 - 3:57
    our crawler to make progress. So by using what we've
  • 3:57 - 4:01
    learned about strings, and what you've learned about variables,
  • 4:01 - 4:04
    you know enough to be able to do that. What
  • 4:04 - 4:07
    we want to do is find the beginning of a tag.
  • 4:07 - 4:10
    And what the beginning of a tag is this text
  • 4:10 - 4:15
    right we're looking for something that matches exactly the a href
  • 4:15 - 4:21
    equals part. That's what the tags were here they all start
  • 4:21 - 4:26
    with a href equals. Not all webpages have the same structure. There are lots
  • 4:26 - 4:28
    of other ways to make a tag. The A
  • 4:28 - 4:30
    could be a capital letter for example. There could
  • 4:30 - 4:33
    be more spaces between the a and the href.
  • 4:33 - 4:35
    The double quote doesn't actually need to be there.
  • 4:35 - 4:37
    For what we do now, we're going to assume that
  • 4:37 - 4:40
    all our webpages follow the same structure that we're seeing
  • 4:40 - 4:44
    here. That each link starts with an a h ref
  • 4:44 - 4:48
    without any funny spaces or anything else. Has an equal,
  • 4:48 - 4:50
    has a double quote, has the URL following that,
  • 4:50 - 4:53
    and then another double quote. So that means we're looking
  • 4:53 - 4:56
    for strings like this, we're looking to find the a
  • 4:56 - 5:00
    href; that's followed by a double quote. After the double
  • 5:00 - 5:03
    quote is the URL. This is what we actually care
  • 5:03 - 5:06
    about; we want to find the URLs on the Web page.
  • 5:06 - 5:10
    That's followed by a closing double quote and then, there's
  • 5:10 - 5:13
    more that closes the tag. And there's lots of other
  • 5:13 - 5:16
    stuff on both sides of this. But this is what
  • 5:16 - 5:17
    we want to do. We want to find the tags that are
  • 5:17 - 5:21
    links and then, within the tags that are links, we
  • 5:21 - 5:25
    want to find the URLs. So we're going to assume that we
  • 5:25 - 5:30
    start with the page contents in a variable. We'll
  • 5:30 - 5:34
    call that page, and we're not going to worry today
  • 5:34 - 5:36
    about how we got those page contents. We're going to
  • 5:36 - 5:39
    provide a function that does that. For the code
  • 5:39 - 5:42
    that you have today, let's going to assume the
  • 5:42 - 5:45
    page is already initialized. That it contains the content
  • 5:45 - 5:50
    of some web page stored as a string and our goal is to find the URL of the first
  • 5:50 - 5:54
    link in the page. That's going to involve a couple
  • 5:54 - 5:58
    steps. SO what we want to do is find the start
  • 5:58 - 6:00
    of the link. We want to find where we have the
  • 6:00 - 6:04
    a href equals. We can't just look for the first string
  • 6:04 - 6:06
    we find, because there's lots of other strings on the
  • 6:06 - 6:10
    page that aren't URLs. So I think you know enough to
  • 6:10 - 6:13
    do that, so we'll make it a quiz. So your goal
  • 6:13 - 6:16
    for this quiz is to write some Python code that will
  • 6:16 - 6:18
    initialize the variable start link to be the
  • 6:18 - 6:21
    value of the position where the first a
  • 6:21 - 6:24
    href equals. So the first tag that starts
  • 6:24 - 6:26
    a link occurs in page, so you should assume
  • 6:26 - 6:28
    that page starts with the content of some
  • 6:28 - 6:30
    web page, and what we're doing is looking for
  • 6:30 - 6:32
    the place where the first a href equals
  • 6:32 - 6:35
    occurs, and that's the first link on the page.
Title:
Extracting Links - Intro to Computer Science
Description:

more » « less
Video Language:
English
Team:
Udacity
Project:
CS101 - Intro to Computer Science
Duration:
06:36

English subtitles

Revisions Compare revisions