Return to Video

Why Upweight Rare Words - Intro to Machine Learning

  • 0:00 - 0:05
    So, let's suppose you got a bunch of emails that, some of them came from me, and
  • 0:05 - 0:06
    some of them came from Sebastian.
  • 0:06 - 0:08
    We might have a lot of overlap.
  • 0:08 - 0:11
    We might both be talking about machine learning and Udacity.
  • 0:11 - 0:13
    But some of them are going to be talking about, let's say physics.
  • 0:13 - 0:16
    Because that's what my background is in.
  • 0:16 - 0:17
    And that's going to be comparatively rare.
  • 0:17 - 0:18
    There's not going to be tons and
  • 0:18 - 0:22
    tons of emails in that corpus that are talking about physics.
  • 0:22 - 0:26
    Because most all of Sebastian's are, don't talk about physics at all.
  • 0:26 - 0:27
    It's only going to be mine.
  • 0:27 - 0:28
    And then likewise,
  • 0:28 - 0:31
    maybe there's a bunch of emails that talk about Stanley the robot.
  • 0:31 - 0:33
    Which of course is one of his projects, but
  • 0:33 - 0:35
    not something that I'm a real expert in.
  • 0:35 - 0:40
    So, the fact that words like physics and Stanley would be rare in this corpus,
  • 0:40 - 0:44
    compared to words like Udacity or machine learning.
  • 0:44 - 0:47
    Means that these might be the words that tell you the most important
  • 0:47 - 0:49
    information about what's going on.
  • 0:49 - 0:51
    Who might be the author of a given message?
  • 0:51 - 0:54
    And so, another way to think about that is that this is why it's called
  • 0:54 - 0:56
    the inverse document frequency.
  • 0:56 - 1:00
    That you want to weight the words by inverse of how often they
  • 1:00 - 1:02
    appear in the corpus as a whole.
  • 1:02 - 1:07
    I'm not going to have you code an example of Tf Idf right now, as a quiz.
  • 1:07 - 1:09
    But this is something that we're going to cover in
  • 1:09 - 1:12
    the mini project that's coming up very shortly, at the end of this lesson.
  • 1:12 - 1:15
    So, you will get your hands dirty with this representation a little bit.
Title:
Why Upweight Rare Words - Intro to Machine Learning
Description:

more » « less
Video Language:
English
Team:
Udacity
Project:
ud120 - Intro to Machine Learning
Duration:
01:16

English subtitles

Revisions Compare revisions