Return to Video

Stemming to Consolidate Vocabulary - Intro to Machine Learning

  • 0:00 - 0:03
    There's another handy trick that I'm going to teach you now, and
  • 0:03 - 0:07
    it has to do with the idea that not all unique words are actually different, or
  • 0:07 - 0:09
    not very different anyway.
  • 0:09 - 0:12
    Let me show you an example of what I mean.
  • 0:12 - 0:15
    Say in my corpus I have a bunch of different versions of the word respond,
  • 0:15 - 0:19
    where the meaning changes ever so slightly based on the context or
  • 0:19 - 0:22
    based on the part of speech that the word is, but they're all talking about
  • 0:22 - 0:27
    basically the same idea, the idea of someone or something responding.
  • 0:27 - 0:31
    The idea is that if I naively put these into a bag of words,
  • 0:31 - 0:33
    they're all going to show up as different features,
  • 0:33 - 0:36
    even though they're all getting at roughly the same idea.
  • 0:36 - 0:39
    And this is going to be true for many words in a lot of languages, that they
  • 0:39 - 0:44
    have lots of different permutations that mean only slightly different things.
  • 0:44 - 0:46
    Luckily for us, there's a way of, sort of,
  • 0:46 - 0:50
    bundling these up together, and representing them as a single word, and
  • 0:50 - 0:53
    the way that that happens is using an algorithm called a stemmer.
  • 0:53 - 0:56
    So if I were to wrap up all these words and put them into a stemmer,
  • 0:56 - 1:00
    it would then apply a function to them that would strip them down all
  • 1:00 - 1:05
    till they have the same sort of root, which might be something like respon.
  • 1:05 - 1:10
    So the idea is not necessarily to make a single word out of this, because,
  • 1:10 - 1:15
    of course, respon isn't a word, but it's kind of the root of a word, or the stem
  • 1:15 - 1:21
    of a word that can then be used in any of our classifiers or our regressions.
  • 1:21 - 1:24
    And we've now taken this five dimensional input space, and
  • 1:24 - 1:27
    turned it into one dimension without losing any real information.
  • 1:27 - 1:32
    A stemming functions can actually be kind of tricky to implement yourself.
  • 1:32 - 1:36
    There are professional linguists and computational linguists who build these
  • 1:36 - 1:41
    stemming functions, that best figure out what is the stem of a given word.
  • 1:41 - 1:45
    And so, usually what we do in machine learning is we take one of
  • 1:45 - 1:49
    these stemmers off the shelf from something like NLTK, or
  • 1:49 - 1:53
    some other similar text-processing package, and we just make use of it,
  • 1:53 - 1:56
    not necessarily always going into the guts of how it works.
  • 1:56 - 1:58
    And then once we've applied the stemming,
  • 1:58 - 2:02
    of course, we have a much cleaner body of vocabulary that we can work with.
Tytuł:
Stemming to Consolidate Vocabulary - Intro to Machine Learning
Opis:

more » « less
Video Language:
English
Team:
Udacity
Projekt:
ud120 - Intro to Machine Learning
Duration:
02:03

English subtitles

Revisions Compare revisions