Return to Video

TfIdf Feature Selection Solution - Intro to Machine Learning

  • 0:00 - 0:05
    This max df argument will actually shrink down the size of my vocabulary.
  • 0:06 - 0:09
    And, it will use it based on the number of documents that
  • 0:09 - 0:11
    a particular word occurs in.
  • 0:11 - 0:14
    So, if there's a word that occurs in more than 50% of the documents,
  • 0:14 - 0:18
    this argument says, don't use it in the tfidf,
  • 0:18 - 0:21
    because it probably doesn't contain a lot of information in it.
  • 0:21 - 0:23
    because it's so common.
  • 0:23 - 0:25
    So, this is an example of another place where you could do
  • 0:25 - 0:30
    some feature reduction, some dimensionality reduction, as, as we also call it.
  • 0:30 - 0:30
    But of course,
  • 0:30 - 0:34
    you also always have your old standby of doing something like, SelectPercentile.
  • 0:35 - 0:38
    So, I hope what you found in that coding exercise underscores this point that
  • 0:38 - 0:43
    we're talking about right now, that features are not the same as information.
  • 0:43 - 0:46
    You just got rid of 90% of your text features, but
  • 0:46 - 0:50
    your classifier accuracy basically didn't suffer at all.
  • 0:50 - 0:53
    And in fact, in some ways the performance improved because it's able to run so
  • 0:53 - 0:56
    much more quickly on the smaller number of features.
  • 0:56 - 0:57
    So, this, obviously,
  • 0:57 - 0:59
    is going to be something that you want to be keeping in mind.
  • 0:59 - 1:02
    Especially, when you're working with very high dimensionality data.
  • 1:02 - 1:04
    Data that has lots and lots of features.
  • 1:04 - 1:07
    You want to be skeptical of all of those features and think,
  • 1:07 - 1:10
    which are the ones that are really going to get me the most bang for my buck?
Title:
TfIdf Feature Selection Solution - Intro to Machine Learning
Description:

11-12 TfIdf_Feature_Selection_Solution

more » « less
Video Language:
English
Team:
Udacity
Project:
ud120 - Intro to Machine Learning
Duration:
01:12

English subtitles

Revisions Compare revisions