YouTube

Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← 21-37 Spelling Correction

Get Embed Code
2 Languages

Showing Revision 1 created 11/28/2012 by Amara Bot.

  1. Now let's do one more example of a probabilistic problem--this time, spelling correction.
  2. That is, given a word that is possibly misspelled,
  3. how do we come up with the best correction for that word?
  4. We're going to do the same type of analysis.
  5. We're saying we're looking for the best possible correction, C*,
  6. and that's going to be the argmax over all possible corrections c to maximize
  7. the probability of that correction given the word.
  8. So that's the definition of what it means to have the best correction.
  9. Then we can start the analysis, and we can apply Bayes rule to say
  10. that's going to be equal to the probability of the word given the correction
  11. times the probability of the correction.
  12. Of course, in Bayes rule there's a factor on the bottom, but that cancels out,
  13. because it's equal for all possible corrections.
  14. So to choose the maximum, we just have to deal with these two probabilities.
  15. Now, it may seem like we made a backwards step.
  16. Here we had one probability to estimate.
  17. Now we've applied Bayes rule and now we have two probabilities we have to estimate,
  18. but the hope is that we can come up with data that can help us with this.
  19. And certainly, these unigram statistics--what's the probability of a correction?--
  20. those we can get from our document counts, so we look at our corpus.
  21. The probability of a correct word is from the data.
  22. We just look at those counts and apply whatever smoothing we decided is best.
  23. Now, the other part--what's the probability that somebody typed the word w
  24. when they meant to type to the word c--that's harder.
  25. We can't observe that directly by just looking at documents that are typed,
  26. because there we only have the words where we are.
  27. We don't have the intent and the word,
  28. but maybe we can look at lists of spelling corrections.
  29. So this is from spelling correction data.
  30. Now that kind of data is much harder to come by.
  31. It's easy to go out and collect billions of words of regular text and do those counts,
  32. but to find spelling correction data--that's harder to do
  33. unless you're, say, already running a spelling correction service.
  34. If you're a big company that happens to run that, then it's easy to collect the data.
  35. But bootstrapping it is hard.
  36. There are, however, some sites that will give you on the order of thousands
  37. or tens of thousands of examples of misspellings, not billions or trillions.