 ## ← 21-37 Spelling Correction

• 1 Follower
• 37 Lines

### Get Embed Code x Embed video Use the following code to embed this video. See our usage guide for more details on embedding. Paste this in your document somewhere (closest to the closing body tag is preferable): ```<script type="text/javascript" src='https://amara.org/embedder-iframe'></script> ``` Paste this inside your HTML body, where you want to include the widget: ```<div class="amara-embed" data-url="http://www.youtube.com/watch?v=itckWrJi92M" data-team="udacity"></div> ``` 2 Languages

• English [en]
• Japanese [ja]

Showing Revision 1 created 11/28/2012 by Amara Bot.

1. Now let's do one more example of a probabilistic problem--this time, spelling correction.
2. That is, given a word that is possibly misspelled,
3. how do we come up with the best correction for that word?
4. We're going to do the same type of analysis.
5. We're saying we're looking for the best possible correction, C*,
6. and that's going to be the argmax over all possible corrections c to maximize
7. the probability of that correction given the word.
8. So that's the definition of what it means to have the best correction.
9. Then we can start the analysis, and we can apply Bayes rule to say
10. that's going to be equal to the probability of the word given the correction
11. times the probability of the correction.
12. Of course, in Bayes rule there's a factor on the bottom, but that cancels out,
13. because it's equal for all possible corrections.
14. So to choose the maximum, we just have to deal with these two probabilities.
15. Now, it may seem like we made a backwards step.
16. Here we had one probability to estimate.
17. Now we've applied Bayes rule and now we have two probabilities we have to estimate,
18. but the hope is that we can come up with data that can help us with this.
19. And certainly, these unigram statistics--what's the probability of a correction?--
20. those we can get from our document counts, so we look at our corpus.
21. The probability of a correct word is from the data.
22. We just look at those counts and apply whatever smoothing we decided is best.
23. Now, the other part--what's the probability that somebody typed the word w
24. when they meant to type to the word c--that's harder.
25. We can't observe that directly by just looking at documents that are typed,
26. because there we only have the words where we are.
27. We don't have the intent and the word,
28. but maybe we can look at lists of spelling corrections.
29. So this is from spelling correction data.
30. Now that kind of data is much harder to come by.
31. It's easy to go out and collect billions of words of regular text and do those counts,
32. but to find spelling correction data--that's harder to do
33. unless you're, say, already running a spelling correction service.
34. If you're a big company that happens to run that, then it's easy to collect the data.
35. But bootstrapping it is hard.
36. There are, however, some sites that will give you on the order of thousands
37. or tens of thousands of examples of misspellings, not billions or trillions.