
Title:
2137 Spelling Correction

Description:

Now let's do one more example of a probabilistic problemthis time, spelling correction.

That is, given a word that is possibly misspelled,

how do we come up with the best correction for that word?

We're going to do the same type of analysis.

We're saying we're looking for the best possible correction, C*,

and that's going to be the argmax over all possible corrections c to maximize

the probability of that correction given the word.

So that's the definition of what it means to have the best correction.

Then we can start the analysis, and we can apply Bayes rule to say

that's going to be equal to the probability of the word given the correction

times the probability of the correction.

Of course, in Bayes rule there's a factor on the bottom, but that cancels out,

because it's equal for all possible corrections.

So to choose the maximum, we just have to deal with these two probabilities.

Now, it may seem like we made a backwards step.

Here we had one probability to estimate.

Now we've applied Bayes rule and now we have two probabilities we have to estimate,

but the hope is that we can come up with data that can help us with this.

And certainly, these unigram statisticswhat's the probability of a correction?

those we can get from our document counts, so we look at our corpus.

The probability of a correct word is from the data.

We just look at those counts and apply whatever smoothing we decided is best.

Now, the other partwhat's the probability that somebody typed the word w

when they meant to type to the word cthat's harder.

We can't observe that directly by just looking at documents that are typed,

because there we only have the words where we are.

We don't have the intent and the word,

but maybe we can look at lists of spelling corrections.

So this is from spelling correction data.

Now that kind of data is much harder to come by.

It's easy to go out and collect billions of words of regular text and do those counts,

but to find spelling correction datathat's harder to do

unless you're, say, already running a spelling correction service.

If you're a big company that happens to run that, then it's easy to collect the data.

But bootstrapping it is hard.

There are, however, some sites that will give you on the order of thousands

or tens of thousands of examples of misspellings, not billions or trillions.