WEBVTT

00:00:01.560 --> 00:00:06.758
Today we're gonna talk about spelling
correction. Lots of applications make use

00:00:06.758 --> 00:00:11.628
of spelling correction. For example, word
processing, almost any modern word

00:00:11.628 --> 00:00:16.630
processor will take a misspelled word like
component with an A and give you

00:00:16.630 --> 00:00:22.077
suggestions like component with an E and
automatically replace it for you. Modern

00:00:22.077 --> 00:00:28.259
search engines will not only flag an
error. So, language spelled without a u,

00:00:28.259 --> 00:00:34.936
here. But, give you, the results, as if
you had spelled the word correctly. And,

00:00:34.936 --> 00:00:40.953
modern phones additionally will
automatically find misspelled words. Here,

00:00:40.953 --> 00:00:47.135
I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,

00:00:47.135 --> 00:00:52.260
with late. We can distinguish a number of
separate tasks and spelling correction.

00:00:52.260 --> 00:00:56.921
One is the detection of the error itself.
And then the correction of the error once

00:00:56.921 --> 00:01:01.301
you've found it. And we can think about
different kinds of correction. We might

00:01:01.301 --> 00:01:06.018
automatically correct an error if we're
positive that the error that we know the

00:01:06.018 --> 00:01:10.510
right answer for the error. So H-T-E is a
very common misspelling for the, and so

00:01:10.679 --> 00:01:14.891
many word processors automatically correct
H-T-E. We might suggest a single

00:01:14.891 --> 00:01:19.495
correction if we're, there's only one very
likely correction, or we might suggest a

00:01:19.495 --> 00:01:24.274
whole list of corrections and let the user
pick from among them. We distinguish two

00:01:24.274 --> 00:01:30.657
different classes of spelling errors. Non
word errors are errors in which the, what

00:01:30.657 --> 00:01:37.213
the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for

00:01:37.213 --> 00:01:43.510
giraffe is not a word of English. By
contrast, real word errors. Are errors in

00:01:43.510 --> 00:01:49.587
which then the resulting. [sound]
Misspelling is actually a word of English

00:01:49.587 --> 00:01:54.867
and that makes them somewhat harder to
detect. And we can break up real word

00:01:54.867 --> 00:02:00.217
errors into ones produced by really
typographical processes. These were meant

00:02:00.217 --> 00:02:06.132
to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,

00:02:06.343 --> 00:02:11.992
meant to type a word like [inaudible] and
instead typed a homophone of a, of the

00:02:11.992 --> 00:02:16.369
word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's

00:02:16.369 --> 00:02:22.088
produced is a real word of English, but by
modeling the differences between these

00:02:22.088 --> 00:02:27.453
kind of errors, we might come up with
better ways of fixing them both. How

00:02:27.453 --> 00:02:33.523
common are spelling errors? Depends a lot
on the task. So in web queries, spelling

00:02:33.523 --> 00:02:38.951
errors are extremely common. So
practically one in four words in a web

00:02:38.951 --> 00:02:44.218
query are likely to be misspelled. But in
web processing tasks on phones it's much

00:02:44.218 --> 00:02:48.669
harder to get an accurate number. So
there's been a number of studies and most

00:02:48.669 --> 00:02:53.404
of these studies are done by retyping. You
give the user a passage to type and then

00:02:53.404 --> 00:02:57.912
you measure how well they, they type it.
And, of course, that's not quite the same

00:02:57.912 --> 00:03:02.591
user's naturally writing messages or
typing. Nonetheless If you ask users to

00:03:02.591 --> 00:03:06.984
retype and you don't let them use the
backspace key, they make about thirteen

00:03:06.984 --> 00:03:10.806
percent of the words, thirteen percent of
the words are in error. So indicating that

00:03:10.806 --> 00:03:16.028
if, that a lot of words. They correct
themselves with the backspace. If you let

00:03:16.028 --> 00:03:21.220
them correct, now we're trying to
experiment on a, on a p d a style phone

00:03:21.220 --> 00:03:25.858
site, organizer, they'll correct about
seven percent of the words themselves.

00:03:25.858 --> 00:03:30.842
They'll still leave about two percent of
the words uncorrected, on the organizer.

00:03:30.842 --> 00:03:35.964
And, similar numbers on people doing
retyping on a regular keyboard. So, the

00:03:35.964 --> 00:03:41.261
numbers are about two percent where people
typing. And probably a much higher number

00:03:41.261 --> 00:03:46.280
for web queries and probably a much higher
number for people texting. Are the kind of

00:03:46.280 --> 00:03:51.210
spelling, spelling error [inaudible] that
we see. How do we detect non word spelling

00:03:51.210 --> 00:03:56.022
errors. The traditional way is just to use
a large dictionary. Any word not in the

00:03:56.022 --> 00:04:00.596
dictionary is an error. And, the larger
the dictionary, it turns out the better

00:04:00.596 --> 00:04:04.705
this works. For correcting these non-word
spelling errors, we generate a set of

00:04:04.705 --> 00:04:08.624
candidates that's real words that are
similar to the error. And then we pick

00:04:08.624 --> 00:04:12.852
whichever one is best. And we'll talk
about the noisy-channel probability model

00:04:12.852 --> 00:04:17.029
of how to do that. And it's also related
to another method called the shortest

00:04:17.029 --> 00:04:20.948
weighted [inaudible] distance myth. So we
find the words that are not in the

00:04:20.948 --> 00:04:25.125
dictionary. For each one, we generate a
set of candidates. Those are going to be

00:04:25.125 --> 00:04:29.148
real words that are similar, we'll talk
about what similar means, to that error

00:04:29.148 --> 00:04:33.448
and then we'll pick the best one. For real
word spelling errors, the algorithm is

00:04:33.448 --> 00:04:37.650
quite similar. Again, for each word we
generate a candidate set. But now we do

00:04:37.650 --> 00:04:41.742
this for every word in a sentence, not
just the words that are not in some

00:04:41.742 --> 00:04:45.944
dictionary. So real word spelling error
correction, we don't use a dictionary

00:04:45.944 --> 00:04:50.245
because of course the errors are in a
dictionary. So that wouldn't help. So, for

00:04:50.245 --> 00:04:54.381
every word, we generate a candidate set.
So we might find candidate words with

00:04:54.381 --> 00:04:58.463
similar pronunciations, we might find
candidate words with similar spellings,

00:04:58.463 --> 00:05:02.760
and depending on the algorithm, exactly.
And it's very important that we're gonna

00:05:02.760 --> 00:05:07.164
include the word itself, in the candidate
set, because the every word might be a

00:05:07.164 --> 00:05:11.515
misspelling of some other real word, or it
might be the correct word. In fact, most

00:05:11.515 --> 00:05:15.597
words are probably correct. So, for each
candidate set of each possible error,

00:05:15.597 --> 00:05:19.732
we're gonna include the word itself. And
most of the time, in fact, we're gonna

00:05:19.732 --> 00:05:25.644
pick that. And again, how we pick the
words we might use the noisy channel

00:05:25.644 --> 00:05:32.240
model. We might use a classifier, we'll
talk about that so we'll discuss the

00:05:32.240 --> 00:05:38.428
different methods of detecting these
errors and correcting them in the next