WEBVTT 00:00:01.560 --> 00:00:06.758 Today we're gonna talk about spelling correction. Lots of applications make use 00:00:06.758 --> 00:00:11.628 of spelling correction. For example, word processing, almost any modern word 00:00:11.628 --> 00:00:16.630 processor will take a misspelled word like component with an A and give you 00:00:16.630 --> 00:00:22.077 suggestions like component with an E and automatically replace it for you. Modern 00:00:22.077 --> 00:00:28.259 search engines will not only flag an error. So, language spelled without a u, 00:00:28.259 --> 00:00:34.936 here. But, give you, the results, as if you had spelled the word correctly. And, 00:00:34.936 --> 00:00:40.953 modern phones additionally will automatically find misspelled words. Here, 00:00:40.953 --> 00:00:47.135 I typed l-a-y-r, and it replaced it automatically, or suggests a replacement, 00:00:47.135 --> 00:00:52.260 with late. We can distinguish a number of separate tasks and spelling correction. 00:00:52.260 --> 00:00:56.921 One is the detection of the error itself. And then the correction of the error once 00:00:56.921 --> 00:01:01.301 you've found it. And we can think about different kinds of correction. We might 00:01:01.301 --> 00:01:06.018 automatically correct an error if we're positive that the error that we know the 00:01:06.018 --> 00:01:10.510 right answer for the error. So H-T-E is a very common misspelling for the, and so 00:01:10.679 --> 00:01:14.891 many word processors automatically correct H-T-E. We might suggest a single 00:01:14.891 --> 00:01:19.495 correction if we're, there's only one very likely correction, or we might suggest a 00:01:19.495 --> 00:01:24.274 whole list of corrections and let the user pick from among them. We distinguish two 00:01:24.274 --> 00:01:30.657 different classes of spelling errors. Non word errors are errors in which the, what 00:01:30.657 --> 00:01:37.213 the user types is not a word of English. So g-r-a-f-f-e a misspelling let's say for 00:01:37.213 --> 00:01:43.510 giraffe is not a word of English. By contrast, real word errors. Are errors in 00:01:43.510 --> 00:01:49.587 which then the resulting. [sound] Misspelling is actually a word of English 00:01:49.587 --> 00:01:54.867 and that makes them somewhat harder to detect. And we can break up real word 00:01:54.867 --> 00:02:00.217 errors into ones produced by really typographical processes. These were meant 00:02:00.217 --> 00:02:06.132 to type three. And typed [inaudible] let's say. Or cognitive errors, where the user, 00:02:06.343 --> 00:02:11.992 meant to type a word like [inaudible] and instead typed a homophone of a, of the 00:02:11.992 --> 00:02:16.369 word, or \u201ct-o-o\u201d instead of [inaudible] And in both cases what, what's 00:02:16.369 --> 00:02:22.088 produced is a real word of English, but by modeling the differences between these 00:02:22.088 --> 00:02:27.453 kind of errors, we might come up with better ways of fixing them both. How 00:02:27.453 --> 00:02:33.523 common are spelling errors? Depends a lot on the task. So in web queries, spelling 00:02:33.523 --> 00:02:38.951 errors are extremely common. So practically one in four words in a web 00:02:38.951 --> 00:02:44.218 query are likely to be misspelled. But in web processing tasks on phones it's much 00:02:44.218 --> 00:02:48.669 harder to get an accurate number. So there's been a number of studies and most 00:02:48.669 --> 00:02:53.404 of these studies are done by retyping. You give the user a passage to type and then 00:02:53.404 --> 00:02:57.912 you measure how well they, they type it. And, of course, that's not quite the same 00:02:57.912 --> 00:03:02.591 user's naturally writing messages or typing. Nonetheless If you ask users to 00:03:02.591 --> 00:03:06.984 retype and you don't let them use the backspace key, they make about thirteen 00:03:06.984 --> 00:03:10.806 percent of the words, thirteen percent of the words are in error. So indicating that 00:03:10.806 --> 00:03:16.028 if, that a lot of words. They correct themselves with the backspace. If you let 00:03:16.028 --> 00:03:21.220 them correct, now we're trying to experiment on a, on a p d a style phone 00:03:21.220 --> 00:03:25.858 site, organizer, they'll correct about seven percent of the words themselves. 00:03:25.858 --> 00:03:30.842 They'll still leave about two percent of the words uncorrected, on the organizer. 00:03:30.842 --> 00:03:35.964 And, similar numbers on people doing retyping on a regular keyboard. So, the 00:03:35.964 --> 00:03:41.261 numbers are about two percent where people typing. And probably a much higher number 00:03:41.261 --> 00:03:46.280 for web queries and probably a much higher number for people texting. Are the kind of 00:03:46.280 --> 00:03:51.210 spelling, spelling error [inaudible] that we see. How do we detect non word spelling 00:03:51.210 --> 00:03:56.022 errors. The traditional way is just to use a large dictionary. Any word not in the 00:03:56.022 --> 00:04:00.596 dictionary is an error. And, the larger the dictionary, it turns out the better 00:04:00.596 --> 00:04:04.705 this works. For correcting these non-word spelling errors, we generate a set of 00:04:04.705 --> 00:04:08.624 candidates that's real words that are similar to the error. And then we pick 00:04:08.624 --> 00:04:12.852 whichever one is best. And we'll talk about the noisy-channel probability model 00:04:12.852 --> 00:04:17.029 of how to do that. And it's also related to another method called the shortest 00:04:17.029 --> 00:04:20.948 weighted [inaudible] distance myth. So we find the words that are not in the 00:04:20.948 --> 00:04:25.125 dictionary. For each one, we generate a set of candidates. Those are going to be 00:04:25.125 --> 00:04:29.148 real words that are similar, we'll talk about what similar means, to that error 00:04:29.148 --> 00:04:33.448 and then we'll pick the best one. For real word spelling errors, the algorithm is 00:04:33.448 --> 00:04:37.650 quite similar. Again, for each word we generate a candidate set. But now we do 00:04:37.650 --> 00:04:41.742 this for every word in a sentence, not just the words that are not in some 00:04:41.742 --> 00:04:45.944 dictionary. So real word spelling error correction, we don't use a dictionary 00:04:45.944 --> 00:04:50.245 because of course the errors are in a dictionary. So that wouldn't help. So, for 00:04:50.245 --> 00:04:54.381 every word, we generate a candidate set. So we might find candidate words with 00:04:54.381 --> 00:04:58.463 similar pronunciations, we might find candidate words with similar spellings, 00:04:58.463 --> 00:05:02.760 and depending on the algorithm, exactly. And it's very important that we're gonna 00:05:02.760 --> 00:05:07.164 include the word itself, in the candidate set, because the every word might be a 00:05:07.164 --> 00:05:11.515 misspelling of some other real word, or it might be the correct word. In fact, most 00:05:11.515 --> 00:05:15.597 words are probably correct. So, for each candidate set of each possible error, 00:05:15.597 --> 00:05:19.732 we're gonna include the word itself. And most of the time, in fact, we're gonna 00:05:19.732 --> 00:05:25.644 pick that. And again, how we pick the words we might use the noisy channel 00:05:25.644 --> 00:05:32.240 model. We might use a classifier, we'll talk about that so we'll discuss the 00:05:32.240 --> 00:05:38.428 different methods of detecting these errors and correcting them in the next