[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:01.56,0:00:06.76,Default,,0000,0000,0000,,Today we're gonna talk about spelling\Ncorrection. Lots of applications make use Dialogue: 0,0:00:06.76,0:00:11.63,Default,,0000,0000,0000,,of spelling correction. For example, word\Nprocessing, almost any modern word Dialogue: 0,0:00:11.63,0:00:16.63,Default,,0000,0000,0000,,processor will take a misspelled word like\Ncomponent with an A and give you Dialogue: 0,0:00:16.63,0:00:22.08,Default,,0000,0000,0000,,suggestions like component with an E and\Nautomatically replace it for you. Modern Dialogue: 0,0:00:22.08,0:00:28.26,Default,,0000,0000,0000,,search engines will not only flag an\Nerror. So, language spelled without a u, Dialogue: 0,0:00:28.26,0:00:34.94,Default,,0000,0000,0000,,here. But, give you, the results, as if\Nyou had spelled the word correctly. And, Dialogue: 0,0:00:34.94,0:00:40.95,Default,,0000,0000,0000,,modern phones additionally will\Nautomatically find misspelled words. Here, Dialogue: 0,0:00:40.95,0:00:47.14,Default,,0000,0000,0000,,I typed l-a-y-r, and it replaced it\Nautomatically, or suggests a replacement, Dialogue: 0,0:00:47.14,0:00:52.26,Default,,0000,0000,0000,,with late. We can distinguish a number of\Nseparate tasks and spelling correction. Dialogue: 0,0:00:52.26,0:00:56.92,Default,,0000,0000,0000,,One is the detection of the error itself.\NAnd then the correction of the error once Dialogue: 0,0:00:56.92,0:01:01.30,Default,,0000,0000,0000,,you've found it. And we can think about\Ndifferent kinds of correction. We might Dialogue: 0,0:01:01.30,0:01:06.02,Default,,0000,0000,0000,,automatically correct an error if we're\Npositive that the error that we know the Dialogue: 0,0:01:06.02,0:01:10.51,Default,,0000,0000,0000,,right answer for the error. So H-T-E is a\Nvery common misspelling for the, and so Dialogue: 0,0:01:10.68,0:01:14.89,Default,,0000,0000,0000,,many word processors automatically correct\NH-T-E. We might suggest a single Dialogue: 0,0:01:14.89,0:01:19.50,Default,,0000,0000,0000,,correction if we're, there's only one very\Nlikely correction, or we might suggest a Dialogue: 0,0:01:19.50,0:01:24.27,Default,,0000,0000,0000,,whole list of corrections and let the user\Npick from among them. We distinguish two Dialogue: 0,0:01:24.27,0:01:30.66,Default,,0000,0000,0000,,different classes of spelling errors. Non\Nword errors are errors in which the, what Dialogue: 0,0:01:30.66,0:01:37.21,Default,,0000,0000,0000,,the user types is not a word of English.\NSo g-r-a-f-f-e a misspelling let's say for Dialogue: 0,0:01:37.21,0:01:43.51,Default,,0000,0000,0000,,giraffe is not a word of English. By\Ncontrast, real word errors. Are errors in Dialogue: 0,0:01:43.51,0:01:49.59,Default,,0000,0000,0000,,which then the resulting. [sound]\NMisspelling is actually a word of English Dialogue: 0,0:01:49.59,0:01:54.87,Default,,0000,0000,0000,,and that makes them somewhat harder to\Ndetect. And we can break up real word Dialogue: 0,0:01:54.87,0:02:00.22,Default,,0000,0000,0000,,errors into ones produced by really\Ntypographical processes. These were meant Dialogue: 0,0:02:00.22,0:02:06.13,Default,,0000,0000,0000,,to type three. And typed [inaudible] let's\Nsay. Or cognitive errors, where the user, Dialogue: 0,0:02:06.34,0:02:11.99,Default,,0000,0000,0000,,meant to type a word like [inaudible] and\Ninstead typed a homophone of a, of the Dialogue: 0,0:02:11.99,0:02:16.37,Default,,0000,0000,0000,,word, or \u201ct-o-o\u201d instead of\N[inaudible] And in both cases what, what's Dialogue: 0,0:02:16.37,0:02:22.09,Default,,0000,0000,0000,,produced is a real word of English, but by\Nmodeling the differences between these Dialogue: 0,0:02:22.09,0:02:27.45,Default,,0000,0000,0000,,kind of errors, we might come up with\Nbetter ways of fixing them both. How Dialogue: 0,0:02:27.45,0:02:33.52,Default,,0000,0000,0000,,common are spelling errors? Depends a lot\Non the task. So in web queries, spelling Dialogue: 0,0:02:33.52,0:02:38.95,Default,,0000,0000,0000,,errors are extremely common. So\Npractically one in four words in a web Dialogue: 0,0:02:38.95,0:02:44.22,Default,,0000,0000,0000,,query are likely to be misspelled. But in\Nweb processing tasks on phones it's much Dialogue: 0,0:02:44.22,0:02:48.67,Default,,0000,0000,0000,,harder to get an accurate number. So\Nthere's been a number of studies and most Dialogue: 0,0:02:48.67,0:02:53.40,Default,,0000,0000,0000,,of these studies are done by retyping. You\Ngive the user a passage to type and then Dialogue: 0,0:02:53.40,0:02:57.91,Default,,0000,0000,0000,,you measure how well they, they type it.\NAnd, of course, that's not quite the same Dialogue: 0,0:02:57.91,0:03:02.59,Default,,0000,0000,0000,,user's naturally writing messages or\Ntyping. Nonetheless If you ask users to Dialogue: 0,0:03:02.59,0:03:06.98,Default,,0000,0000,0000,,retype and you don't let them use the\Nbackspace key, they make about thirteen Dialogue: 0,0:03:06.98,0:03:10.81,Default,,0000,0000,0000,,percent of the words, thirteen percent of\Nthe words are in error. So indicating that Dialogue: 0,0:03:10.81,0:03:16.03,Default,,0000,0000,0000,,if, that a lot of words. They correct\Nthemselves with the backspace. If you let Dialogue: 0,0:03:16.03,0:03:21.22,Default,,0000,0000,0000,,them correct, now we're trying to\Nexperiment on a, on a p d a style phone Dialogue: 0,0:03:21.22,0:03:25.86,Default,,0000,0000,0000,,site, organizer, they'll correct about\Nseven percent of the words themselves. Dialogue: 0,0:03:25.86,0:03:30.84,Default,,0000,0000,0000,,They'll still leave about two percent of\Nthe words uncorrected, on the organizer. Dialogue: 0,0:03:30.84,0:03:35.96,Default,,0000,0000,0000,,And, similar numbers on people doing\Nretyping on a regular keyboard. So, the Dialogue: 0,0:03:35.96,0:03:41.26,Default,,0000,0000,0000,,numbers are about two percent where people\Ntyping. And probably a much higher number Dialogue: 0,0:03:41.26,0:03:46.28,Default,,0000,0000,0000,,for web queries and probably a much higher\Nnumber for people texting. Are the kind of Dialogue: 0,0:03:46.28,0:03:51.21,Default,,0000,0000,0000,,spelling, spelling error [inaudible] that\Nwe see. How do we detect non word spelling Dialogue: 0,0:03:51.21,0:03:56.02,Default,,0000,0000,0000,,errors. The traditional way is just to use\Na large dictionary. Any word not in the Dialogue: 0,0:03:56.02,0:04:00.60,Default,,0000,0000,0000,,dictionary is an error. And, the larger\Nthe dictionary, it turns out the better Dialogue: 0,0:04:00.60,0:04:04.70,Default,,0000,0000,0000,,this works. For correcting these non-word\Nspelling errors, we generate a set of Dialogue: 0,0:04:04.70,0:04:08.62,Default,,0000,0000,0000,,candidates that's real words that are\Nsimilar to the error. And then we pick Dialogue: 0,0:04:08.62,0:04:12.85,Default,,0000,0000,0000,,whichever one is best. And we'll talk\Nabout the noisy-channel probability model Dialogue: 0,0:04:12.85,0:04:17.03,Default,,0000,0000,0000,,of how to do that. And it's also related\Nto another method called the shortest Dialogue: 0,0:04:17.03,0:04:20.95,Default,,0000,0000,0000,,weighted [inaudible] distance myth. So we\Nfind the words that are not in the Dialogue: 0,0:04:20.95,0:04:25.12,Default,,0000,0000,0000,,dictionary. For each one, we generate a\Nset of candidates. Those are going to be Dialogue: 0,0:04:25.12,0:04:29.15,Default,,0000,0000,0000,,real words that are similar, we'll talk\Nabout what similar means, to that error Dialogue: 0,0:04:29.15,0:04:33.45,Default,,0000,0000,0000,,and then we'll pick the best one. For real\Nword spelling errors, the algorithm is Dialogue: 0,0:04:33.45,0:04:37.65,Default,,0000,0000,0000,,quite similar. Again, for each word we\Ngenerate a candidate set. But now we do Dialogue: 0,0:04:37.65,0:04:41.74,Default,,0000,0000,0000,,this for every word in a sentence, not\Njust the words that are not in some Dialogue: 0,0:04:41.74,0:04:45.94,Default,,0000,0000,0000,,dictionary. So real word spelling error\Ncorrection, we don't use a dictionary Dialogue: 0,0:04:45.94,0:04:50.24,Default,,0000,0000,0000,,because of course the errors are in a\Ndictionary. So that wouldn't help. So, for Dialogue: 0,0:04:50.24,0:04:54.38,Default,,0000,0000,0000,,every word, we generate a candidate set.\NSo we might find candidate words with Dialogue: 0,0:04:54.38,0:04:58.46,Default,,0000,0000,0000,,similar pronunciations, we might find\Ncandidate words with similar spellings, Dialogue: 0,0:04:58.46,0:05:02.76,Default,,0000,0000,0000,,and depending on the algorithm, exactly.\NAnd it's very important that we're gonna Dialogue: 0,0:05:02.76,0:05:07.16,Default,,0000,0000,0000,,include the word itself, in the candidate\Nset, because the every word might be a Dialogue: 0,0:05:07.16,0:05:11.52,Default,,0000,0000,0000,,misspelling of some other real word, or it\Nmight be the correct word. In fact, most Dialogue: 0,0:05:11.52,0:05:15.60,Default,,0000,0000,0000,,words are probably correct. So, for each\Ncandidate set of each possible error, Dialogue: 0,0:05:15.60,0:05:19.73,Default,,0000,0000,0000,,we're gonna include the word itself. And\Nmost of the time, in fact, we're gonna Dialogue: 0,0:05:19.73,0:05:25.64,Default,,0000,0000,0000,,pick that. And again, how we pick the\Nwords we might use the noisy channel Dialogue: 0,0:05:25.64,0:05:32.24,Default,,0000,0000,0000,,model. We might use a classifier, we'll\Ntalk about that so we'll discuss the Dialogue: 0,0:05:32.24,0:05:38.43,Default,,0000,0000,0000,,different methods of detecting these\Nerrors and correcting them in the next