[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.24,0:00:03.44,Default,,0000,0000,0000,,Let’s talk about Kneser-Ney Smoothing, Dialogue: 0,0:00:03.44,0:00:05.46,Default,,0000,0000,0000,,one of the most sophisticated forms of smoothing, Dialogue: 0,0:00:05.46,0:00:08.33,Default,,0000,0000,0000,,but also one with a beautiful and elegant intuition. Dialogue: 0,0:00:09.76,0:00:14.46,Default,,0000,0000,0000,,Remember that from Good-Turing, we talked about the c*’s, Dialogue: 0,0:00:14.46,0:00:17.19,Default,,0000,0000,0000,,the discounted counts you end up from Good-Turing, Dialogue: 0,0:00:17.19,0:00:22.19,Default,,0000,0000,0000,,and we discounted each of the counts — Dialogue: 0,0:00:22.19,0:00:24.41,Default,,0000,0000,0000,,the count of one was discounted to point four, Dialogue: 0,0:00:24.41,0:00:27.68,Default,,0000,0000,0000,,and the count of two discounted to 1.26 and so on — Dialogue: 0,0:00:27.68,0:00:32.56,Default,,0000,0000,0000,,in order to save mass to replace the zero counts with some low number. Dialogue: 0,0:00:32.56,0:00:39.42,Default,,0000,0000,0000,,And if you look at the actual values of these counts — 8.25 for nine and 7.24 for eight —\N Dialogue: 0,0:00:39.42,0:00:41.62,Default,,0000,0000,0000,,you’ll notice that in a very large number of cases, Dialogue: 0,0:00:41.62,0:00:47.56,Default,,0000,0000,0000,,the discounted count has a very close relationship with the original count.\N Dialogue: 0,0:00:47.56,0:00:52.86,Default,,0000,0000,0000,,It’s really the original count minus .75, or somewhat close to that. Dialogue: 0,0:00:52.86,0:01:01.36,Default,,0000,0000,0000,,So, in practice what Good-Turing often does is produce a fixed small discount from the counts. Dialogue: 0,0:01:01.36,0:01:07.59,Default,,0000,0000,0000,,And that intuition, that of a fixed small discount, can be applied directly. Dialogue: 0,0:01:07.59,0:01:12.81,Default,,0000,0000,0000,,When we do this, we call this absolute discounting, and absolute discounting is a popular kind of smoothing. Dialogue: 0,0:01:12.81,0:01:17.55,Default,,0000,0000,0000,,And here we’re showing you absolute discounting interpolation. Dialogue: 0,0:01:17.55,0:01:22.37,Default,,0000,0000,0000,,And again, the intuition is just: We’ll save some time and have to compute all those complicated Good-Turing numbers\N Dialogue: 0,0:01:22.37,0:01:27.96,Default,,0000,0000,0000,,and we’ll just subtract .75, or maybe it will be a different discount value for different corpora.\N Dialogue: 0,0:01:27.96,0:01:30.03,Default,,0000,0000,0000,,Now here’s the equation for absolute discounting. Dialogue: 0,0:01:30.03,0:01:31.90,Default,,0000,0000,0000,,So we’re doing diagrams again. Dialogue: 0,0:01:31.90,0:01:37.46,Default,,0000,0000,0000,,So the probability, absolute discounted, of a word, given the previous word, Dialogue: 0,0:01:37.46,0:01:44.17,Default,,0000,0000,0000,,will be some discounted bigram, interpolated with some interpolation weight, with the unigram probability. Dialogue: 0,0:01:44.17,0:01:47.90,Default,,0000,0000,0000,,So we have a unigram probability P(w), and then the bigram probability, Dialogue: 0,0:01:47.90,0:01:53.40,Default,,0000,0000,0000,,and we just subtract a fixed amount, let’s say it’s .75, from the count. Dialogue: 0,0:01:53.40,0:01:56.96,Default,,0000,0000,0000,,And otherwise compute the bigram probability in the normal way. Dialogue: 0,0:01:56.96,0:02:00.14,Default,,0000,0000,0000,,So, we have a discounted bigram probability, mixed with some weight, Dialogue: 0,0:02:00.14,0:02:03.10,Default,,0000,0000,0000,,which I shall talk later about how to set this weight, with a unigram. Dialogue: 0,0:02:03.10,0:02:07.70,Default,,0000,0000,0000,,And maybe we might keep a couple of extra values of d for counts one and two. Dialogue: 0,0:02:07.70,0:02:12.70,Default,,0000,0000,0000,,Counts one and two, we saw on a previous slide, weren’t quite subtracting .75, Dialogue: 0,0:02:12.70,0:02:16.32,Default,,0000,0000,0000,,so we can model this more carefully by having separate counts for those. Dialogue: 0,0:02:17.24,0:02:22.56,Default,,0000,0000,0000,,But the problem with absolute discounting is the unigram probability itself. Dialogue: 0,0:02:22.56,0:02:25.69,Default,,0000,0000,0000,,And I want to talk about changing the unigram probability. Dialogue: 0,0:02:25.69,0:02:28.24,Default,,0000,0000,0000,,And that’s the fundamental intuition of Kneser-Ney. Dialogue: 0,0:02:28.24,0:02:34.50,Default,,0000,0000,0000,,So in Kneser-Ney smoothing, the idea is keep that same interpolation that we saw in absolute discounting,\N Dialogue: 0,0:02:34.50,0:02:38.56,Default,,0000,0000,0000,,but use a better estimate of probabilities of the lower unigrams. Dialogue: 0,0:02:38.56,0:02:42.67,Default,,0000,0000,0000,,And the intuition for that, we can go back and look at the classic Shannon games.\N Dialogue: 0,0:02:42.67,0:02:45.57,Default,,0000,0000,0000,,Remember, in the Shannon game we’re predicting a word from previous words, Dialogue: 0,0:02:45.57,0:02:49.96,Default,,0000,0000,0000,,so we see a sentence, “I can’t see without my reading {\u1}{\u1}{\u0}{\u0}.” Dialogue: 0,0:02:49.96,0:02:51.38,Default,,0000,0000,0000,,What’s the most likely next word? Dialogue: 0,0:02:51.38,0:02:52.88,Default,,0000,0000,0000,,Well, “glasses” seems pretty likely. Dialogue: 0,0:02:52.88,0:02:56.23,Default,,0000,0000,0000,,Well, how about instead the word “Francisco”? Dialogue: 0,0:02:56.23,0:02:58.89,Default,,0000,0000,0000,,Well, that seems very unlikely in this situation. Dialogue: 0,0:02:58.89,0:03:04.91,Default,,0000,0000,0000,,And yet “Francisco”, as just a unigram, is more common than “glasses”. Dialogue: 0,0:03:04.91,0:03:10.67,Default,,0000,0000,0000,,But the reason why “Francisco” seems like a bad thing after “reading”, Dialogue: 0,0:03:10.67,0:03:16.40,Default,,0000,0000,0000,,one intuition we might be able to get is that “Francisco” always follows “San” or very often follows “San”. Dialogue: 0,0:03:16.40,0:03:21.63,Default,,0000,0000,0000,,So while “Francisco” is very frequent, it’s frequent in the context of the word “San”. Dialogue: 0,0:03:21.63,0:03:26.17,Default,,0000,0000,0000,,Now, unigrams in an interpolation model, where we’re mixing a unigram and a bigram, Dialogue: 0,0:03:26.17,0:03:31.23,Default,,0000,0000,0000,,are specifically useful — they’re very helpful — just in case where we haven’t seen a bigram. Dialogue: 0,0:03:31.23,0:03:37.16,Default,,0000,0000,0000,,So it’s unfortunate that just in the case where we haven’t seen the bigram “reading Francisco”, Dialogue: 0,0:03:37.16,0:03:41.21,Default,,0000,0000,0000,,we’re trusting “Francisco”’s unigram weight which is just where we shouldn’t trust it. Dialogue: 0,0:03:41.21,0:03:45.58,Default,,0000,0000,0000,,So instead of using the probability of w, how likely is a word, Dialogue: 0,0:03:45.58,0:03:48.52,Default,,0000,0000,0000,,our intuition is going to be when we’re backing off to something Dialogue: 0,0:03:48.52,0:03:51.52,Default,,0000,0000,0000,,we should instead use the continuation probability. Dialogue: 0,0:03:51.52,0:03:53.73,Default,,0000,0000,0000,,We’re going to call it P continuation of a word, Dialogue: 0,0:03:53.73,0:03:57.16,Default,,0000,0000,0000,,how likely is the word to appear as a novel continuation. Dialogue: 0,0:03:57.16,0:03:59.45,Default,,0000,0000,0000,,Well, how do we measure novel continuation? Dialogue: 0,0:03:59.45,0:04:04.07,Default,,0000,0000,0000,,Well, for each word we’ll just count the number of bigram types it completes. Dialogue: 0,0:04:04.07,0:04:08.86,Default,,0000,0000,0000,,How many different bigrams does it create by appearing after another word. Dialogue: 0,0:04:08.86,0:04:14.24,Default,,0000,0000,0000,,In other words, each bigram type is a novel continuation the first time we see this new bigram. Dialogue: 0,0:04:14.24,0:04:21.20,Default,,0000,0000,0000,,In other words, the continuation probability is going to be proportional to the cardinality of this set, Dialogue: 0,0:04:21.20,0:04:27.88,Default,,0000,0000,0000,,the number of words of preceding words, i minus one, that occur with our word. Dialogue: 0,0:04:27.88,0:04:33.29,Default,,0000,0000,0000,,So, how many words occur before this word in a bigram. Dialogue: 0,0:04:33.29,0:04:34.98,Default,,0000,0000,0000,,How many preceding words are there. Dialogue: 0,0:04:34.98,0:04:41.79,Default,,0000,0000,0000,,That will be the cardinality of that set, that’s a number we would like our continuation probability to be proportional to. Dialogue: 0,0:04:41.79,0:04:45.55,Default,,0000,0000,0000,,So how many times does w appear as a novel continuation? Dialogue: 0,0:04:45.55,0:04:47.45,Default,,0000,0000,0000,,We need to turn that into a probability. Dialogue: 0,0:04:47.45,0:04:50.97,Default,,0000,0000,0000,,So we just divide by the total number of word bigram types. Dialogue: 0,0:04:50.97,0:04:58.65,Default,,0000,0000,0000,,So, of all word bigrams that occur more than zero times, what’s the cardinality of that set? Dialogue: 0,0:04:58.65,0:05:00.74,Default,,0000,0000,0000,,How many different word bigram types are there, Dialogue: 0,0:05:00.74,0:05:05.55,Default,,0000,0000,0000,,and we’re just going to divide the two to get a probability of continuation,\N Dialogue: 0,0:05:05.55,0:05:11.46,Default,,0000,0000,0000,,of all the number of word bigram types how many of those have w as a novel continuation. Dialogue: 0,0:05:11.46,0:05:16.60,Default,,0000,0000,0000,,Now it turns out that there’s an alternative metaphor for Kneser-Ney with the same equations Dialogue: 0,0:05:16.60,0:05:23.54,Default,,0000,0000,0000,,so again we can see the numerator as the total number of word types that precede w, Dialogue: 0,0:05:23.54,0:05:30.30,Default,,0000,0000,0000,,how many word types can w follow and we’re going to normalize it by the number of words that could precede all words. Dialogue: 0,0:05:30.30,0:05:36.86,Default,,0000,0000,0000,,So, this sum over all words of the number of word types that can precede the word. Dialogue: 0,0:05:36.86,0:05:39.66,Default,,0000,0000,0000,,And these two are the same: Dialogue: 0,0:05:39.66,0:05:45.18,Default,,0000,0000,0000,,The number of this denominator and the denominator we saw on the previous slide are the same Dialogue: 0,0:05:45.18,0:05:53.85,Default,,0000,0000,0000,,because the number of possible bigram types is the same as the number of word type that can precede all words summed over all words. Dialogue: 0,0:05:53.85,0:05:56.47,Default,,0000,0000,0000,,If you think about that for a second, you’ll realize that’s true. Dialogue: 0,0:05:56.47,0:06:00.16,Default,,0000,0000,0000,,So in other words, with this kind of Kneser-Ney model, Dialogue: 0,0:06:00.16,0:06:08.26,Default,,0000,0000,0000,,a frequent word like “Francisco” that occurs only in one context like “San” will have a low continuation probability. Dialogue: 0,0:06:08.26,0:06:15.60,Default,,0000,0000,0000,,So if we put together the intuition of absolute discounting with the Kneser-Ney probability for the lower order n-gram, Dialogue: 0,0:06:15.60,0:06:18.81,Default,,0000,0000,0000,,we have the Kneser-Ney smoothing algorithm. Dialogue: 0,0:06:18.81,0:06:24.79,Default,,0000,0000,0000,,So, for the bigram itself we just have absolute discounting — Dialogue: 0,0:06:24.79,0:06:28.30,Default,,0000,0000,0000,,We take the bigram count, we subtract some d discount, Dialogue: 0,0:06:28.30,0:06:31.42,Default,,0000,0000,0000,,and I’ve just shown here that we take the max of that and zero\N Dialogue: 0,0:06:31.42,0:06:35.51,Default,,0000,0000,0000,,because, obviously, if the discount happens to be higher than the probability we don’t want a negative probability. Dialogue: 0,0:06:35.51,0:06:40.34,Default,,0000,0000,0000,,And we’re just gonna interpolate that with this same continuation probability that we just saw, Dialogue: 0,0:06:40.34,0:06:44.23,Default,,0000,0000,0000,,p continuation of w sub i. Dialogue: 0,0:06:44.23,0:06:47.36,Default,,0000,0000,0000,,And, the lambda, now let’s talk about how to compute that lambda. Dialogue: 0,0:06:47.36,0:06:54.11,Default,,0000,0000,0000,,The lambda is gonna take all that probability mass from all those normalized discounts Dialogue: 0,0:06:54.11,0:06:56.23,Default,,0000,0000,0000,,that we took out of these higher-order probabilities, Dialogue: 0,0:06:56.23,0:07:02.58,Default,,0000,0000,0000,,and use those to weight how much probability we should assign to the unigram. Dialogue: 0,0:07:02.58,0:07:05.67,Default,,0000,0000,0000,,We’re gonna combine those. Dialogue: 0,0:07:05.67,0:07:13.70,Default,,0000,0000,0000,,So that lambda is the amount of the discount weight divided by the denominator there. Dialogue: 0,0:07:13.70,0:07:15.96,Default,,0000,0000,0000,,So, it’s the normalized discount. Dialogue: 0,0:07:15.96,0:07:23.03,Default,,0000,0000,0000,,And then, we’re gonna multiply that by the total number of word types that can follow this context, w_i minus one. Dialogue: 0,0:07:23.03,0:07:27.30,Default,,0000,0000,0000,,In other words, how many different word types did we discount Dialogue: 0,0:07:27.30,0:07:30.08,Default,,0000,0000,0000,,or how many times did we apply this normalized discount. Dialogue: 0,0:07:30.08,0:07:31.87,Default,,0000,0000,0000,,And we multiply those together Dialogue: 0,0:07:31.87,0:07:37.41,Default,,0000,0000,0000,,and we know how much probably mass total we can now assign to the continuation of the word. Dialogue: 0,0:07:37.41,0:07:40.66,Default,,0000,0000,0000,,Now, this is the bigram formulation for Kneser-Ney. Dialogue: 0,0:07:40.66,0:07:46.17,Default,,0000,0000,0000,,Now in this slide, we’re showing you the general recursive formulation for n-grams in general, Dialogue: 0,0:07:46.17,0:07:51.36,Default,,0000,0000,0000,,and here we have to make a slight change to deal with all the higher-order n-grams. Dialogue: 0,0:07:51.36,0:07:55.60,Default,,0000,0000,0000,,So here we’re just showing the Kneser-Ney probability of a word given the prefix of the word. Dialogue: 0,0:07:55.60,0:07:57.66,Default,,0000,0000,0000,,And, just like Kneser-Ney we saw before, Dialogue: 0,0:07:57.66,0:08:06.81,Default,,0000,0000,0000,,we’re just interpolating a higher-order n-gram which is discounted with a lambda weight and a lower-order probability. Dialogue: 0,0:08:06.81,0:08:14.94,Default,,0000,0000,0000,,But now we need to distinguish between the very first top-level time that we use a count and these lower order counts. Dialogue: 0,0:08:14.94,0:08:20.11,Default,,0000,0000,0000,,So we’re going to use the actual count for the very highest order bigram, Dialogue: 0,0:08:20.11,0:08:25.86,Default,,0000,0000,0000,,and we’re going to use the continuation value that we just defined earlier for all the lower order probabilities. Dialogue: 0,0:08:25.86,0:08:33.05,Default,,0000,0000,0000,,So we’ll define this new thing count Kneser-Ney of dot which will mean actual count. Dialogue: 0,0:08:33.05,0:08:35.76,Default,,0000,0000,0000,,This will be actual count, let’s say we’re doing trigrams. Dialogue: 0,0:08:35.76,0:08:41.59,Default,,0000,0000,0000,,For the trigram, and then when we recurse and have the Kneser-Ney probability for the lower order things. Dialogue: 0,0:08:41.59,0:08:44.61,Default,,0000,0000,0000,,When we get down to the bigrams and unigrams, we’ll be using the continuation count, Dialogue: 0,0:08:44.61,0:08:49.22,Default,,0000,0000,0000,,that’s again the single word context that we defined earlier. Dialogue: 0,0:08:49.22,0:08:51.72,Default,,0000,0000,0000,,So Kneser-Ney smoothing, a very excellent algorithm. Dialogue: 0,0:08:51.72,0:08:54.74,Default,,0000,0000,0000,,It’s very commonly used in speech recognition and machine translation Dialogue: 0,0:08:54.74,9:59:59.99,Default,,0000,0000,0000,,and yet it has a very beautiful and elegant intuition and I hope you appreciate it.