0:00:00.242,0:00:03.442
Let’s talk about Kneser-Ney Smoothing,

0:00:03.442,0:00:05.458
one of the most sophisticated forms of smoothing,

0:00:05.458,0:00:08.334
but also one with a beautiful and elegant intuition.

0:00:09.765,0:00:14.459
Remember that from Good-Turing, we talked about the c*’s,

0:00:14.459,0:00:17.194
the discounted counts you end up from Good-Turing,

0:00:17.194,0:00:22.191
and we discounted each of the counts —

0:00:22.191,0:00:24.412
the count of one was discounted to point four,

0:00:24.412,0:00:27.682
and the count of two discounted to 1.26 and so on —

0:00:27.682,0:00:32.564
in order to save mass to replace the zero counts with some low number.

0:00:32.564,0:00:39.421
And if you look at the actual values of these counts — 8.25 for nine and 7.24 for eight —[br]

0:00:39.421,0:00:41.621
you’ll notice that in a very large number of cases,

0:00:41.621,0:00:47.559
the discounted count has a very close relationship with the original count.[br]

0:00:47.559,0:00:52.856
It’s really the original count minus .75, or somewhat close to that.

0:00:52.856,0:01:01.361
So, in practice what Good-Turing often does is produce a fixed small discount from the counts.

0:01:01.361,0:01:07.590
And that intuition, that of a fixed small discount, can be applied directly.

0:01:07.590,0:01:12.807
When we do this, we call this absolute discounting, and absolute discounting is a popular kind of smoothing.

0:01:12.807,0:01:17.548
And here we’re showing you absolute discounting interpolation.

0:01:17.548,0:01:22.372
And again, the intuition is just: We’ll save some time and have to compute all those complicated Good-Turing numbers[br]

0:01:22.372,0:01:27.962
and we’ll just subtract .75, or maybe it will be a different discount value for different corpora.[br]

0:01:27.962,0:01:30.032
Now here’s the equation for absolute discounting.

0:01:30.032,0:01:31.900
So we’re doing diagrams again.

0:01:31.900,0:01:37.464
So the probability, absolute discounted, of a word, given the previous word,

0:01:37.464,0:01:44.173
will be some discounted bigram, interpolated with some interpolation weight, with the unigram probability.

0:01:44.173,0:01:47.905
So we have a unigram probability P(w), and then the bigram probability,

0:01:47.905,0:01:53.402
and we just subtract a fixed amount, let’s say it’s .75, from the count.

0:01:53.402,0:01:56.962
And otherwise compute the bigram probability in the normal way.

0:01:56.962,0:02:00.136
So, we have a discounted bigram probability, mixed with some weight,

0:02:00.136,0:02:03.103
which I shall talk later about how to set this weight, with a unigram.

0:02:03.103,0:02:07.701
And maybe we might keep a couple of extra values of d for counts one and two.

0:02:07.701,0:02:12.701
Counts one and two, we saw on a previous slide, weren’t quite subtracting .75,

0:02:12.701,0:02:16.316
so we can model this more carefully by having separate counts for those.

0:02:17.239,0:02:22.558
But the problem with absolute discounting is the unigram probability itself.

0:02:22.558,0:02:25.688
And I want to talk about changing the unigram probability.

0:02:25.688,0:02:28.236
And that’s the fundamental intuition of Kneser-Ney.

0:02:28.236,0:02:34.501
So in Kneser-Ney smoothing, the idea is keep that same interpolation that we saw in absolute discounting,[br]

0:02:34.501,0:02:38.557
but use a better estimate of probabilities of the lower unigrams.

0:02:38.557,0:02:42.671
And the intuition for that, we can go back and look at the classic Shannon games.[br]

0:02:42.671,0:02:45.573
Remember, in the Shannon game we’re predicting a word from previous words,

0:02:45.573,0:02:49.960
so we see a sentence, “I can’t see without my reading <u><u></u></u>.”

0:02:49.960,0:02:51.375
What’s the most likely next word?

0:02:51.375,0:02:52.879
Well, “glasses” seems pretty likely.

0:02:52.879,0:02:56.226
Well, how about instead the word “Francisco”?

0:02:56.226,0:02:58.886
Well, that seems very unlikely in this situation.

0:02:58.886,0:03:04.907
And yet “Francisco”, as just a unigram, is more common than “glasses”.

0:03:04.907,0:03:10.670
But the reason why “Francisco” seems like a bad thing after “reading”,

0:03:10.670,0:03:16.404
one intuition we might be able to get is that “Francisco” always follows “San” or very often follows “San”.

0:03:16.404,0:03:21.627
So while “Francisco” is very frequent, it’s frequent in the context of the word “San”.

0:03:21.627,0:03:26.168
Now, unigrams in an interpolation model, where we’re mixing a unigram and a bigram,

0:03:26.168,0:03:31.230
are specifically useful — they’re very helpful — just in case where we haven’t seen a bigram.

0:03:31.230,0:03:37.158
So it’s unfortunate that just in the case where we haven’t seen the bigram “reading Francisco”,

0:03:37.158,0:03:41.214
we’re trusting “Francisco”’s unigram weight which is just where we shouldn’t trust it.

0:03:41.214,0:03:45.582
So instead of using the probability of w, how likely is a word,

0:03:45.582,0:03:48.518
our intuition is going to be when we’re backing off to something

0:03:48.518,0:03:51.523
we should instead use the continuation probability.

0:03:51.523,0:03:53.730
We’re going to call it P continuation of a word,

0:03:53.730,0:03:57.164
how likely is the word to appear as a novel continuation.

0:03:57.164,0:03:59.453
Well, how do we measure novel continuation?

0:03:59.453,0:04:04.067
Well, for each word we’ll just count the number of bigram types it completes.

0:04:04.067,0:04:08.863
How many different bigrams does it create by appearing after another word.

0:04:08.863,0:04:14.243
In other words, each bigram type is a novel continuation the first time we see this new bigram.

0:04:14.243,0:04:21.202
In other words, the continuation probability is going to be proportional to the cardinality of this set,

0:04:21.202,0:04:27.875
the number of words of preceding words, i minus one, that occur with our word.

0:04:27.875,0:04:33.292
So, how many words occur before this word in a bigram.

0:04:33.292,0:04:34.985
How many preceding words are there.

0:04:34.985,0:04:41.792
That will be the cardinality of that set, that’s a number we would like our continuation probability to be proportional to.

0:04:41.792,0:04:45.554
So how many times does w appear as a novel continuation?

0:04:45.554,0:04:47.450
We need to turn that into a probability.

0:04:47.450,0:04:50.967
So we just divide by the total number of word bigram types.

0:04:50.967,0:04:58.648
So, of all word bigrams that occur more than zero times, what’s the cardinality of that set?

0:04:58.648,0:05:00.741
How many different word bigram types are there,

0:05:00.741,0:05:05.552
and we’re just going to divide the two to get a probability of continuation,[br]

0:05:05.552,0:05:11.465
of all the number of word bigram types how many of those have w as a novel continuation.

0:05:11.465,0:05:16.604
Now it turns out that there’s an alternative metaphor for Kneser-Ney with the same equations

0:05:16.604,0:05:23.543
so again we can see the numerator as the total number of word types that precede w,

0:05:23.543,0:05:30.304
how many word types can w follow and we’re going to normalize it by the number of words that could precede all words.

0:05:30.304,0:05:36.865
So, this sum over all words of the number of word types that can precede the word.

0:05:36.865,0:05:39.656
And these two are the same:

0:05:39.656,0:05:45.180
The number of this denominator and the denominator we saw on the previous slide are the same

0:05:45.180,0:05:53.846
because the number of possible bigram types is the same as the number of word type that can precede all words summed over all words.

0:05:53.846,0:05:56.470
If you think about that for a second, you’ll realize that’s true.

0:05:56.470,0:06:00.159
So in other words, with this kind of Kneser-Ney model,

0:06:00.159,0:06:08.261
a frequent word like “Francisco” that occurs only in one context like “San” will have a low continuation probability.

0:06:08.261,0:06:15.602
So if we put together the intuition of absolute discounting with the Kneser-Ney probability for the lower order n-gram,

0:06:15.602,0:06:18.809
we have the Kneser-Ney smoothing algorithm.

0:06:18.809,0:06:24.794
So, for the bigram itself we just have absolute discounting —

0:06:24.794,0:06:28.299
We take the bigram count, we subtract some d discount,

0:06:28.299,0:06:31.417
and I’ve just shown here that we take the max of that and zero[br]

0:06:31.417,0:06:35.511
because, obviously, if the discount happens to be higher than the probability we don’t want a negative probability.

0:06:35.511,0:06:40.345
And we’re just gonna interpolate that with this same continuation probability that we just saw,

0:06:40.345,0:06:44.230
p continuation of w sub i.

0:06:44.230,0:06:47.360
And, the lambda, now let’s talk about how to compute that lambda.

0:06:47.360,0:06:54.114
The lambda is gonna take all that probability mass from all those normalized discounts

0:06:54.114,0:06:56.232
that we took out of these higher-order probabilities,

0:06:56.232,0:07:02.580
and use those to weight how much probability we should assign to the unigram.

0:07:02.580,0:07:05.673
We’re gonna combine those.

0:07:05.673,0:07:13.698
So that lambda is the amount of the discount weight divided by the denominator there.

0:07:13.698,0:07:15.965
So, it’s the normalized discount.

0:07:15.965,0:07:23.030
And then, we’re gonna multiply that by the total number of word types that can follow this context, w_i minus one.

0:07:23.030,0:07:27.300
In other words, how many different word types did we discount

0:07:27.300,0:07:30.076
or how many times did we apply this normalized discount.

0:07:30.076,0:07:31.867
And we multiply those together

0:07:31.867,0:07:37.409
and we know how much probably mass total we can now assign to the continuation of the word.

0:07:37.409,0:07:40.659
Now, this is the bigram formulation for Kneser-Ney.

0:07:40.659,0:07:46.173
Now in this slide, we’re showing you the general recursive formulation for n-grams in general,

0:07:46.173,0:07:51.365
and here we have to make a slight change to deal with all the higher-order n-grams.

0:07:51.365,0:07:55.603
So here we’re just showing the Kneser-Ney probability of a word given the prefix of the word.

0:07:55.603,0:07:57.662
And, just like Kneser-Ney we saw before,

0:07:57.662,0:08:06.809
we’re just interpolating a higher-order n-gram which is discounted with a lambda weight and a lower-order probability.

0:08:06.809,0:08:14.940
But now we need to distinguish between the very first top-level time that we use a count and these lower order counts.

0:08:14.940,0:08:20.107
So we’re going to use the actual count for the very highest order bigram,

0:08:20.107,0:08:25.859
and we’re going to use the continuation value that we just defined earlier for all the lower order probabilities.

0:08:25.859,0:08:33.052
So we’ll define this new thing count Kneser-Ney of dot which will mean actual count.

0:08:33.052,0:08:35.763
This will be actual count, let’s say we’re doing trigrams.

0:08:35.763,0:08:41.589
For the trigram, and then when we recurse and have the Kneser-Ney probability for the lower order things.

0:08:41.589,0:08:44.607
When we get down to the bigrams and unigrams, we’ll be using the continuation count,

0:08:44.607,0:08:49.216
that’s again the single word context that we defined earlier.

0:08:49.216,0:08:51.720
So Kneser-Ney smoothing, a very excellent algorithm.

0:08:51.720,0:08:54.743
It’s very commonly used in speech recognition and machine translation

0:08:54.743,9:59:59.000
and yet it has a very beautiful and elegant intuition and I hope you appreciate it.