1
00:00:00,242 --> 00:00:03,442
Let’s talk about Kneser-Ney Smoothing,

2
00:00:03,442 --> 00:00:05,458
one of the most sophisticated forms of smoothing,

3
00:00:05,458 --> 00:00:08,334
but also one with a beautiful and elegant intuition.

4
00:00:09,765 --> 00:00:14,459
Remember that from Good-Turing, we talked about the c*’s,

5
00:00:14,459 --> 00:00:17,194
the discounted counts you end up from Good-Turing,

6
00:00:17,194 --> 00:00:22,191
and we discounted each of the counts —

7
00:00:22,191 --> 00:00:24,412
the count of one was discounted to point four,

8
00:00:24,412 --> 00:00:27,682
and the count of two discounted to 1.26 and so on —

9
00:00:27,682 --> 00:00:32,564
in order to save mass to replace the zero counts with some low number.

10
00:00:32,564 --> 00:00:39,421
And if you look at the actual values of these counts — 8.25 for nine and 7.24 for eight —


11
00:00:39,421 --> 00:00:41,621
you’ll notice that in a very large number of cases,

12
00:00:41,621 --> 00:00:47,559
the discounted count has a very close relationship with the original count.


13
00:00:47,559 --> 00:00:52,856
It’s really the original count minus .75, or somewhat close to that.

14
00:00:52,856 --> 00:01:01,361
So, in practice what Good-Turing often does is produce a fixed small discount from the counts.

15
00:01:01,361 --> 00:01:07,590
And that intuition, that of a fixed small discount, can be applied directly.

16
00:01:07,590 --> 00:01:12,807
When we do this, we call this absolute discounting, and absolute discounting is a popular kind of smoothing.

17
00:01:12,807 --> 00:01:17,548
And here we’re showing you absolute discounting interpolation.

18
00:01:17,548 --> 00:01:22,372
And again, the intuition is just: We’ll save some time and have to compute all those complicated Good-Turing numbers


19
00:01:22,372 --> 00:01:27,962
and we’ll just subtract .75, or maybe it will be a different discount value for different corpora.


20
00:01:27,962 --> 00:01:30,032
Now here’s the equation for absolute discounting.

21
00:01:30,032 --> 00:01:31,900
So we’re doing diagrams again.

22
00:01:31,900 --> 00:01:37,464
So the probability, absolute discounted, of a word, given the previous word,

23
00:01:37,464 --> 00:01:44,173
will be some discounted bigram, interpolated with some interpolation weight, with the unigram probability.

24
00:01:44,173 --> 00:01:47,905
So we have a unigram probability P(w), and then the bigram probability,

25
00:01:47,905 --> 00:01:53,402
and we just subtract a fixed amount, let’s say it’s .75, from the count.

26
00:01:53,402 --> 00:01:56,962
And otherwise compute the bigram probability in the normal way.

27
00:01:56,962 --> 00:02:00,136
So, we have a discounted bigram probability, mixed with some weight,

28
00:02:00,136 --> 00:02:03,103
which I shall talk later about how to set this weight, with a unigram.

29
00:02:03,103 --> 00:02:07,701
And maybe we might keep a couple of extra values of d for counts one and two.

30
00:02:07,701 --> 00:02:12,701
Counts one and two, we saw on a previous slide, weren’t quite subtracting .75,

31
00:02:12,701 --> 00:02:16,316
so we can model this more carefully by having separate counts for those.

32
00:02:17,239 --> 00:02:22,558
But the problem with absolute discounting is the unigram probability itself.

33
00:02:22,558 --> 00:02:25,688
And I want to talk about changing the unigram probability.

34
00:02:25,688 --> 00:02:28,236
And that’s the fundamental intuition of Kneser-Ney.

35
00:02:28,236 --> 00:02:34,501
So in Kneser-Ney smoothing, the idea is keep that same interpolation that we saw in absolute discounting,


36
00:02:34,501 --> 00:02:38,557
but use a better estimate of probabilities of the lower unigrams.

37
00:02:38,557 --> 00:02:42,671
And the intuition for that, we can go back and look at the classic Shannon games.


38
00:02:42,671 --> 00:02:45,573
Remember, in the Shannon game we’re predicting a word from previous words,

39
00:02:45,573 --> 00:02:49,960
so we see a sentence, “I can’t see without my reading <u><u></u></u>.”

40
00:02:49,960 --> 00:02:51,375
What’s the most likely next word?

41
00:02:51,375 --> 00:02:52,879
Well, “glasses” seems pretty likely.

42
00:02:52,879 --> 00:02:56,226
Well, how about instead the word “Francisco”?

43
00:02:56,226 --> 00:02:58,886
Well, that seems very unlikely in this situation.

44
00:02:58,886 --> 00:03:04,907
And yet “Francisco”, as just a unigram, is more common than “glasses”.

45
00:03:04,907 --> 00:03:10,670
But the reason why “Francisco” seems like a bad thing after “reading”,

46
00:03:10,670 --> 00:03:16,404
one intuition we might be able to get is that “Francisco” always follows “San” or very often follows “San”.

47
00:03:16,404 --> 00:03:21,627
So while “Francisco” is very frequent, it’s frequent in the context of the word “San”.

48
00:03:21,627 --> 00:03:26,168
Now, unigrams in an interpolation model, where we’re mixing a unigram and a bigram,

49
00:03:26,168 --> 00:03:31,230
are specifically useful — they’re very helpful — just in case where we haven’t seen a bigram.

50
00:03:31,230 --> 00:03:37,158
So it’s unfortunate that just in the case where we haven’t seen the bigram “reading Francisco”,

51
00:03:37,158 --> 00:03:41,214
we’re trusting “Francisco”’s unigram weight which is just where we shouldn’t trust it.

52
00:03:41,214 --> 00:03:45,582
So instead of using the probability of w, how likely is a word,

53
00:03:45,582 --> 00:03:48,518
our intuition is going to be when we’re backing off to something

54
00:03:48,518 --> 00:03:51,523
we should instead use the continuation probability.

55
00:03:51,523 --> 00:03:53,730
We’re going to call it P continuation of a word,

56
00:03:53,730 --> 00:03:57,164
how likely is the word to appear as a novel continuation.

57
00:03:57,164 --> 00:03:59,453
Well, how do we measure novel continuation?

58
00:03:59,453 --> 00:04:04,067
Well, for each word we’ll just count the number of bigram types it completes.

59
00:04:04,067 --> 00:04:08,863
How many different bigrams does it create by appearing after another word.

60
00:04:08,863 --> 00:04:14,243
In other words, each bigram type is a novel continuation the first time we see this new bigram.

61
00:04:14,243 --> 00:04:21,202
In other words, the continuation probability is going to be proportional to the cardinality of this set,

62
00:04:21,202 --> 00:04:27,875
the number of words of preceding words, i minus one, that occur with our word.

63
00:04:27,875 --> 00:04:33,292
So, how many words occur before this word in a bigram.

64
00:04:33,292 --> 00:04:34,985
How many preceding words are there.

65
00:04:34,985 --> 00:04:41,792
That will be the cardinality of that set, that’s a number we would like our continuation probability to be proportional to.

66
00:04:41,792 --> 00:04:45,554
So how many times does w appear as a novel continuation?

67
00:04:45,554 --> 00:04:47,450
We need to turn that into a probability.

68
00:04:47,450 --> 00:04:50,967
So we just divide by the total number of word bigram types.

69
00:04:50,967 --> 00:04:58,648
So, of all word bigrams that occur more than zero times, what’s the cardinality of that set?

70
00:04:58,648 --> 00:05:00,741
How many different word bigram types are there,

71
00:05:00,741 --> 00:05:05,552
and we’re just going to divide the two to get a probability of continuation,


72
00:05:05,552 --> 00:05:11,465
of all the number of word bigram types how many of those have w as a novel continuation.

73
00:05:11,465 --> 00:05:16,604
Now it turns out that there’s an alternative metaphor for Kneser-Ney with the same equations

74
00:05:16,604 --> 00:05:23,543
so again we can see the numerator as the total number of word types that precede w,

75
00:05:23,543 --> 00:05:30,304
how many word types can w follow and we’re going to normalize it by the number of words that could precede all words.

76
00:05:30,304 --> 00:05:36,865
So, this sum over all words of the number of word types that can precede the word.

77
00:05:36,865 --> 00:05:39,656
And these two are the same:

78
00:05:39,656 --> 00:05:45,180
The number of this denominator and the denominator we saw on the previous slide are the same

79
00:05:45,180 --> 00:05:53,846
because the number of possible bigram types is the same as the number of word type that can precede all words summed over all words.

80
00:05:53,846 --> 00:05:56,470
If you think about that for a second, you’ll realize that’s true.

81
00:05:56,470 --> 00:06:00,159
So in other words, with this kind of Kneser-Ney model,

82
00:06:00,159 --> 00:06:08,261
a frequent word like “Francisco” that occurs only in one context like “San” will have a low continuation probability.

83
00:06:08,261 --> 00:06:15,602
So if we put together the intuition of absolute discounting with the Kneser-Ney probability for the lower order n-gram,

84
00:06:15,602 --> 00:06:18,809
we have the Kneser-Ney smoothing algorithm.

85
00:06:18,809 --> 00:06:24,794
So, for the bigram itself we just have absolute discounting —

86
00:06:24,794 --> 00:06:28,299
We take the bigram count, we subtract some d discount,

87
00:06:28,299 --> 00:06:31,417
and I’ve just shown here that we take the max of that and zero


88
00:06:31,417 --> 00:06:35,511
because, obviously, if the discount happens to be higher than the probability we don’t want a negative probability.

89
00:06:35,511 --> 00:06:40,345
And we’re just gonna interpolate that with this same continuation probability that we just saw,

90
00:06:40,345 --> 00:06:44,230
p continuation of w sub i.

91
00:06:44,230 --> 00:06:47,360
And, the lambda, now let’s talk about how to compute that lambda.

92
00:06:47,360 --> 00:06:54,114
The lambda is gonna take all that probability mass from all those normalized discounts

93
00:06:54,114 --> 00:06:56,232
that we took out of these higher-order probabilities,

94
00:06:56,232 --> 00:07:02,580
and use those to weight how much probability we should assign to the unigram.

95
00:07:02,580 --> 00:07:05,673
We’re gonna combine those.

96
00:07:05,673 --> 00:07:13,698
So that lambda is the amount of the discount weight divided by the denominator there.

97
00:07:13,698 --> 00:07:15,965
So, it’s the normalized discount.

98
00:07:15,965 --> 00:07:23,030
And then, we’re gonna multiply that by the total number of word types that can follow this context, w_i minus one.

99
00:07:23,030 --> 00:07:27,300
In other words, how many different word types did we discount

100
00:07:27,300 --> 00:07:30,076
or how many times did we apply this normalized discount.

101
00:07:30,076 --> 00:07:31,867
And we multiply those together

102
00:07:31,867 --> 00:07:37,409
and we know how much probably mass total we can now assign to the continuation of the word.

103
00:07:37,409 --> 00:07:40,659
Now, this is the bigram formulation for Kneser-Ney.

104
00:07:40,659 --> 00:07:46,173
Now in this slide, we’re showing you the general recursive formulation for n-grams in general,

105
00:07:46,173 --> 00:07:51,365
and here we have to make a slight change to deal with all the higher-order n-grams.

106
00:07:51,365 --> 00:07:55,603
So here we’re just showing the Kneser-Ney probability of a word given the prefix of the word.

107
00:07:55,603 --> 00:07:57,662
And, just like Kneser-Ney we saw before,

108
00:07:57,662 --> 00:08:06,809
we’re just interpolating a higher-order n-gram which is discounted with a lambda weight and a lower-order probability.

109
00:08:06,809 --> 00:08:14,940
But now we need to distinguish between the very first top-level time that we use a count and these lower order counts.

110
00:08:14,940 --> 00:08:20,107
So we’re going to use the actual count for the very highest order bigram,

111
00:08:20,107 --> 00:08:25,859
and we’re going to use the continuation value that we just defined earlier for all the lower order probabilities.

112
00:08:25,859 --> 00:08:33,052
So we’ll define this new thing count Kneser-Ney of dot which will mean actual count.

113
00:08:33,052 --> 00:08:35,763
This will be actual count, let’s say we’re doing trigrams.

114
00:08:35,763 --> 00:08:41,589
For the trigram, and then when we recurse and have the Kneser-Ney probability for the lower order things.

115
00:08:41,589 --> 00:08:44,607
When we get down to the bigrams and unigrams, we’ll be using the continuation count,

116
00:08:44,607 --> 00:08:49,216
that’s again the single word context that we defined earlier.

117
00:08:49,216 --> 00:08:51,720
So Kneser-Ney smoothing, a very excellent algorithm.

118
00:08:51,720 --> 00:08:54,743
It’s very commonly used in speech recognition and machine translation

119
00:08:54,743 --> 99:59:59,999
and yet it has a very beautiful and elegant intuition and I hope you appreciate it.