In this case, the problem really comes down to the naive Bayes assumption is
a weak one, and the Markov assumption would do much better.
It wouldn't really help to have more data or to do a better job of smoothing,
because I already have good counts for words like "in" and "significant"
as well as words like "small" and "and."
They're all common enough that I have a good representation of how often they occur
as a unigram as a single word.
The problem is that we would like to know that the word "small" goes very well
with the word "insignificant" but does not goes very well with the word "significant."
So if we had a Markov model where the probability of "insignificant" depended
on the probability of "small," then we could catch that,
and we could get this segmentation correct.
このケースには問題があり
ナイーブベイズ仮定は適さず
マルコフ仮定の適用が有効です
多くのデータやスムージングでは
すでにinやsignificantなどの単語があるので
役に立ちません
smallやandのような単語も同じです
これらは頻出する単語で1語を1文字とした
ユニグラムとしてよく出現する表現です
ここで確かめたいのは
smallがinsignificantとではなく
significantとの組み合わせで頻出することです
そこでマルコフモデルを使えば
smallの確率によって決まる
insignificantの確率を得ることができ
正しいセグメンテーションが得られるのです