
Now, 2^n is a lot.

For example, if we have 30 characters in our string, then there'd be a billion possible segmentations to deal with.

We clearly don't want to have to enumerate them all.

We'd like some way of searching through them efficiently

without having to consider the probability of every possible segmentation.

That's one of the reasons why making this naive Bayes assumption is so helpful.

It means that there's no interrelations between the various words,

so we can consider them one at a time.

That is, here's one thing we can say.

We can say that the best segmentation is equal to the argmax

over all possible segmentations of the string into a first word and the rest of the words

of the probability of that first word times the probability of the best segmentation of the rest of the words.

And notice that this is independent.

The best segmentation of the rest of the words doesn't depend on the first word.

And so that means we don't have to consider all interactions,

and we don't need to consider all 2^n possibilities.

So now we have two reasons why the naive Bayes assumption is a good thing.

One is it makes this computation much more efficient,

and secondly, it makes learning easier,

because it's easy to come up with a unigram probability.

What's the probability of an individual word from our corpus of text?

It's much harder to get combinations of multiple word sequences.

We're going to have to do more smoothing, more guessing what those probabilities are,

because we just won't have the counts for them.