
Title:
Information.6.MutualInformation

Description:
Introduction to mutual information

so we've been talking about information

how you measure information

and of course, you measure information in bits

how you can use information to label things

for instance, with bar codes

so you can use information

to label things

and then we talked about

probability and information

and if I have probabilities for events P_i

then it says that the amount of information

that's associated with this event occurring is

minus the sum over i

p_i log to the base 2

of p_i

this beautiful formula that was developed

by Maxwell, Boltzmann, and Gibbs

back in the middle of nineteenth century

to talk about the amount of entropy

and atoms or molecules

this is often called S

for entropy, as well

and then was rediscovered

by Claude Shannon

in the 1940's

to talk about information theory in the abstract

and the mathematical theory of communication

in fact, there is a funny story

about this that

Shannon, when he came up with this formula

minus sum over i p_i log to the base 2 of p_i

he went to John Von Neumann

the famous mathematician

and he said, "what sould I call this quantity?"

and Von Neumann says

"you should call it H, because that's what Boltzmann called it "

but Von Neumann, who had a famous memory

apparently forgot

of Boltzmann's H

of his famous "H theorem"

was the same thing but without the minus sign

so it's a negative quantity

and it gets more and more negative

as opposed to entropy

which is a positive quantiy

and gets more and more positive

so, actually these fundamental formulas

about information theory

go back to the mid 19th century

a hundred and fifty years

so now we'd like to apply them

to ideas about communication

and to do that, I'd like to tell you

a little bit more about

probability

so, we talked about

probabilities for events

probability x

you know x equals

"it's sunny"

probability of y

y is "it's raining"

and we could look at the probability

of x

and I'm gonna use the notation

I introduced for Boolean logic before

is the probabiity of this thing right here

means "AND"

"probability of X AND Y"

or we can also just call

this the probability of X Y simultaneously

keep on streamlining our notation

this is the probability

that it's raining... it's sunny and it's raining

now, mostly in the world

this is a pretty small probability

but here in Santa Fe

it happens all the time

and as a result, you get rather beautiful

rainbows...single, double, triple

on a daily basis

so, we have...

this is what is called

the joint probability

the joint probability that it's sunny and its raining

the joint probability of X and Y

and what do we expect of this
joint probability?

so, we have the probability of X and Y

and this tells you the probability
that it's sunny and it's raining

we can also look at the probability
X AND NOT Y

so X AND (NOT Y)

again using our notation

introduced to us by the famous husband

of the daughter of the severe general
of the British ___(?)

George Boole, married to Mary Everest

and we have a relationship which says

that the probability of X on its own

should be equal to the probability
of X AND Y plus the

probability of X AND (NOT Y)

and the probability of X on its own

is called the "marginal probability"

so, it's just the probability that

it's sunny on its own

so the probability that it's sunny on its own

is the probability that it's sunny and it's raning

plus the probability that it's sunny and it's not raining

I think this makes some kind of sense

why is called the "marginal probability"?

I have no idea

so let's not even worry about it

there's a very nice picture of probabilities

in terms of set theory

I don't know about you

but I grew up in the age of "new math"

where they tried to teach us

about set theory

and unions of sets

and intersections of sets and things like that

from starting at a very early age

which means people of my generation

are completely unable to do their tax returns

but for me, dealing a lot with math

it actually has been quite helpful

for my career to learn about set theory at the age of 3 or 4

or whatever it was

so, we have a picture like this

this is the space or the set of all events

here is the set X

which is the set of events X, where
it's sunny

here is the set of events Y, where is
the set of events where it's raining

this thing right here is called

"X intersection Y"

which is the set of events

where it's both sunny and it's raining

but in contrast, if I look at

this right here

this is "X union Y"

which is the set of events

where it's either sunny or raining

and now you can kind of see

where George Boole got his funny
"cap" and "cup" notation

we can pair this with X AND Y

X AND Y, from a logical standpoint

is essentially the same as this union
of these sets

and similarly, X intersection Y
is X OR Y translator's note: professor Lloyd meant "union" when referring to OR and "intersection" when referring to AND http://www.onlinemathlearning.com/intersectionoftwosets.html

so when I take the logical statement
corresponding to the set of events

that I write it as X AND Y

the set of events is the intersection
of it's sunny and it's raining

X OR Y is the intersection of events
where it's sunny or it's raining
translator's note: professor Lloyd meant "union"
when referring to OR, "intersection" refers to AND

and you can have all kinds of you know
nice pictures

here's Z where let's say it's snowy at the
same time it's sunny

which is something that I've seen happen
here in Santa Fe

this is not so strange in here

where we have X intersection Y intersection Z

which is not the empty when in terms of Santa Fe

ok, so now let's actually look

at the kinds of information that are
associated with this

suppose that I have a set of possible
events, I'll call one set labeled by i

the other set, labeled by j

and now I can look at p of i and j

so this is a case where the
first type of event

is i and the second type of event is j

and I can define

you know, I'm gonna do this
slightly different

let's call this... we'll be slightly fancier

we'll call these event x_i and event y_j

so, i labels the different events of x

and j labels the different events of y

so, for instance x_i could be two events
either it's sunny or it's not sunny

so i could be zero, and it would be
'it's not sunny'

and 1 could be it's sunny

and j could be it's either raining

or it's not raining

so there are two possible value of y

I'm just trying to make my life easier

so we have a joint probability
distribution x_i and y_j

this is our joint probability, as before

and now we have a joint information

which we shall call I of X and Y

this is the information

that's inherent in the joint set of events

X and Y

in our case, it being sunny and not sunny,
raining and not raining

and this just takes the same form as before

we sum over all different possibilities

sunnyraining, not sunnyraining,
sunnynot raining, not sunnynot raining

this is why one shouldn't try to enumerate these things

p of x_i y_j logarithm of p of x_i y_j

so this is the amount of information that's

inherent with these two sets of events
together

and of course, we still have this, if you like the

marginal information, the information
of X on its own

which is now just the sum over events x
on its own

of the marginal distribution

why it's called "marginal" I don't know

it's just the probability for X on its own

p of X_i log base two of X_i

and similarly we can talk about

I of Y is minus the sum over j
p of Y_j log to the base 2 of

p of Y_j

this is the amount of information

inherent whether it's sunny or not sunny

it could be up to a bit of information

if it's probability one half of being
sunny or not sunny

then there's a bit of information let me
tell you in Santa Fe

there's far less than a bit of information

on whether it's sunny or not

because it's sunny most of the time

similarly, raining or not raining

could be up to a bit of information

if each of these probabilities is 1/2

again we're in the high desert here

it's normally not raining

so, you've far less than a bit of information

on the question whether it's raining or not raining

so, we have joint information

constructed out of joint probabilities

marginal information, or information on the original variables on their own,

constructed
out of marginal probabilities

and let me end this little section by defining

a very useful quantity which is called the
mutual information

the mutual information, which is defined to be

I( X ...I normally define it with this little colon

right in the middle, because it looks nice
and symmetrical

and we'll see that this isn't symmetrical

it's the information in X plus the information in Y

minus the information in X and Y taken together

it's possible to show that this is always greater or equal to zero

and this mutual information can be thought of as the amount of information

the variable X has about Y

if X and Y are completely uncorrelated, so it's completely
uncorrelated whether it's sunny

or not sunny or raining or not raining

then this will be zero

however, in the case of sunny and not sunny

raning and not raining, they are very correlated

in the sense that once you know that it's sunny

it's probabiy not raining, even though

sometimes that does happen here in Santa Fe

and so in that case, you'd expect

to find a large amount of mutual information

in most places in fact, you'll find that knowing

whether it's sunny or not sunny

gives you a very good prediction

about whether it's raining or it's not raining

mutual information measures the amount of information
that X can tell us about Y

it's symmetric, so it tells us the amount of information that
Y can tell us about X

and another way of thinking about it

is that it's the amount of information

that X and Y hold in common

which is why it's called "mutual information"