WEBVTT
00:00:09.629 --> 00:00:12.554
so we've been talking about information
00:00:12.608 --> 00:00:15.780
how you measure information
00:00:19.332 --> 00:00:20.936
and of course, you measure information in bits
00:00:21.749 --> 00:00:26.872
how you can use information to label things
00:00:27.345 --> 00:00:30.712
for instance, with bar codes
00:00:34.798 --> 00:00:37.028
so you can use information
00:00:37.063 --> 00:00:38.065
to label things
00:00:38.065 --> 00:00:40.148
and then we talked about
00:00:40.478 --> 00:00:45.490
probability and information
00:00:45.490 --> 00:00:47.249
and if I have probabilities for events P_i
00:00:47.249 --> 00:00:51.577
then it says that the amount of information
00:00:51.745 --> 00:00:53.530
that's associated with this event occurring is
00:00:54.006 --> 00:00:56.158
minus the sum over i
00:00:56.426 --> 00:00:58.259
p_i log to the base 2
00:00:58.332 --> 00:01:00.834
of p_i
00:01:01.161 --> 00:01:02.047
this beautiful formula that was developed
00:01:02.507 --> 00:01:04.031
by Maxwell, Boltzmann, and Gibbs
00:01:04.726 --> 00:01:07.707
back in the middle of nineteenth century
00:01:11.235 --> 00:01:12.106
to talk about the amount of entropy
00:01:14.759 --> 00:01:15.026
and atoms or molecules
00:01:16.800 --> 00:01:17.050
this is often called S
00:01:19.611 --> 00:01:20.534
for entropy, as well
00:01:21.595 --> 00:01:22.760
and then was rediscovered
00:01:22.822 --> 00:01:23.666
by Claude Shannon
00:01:23.735 --> 00:01:24.624
in the 1940's
00:01:24.766 --> 00:01:26.587
to talk about information theory in the abstract
00:01:26.660 --> 00:01:27.815
and the mathematical theory of communication
00:01:28.048 --> 00:01:33.006
in fact, there is a funny story
00:01:33.101 --> 00:01:34.764
about this that
00:01:35.677 --> 00:01:38.377
Shannon, when he came up with this formula
00:01:38.390 --> 00:01:41.143
minus sum over i p_i log to the base 2 of p_i
00:01:41.231 --> 00:01:43.232
he went to John Von Neumann
00:01:43.423 --> 00:01:45.007
the famous mathematician
00:01:45.188 --> 00:01:48.283
and he said, "what sould I call this quantity?"
00:01:48.378 --> 00:01:49.821
and Von Neumann says
00:01:49.993 --> 00:01:53.641
"you should call it H, because that's what Boltzmann called it "
00:01:53.906 --> 00:01:55.447
but Von Neumann, who had a famous memory
00:01:57.527 --> 00:01:57.777
apparently forgot
00:01:57.777 --> 00:01:59.854
of Boltzmann's H
00:02:00.045 --> 00:02:02.080
of his famous "H theorem"
00:02:02.353 --> 00:02:03.976
was the same thing but without the minus sign
00:02:04.028 --> 00:02:05.212
so it's a negative quantity
00:02:05.912 --> 00:02:06.371
and it gets more and more negative
00:02:06.549 --> 00:02:07.553
as opposed to entropy
00:02:07.576 --> 00:02:08.752
which is a positive quantiy
00:02:08.895 --> 00:02:10.552
and gets more and more positive
00:02:10.678 --> 00:02:13.173
so, actually these fundamental formulas
00:02:13.246 --> 00:02:14.782
about information theory
00:02:14.838 --> 00:02:17.366
go back to the mid 19th century
00:02:20.236 --> 00:02:20.841
a hundred and fifty years
00:02:21.298 --> 00:02:22.627
so now we'd like to apply them
00:02:22.937 --> 00:02:25.071
to ideas about communication
00:02:25.318 --> 00:02:26.384
and to do that, I'd like to tell you
00:02:26.577 --> 00:02:27.575
a little bit more about
00:02:27.600 --> 00:02:28.548
probability
00:02:32.186 --> 00:02:33.963
so, we talked about
00:02:34.432 --> 00:02:34.967
probabilities for events
00:02:35.447 --> 00:02:35.822
probability x
00:02:37.707 --> 00:02:39.140
you know x equals
00:02:42.846 --> 00:02:43.096
"it's sunny"
00:02:43.096 --> 00:02:44.610
probability of y
00:02:45.662 --> 00:02:49.085
y is "it's raining"
00:02:52.794 --> 00:02:53.067
and we could look at the probability
00:02:53.067 --> 00:02:54.694
of x
00:02:57.032 --> 00:02:57.282
and I'm gonna use the notation
00:02:57.282 --> 00:02:59.129
I introduced for Boolean logic before
00:03:02.553 --> 00:03:02.803
is the probabiity of this thing right here
00:03:04.835 --> 00:03:05.085
means "AND"
00:03:05.085 --> 00:03:07.739
"probability of X AND Y"
00:03:07.838 --> 00:03:09.480
or we can also just call
00:03:13.838 --> 00:03:14.269
this the probability of X Y simultaneously
00:03:14.378 --> 00:03:16.996
keep on streamlining our notation
00:03:22.515 --> 00:03:23.549
this is the probability
00:03:23.704 --> 00:03:32.035
that it's raining... it's sunny and it's raining
00:03:36.238 --> 00:03:36.488
now, mostly in the world
00:03:36.488 --> 00:03:36.877
this is a pretty small probability
00:03:38.873 --> 00:03:39.123
but here in Santa Fe
00:03:39.767 --> 00:03:40.389
it happens all the time
00:03:41.982 --> 00:03:42.232
and as a result, you get rather beautiful
00:03:42.232 --> 00:03:43.919
rainbows...single, double, triple
00:03:44.068 --> 00:03:45.850
on a daily basis
00:03:45.933 --> 00:03:46.794
so, we have...
00:03:47.324 --> 00:03:49.636
this is what is called
00:03:49.665 --> 00:03:54.263
the joint probability
00:03:58.698 --> 00:04:01.862
the joint probability that it's sunny and its raining
00:04:01.939 --> 00:04:05.842
the joint probability of X and Y
00:04:08.716 --> 00:04:12.795
and what do we expect of this
joint probability?
00:04:13.432 --> 00:04:16.560
so, we have the probability of X and Y
00:04:17.213 --> 00:04:22.186
and this tells you the probability
that it's sunny and it's raining
00:04:22.941 --> 00:04:25.000
we can also look at the probability
X AND NOT Y
00:04:34.664 --> 00:04:34.914
so X AND (NOT Y)
00:04:35.323 --> 00:04:36.184
again using our notation
00:04:39.564 --> 00:04:41.209
introduced to us by the famous husband
00:04:41.307 --> 00:04:41.882
of the daughter of the severe general
of the British ___(?)
00:04:46.557 --> 00:04:46.867
George Boole, married to Mary Everest
00:04:46.867 --> 00:04:49.540
and we have a relationship which says
00:04:52.117 --> 00:04:52.515
that the probability of X on its own
00:04:56.203 --> 00:04:56.453
should be equal to the probability
of X AND Y plus the
00:04:56.453 --> 00:04:59.409
probability of X AND (NOT Y)
00:05:00.215 --> 00:05:02.449
and the probability of X on its own
00:05:03.082 --> 00:05:07.728
is called the "marginal probability"
00:05:10.860 --> 00:05:11.359
so, it's just the probability that
00:05:13.018 --> 00:05:13.268
it's sunny on its own
00:05:13.268 --> 00:05:15.306
so the probability that it's sunny on its own
00:05:15.574 --> 00:05:18.052
is the probability that it's sunny and it's raning
00:05:19.527 --> 00:05:19.942
plus the probability that it's sunny and it's not raining
00:05:22.199 --> 00:05:22.449
I think this makes some kind of sense
00:05:24.475 --> 00:05:24.725
why is called the "marginal probability"?
00:05:25.383 --> 00:05:25.633
I have no idea
00:05:25.633 --> 00:05:28.432
so let's not even worry about it
00:05:28.786 --> 00:05:31.563
there's a very nice picture of probabilities
00:05:31.563 --> 00:05:38.981
in terms of set theory
00:05:38.981 --> 00:05:41.094
I don't know about you
00:05:41.094 --> 00:05:43.443
but I grew up in the age of "new math"
00:05:43.443 --> 00:05:44.805
where they tried to teach us
00:05:44.805 --> 00:05:46.580
about set theory
00:05:46.580 --> 00:05:48.024
and unions of sets
00:05:48.024 --> 00:05:50.452
and intersections of sets and things like that
00:05:50.452 --> 00:05:53.000
from starting at a very early age
00:05:53.000 --> 00:05:54.897
which means people of my generation
00:05:54.897 --> 00:05:57.575
are completely unable to do their tax returns
00:05:57.575 --> 00:06:00.073
but for me, dealing a lot with math
00:06:00.073 --> 00:06:02.258
it actually has been quite helpful
00:06:02.258 --> 00:06:04.456
for my career to learn about set theory at the age of 3 or 4
00:06:04.456 --> 00:06:06.326
or whatever it was
00:06:06.326 --> 00:06:09.255
so, we have a picture like this
00:06:09.255 --> 00:06:17.869
this is the space or the set of all events
00:06:17.869 --> 00:06:20.570
here is the set X
00:06:20.570 --> 00:06:22.312
which is the set of events X, where
it's sunny
00:06:22.312 --> 00:06:28.348
here is the set of events Y, where is
the set of events where it's raining
00:06:30.532 --> 00:06:32.432
this thing right here is called
00:06:32.489 --> 00:06:34.084
"X intersection Y"
00:06:34.995 --> 00:06:37.302
which is the set of events
00:06:37.302 --> 00:06:40.769
where it's both sunny and it's raining
00:06:40.769 --> 00:06:43.165
but in contrast, if I look at
00:06:43.165 --> 00:06:44.867
this right here
00:06:44.867 --> 00:06:47.733
this is "X union Y"
00:06:47.733 --> 00:06:49.090
which is the set of events
00:06:49.090 --> 00:06:51.657
where it's either sunny or raining
00:06:51.657 --> 00:06:52.806
and now you can kind of see
00:06:52.806 --> 00:06:59.159
where George Boole got his funny
"cap" and "cup" notation
00:06:59.159 --> 00:07:02.202
we can pair this with X AND Y
00:07:02.202 --> 00:07:04.818
X AND Y, from a logical standpoint
00:07:04.818 --> 00:07:10.088
is essentially the same as this union
of these sets
00:07:10.088 --> 00:07:15.043
and similarly, X intersection Y
is X OR Y --translator's note: professor Lloyd meant "union" when referring to OR and "intersection" when referring to AND http://www.onlinemathlearning.com/intersection-of-two-sets.html--
00:07:15.550 --> 00:07:20.035
so when I take the logical statement
corresponding to the set of events
00:07:20.035 --> 00:07:22.202
that I write it as X AND Y
00:07:22.202 --> 00:07:26.743
the set of events is the intersection
of it's sunny and it's raining
00:07:26.743 --> 00:07:32.613
X OR Y is the intersection of events
where it's sunny or it's raining
--translator's note: professor Lloyd meant "union"
when referring to OR, "intersection" refers to AND--
00:07:33.124 --> 00:07:37.124
and you can have all kinds of you know
nice pictures
00:07:42.092 --> 00:07:46.784
here's Z where let's say it's snowy at the
same time it's sunny
00:07:46.784 --> 00:07:49.177
which is something that I've seen happen
here in Santa Fe
00:07:49.177 --> 00:07:50.069
this is not so strange in here
00:07:50.069 --> 00:07:54.813
where we have X intersection Y intersection Z
00:07:54.813 --> 00:07:58.900
which is not the empty when in terms of Santa Fe
00:07:58.900 --> 00:08:01.341
ok, so now let's actually look
00:08:01.341 --> 00:08:02.761
at the kinds of information that are
associated with this
00:08:02.761 --> 00:08:10.588
suppose that I have a set of possible
events, I'll call one set labeled by i
00:08:10.588 --> 00:08:16.182
the other set, labeled by j
00:08:16.182 --> 00:08:22.411
and now I can look at p of i and j
00:08:22.411 --> 00:08:25.981
so this is a case where the
first type of event
00:08:25.981 --> 00:08:29.517
is i and the second type of event is j
00:08:29.517 --> 00:08:31.726
and I can define
00:08:31.726 --> 00:08:33.586
you know, I'm gonna do this
slightly different
00:08:33.586 --> 00:08:37.179
let's call this... we'll be slightly fancier
00:08:37.179 --> 00:08:41.546
we'll call these event x_i and event y_j
00:08:41.546 --> 00:08:43.782
so, i labels the different events of x
00:08:43.782 --> 00:08:46.105
and j labels the different events of y
00:08:46.105 --> 00:08:51.491
so, for instance x_i could be two events
either it's sunny or it's not sunny
00:08:51.491 --> 00:08:55.639
so i could be zero, and it would be
'it's not sunny'
00:08:55.639 --> 00:08:56.540
and 1 could be it's sunny
00:08:56.540 --> 00:08:58.352
and j could be it's either raining
00:08:58.352 --> 00:08:59.563
or it's not raining
00:08:59.563 --> 00:09:01.771
so there are two possible value of y
00:09:01.771 --> 00:09:04.044
I'm just trying to make my life easier
00:09:04.044 --> 00:09:09.612
so we have a joint probability
distribution x_i and y_j
00:09:09.612 --> 00:09:12.045
this is our joint probability, as before
00:09:12.045 --> 00:09:14.883
and now we have a joint information
00:09:14.883 --> 00:09:18.376
which we shall call I of X and Y
00:09:18.376 --> 00:09:21.023
this is the information
00:09:21.023 --> 00:09:21.273
that's inherent in the joint set of events
00:09:21.273 --> 00:09:24.233
X and Y
00:09:24.233 --> 00:09:26.386
in our case, it being sunny and not sunny,
raining and not raining
00:09:26.386 --> 00:09:29.836
and this just takes the same form as before
00:09:29.836 --> 00:09:32.274
we sum over all different possibilities
00:09:32.274 --> 00:09:42.296
sunny-raining, not sunny-raining,
sunny-not raining, not sunny-not raining
00:09:42.296 --> 00:09:45.986
this is why one shouldn't try to enumerate these things
00:09:45.986 --> 00:09:53.138
p of x_i y_j logarithm of p of x_i y_j
00:09:53.138 --> 00:09:55.747
so this is the amount of information that's
00:09:55.747 --> 00:09:57.282
inherent with these two sets of events
together
00:09:57.282 --> 00:09:59.006
and of course, we still have this, if you like the
00:10:00.802 --> 00:10:03.887
marginal information, the information
of X on its own
00:10:10.813 --> 00:10:11.075
which is now just the sum over events x
on its own
00:10:13.468 --> 00:10:13.769
of the marginal distribution
00:10:13.769 --> 00:10:14.844
why it's called "marginal" I don't know
00:10:14.844 --> 00:10:17.196
it's just the probability for X on its own
00:10:22.078 --> 00:10:22.710
p of X_i log base two of X_i
00:10:22.710 --> 00:10:24.826
and similarly we can talk about
00:10:32.311 --> 00:10:33.431
I of Y is minus the sum over j
p of Y_j log to the base 2 of
00:10:35.829 --> 00:10:36.079
p of Y_j
00:10:38.369 --> 00:10:38.988
this is the amount of information
00:10:40.966 --> 00:10:41.216
inherent whether it's sunny or not sunny
00:10:43.319 --> 00:10:43.630
it could be up to a bit of information
00:10:46.327 --> 00:10:47.131
if it's probability one half of being
sunny or not sunny
00:10:49.213 --> 00:10:49.678
then there's a bit of information let me
tell you in Santa Fe
00:10:51.676 --> 00:10:52.215
there's far less than a bit of information
00:10:52.215 --> 00:10:53.365
on whether it's sunny or not
00:10:53.365 --> 00:10:54.715
because it's sunny most of the time
00:10:54.715 --> 00:10:56.402
similarly, raining or not raining
00:10:56.530 --> 00:10:58.441
could be up to a bit of information
00:11:00.236 --> 00:11:00.486
if each of these probabilities is 1/2
00:11:02.861 --> 00:11:03.642
again we're in the high desert here
00:11:05.084 --> 00:11:05.471
it's normally not raining
00:11:05.471 --> 00:11:07.848
so, you've far less than a bit of information
00:11:09.831 --> 00:11:10.428
on the question whether it's raining or not raining
00:11:12.768 --> 00:11:13.018
so, we have joint information
00:11:14.891 --> 00:11:16.822
constructed out of joint probabilities
00:11:19.138 --> 00:11:19.388
marginal information, or information on the original variables on their own,
00:11:21.606 --> 00:11:22.942
constructed
out of marginal probabilities
00:11:26.134 --> 00:11:26.481
and let me end this little section by defining
00:11:28.635 --> 00:11:32.481
a very useful quantity which is called the
mutual information
00:11:38.809 --> 00:11:42.431
the mutual information, which is defined to be
00:11:42.431 --> 00:11:46.651
I( X ...I normally define it with this little colon
00:11:46.651 --> 00:11:49.776
right in the middle, because it looks nice
and symmetrical
00:11:49.776 --> 00:11:54.964
and we'll see that this isn't symmetrical
00:11:54.964 --> 00:11:58.093
it's the information in X plus the information in Y
00:11:58.093 --> 00:12:01.211
minus the information in X and Y taken together
00:12:01.211 --> 00:12:03.961
it's possible to show that this is always greater or equal to zero
00:12:03.961 --> 00:12:09.112
and this mutual information can be thought of as the amount of information
00:12:09.112 --> 00:12:11.981
the variable X has about Y
00:12:11.981 --> 00:12:16.546
if X and Y are completely uncorrelated, so it's completely
uncorrelated whether it's sunny
00:12:16.546 --> 00:12:18.546
or not sunny or raining or not raining
00:12:18.546 --> 00:12:22.412
then this will be zero
00:12:22.412 --> 00:12:24.719
however, in the case of sunny and not sunny
00:12:24.719 --> 00:12:28.966
raning and not raining, they are very correlated
00:12:28.966 --> 00:12:32.171
in the sense that once you know that it's sunny
00:12:32.171 --> 00:12:33.574
it's probabiy not raining, even though
00:12:33.574 --> 00:12:35.371
sometimes that does happen here in Santa Fe
00:12:35.371 --> 00:12:36.387
and so in that case, you'd expect
00:12:36.387 --> 00:12:37.664
to find a large amount of mutual information
00:12:37.664 --> 00:12:39.804
in most places in fact, you'll find that knowing
00:12:39.804 --> 00:12:41.148
whether it's sunny or not sunny
00:12:41.148 --> 00:12:42.357
gives you a very good prediction
00:12:42.357 --> 00:12:45.128
about whether it's raining or it's not raining
00:12:45.128 --> 00:12:49.149
mutual information measures the amount of information
that X can tell us about Y
00:12:49.149 --> 00:12:53.404
it's symmetric, so it tells us the amount of information that
Y can tell us about X
00:12:53.404 --> 00:12:55.910
and another way of thinking about it
00:12:55.910 --> 00:12:58.359
is that it's the amount of information
00:12:58.359 --> 00:13:00.042
that X and Y hold in common
00:13:00.042 --> 00:13:05.956
which is why it's called "mutual information"