WEBVTT

00:00:02.980 --> 00:00:09.400
Coins and dice provide a nice simple model
of how to calculate probabilities, but

00:00:09.990 --> 00:00:14.540
everyday life is a lot more complicated
and it's not taken up with gambling.

00:00:14.540 --> 00:00:17.447
At least, I hope your life is not taken up
with gambling.

00:00:18.400 --> 00:00:22.230
So in order to make probabilities more
applicable to everyday life,

00:00:22.230 --> 00:00:25.986
we need to look at, slightly more
complicated methods.

00:00:26.820 --> 00:00:30.130
Now, because these methods 
are more complicated,

00:00:30.130 --> 00:00:34.230
this lecture is going to be 
an honors lecture: it's optional.

00:00:34.230 --> 00:00:35.930
It will not be on the quiz,

00:00:35.930 --> 00:00:37.745
so don't get worried about that.

00:00:38.490 --> 00:00:41.774
But it is still useful, and it's fascinating,

00:00:41.774 --> 00:00:44.432
and it'll help you avoid some mistakes

00:00:44.432 --> 00:00:47.564
that a lot people make
and that create a lot of problems.

00:00:48.208 --> 00:00:52.620
And so I hope you'll stick with it and listen to this lecture.

00:00:52.620 --> 00:00:56.575
And there will be exercises 
to help you figure out

00:00:56.575 --> 00:00:58.502
whether you understand 
the material or not.

00:00:58.502 --> 00:01:02.653
But don't get too worried, because 
it's not going to be on the quizz.

00:01:05.461 --> 00:01:08.082
The real problem 
that we'll be facing in this lecture

00:01:08.742 --> 00:01:10.769
is the problem of test

00:01:10.769 --> 00:01:13.761
We use tests all the time: 
we use tests to figure out

00:01:13.761 --> 00:01:17.065
whether you have 
a certain medical condition.

00:01:17.065 --> 00:01:22.201
We use tests to predict the weather 
or to predict people's future behavior.

00:01:22.201 --> 00:01:25.087
We have certain indicators 
of how they're going to act,

00:01:25.745 --> 00:01:28.305
either commit a crime 
or not commit a crime,

00:01:28.305 --> 00:01:30.345
but also whether they're going to pass,

00:01:30.345 --> 00:01:32.405
do well in school or fail.

00:01:33.488 --> 00:01:37.660
We always use these tests 
when we don't know for certain,

00:01:38.230 --> 00:01:41.340
but we want some kind of evidence, 
or some kind of indicator.

00:01:41.900 --> 00:01:45.144
The problem is none of these tests
are perfect.

00:01:45.450 --> 00:01:48.495
They always contain errors 
of various sorts.

00:01:48.900 --> 00:01:51.911
And what we're going to have to do is to
see how to take

00:01:51.911 --> 00:01:57.957
those errors of different sorts 
and build them together into a method

00:01:57.957 --> 00:02:03.440
and then a formula for calculating 
how reliable the method is

00:02:03.440 --> 00:02:06.259
for detecting the thing that we want to detect.

00:02:07.260 --> 00:02:10.380
This problem is a lot like the problem 
we faced earlier

00:02:10.380 --> 00:02:14.977
when we were talking about applying 
generalizations to particular cases

00:02:14.977 --> 00:02:18.261
because here we're going to be applying 
probabilities to particular cases.

00:02:18.936 --> 00:02:21.995
So it'll seem familiar to you in certain parts,

00:02:21.995 --> 00:02:25.280
but you'll see that this case 
is a little trickier.

00:02:25.890 --> 00:02:28.163
The best examples occur in medicine.

00:02:28.510 --> 00:02:32.465
So just imagine that you go to your doctor
for a regular checkup.

00:02:32.820 --> 00:02:34.386
You don't have any special symptoms,

00:02:35.140 --> 00:02:37.580
but he decides to do 
a few screening tests.

00:02:38.920 --> 00:02:44.385
And unfortunately, and very worryingly, 
it turns out that you test positive

00:02:45.045 --> 00:02:51.249
on one test for a particular form of cancer,
a certain kind of medical condition.

00:02:52.470 --> 00:02:56.101
Well, what that means is that you might
have cancer.

00:02:56.870 --> 00:02:58.250
Might, great.

00:02:58.250 --> 00:03:00.376
You want to know whether you do have
cancer.

00:03:01.010 --> 00:03:04.090
But of course, finding out for sure
whether or not you have cancer

00:03:04.106 --> 00:03:06.290
is going to take further tests.

00:03:06.290 --> 00:03:10.670
And those tests might be expensive, 
they might be dangerous,

00:03:10.670 --> 00:03:13.058
they're going to be invasive 
in various ways.

00:03:13.650 --> 00:03:16.525
So you really want to know what's the
probability,

00:03:17.212 --> 00:03:20.506
given that you've tested positive 
on this one test,

00:03:21.140 --> 00:03:22.690
that you really have cancer.

00:03:23.930 --> 00:03:27.951
Now clearly that probability is going 
to depend on a number of facts

00:03:27.951 --> 00:03:31.243
about this type of cancer, 
about the type of test and so on.

00:03:31.520 --> 00:03:33.517
And I am not a doctor.

00:03:33.940 --> 00:03:36.266
I am not giving you medical advice.

00:03:36.590 --> 00:03:41.210
If you test positive on a test, 
go talk to your doctor,

00:03:41.210 --> 00:03:44.040
don't trust me, because I'm just 
making up numbers here.

00:03:44.350 --> 00:03:48.270
But let's do make up a few numbers 
and figure out

00:03:48.270 --> 00:03:53.290
what the likelihood is of having cancer,
given that you tested positive.

00:03:53.290 --> 00:03:58.890
So let's imagine that the base rate 
of this particular type of cancer

00:03:58.890 --> 00:04:06.244
in the population is 0.3%, that is, 
3 out of 1,000, or 0.003.

00:04:06.244 --> 00:04:08.430
And they say that's the base rate,

00:04:08.430 --> 00:04:12.576
or it's sometimes called the prevalence
of the condition in the population.

00:04:12.970 --> 00:04:16.820
That's simply to say that out of 1,000
people chosen randomly

00:04:16.820 --> 00:04:19.949
in the population, you'd get about 3 
that have this condition.

00:04:21.640 --> 00:04:24.880
It's just a percentage 
of the general population.

00:04:25.910 --> 00:04:28.406
So that's the condition, what about the
test?

00:04:28.790 --> 00:04:32.273
Well the first thing we want to know 
is the sensitivity of the test.

00:04:32.930 --> 00:04:37.460
The sensitivity of the test we're going to
assume is 0.99.

00:04:38.620 --> 00:04:46.040
And what that means is that out of 
100 people who have this condition,

00:04:46.500 --> 00:04:49.010
99 of them will test positive.

00:04:49.010 --> 00:04:53.570
So this test is pretty good at figuring
out,

00:04:53.570 --> 00:04:57.400
from among the people 
who have the condition, which ones do.

00:04:57.400 --> 00:05:03.018
99 of those 100 people who have the
condition will test positive.

00:05:03.210 --> 00:05:08.005
The other feature is specificity, and what
that means is

00:05:08.005 --> 00:05:13.413
the percentage of the people who don't
have the condition who will test negative.

00:05:14.390 --> 00:05:17.500
The point here is you're not going 
to get a positive result

00:05:17.500 --> 00:05:20.150
for people who don't have the condition,
right?

00:05:20.180 --> 00:05:23.665
Because you want it to be specific 
to this particular condition

00:05:23.665 --> 00:05:27.600
and not get a bunch of positives for 
people who have other types of conditions

00:05:27.600 --> 00:05:29.485
or no medical condition at all.

00:05:30.240 --> 00:05:32.442
So the specificity we're going to assume,

00:05:32.442 --> 00:05:37.106
in this particular case we're talking about, is also 99%.

00:05:39.234 --> 00:05:46.490
Now, what we want to know is the probability
that you have a cancer, a condition,

00:05:47.120 --> 00:05:50.630
given that you tested positive on the test;

00:05:50.630 --> 00:05:55.462
but notice that the sensitivity 
tells you the probability

00:05:55.462 --> 00:05:59.039
that you will test positive 
given that you have the condition.

00:05:59.339 --> 00:06:01.758
We want to know the opposite of that,

00:06:01.758 --> 00:06:04.737
the probability 
that you have the condition

00:06:04.737 --> 00:06:07.318
given that you tested positive.

00:06:08.180 --> 00:06:10.840
And that's what we have to do 
a little calculation to figure out.

00:06:10.840 --> 00:06:15.320
But before we do that calculation, 
I want you to think about these figures

00:06:15.320 --> 00:06:18.310
that I've given you:
the prevalence in the population,

00:06:18.310 --> 00:06:22.060
the sensitivity of the test,
the specificity of the test,

00:06:22.060 --> 00:06:23.487
and just make a guess.

00:06:23.900 --> 00:06:26.510
Just start out by writing down 
on a piece of paper

00:06:26.510 --> 00:06:32.430
what you think the probability is 
that you would have the cancer

00:06:32.430 --> 00:06:35.850
given that you tested positive 
on the test.

00:06:36.990 --> 00:06:40.230
Take a minute and think about it 
and write it down.

00:06:41.020 --> 00:06:45.080
But we don't want to just guess 
about medical conditions,

00:06:45.080 --> 00:06:48.340
about probabilities that really matter 
as much as this will do.

00:06:48.910 --> 00:06:53.099
Instead, we want to calculate what the
probability really is.

00:06:53.680 --> 00:06:58.690
So, let's go through it carefully and
show you how to use

00:06:58.690 --> 00:07:04.104
what I'll call the box method in order 
to calculate the real likelihood

00:07:04.104 --> 00:07:08.575
that you have the condition, given that 
you got a positive test result.

00:07:09.320 --> 00:07:15.550
What we need to do is to divide the
population into four different groups:

00:07:15.983 --> 00:07:19.859
the group that has the condition 
and tested positive,

00:07:19.859 --> 00:07:22.600
the group that has the condition 
and tested negative,

00:07:22.900 --> 00:07:25.650
the group that doesn't have the condition
and tested positive,

00:07:25.650 --> 00:07:28.660
and the group that doesn't have
the condition and tested negative.

00:07:29.490 --> 00:07:34.150
And this chart will show you a nice, 
simple way of organizing

00:07:34.150 --> 00:07:35.546
all of that information.

00:07:35.870 --> 00:07:43.590
Because this row, the top row, tells 
you all the people who tested positive.

00:07:44.479 --> 00:07:49.460
The bottom row tells you the people 
who tested negative.

00:07:50.034 --> 00:07:55.884
Then, the left column gives you the 
people who do have the medical condition,

00:07:55.884 --> 00:07:57.739
in this case, some kind of cancer.

00:07:58.520 --> 00:08:02.893
And the right column tells you the people
who do not have that condition.

00:08:03.690 --> 00:08:07.948
Now what we need to do is to start
filling it out with numbers.

00:08:08.940 --> 00:08:12.736
Now the first thing we need to specify is
the population.

00:08:13.240 --> 00:08:16.400
In this case we want to start with a big
enough population

00:08:16.400 --> 00:08:19.380
that we're not going to have a lot 
of fractions in the other boxes.

00:08:19.380 --> 00:08:22.639
So, let's just imagine that the population
is 100,000.

00:08:23.000 --> 00:08:25.490
Make it a million or 10 million,
it doesn't matter

00:08:25.490 --> 00:08:28.747
because we're going to be interested 
in the ratios with the different groups.

00:08:30.530 --> 00:08:33.500
We can use that 100,000 to fill out the
other boxes,

00:08:33.500 --> 00:08:36.380
if we know the prevalence, or the
base rate,

00:08:37.099 --> 00:08:40.245
because the base rate tells you what
percentage of that 100,000

00:08:40.245 --> 00:08:43.705
actually do have the condition and
don't have the condition.

00:08:44.700 --> 00:08:47.660
We imagined -- remember we're just 
making up numbers here --

00:08:47.660 --> 00:08:52.180
but we imagined that the prevalence 
of this condition is 0.3%.

00:08:52.180 --> 00:08:56.200
And that means out of 100,000 people,
there will be 300

00:08:56.200 --> 00:08:59.506
who do have the medical condition.

00:09:01.010 --> 00:09:04.100
Well, if there are 300 who have it and
there are 100,000 total,

00:09:04.100 --> 00:09:07.533
we can figure out how many don't have the
medical condition by just subtracting.

00:09:07.720 --> 00:09:11.876
Which means 99,700 
do not have the medical condition.

00:09:12.652 --> 00:09:13.590
Okay?

00:09:13.590 --> 00:09:17.820
Now, we've divided the population into our
two columns:

00:09:17.820 --> 00:09:20.724
the ones that do and the ones that don't
have the medical condition.

00:09:21.420 --> 00:09:25.730
The next step is to figure out how many
are going to test positive

00:09:25.730 --> 00:09:30.147
and how many are going to test negative
out of each of these groups.

00:09:30.630 --> 00:09:33.727
For that, we first need the sensitivity.

00:09:34.200 --> 00:09:38.500
The sensitivity tells us the percentage 
of the cases that have the condition

00:09:38.500 --> 00:09:40.070
who will test positive.

00:09:41.370 --> 00:09:45.260
So the people who have the condition are
the 300.

00:09:45.690 --> 00:09:50.147
The ones who test positive are going 
to go up in this area

00:09:50.879 --> 00:09:56.842
and we know from the sensitivity being 0.99 or 99%

00:09:57.112 --> 00:10:03.385
that the number in that area should be 99%
of 300, or 297.

00:10:04.861 --> 00:10:08.353
And of course, if that's the number 
that test positive,

00:10:08.693 --> 00:10:12.224
then the remainder 
are going to test negative

00:10:12.224 --> 00:10:13.874
and that means that we'll have three.

00:10:14.410 --> 00:10:18.680
Which shouldn't surprise you because if
99% of the cases that have it

00:10:18.680 --> 00:10:23.659
test positive, then 1% will test negative,
and 1% of 300 is 3.

00:10:24.330 --> 00:10:26.070
Good: so we got the first column done.

00:10:26.760 --> 00:10:31.188
Now, the next question is going to be the
specificity.

00:10:31.460 --> 00:10:37.297
We can use the specificity to figure out
what goes in that next column.

00:10:38.090 --> 00:10:43.620
If the specificity is 99 and we know

00:10:43.620 --> 00:10:50.801
that 99,700 people do not have the
condition out of our sample of 100,000,

00:10:51.756 --> 00:10:58.940
well, that means that 99% of 99,700 are
going to test negative

00:10:58.940 --> 00:11:02.938
because the specificity is the 
percentage of cases without the condition

00:11:02.938 --> 00:11:04.605
that test negative.

00:11:04.605 --> 00:11:10.865
And that means that we'll have 
98,703 among the people

00:11:10.865 --> 00:11:13.903
who do not have the condition 
who test negative.

00:11:14.823 --> 00:11:18.157
How many are going to test positive?
The rest of them.

00:11:18.467 --> 00:11:27.450
So 99,700 minus 98,703 
is going to be 997.

00:11:27.980 --> 00:11:35.100
And of course, that shouldn't be surprising 
again, because 1% of 99,700 is 997.

00:11:36.375 --> 00:11:38.612
We only got two boxes left to fill out.

00:11:39.190 --> 00:11:40.502
How do you fill out those?

00:11:41.020 --> 00:11:46.988
Well, this box in the upper right, 
is the total number of people

00:11:46.988 --> 00:11:50.551
in this population of 100,000 
who test positive.

00:11:51.238 --> 00:11:56.439
And so, we can get that by adding the ones
that do have the condition and test positive

00:11:56.439 --> 00:11:59.750
and the ones that don't have 
the condition and test positive.

00:11:59.750 --> 00:12:05.587
Just add them together, and you get 1,294.

00:12:06.270 --> 00:12:13.439
And you do the same on the next row, 
because that blank is the area

00:12:13.439 --> 00:12:16.194
that has all the people 
who test negative,

00:12:16.194 --> 00:12:20.174
and 3 people who have the condition 
test negative,

00:12:20.624 --> 00:12:25.326
98,703 people who do not have the
condition test negative,

00:12:25.626 --> 00:12:30.490
so the total is going to be 98,706.

00:12:30.496 --> 00:12:35.060
And we can check to make sure that
we got it right,

00:12:35.085 --> 00:12:43.576
by just adding them together:
1,294 plus 98,706 is equal to 100,000.

00:12:44.944 --> 00:12:46.526
Phew, we got it right.

00:12:46.526 --> 00:12:52.230
Okay, so now we've divided the population 
into those people who have the condition,

00:12:52.931 --> 00:12:54.774
those people who don't have the
condition,

00:12:55.069 --> 00:12:59.350
and we know how many of each 
of those groups test positive,

00:12:59.350 --> 00:13:03.188
and how many of each of those groups 
test negative.

00:13:04.000 --> 00:13:08.160
The real question is 
what's the probability

00:13:08.160 --> 00:13:12.268
that I have cancer or the medical 
condition, given that I tested positive?

00:13:12.490 --> 00:13:13.940
How do we figure that out?

00:13:14.374 --> 00:13:19.960
Well, the total number 
of positive tests was 1,294

00:13:20.817 --> 00:13:27.334
and the people who tested positive
who really had the condition was 297.

00:13:27.770 --> 00:13:34.130
So it looks like the probability of
actually having the condition,

00:13:34.748 --> 00:13:43.670
given that you tested positive, 
is 297 out of 1294 or 0.23.

00:13:44.090 --> 00:13:47.235
That's 23%, less than one in four.

00:13:47.788 --> 00:13:49.092
Is that what you guessed?

00:13:49.610 --> 00:13:54.915
Most people, including most doctors, when
they hear that the test is

00:13:54.915 --> 00:14:01.215
99% sensitive and 99% specific, will
guess a lot higher than one in four.

00:14:01.867 --> 00:14:03.095
>> Oh my gosh!

00:14:03.095 --> 00:14:06.435
I'm a doctor, and I never would have
thought that!

00:14:07.036 --> 00:14:07.920
>> Now, don't worry:

00:14:07.920 --> 00:14:11.017
she's not a physician.
she's a metaphysician.

00:14:12.390 --> 00:14:16.180
>> But in this case, the probability 
really is just one in four

00:14:16.180 --> 00:14:17.914
that you had that medical condition.

00:14:17.920 --> 00:14:19.124
Now how did that happen?

00:14:19.630 --> 00:14:23.260
The reason was that the prevalence or the
base rate was so low

00:14:23.840 --> 00:14:27.589
that even a small rate 
of false positives,

00:14:28.211 --> 00:14:33.095
given the massive numbers of people who
don't have the condition,

00:14:33.642 --> 00:14:37.152
will mean that there are more false positives,
3 times as many,

00:14:37.642 --> 00:14:39.332
as there are true positives.

00:14:39.642 --> 00:14:42.744
And that's why the probability 
is just one in four,

00:14:42.744 --> 00:14:44.637
actually a little less than one in four,

00:14:44.790 --> 00:14:48.600
that you have the medical condition even
when you tested positive.

00:14:49.150 --> 00:14:53.330
I want to add a quick caveat here, in
order to avoid misinterpretation.

00:14:53.780 --> 00:14:58.628
because the point here is that, if you 
have a screening test for a condition

00:14:58.630 --> 00:15:03.600
with a very low base rate or prevalence,
and you don't have any symptoms

00:15:03.600 --> 00:15:09.780
that put you in a special category, 
then, you need to get another test

00:15:10.411 --> 00:15:14.871
before you jump to any conclusions
about having the medical condition.

00:15:15.890 --> 00:15:19.890
Because, if you have that other test, 
then the fact that you tested positive

00:15:19.890 --> 00:15:22.540
on the first test puts you in a smaller class,

00:15:22.540 --> 00:15:24.995
with a much higher base rate, or prevalence.

00:15:25.000 --> 00:15:27.872
And now, the probability's going to go up.

00:15:28.890 --> 00:15:32.450
Most doctors know that, and that's why,
after the first test,

00:15:32.450 --> 00:15:35.340
they don't jump to conclusions, and they
order another test,

00:15:35.340 --> 00:15:39.380
but many patients don't realize that and
they get extremely worried

00:15:39.380 --> 00:15:42.256
after a single test even when they don't
have any symptoms.

00:15:43.500 --> 00:15:45.910
So that's the mistake 
that we're trying to avoid here

00:15:45.910 --> 00:15:53.167
and that's surprising, but it actually
applies to many different areas of life.

00:15:54.850 --> 00:15:59.262
It applies, for example, to medical tests
with all kinds of other diseases.

00:15:59.911 --> 00:16:04.420
Not just cancer or colon cancer, but
pretty much every disease

00:16:04.420 --> 00:16:06.327
where the prevalence is extremely low.

00:16:07.400 --> 00:16:09.960
It applies also to drug tests.

00:16:10.670 --> 00:16:12.520
If somebody gets a positive drug test,

00:16:12.520 --> 00:16:14.548
does that mean they really 
were using drugs?

00:16:14.980 --> 00:16:20.320
Well, if it's a population where the 
base rate or prevelance of drug use

00:16:20.320 --> 00:16:23.130
is quite low, then it might not.

00:16:24.440 --> 00:16:28.290
Of course, if you assume that the 
prevalence or base rate is quite high,

00:16:28.290 --> 00:16:30.439
then you're going to believe 
that drug test.

00:16:30.840 --> 00:16:35.100
But you need to know the facts about what 
the prevalence or base rate really is

00:16:35.100 --> 00:16:39.150
in order to calculate 
accurately the probability

00:16:39.150 --> 00:16:41.926
that this person really was using drugs.

00:16:43.400 --> 00:16:47.840
Same applies to evidence in legal trials:
take eyewitnesses for example,

00:16:47.840 --> 00:16:55.110
it's very tricky, someone's trying to use 
their eyes as a test for what they see.

00:16:55.110 --> 00:16:57.740
They might identify a friend, 
or they might just say

00:16:58.230 --> 00:17:02.308
that car that did the hit-and-run accident
was a Porsche.

00:17:03.420 --> 00:17:07.716
Well, how good are they at identifying
Porsches?

00:17:09.579 --> 00:17:12.932
If they get it right most of the time, 
but not always,

00:17:12.932 --> 00:17:17.550
and sometimes they don't get it right 
when it is a Porsche,

00:17:17.550 --> 00:17:22.030
then we've got the sensitivity and 
specificity of what they identify.

00:17:22.840 --> 00:17:26.014
And we can use that to calculate 
how likely it is

00:17:26.454 --> 00:17:30.350
that their evidence in the trial 
really is reliable or not.

00:17:31.090 --> 00:17:34.150
Another example is the prediction of
future behavior.

00:17:34.950 --> 00:17:37.121
We might have some kind of marker

00:17:37.740 --> 00:17:40.550
that a certain group of people 
with that marker

00:17:41.030 --> 00:17:43.602
have a certain likelihood of
committing crimes.

00:17:44.150 --> 00:17:49.000
But if crimes are very rare 
in that community and every other,

00:17:49.000 --> 00:17:55.120
then a test which has a pretty good 
sensitivity and specificity

00:17:55.120 --> 00:17:59.839
still might not be good enough when 
we're talking about something like crime

00:18:00.160 --> 00:18:04.800
that's actually very rare and has 
a very low prevalence or base rate

00:18:04.800 --> 00:18:06.230
in most communities.

00:18:06.670 --> 00:18:08.998
And the same applies 
to failing out of school.

00:18:10.700 --> 00:18:14.070
Our SAT scores or GRE scores 
are going to be

00:18:14.070 --> 00:18:16.960
good predictors of 
who's going to fail out of school.

00:18:18.320 --> 00:18:21.450
Well, if very few people fail out of
school,

00:18:21.450 --> 00:18:24.556
so that the prevalence and base rate 
is very low,

00:18:24.556 --> 00:18:27.710
then, even if they're 
pretty sensitive and specific,

00:18:27.710 --> 00:18:29.429
they might not be good predictors.

00:18:29.940 --> 00:18:35.301
So this same type of problem arises 
in a lot of different areas.

00:18:35.840 --> 00:18:38.330
And I'm not going to go through 
more examples right now,

00:18:38.330 --> 00:18:41.870
but we'll have plenty of examples in the 
exercises at the end of this chapter.

00:18:43.690 --> 00:18:45.970
I want to end, though,
by saying a few things

00:18:45.970 --> 00:18:49.261
that are a bit more technical 
about this method.

00:18:49.780 --> 00:18:52.490
First, there's a lot of terminology to
learn,

00:18:53.043 --> 00:18:57.670
because when you read about using 
this method in other areas,

00:18:57.670 --> 00:19:01.211
for other types of topics, 
then you'll run into these terms,

00:19:01.211 --> 00:19:02.690
and it's a good idea to know them.

00:19:04.044 --> 00:19:13.860
So first, the cases where the person does 
have the condition and also tests positive

00:19:13.860 --> 00:19:17.060
are called hits, or true positives.

00:19:17.060 --> 00:19:18.736
Different people use different terms.

00:19:21.540 --> 00:19:27.625
The cases where the person tests positive,
but they don't have the condition,

00:19:27.625 --> 00:19:31.389
are called, false positives 
or false alarms.

00:19:33.830 --> 00:19:40.525
The cases where a person really does have
the condition, but tests negative

00:19:41.120 --> 00:19:44.291
are called misses or false negatives.

00:19:47.360 --> 00:19:51.064
And the cases where the person 
does not have the condition

00:19:51.594 --> 00:19:54.829
and the test comes out negative 
are called true negatives,

00:19:54.829 --> 00:19:57.676
because they're negative and it's true
that they don't have the condition.

00:19:59.620 --> 00:20:03.385
If we put together the false negatives,
and the true negatives,

00:20:04.250 --> 00:20:06.431
we get the total set of negatives.

00:20:07.346 --> 00:20:11.672
And if we put together the true positives 
and the false positives

00:20:12.112 --> 00:20:14.680
we get, the total set of positives.

00:20:16.410 --> 00:20:18.970
And of course, we have the general
population.

00:20:19.350 --> 00:20:23.325
Within that population, 
a percentage that have the condition

00:20:23.325 --> 00:20:25.527
and a percentage 
that don't have the condition.

00:20:27.016 --> 00:20:29.412
Now, what's the base rate?

00:20:29.752 --> 00:20:35.250
The base rate in this population is simply
the set that have the condition,

00:20:35.510 --> 00:20:41.570
divided by the total population,
which is Box 7 divided by Box 9.

00:20:41.940 --> 00:20:44.815
If we use e for the evidence

00:20:45.090 --> 00:20:50.190
and h for the hypothesis being true that
the condition really does exist,

00:20:50.190 --> 00:20:52.848
then that's the probability of h,

00:20:54.385 --> 00:21:02.066
and the sensitivity is going to be 
the total number of true positives

00:21:02.066 --> 00:21:06.232
divided by the total number of people 
with the condition,

00:21:06.232 --> 00:21:11.727
because it's the percentage of people who
have the condition and test positive.

00:21:12.760 --> 00:21:16.100
OK? So that's the probability of e given h,

00:21:16.100 --> 00:21:20.480
and it's box one divided by box 7.

00:21:21.040 --> 00:21:26.870
The specificity in contrast is the ratio
of it being a true negative

00:21:26.870 --> 00:21:31.710
to the total number of people 
who do not have the condition, that is,

00:21:31.710 --> 00:21:35.090
the probability of not e, that is,

00:21:35.090 --> 00:21:38.951
not having the evidence 
of a positive test result,

00:21:38.951 --> 00:21:43.180
given not h, 
given that you're in the second column,

00:21:43.180 --> 00:21:47.170
where the hypothesis is false, 
because you don't have the condition.

00:21:47.170 --> 00:21:52.057
So that's Box 5 divided by Box 8.

00:21:53.730 --> 00:21:55.807
That's the specificity.

00:21:56.000 --> 00:22:00.352
So we can define all of these 
in terms of each other.

00:22:00.760 --> 00:22:06.815
The hits divided by the total with that
condition is going to be the sensitivity.

00:22:06.820 --> 00:22:11.130
And you can use this terminology to guide
your way through this box.

00:22:11.140 --> 00:22:15.048
And the big question is again going to be
what's the solution?

00:22:15.048 --> 00:22:21.660
What's the probability of the hypothesis
having the condition, given the evidence,

00:22:21.660 --> 00:22:27.707
that is, a positive test result: 
that's going to be Box 1 divided by Box 3.

00:22:28.540 --> 00:22:31.887
And as we saw in the case that we just
went through,

00:22:31.887 --> 00:22:37.070
that gives you the probability of having
the medical condition, or colon cancer,

00:22:37.070 --> 00:22:39.232
given a positive test result.

00:22:39.480 --> 00:22:44.130
That's called the posterior probability,
or in symbols,

00:22:44.130 --> 00:22:47.550
the probability of the hypothesis, 
given the evidence.

00:22:48.210 --> 00:22:53.170
So I hope this terminology helps you
understand some of the discussions of this,

00:22:53.170 --> 00:22:55.856
if you go on and read about it 
in the literature.

00:22:56.380 --> 00:23:01.206
This procedure that we've been discussing 
is actually just an application

00:23:01.730 --> 00:23:06.570
of a famous theorem called Bayes' Theorem
after Thomas Bayes,

00:23:06.570 --> 00:23:11.270
a 18th century English clergyman, 
who was also a mathematician

00:23:11.270 --> 00:23:15.686
and proved this extremely important 
theorem in probability theory.

00:23:16.700 --> 00:23:22.520
Now some of you out there will use the
boxes, and it'll make sense to you.

00:23:22.520 --> 00:23:26.290
But some Courserians, I assume, 
are mathematicians,

00:23:26.290 --> 00:23:28.344
and they want to see 
the mathematics behind it.

00:23:28.800 --> 00:23:32.850
So now, I want to show you how to derive
Bayes' theorem

00:23:32.850 --> 00:23:36.678
from the rules of probability 
that we learned in earlier lectures.

00:23:37.160 --> 00:23:40.345
So for all you math nerds out there, 
here goes.

00:23:41.230 --> 00:23:43.552
You start with rule 2G,

00:23:45.180 --> 00:23:50.859
apply it to the probability that the
evidence and the hypothesis are both true.

00:23:51.330 --> 00:23:56.500
And by the rule, that probability is 
equal to the probability of the evidence,

00:23:56.500 --> 00:24:00.990
times the probability of the hypothesis,
given the evidence.

00:24:02.320 --> 00:24:04.510
You have to have 
that conditional probability

00:24:04.510 --> 00:24:07.380
because they're not independent.

00:24:08.800 --> 00:24:14.310
Then you simply divide both sides of that 
by the probability of the evidence:

00:24:14.310 --> 00:24:15.790
a little simple algebra.

00:24:15.790 --> 00:24:20.221
And you end up with the probability 
of the hypothesis, given the evidence,

00:24:20.221 --> 00:24:24.790
is equal to the probability 
of the evidence and the hypothesis,

00:24:24.790 --> 00:24:28.100
divided by the probability 
of the evidence.

00:24:30.830 --> 00:24:34.528
Now we can do a little trick.
This was ingenious.

00:24:35.300 --> 00:24:39.165
Substitute for e, something 
that's logically equivalent to e,

00:24:39.165 --> 00:24:45.460
namely, the evidence AND the hypothesis
or the evidence AND NOT the hypothesis.

00:24:45.880 --> 00:24:48.324
Now if you think about it, you'll see
that those are equivalent,

00:24:48.324 --> 00:24:51.260
because either the hypothesis 
has to be true

00:24:51.260 --> 00:24:54.274
or NOT the hypothesis is true.

00:24:54.274 --> 00:24:55.920
One or the other has to be true.

00:24:56.560 --> 00:25:00.033
And that means that the evidence 
AND the hypothesis

00:25:00.033 --> 00:25:04.774
or the evidence AND NOT the hypothesis 
is going to be equivalent to e.

00:25:05.020 --> 00:25:07.960
So this is equivalent to this.

00:25:08.440 --> 00:25:11.360
And because they're equivalent,
we can substitute them

00:25:11.360 --> 00:25:15.376
within the formula for probability 
without affecting the truth values.

00:25:15.720 --> 00:25:23.400
So we just substitute this formula in 
here for the e up there.

00:25:23.840 --> 00:25:27.910
And we end up with the probability of the
hypothesis, given the evidence,

00:25:27.910 --> 00:25:32.240
is equal to the probability of the
evidence AND the hypothesis, divided by

00:25:32.240 --> 00:25:35.220
the probability of the evidence 
AND the hypothesis

00:25:35.220 --> 00:25:37.513
or the evidence AND NOT the hypothesis.

00:25:37.950 --> 00:25:41.426
Now, that's not supposed to make much
sense, but it helps with the derivation.

00:25:43.283 --> 00:25:47.600
The next step is to apply rule 3, because
we have a disjunction.

00:25:47.600 --> 00:25:51.082
And notice the disjuncts are mutually
exclusive.

00:25:51.540 --> 00:25:56.230
It cannot be true, both, that the evidence
AND the hypothesis is true,

00:25:56.230 --> 00:26:00.197
and also that the evidence 
AND NOT the hypothesis is true,

00:26:00.197 --> 00:26:04.247
because it can't be both h and not h.

00:26:05.160 --> 00:26:08.073
So we can apply the simple version 
of rule 3.

00:26:08.580 --> 00:26:14.316
And that means that the probability of
(e&h) or (e&~h)

00:26:14.316 --> 00:26:21.074
is equal to the probability of (e&h
+ the probability of (e&~h).

00:26:21.510 --> 00:26:23.920
We're just applying 
that rule 3 for disjunction

00:26:23.920 --> 00:26:26.260
that we learned a few lectures ago.

00:26:27.150 --> 00:26:29.575
Now we apply rule 2G again,

00:26:29.575 --> 00:26:35.380
because we have the probability 
of a conjunction up in the top.

00:26:36.810 --> 00:26:41.695
And, since these are not independent of
each other

00:26:41.880 --> 00:26:44.630
-- we hope not, if it's a hypothesis 
and the evidence for it --

00:26:45.345 --> 00:26:48.315
then we have to use 
the conditional probability.

00:26:48.830 --> 00:26:53.010
And using rule 2G, we find that 
the probability of the hypothesis,

00:26:53.010 --> 00:26:55.110
given the evidence, is equal to

00:26:55.110 --> 00:26:59.870
the probability of the hypothesis, times
the probability of the evidence,

00:26:59.870 --> 00:27:05.130
given the hypothesis, divided by 
the probability of the hypothesis,

00:27:05.130 --> 00:27:08.586
times the probability of the evidence, 
given the hypothesis,

00:27:08.831 --> 00:27:12.648
plus the probability 
of the hypothesis being false,

00:27:12.648 --> 00:27:16.851
that is the probability of NOT h, 
times the probability of the evidence,

00:27:16.851 --> 00:27:21.130
given NOT h, or the hypothesis being false.

00:27:22.010 --> 00:27:23.170
And that's a mouthful

00:27:23.170 --> 00:27:27.070
and it's a long formula, 
but that's the mathematical formula

00:27:27.070 --> 00:27:33.270
that Bayes proved in the 18th century
and it provides the mathematical basis

00:27:33.500 --> 00:27:36.280
for that whole system of boxes 
that we talked about before.

00:27:37.470 --> 00:27:42.780
But if you don't like the mathematical 
proof and that's too confusing for you,

00:27:42.780 --> 00:27:44.090
then use the boxes.

00:27:44.400 --> 00:27:47.396
And if you don't like the boxes, 
use the mathematical proof.

00:27:47.860 --> 00:27:50.706
They're both going to work:
just pick the one that works for you.

00:27:50.990 --> 00:27:53.220
In fact, you don't have to pick 
either of them,

00:27:53.220 --> 00:27:57.120
because remember, this is an honors
lecture, it's optional,

00:27:57.988 --> 00:27:59.999
and it won't be on the quiz.

00:28:00.590 --> 00:28:04.270
But if you do want to try this method, 
and make sure that you understand it,

00:28:05.075 --> 00:28:08.437
we'll have a bunch of exercises for you, 
where you can test your skills.