-
Coins and dice provide a nice simple model
of how to calculate probabilities, but
-
everyday life is a lot more complicated
and it's not taken up with gambling.
-
At least, I hope your life is not taken up
with gambling.
-
So in order to make probabilities more
applicable to everyday life,
-
we need to look at, slightly more
complicated methods.
-
Now, because these methods
are more complicated,
-
this lecture is going to be
an honors lecture: it's optional.
-
It will not be on the quiz,
-
so don't get worried about that.
-
But it is still useful, and it's fascinating,
-
and it'll help you avoid some mistakes
-
that a lot people make
and that create a lot of problems.
-
And so I hope you'll stick with it and listen to this lecture.
-
And there will be exercises
to help you figure out
-
whether you understand
the material or not.
-
But don't get too worried, because
it's not going to be on the quizz.
-
The real problem
that we'll be facing in this lecture
-
is the problem of test
-
We use tests all the time:
we use tests to figure out
-
whether you have
a certain medical condition.
-
We use tests to predict the weather
or to predict people's future behavior.
-
We have certain indicators
of how they're going to act,
-
either commit a crime
or not commit a crime,
-
but also whether they're going to pass,
-
do well in school or fail.
-
We always use these tests
when we don't know for certain,
-
but we want some kind of evidence,
or some kind of indicator.
-
The problem is none of these tests
are perfect.
-
They always contain errors
of various sorts.
-
And what we're going to have to do is to
see how to take
-
those errors of different sorts
and build them together into a method
-
and then a formula for calculating
how reliable the method is
-
for detecting the thing that we want to detect.
-
This problem is a lot like the problem
we faced earlier
-
when we were talking about applying
generalizations to particular cases
-
because here we're going to be applying
probabilities to particular cases.
-
So it'll seem familiar to you in certain parts,
-
but you'll see that this case
is a little trickier.
-
The best examples occur in medicine.
-
So just imagine that you go to your doctor
for a regular checkup.
-
You don't have any special symptoms,
-
but he decides to do
a few screening tests.
-
And unfortunately, and very worryingly,
it turns out that you test positive
-
on one test for a particular form of cancer,
a certain kind of medical condition.
-
Well, what that means is that you might
have cancer.
-
Might, great.
-
You want to know whether you do have
cancer.
-
But of course, finding out for sure
whether or not you have cancer
-
is going to take further tests.
-
And those tests might be expensive,
they might be dangerous,
-
they're going to be invasive
in various ways.
-
So you really want to know what's the
probability,
-
given that you've tested positive
on this one test,
-
that you really have cancer.
-
Now clearly that probability is going
to depend on a number of facts
-
about this type of cancer,
about the type of test and so on.
-
And I am not a doctor.
-
I am not giving you medical advice.
-
If you test positive on a test,
go talk to your doctor,
-
don't trust me, because I'm just
making up numbers here.
-
But let's do make up a few numbers
and figure out
-
what the likelihood is of having cancer,
given that you tested positive.
-
So let's imagine that the base rate
of this particular type of cancer
-
in the population is 0.3%, that is,
3 out of 1,000, or 0.003.
-
And they say that's the base rate,
-
or it's sometimes called the prevalence
of the condition in the population.
-
That's simply to say that out of 1,000
people chosen randomly
-
in the population, you'd get about 3
that have this condition.
-
It's just a percentage
of the general population.
-
So that's the condition, what about the
test?
-
Well the first thing we want to know
is the sensitivity of the test.
-
The sensitivity of the test we're going to
assume is 0.99.
-
And what that means is that out of
100 people who have this condition,
-
99 of them will test positive.
-
So this test is pretty good at figuring
out,
-
from among the people
who have the condition, which ones do.
-
99 of those 100 people who have the
condition will test positive.
-
The other feature is specificity, and what
that means is
-
the percentage of the people who don't
have the condition who will test negative.
-
The point here is you're not going
to get a positive result
-
for people who don't have the condition,
right?
-
Because you want it to be specific
to this particular condition
-
and not get a bunch of positives for
people who have other types of conditions
-
or no medical condition at all.
-
So the specificity we're going to assume,
-
in this particular case we're talking about, is also 99%.
-
Now, what we want to know is the probability
that you have a cancer, a condition,
-
given that you tested positive on the test;
-
but notice that the sensitivity
tells you the probability
-
that you will test positive
given that you have the condition.
-
We want to know the opposite of that,
-
the probability
that you have the condition
-
given that you tested positive.
-
And that's what we have to do
a little calculation to figure out.
-
But before we do that calculation,
I want you to think about these figures
-
that I've given you:
the prevalence in the population,
-
the sensitivity of the test,
the specificity of the test,
-
and just make a guess.
-
Just start out by writing down
on a piece of paper
-
what you think the probability is
that you would have the cancer
-
given that you tested positive
on the test.
-
Take a minute and think about it
and write it down.
-
But we don't want to just guess
about medical conditions,
-
about probabilities that really matter
as much as this will do.
-
Instead, we want to calculate what the
probability really is.
-
So, let's go through it carefully and
show you how to use
-
what I'll call the box method in order
to calculate the real likelihood
-
that you have the condition, given that
you got a positive test result.
-
What we need to do is to divide the
population into four different groups:
-
the group that has the condition
and tested positive,
-
the group that has the condition
and tested negative,
-
the group that doesn't have the condition
and tested positive,
-
and the group that doesn't have
the condition and tested negative.
-
And this chart will show you a nice,
simple way of organizing
-
all of that information.
-
Because this row, the top row, tells
you all the people who tested positive.
-
The bottom row tells you the people
who tested negative.
-
Then, the left column gives you the
people who do have the medical condition,
-
in this case, some kind of cancer.
-
And the right column tells you the people
who do not have that condition.
-
Now what we need to do is to start
filling it out with numbers.
-
Now the first thing we need to specify is
the population.
-
In this case we want to start with a big
enough population
-
that we're not going to have a lot
of fractions in the other boxes.
-
So, let's just imagine that the population
is 100,000.
-
Make it a million or 10 million,
it doesn't matter
-
because we're going to be interested
in the ratios with the different groups.
-
We can use that 100,000 to fill out the
other boxes,
-
if we know the prevalence, or the
base rate,
-
because the base rate tells you what
percentage of that 100,000
-
actually do have the condition and
don't have the condition.
-
We imagined -- remember we're just
making up numbers here --
-
but we imagined that the prevalence
of this condition is 0.3%.
-
And that means out of 100,000 people,
there will be 300
-
who do have the medical condition.
-
Well, if there are 300 who have it and
there are 100,000 total,
-
we can figure out how many don't have the
medical condition by just subtracting.
-
Which means 99,700
do not have the medical condition.
-
Okay?
-
Now, we've divided the population into our
two columns:
-
the ones that do and the ones that don't
have the medical condition.
-
The next step is to figure out how many
are going to test positive
-
and how many are going to test negative
out of each of these groups.
-
For that, we first need the sensitivity.
-
The sensitivity tells us the percentage
of the cases that have the condition
-
who will test positive.
-
So the people who have the condition are
the 300.
-
The ones who test positive are going
to go up in this area
-
and we know from the sensitivity being 0.99 or 99%
-
that the number in that area should be 99%
of 300, or 297.
-
And of course, if that's the number
that test positive,
-
then the remainder
are going to test negative
-
and that means that we'll have three.
-
Which shouldn't surprise you because if
99% of the cases that have it
-
test positive, then 1% will test negative,
and 1% of 300 is 3.
-
Good: so we got the first column done.
-
Now, the next question is going to be the
specificity.
-
We can use the specificity to figure out
what goes in that next column.
-
If the specificity is 99 and we know
-
that 99,700 people do not have the
condition out of our sample of 100,000,
-
well, that means that 99% of 99,700 are
going to test negative
-
because the specificity is the
percentage of cases without the condition
-
that test negative.
-
And that means that we'll have
98,703 among the people
-
who do not have the condition
who test negative.
-
How many are going to test positive?
The rest of them.
-
So 99,700 minus 98,703
is going to be 997.
-
And of course, that shouldn't be surprising
again, because 1% of 99,700 is 997.
-
We only got two boxes left to fill out.
-
How do you fill out those?
-
Well, this box in the upper right,
is the total number of people
-
in this population of 100,000
who test positive.
-
And so, we can get that by adding the ones
that do have the condition and test positive
-
and the ones that don't have
the condition and test positive.
-
Just add them together, and you get 1,294.
-
And you do the same on the next row,
because that blank is the area
-
that has all the people
who test negative,
-
and 3 people who have the condition
test negative,
-
98,703 people who do not have the
condition test negative,
-
so the total is going to be 98,706.
-
And we can check to make sure that
we got it right,
-
by just adding them together:
1,294 plus 98,706 is equal to 100,000.
-
Phew, we got it right.
-
Okay, so now we've divided the population
into those people who have the condition,
-
those people who don't have the
condition,
-
and we know how many of each
of those groups test positive,
-
and how many of each of those groups
test negative.
-
The real question is
what's the probability
-
that I have cancer or the medical
condition, given that I tested positive?
-
How do we figure that out?
-
Well, the total number
of positive tests was 1,294
-
and the people who tested positive
who really had the condition was 297.
-
So it looks like the probability of
actually having the condition,
-
given that you tested positive,
is 297 out of 1294 or 0.23.
-
That's 23%, less than one in four.
-
Is that what you guessed?
-
Most people, including most doctors, when
they hear that the test is
-
99% sensitive and 99% specific, will
guess a lot higher than one in four.
-
>> Oh my gosh!
-
I'm a doctor, and I never would have
thought that!
-
>> Now, don't worry:
-
she's not a physician.
she's a metaphysician.
-
>> But in this case, the probability
really is just one in four
-
that you had that medical condition.
-
Now how did that happen?
-
The reason was that the prevalence or the
base rate was so low
-
that even a small rate
of false positives,
-
given the massive numbers of people who
don't have the condition,
-
will mean that there are more false positives,
3 times as many,
-
as there are true positives.
-
And that's why the probability
is just one in four,
-
actually a little less than one in four,
-
that you have the medical condition even
when you tested positive.
-
I want to add a quick caveat here, in
order to avoid misinterpretation.
-
because the point here is that, if you
have a screening test for a condition
-
with a very low base rate or prevalence,
and you don't have any symptoms
-
that put you in a special category,
then, you need to get another test
-
before you jump to any conclusions
about having the medical condition.
-
Because, if you have that other test,
then the fact that you tested positive
-
on the first test puts you in a smaller class,
-
with a much higher base rate, or prevalence.
-
And now, the probability's going to go up.
-
Most doctors know that, and that's why,
after the first test,
-
they don't jump to conclusions, and they
order another test,
-
but many patients don't realize that and
they get extremely worried
-
after a single test even when they don't
have any symptoms.
-
So that's the mistake
that we're trying to avoid here
-
and that's surprising, but it actually
applies to many different areas of life.
-
It applies, for example, to medical tests
with all kinds of other diseases.
-
Not just cancer or colon cancer, but
pretty much every disease
-
where the prevalence is extremely low.
-
It applies also to drug tests.
-
If somebody gets a positive drug test,
-
does that mean they really
were using drugs?
-
Well, if it's a population where the
base rate or prevelance of drug use
-
is quite low, then it might not.
-
Of course, if you assume that the
prevalence or base rate is quite high,
-
then you're going to believe
that drug test.
-
But you need to know the facts about what
the prevalence or base rate really is
-
in order to calculate
accurately the probability
-
that this person really was using drugs.
-
Same applies to evidence in legal trials:
take eyewitnesses for example,
-
it's very tricky, someone's trying to use
their eyes as a test for what they see.
-
They might identify a friend,
or they might just say
-
that car that did the hit-and-run accident
was a Porsche.
-
Well, how good are they at identifying
Porsches?
-
If they get it right most of the time,
but not always,
-
and sometimes they don't get it right
when it is a Porsche,
-
then we've got the sensitivity and
specificity of what they identify.
-
And we can use that to calculate
how likely it is
-
that their evidence in the trial
really is reliable or not.
-
Another example is the prediction of
future behavior.
-
We might have some kind of marker
-
that a certain group of people
with that marker
-
have a certain likelihood of
committing crimes.
-
But if crimes are very rare
in that community and every other,
-
then a test which has a pretty good
sensitivity and specificity
-
still might not be good enough when
we're talking about something like crime
-
that's actually very rare and has
a very low prevalence or base rate
-
in most communities.
-
And the same applies
to failing out of school.
-
Our SAT scores or GRE scores
are going to be
-
good predictors of
who's going to fail out of school.
-
Well, if very few people fail out of
school,
-
so that the prevalence and base rate
is very low,
-
then, even if they're
pretty sensitive and specific,
-
they might not be good predictors.
-
So this same type of problem arises
in a lot of different areas.
-
And I'm not going to go through
more examples right now,
-
but we'll have plenty of examples in the
exercises at the end of this chapter.
-
I want to end, though,
by saying a few things
-
that are a bit more technical
about this method.
-
First, there's a lot of terminology to
learn,
-
because when you read about using
this method in other areas,
-
for other types of topics,
then you'll run into these terms,
-
and it's a good idea to know them.
-
So first, the cases where the person does
have the condition and also tests positive
-
are called hits, or true positives.
-
Different people use different terms.
-
The cases where the person tests positive,
but they don't have the condition,
-
are called, false positives
or false alarms.
-
The cases where a person really does have
the condition, but tests negative
-
are called misses or false negatives.
-
And the cases where the person
does not have the condition
-
and the test comes out negative
are called true negatives,
-
because they're negative and it's true
that they don't have the condition.
-
If we put together the false negatives,
and the true negatives,
-
we get the total set of negatives.
-
And if we put together the true positives
and the false positives
-
we get, the total set of positives.
-
And of course, we have the general
population.
-
Within that population,
a percentage that have the condition
-
and a percentage
that don't have the condition.
-
Now, what's the base rate?
-
The base rate in this population is simply
the set that have the condition,
-
divided by the total population,
which is Box 7 divided by Box 9.
-
If we use e for the evidence
-
and h for the hypothesis being true that
the condition really does exist,
-
then that's the probability of h,
-
and the sensitivity is going to be
the total number of true positives
-
divided by the total number of people
with the condition,
-
because it's the percentage of people who
have the condition and test positive.
-
OK? So that's the probability of e given h,
-
and it's box one divided by box 7.
-
The specificity in contrast is the ratio
of it being a true negative
-
to the total number of people
who do not have the condition, that is,
-
the probability of not e, that is,
-
not having the evidence
of a positive test result,
-
given not h,
given that you're in the second column,
-
where the hypothesis is false,
because you don't have the condition.
-
So that's Box 5 divided by Box 8.
-
That's the specificity.
-
So we can define all of these
in terms of each other.
-
The hits divided by the total with that
condition is going to be the sensitivity.
-
And you can use this terminology to guide
your way through this box.
-
And the big question is again going to be
what's the solution?
-
What's the probability of the hypothesis
having the condition, given the evidence,
-
that is, a positive test result:
that's going to be Box 1 divided by Box 3.
-
And as we saw in the case that we just
went through,
-
that gives you the probability of having
the medical condition, or colon cancer,
-
given a positive test result.
-
That's called the posterior probability,
or in symbols,
-
the probability of the hypothesis,
given the evidence.
-
So I hope this terminology helps you
understand some of the discussions of this,
-
if you go on and read about it
in the literature.
-
This procedure that we've been discussing
is actually just an application
-
of a famous theorem called Bayes' Theorem
after Thomas Bayes,
-
a 18th century English clergyman,
who was also a mathematician
-
and proved this extremely important
theorem in probability theory.
-
Now some of you out there will use the
boxes, and it'll make sense to you.
-
But some Courserians, I assume,
are mathematicians,
-
and they want to see
the mathematics behind it.
-
So now, I want to show you how to derive
Bayes' theorem
-
from the rules of probability
that we learned in earlier lectures.
-
So for all you math nerds out there,
here goes.
-
You start with rule 2G,
-
apply it to the probability that the
evidence and the hypothesis are both true.
-
And by the rule, that probability is
equal to the probability of the evidence,
-
times the probability of the hypothesis,
given the evidence.
-
You have to have
that conditional probability
-
because they're not independent.
-
Then you simply divide both sides of that
by the probability of the evidence:
-
a little simple algebra.
-
And you end up with the probability
of the hypothesis, given the evidence,
-
is equal to the probability
of the evidence and the hypothesis,
-
divided by the probability
of the evidence.
-
Now we can do a little trick.
This was ingenious.
-
Substitute for e, something
that's logically equivalent to e,
-
namely, the evidence AND the hypothesis
or the evidence AND NOT the hypothesis.
-
Now if you think about it, you'll see
that those are equivalent,
-
because either the hypothesis
has to be true
-
or NOT the hypothesis is true.
-
One or the other has to be true.
-
And that means that the evidence
AND the hypothesis
-
or the evidence AND NOT the hypothesis
is going to be equivalent to e.
-
So this is equivalent to this.
-
And because they're equivalent,
we can substitute them
-
within the formula for probability
without affecting the truth values.
-
So we just substitute this formula in
here for the e up there.
-
And we end up with the probability of the
hypothesis, given the evidence,
-
is equal to the probability of the
evidence AND the hypothesis, divided by
-
the probability of the evidence
AND the hypothesis
-
or the evidence AND NOT the hypothesis.
-
Now, that's not supposed to make much
sense, but it helps with the derivation.
-
The next step is to apply rule 3, because
we have a disjunction.
-
And notice the disjuncts are mutually
exclusive.
-
It cannot be true, both, that the evidence
AND the hypothesis is true,
-
and also that the evidence
AND NOT the hypothesis is true,
-
because it can't be both h and not h.
-
So we can apply the simple version
of rule 3.
-
And that means that the probability of
(e&h) or (e&~h)
-
is equal to the probability of (e&h
+ the probability of (e&~h).
-
We're just applying
that rule 3 for disjunction
-
that we learned a few lectures ago.
-
Now we apply rule 2G again,
-
because we have the probability
of a conjunction up in the top.
-
And, since these are not independent of
each other
-
-- we hope not, if it's a hypothesis
and the evidence for it --
-
then we have to use
the conditional probability.
-
And using rule 2G, we find that
the probability of the hypothesis,
-
given the evidence, is equal to
-
the probability of the hypothesis, times
the probability of the evidence,
-
given the hypothesis, divided by
the probability of the hypothesis,
-
times the probability of the evidence,
given the hypothesis,
-
plus the probability
of the hypothesis being false,
-
that is the probability of NOT h,
times the probability of the evidence,
-
given NOT h, or the hypothesis being false.
-
And that's a mouthful
-
and it's a long formula,
but that's the mathematical formula
-
that Bayes proved in the 18th century
and it provides the mathematical basis
-
for that whole system of boxes
that we talked about before.
-
But if you don't like the mathematical
proof and that's too confusing for you,
-
then use the boxes.
-
And if you don't like the boxes,
use the mathematical proof.
-
They're both going to work:
just pick the one that works for you.
-
In fact, you don't have to pick
either of them,
-
because remember, this is an honors
lecture, it's optional,
-
and it won't be on the quiz.
-
But if you do want to try this method,
and make sure that you understand it,
-
we'll have a bunch of exercises for you,
where you can test your skills.
Claude Almansi
revision 1 = upload of provided subtitles