Coins and dice provide a nice simple model of how to calculate probabilities, but everyday life is a lot more complicated and it's not taken up with gambling. At least, I hope your life is not taken up with gambling. So in order to make probabilities more applicable to everyday life, we need to look at, slightly more complicated methods. Now, because these methods are more complicated, this lecture is going to be an honors lecture: it's optional. It will not be on the quiz, so don't get worried about that. But it is still useful, and it's fascinating, and it'll help you avoid some mistakes that a lot people make and that create a lot of problems. And so I hope you'll stick with it and listen to this lecture. And there will be exercises to help you figure out whether you understand the material or not. But don't get too worried, because it's not going to be on the quizz. The real problem that we'll be facing in this lecture is the problem of test We use tests all the time: we use tests to figure out whether you have a certain medical condition. We use tests to predict the weather or to predict people's future behavior. We have certain indicators of how they're going to act, either commit a crime or not commit a crime, but also whether they're going to pass, do well in school or fail. We always use these tests when we don't know for certain, but we want some kind of evidence, or some kind of indicator. The problem is none of these tests are perfect. They always contain errors of various sorts. And what we're going to have to do is to see how to take those errors of different sorts and build them together into a method and then a formula for calculating how reliable the method is for detecting the thing that we want to detect. This problem is a lot like the problem we faced earlier when we were talking about applying generalizations to particular cases because here we're going to be applying probabilities to particular cases. So it'll seem familiar to you in certain parts, but you'll see that this case is a little trickier. The best examples occur in medicine. So just imagine that you go to your doctor for a regular checkup. You don't have any special symptoms, but he decides to do a few screening tests. And unfortunately, and very worryingly, it turns out that you test positive on one test for a particular form of cancer, a certain kind of medical condition. Well, what that means is that you might have cancer. Might, great. You want to know whether you do have cancer. But of course, finding out for sure whether or not you have cancer is going to take further tests. And those tests might be expensive, they might be dangerous, they're going to be invasive in various ways. So you really want to know what's the probability, given that you've tested positive on this one test, that you really have cancer. Now clearly that probability is going to depend on a number of facts about this type of cancer, about the type of test and so on. And I am not a doctor. I am not giving you medical advice. If you test positive on a test, go talk to your doctor, don't trust me, because I'm just making up numbers here. But let's do make up a few numbers and figure out what the likelihood is of having cancer, given that you tested positive. So let's imagine that the base rate of this particular type of cancer in the population is 0.3%, that is, 3 out of 1,000, or 0.003. And they say that's the base rate, or it's sometimes called the prevalence of the condition in the population. That's simply to say that out of 1,000 people chosen randomly in the population, you'd get about 3 that have this condition. It's just a percentage of the general population. So that's the condition, what about the test? Well the first thing we want to know is the sensitivity of the test. The sensitivity of the test we're going to assume is 0.99. And what that means is that out of 100 people who have this condition, 99 of them will test positive. So this test is pretty good at figuring out, from among the people who have the condition, which ones do. 99 of those 100 people who have the condition will test positive. The other feature is specificity, and what that means is the percentage of the people who don't have the condition who will test negative. The point here is you're not going to get a positive result for people who don't have the condition, right? Because you want it to be specific to this particular condition and not get a bunch of positives for people who have other types of conditions or no medical condition at all. So the specificity we're going to assume, in this particular case we're talking about, is also 99%. Now, what we want to know is the probability that you have a cancer, a condition, given that you tested positive on the test; but notice that the sensitivity tells you the probability that you will test positive given that you have the condition. We want to know the opposite of that, the probability that you have the condition given that you tested positive. And that's what we have to do a little calculation to figure out. But before we do that calculation, I want you to think about these figures that I've given you: the prevalence in the population, the sensitivity of the test, the specificity of the test, and just make a guess. Just start out by writing down on a piece of paper what you think the probability is that you would have the cancer given that you tested positive on the test. Take a minute and think about it and write it down. But we don't want to just guess about medical conditions, about probabilities that really matter as much as this will do. Instead, we want to calculate what the probability really is. So, let's go through it carefully and show you how to use what I'll call the box method in order to calculate the real likelihood that you have the condition, given that you got a positive test result. What we need to do is to divide the population into four different groups: the group that has the condition and tested positive, the group that has the condition and tested negative, the group that doesn't have the condition and tested positive, and the group that doesn't have the condition and tested negative. And this chart will show you a nice, simple way of organizing all of that information. Because this row, the top row, tells you all the people who tested positive. The bottom row tells you the people who tested negative. Then, the left column gives you the people who do have the medical condition, in this case, some kind of cancer. And the right column tells you the people who do not have that condition. Now what we need to do is to start filling it out with numbers. Now the first thing we need to specify is the population. In this case we want to start with a big enough population that we're not going to have a lot of fractions in the other boxes. So, let's just imagine that the population is 100,000. Make it a million or 10 million, it doesn't matter because we're going to be interested in the ratios with the different groups. We can use that 100,000 to fill out the other boxes, if we know the prevalence, or the base rate, because the base rate tells you what percentage of that 100,000 actually do have the condition and don't have the condition. We imagined -- remember we're just making up numbers here -- but we imagined that the prevalence of this condition is 0.3%. And that means out of 100,000 people, there will be 300 who do have the medical condition. Well, if there are 300 who have it and there are 100,000 total, we can figure out how many don't have the medical condition by just subtracting. Which means 99,700 do not have the medical condition. Okay? Now, we've divided the population into our two columns: the ones that do and the ones that don't have the medical condition. The next step is to figure out how many are going to test positive and how many are going to test negative out of each of these groups. For that, we first need the sensitivity. The sensitivity tells us the percentage of the cases that have the condition who will test positive. So the people who have the condition are the 300. The ones who test positive are going to go up in this area and we know from the sensitivity being 0.99 or 99% that the number in that area should be 99% of 300, or 297. And of course, if that's the number that test positive, then the remainder are going to test negative and that means that we'll have three. Which shouldn't surprise you because if 99% of the cases that have it test positive, then 1% will test negative, and 1% of 300 is 3. Good: so we got the first column done. Now, the next question is going to be the specificity. We can use the specificity to figure out what goes in that next column. If the specificity is 99 and we know that 99,700 people do not have the condition out of our sample of 100,000, well, that means that 99% of 99,700 are going to test negative because the specificity is the percentage of cases without the condition that test negative. And that means that we'll have 98,703 among the people who do not have the condition who test negative. How many are going to test positive? The rest of them. So 99,700 minus 98,703 is going to be 997. And of course, that shouldn't be surprising again, because 1% of 99,700 is 997. We only got two boxes left to fill out. How do you fill out those? Well, this box in the upper right, is the total number of people in this population of 100,000 who test positive. And so, we can get that by adding the ones that do have the condition and test positive and the ones that don't have the condition and test positive. Just add them together, and you get 1,294. And you do the same on the next row, because that blank is the area that has all the people who test negative, and 3 people who have the condition test negative, 98,703 people who do not have the condition test negative, so the total is going to be 98,706. And we can check to make sure that we got it right, by just adding them together: 1,294 plus 98,706 is equal to 100,000. Phew, we got it right. Okay, so now we've divided the population into those people who have the condition, those people who don't have the condition, and we know how many of each of those groups test positive, and how many of each of those groups test negative. The real question is what's the probability that I have cancer or the medical condition, given that I tested positive? How do we figure that out? Well, the total number of positive tests was 1,294 and the people who tested positive who really had the condition was 297. So it looks like the probability of actually having the condition, given that you tested positive, is 297 out of 1294 or 0.23. That's 23%, less than one in four. Is that what you guessed? Most people, including most doctors, when they hear that the test is 99% sensitive and 99% specific, will guess a lot higher than one in four. >> Oh my gosh! I'm a doctor, and I never would have thought that! >> Now, don't worry: she's not a physician. she's a metaphysician. >> But in this case, the probability really is just one in four that you had that medical condition. Now how did that happen? The reason was that the prevalence or the base rate was so low that even a small rate of false positives, given the massive numbers of people who don't have the condition, will mean that there are more false positives, 3 times as many, as there are true positives. And that's why the probability is just one in four, actually a little less than one in four, that you have the medical condition even when you tested positive. I want to add a quick caveat here, in order to avoid misinterpretation. because the point here is that, if you have a screening test for a condition with a very low base rate or prevalence, and you don't have any symptoms that put you in a special category, then, you need to get another test before you jump to any conclusions about having the medical condition. Because, if you have that other test, then the fact that you tested positive on the first test puts you in a smaller class, with a much higher base rate, or prevalence. And now, the probability's going to go up. Most doctors know that, and that's why, after the first test, they don't jump to conclusions, and they order another test, but many patients don't realize that and they get extremely worried after a single test even when they don't have any symptoms. So that's the mistake that we're trying to avoid here and that's surprising, but it actually applies to many different areas of life. It applies, for example, to medical tests with all kinds of other diseases. Not just cancer or colon cancer, but pretty much every disease where the prevalence is extremely low. It applies also to drug tests. If somebody gets a positive drug test, does that mean they really were using drugs? Well, if it's a population where the base rate or prevelance of drug use is quite low, then it might not. Of course, if you assume that the prevalence or base rate is quite high, then you're going to believe that drug test. But you need to know the facts about what the prevalence or base rate really is in order to calculate accurately the probability that this person really was using drugs. Same applies to evidence in legal trials: take eyewitnesses for example, it's very tricky, someone's trying to use their eyes as a test for what they see. They might identify a friend, or they might just say that car that did the hit-and-run accident was a Porsche. Well, how good are they at identifying Porsches? If they get it right most of the time, but not always, and sometimes they don't get it right when it is a Porsche, then we've got the sensitivity and specificity of what they identify. And we can use that to calculate how likely it is that their evidence in the trial really is reliable or not. Another example is the prediction of future behavior. We might have some kind of marker that a certain group of people with that marker have a certain likelihood of committing crimes. But if crimes are very rare in that community and every other, then a test which has a pretty good sensitivity and specificity still might not be good enough when we're talking about something like crime that's actually very rare and has a very low prevalence or base rate in most communities. And the same applies to failing out of school. Our SAT scores or GRE scores are going to be good predictors of who's going to fail out of school. Well, if very few people fail out of school, so that the prevalence and base rate is very low, then, even if they're pretty sensitive and specific, they might not be good predictors. So this same type of problem arises in a lot of different areas. And I'm not going to go through more examples right now, but we'll have plenty of examples in the exercises at the end of this chapter. I want to end, though, by saying a few things that are a bit more technical about this method. First, there's a lot of terminology to learn, because when you read about using this method in other areas, for other types of topics, then you'll run into these terms, and it's a good idea to know them. So first, the cases where the person does have the condition and also tests positive are called hits, or true positives. Different people use different terms. The cases where the person tests positive, but they don't have the condition, are called, false positives or false alarms. The cases where a person really does have the condition, but tests negative are called misses or false negatives. And the cases where the person does not have the condition and the test comes out negative are called true negatives, because they're negative and it's true that they don't have the condition. If we put together the false negatives, and the true negatives, we get the total set of negatives. And if we put together the true positives and the false positives we get, the total set of positives. And of course, we have the general population. Within that population, a percentage that have the condition and a percentage that don't have the condition. Now, what's the base rate? The base rate in this population is simply the set that have the condition, divided by the total population, which is Box 7 divided by Box 9. If we use e for the evidence and h for the hypothesis being true that the condition really does exist, then that's the probability of h, and the sensitivity is going to be the total number of true positives divided by the total number of people with the condition, because it's the percentage of people who have the condition and test positive. OK? So that's the probability of e given h, and it's box one divided by box 7. The specificity in contrast is the ratio of it being a true negative to the total number of people who do not have the condition, that is, the probability of not e, that is, not having the evidence of a positive test result, given not h, given that you're in the second column, where the hypothesis is false, because you don't have the condition. So that's Box 5 divided by Box 8. That's the specificity. So we can define all of these in terms of each other. The hits divided by the total with that condition is going to be the sensitivity. And you can use this terminology to guide your way through this box. And the big question is again going to be what's the solution? What's the probability of the hypothesis having the condition, given the evidence, that is, a positive test result: that's going to be Box 1 divided by Box 3. And as we saw in the case that we just went through, that gives you the probability of having the medical condition, or colon cancer, given a positive test result. That's called the posterior probability, or in symbols, the probability of the hypothesis, given the evidence. So I hope this terminology helps you understand some of the discussions of this, if you go on and read about it in the literature. This procedure that we've been discussing is actually just an application of a famous theorem called Bayes' Theorem after Thomas Bayes, a 18th century English clergyman, who was also a mathematician and proved this extremely important theorem in probability theory. Now some of you out there will use the boxes, and it'll make sense to you. But some Courserians, I assume, are mathematicians, and they want to see the mathematics behind it. So now, I want to show you how to derive Bayes' theorem from the rules of probability that we learned in earlier lectures. So for all you math nerds out there, here goes. You start with rule 2G, apply it to the probability that the evidence and the hypothesis are both true. And by the rule, that probability is equal to the probability of the evidence, times the probability of the hypothesis, given the evidence. You have to have that conditional probability because they're not independent. Then you simply divide both sides of that by the probability of the evidence: a little simple algebra. And you end up with the probability of the hypothesis, given the evidence, is equal to the probability of the evidence and the hypothesis, divided by the probability of the evidence. Now we can do a little trick. This was ingenious. Substitute for e, something that's logically equivalent to e, namely, the evidence AND the hypothesis or the evidence AND NOT the hypothesis. Now if you think about it, you'll see that those are equivalent, because either the hypothesis has to be true or NOT the hypothesis is true. One or the other has to be true. And that means that the evidence AND the hypothesis or the evidence AND NOT the hypothesis is going to be equivalent to e. So this is equivalent to this. And because they're equivalent, we can substitute them within the formula for probability without affecting the truth values. So we just substitute this formula in here for the e up there. And we end up with the probability of the hypothesis, given the evidence, is equal to the probability of the evidence AND the hypothesis, divided by the probability of the evidence AND the hypothesis or the evidence AND NOT the hypothesis. Now, that's not supposed to make much sense, but it helps with the derivation. The next step is to apply rule 3, because we have a disjunction. And notice the disjuncts are mutually exclusive. It cannot be true, both, that the evidence AND the hypothesis is true, and also that the evidence AND NOT the hypothesis is true, because it can't be both h and not h. So we can apply the simple version of rule 3. And that means that the probability of (e&h) or (e&~h) is equal to the probability of (e&h + the probability of (e&~h). We're just applying that rule 3 for disjunction that we learned a few lectures ago. Now we apply rule 2G again, because we have the probability of a conjunction up in the top. And, since these are not independent of each other -- we hope not, if it's a hypothesis and the evidence for it -- then we have to use the conditional probability. And using rule 2G, we find that the probability of the hypothesis, given the evidence, is equal to the probability of the hypothesis, times the probability of the evidence, given the hypothesis, divided by the probability of the hypothesis, times the probability of the evidence, given the hypothesis, plus the probability of the hypothesis being false, that is the probability of NOT h, times the probability of the evidence, given NOT h, or the hypothesis being false. And that's a mouthful and it's a long formula, but that's the mathematical formula that Bayes proved in the 18th century and it provides the mathematical basis for that whole system of boxes that we talked about before. But if you don't like the mathematical proof and that's too confusing for you, then use the boxes. And if you don't like the boxes, use the mathematical proof. They're both going to work: just pick the one that works for you. In fact, you don't have to pick either of them, because remember, this is an honors lecture, it's optional, and it won't be on the quiz. But if you do want to try this method, and make sure that you understand it, we'll have a bunch of exercises for you, where you can test your skills.