WEBVTT 00:00:02.980 --> 00:00:09.400 Coins and dice provide a nice simple model of how to calculate probabilities, but 00:00:09.990 --> 00:00:14.540 everyday life is a lot more complicated and it's not taken up with gambling. 00:00:14.540 --> 00:00:17.447 At least, I hope your life is not taken up with gambling. 00:00:18.400 --> 00:00:22.230 So in order to make probabilities more applicable to everyday life, 00:00:22.230 --> 00:00:25.986 we need to look at, slightly more complicated methods. 00:00:26.820 --> 00:00:30.130 Now, because these methods are more complicated, 00:00:30.130 --> 00:00:34.230 this lecture is going to be an honors lecture: it's optional. 00:00:34.230 --> 00:00:35.930 It will not be on the quiz, 00:00:35.930 --> 00:00:37.745 so don't get worried about that. 00:00:38.490 --> 00:00:41.774 But it is still useful, and it's fascinating, 00:00:41.774 --> 00:00:44.432 and it'll help you avoid some mistakes 00:00:44.432 --> 00:00:47.564 that a lot people make and that create a lot of problems. 00:00:48.208 --> 00:00:52.620 And so I hope you'll stick with it and listen to this lecture. 00:00:52.620 --> 00:00:56.575 And there will be exercises to help you figure out 00:00:56.575 --> 00:00:58.502 whether you understand the material or not. 00:00:58.502 --> 00:01:02.653 But don't get too worried, because it's not going to be on the quizz. 00:01:05.461 --> 00:01:08.082 The real problem that we'll be facing in this lecture 00:01:08.742 --> 00:01:10.769 is the problem of test 00:01:10.769 --> 00:01:13.761 We use tests all the time: we use tests to figure out 00:01:13.761 --> 00:01:17.065 whether you have a certain medical condition. 00:01:17.065 --> 00:01:22.201 We use tests to predict the weather or to predict people's future behavior. 00:01:22.201 --> 00:01:25.087 We have certain indicators of how they're going to act, 00:01:25.745 --> 00:01:28.305 either commit a crime or not commit a crime, 00:01:28.305 --> 00:01:30.345 but also whether they're going to pass, 00:01:30.345 --> 00:01:32.405 do well in school or fail. 00:01:33.488 --> 00:01:37.660 We always use these tests when we don't know for certain, 00:01:38.230 --> 00:01:41.340 but we want some kind of evidence, or some kind of indicator. 00:01:41.900 --> 00:01:45.144 The problem is none of these tests are perfect. 00:01:45.450 --> 00:01:48.495 They always contain errors of various sorts. 00:01:48.900 --> 00:01:51.911 And what we're going to have to do is to see how to take 00:01:51.911 --> 00:01:57.957 those errors of different sorts and build them together into a method 00:01:57.957 --> 00:02:03.440 and then a formula for calculating how reliable the method is 00:02:03.440 --> 00:02:06.259 for detecting the thing that we want to detect. 00:02:07.260 --> 00:02:10.380 This problem is a lot like the problem we faced earlier 00:02:10.380 --> 00:02:14.977 when we were talking about applying generalizations to particular cases 00:02:14.977 --> 00:02:18.261 because here we're going to be applying probabilities to particular cases. 00:02:18.936 --> 00:02:21.995 So it'll seem familiar to you in certain parts, 00:02:21.995 --> 00:02:25.280 but you'll see that this case is a little trickier. 00:02:25.890 --> 00:02:28.163 The best examples occur in medicine. 00:02:28.510 --> 00:02:32.465 So just imagine that you go to your doctor for a regular checkup. 00:02:32.820 --> 00:02:34.386 You don't have any special symptoms, 00:02:35.140 --> 00:02:37.580 but he decides to do a few screening tests. 00:02:38.920 --> 00:02:44.385 And unfortunately, and very worryingly, it turns out that you test positive 00:02:45.045 --> 00:02:51.249 on one test for a particular form of cancer, a certain kind of medical condition. 00:02:52.470 --> 00:02:56.101 Well, what that means is that you might have cancer. 00:02:56.870 --> 00:02:58.250 Might, great. 00:02:58.250 --> 00:03:00.376 You want to know whether you do have cancer. 00:03:01.010 --> 00:03:04.090 But of course, finding out for sure whether or not you have cancer 00:03:04.106 --> 00:03:06.290 is going to take further tests. 00:03:06.290 --> 00:03:10.670 And those tests might be expensive, they might be dangerous, 00:03:10.670 --> 00:03:13.058 they're going to be invasive in various ways. 00:03:13.650 --> 00:03:16.525 So you really want to know what's the probability, 00:03:17.212 --> 00:03:20.506 given that you've tested positive on this one test, 00:03:21.140 --> 00:03:22.690 that you really have cancer. 00:03:23.930 --> 00:03:27.951 Now clearly that probability is going to depend on a number of facts 00:03:27.951 --> 00:03:31.243 about this type of cancer, about the type of test and so on. 00:03:31.520 --> 00:03:33.517 And I am not a doctor. 00:03:33.940 --> 00:03:36.266 I am not giving you medical advice. 00:03:36.590 --> 00:03:41.210 If you test positive on a test, go talk to your doctor, 00:03:41.210 --> 00:03:44.040 don't trust me, because I'm just making up numbers here. 00:03:44.350 --> 00:03:48.270 But let's do make up a few numbers and figure out 00:03:48.270 --> 00:03:53.290 what the likelihood is of having cancer, given that you tested positive. 00:03:53.290 --> 00:03:58.890 So let's imagine that the base rate of this particular type of cancer 00:03:58.890 --> 00:04:06.244 in the population is 0.3%, that is, 3 out of 1,000, or 0.003. 00:04:06.244 --> 00:04:08.430 And they say that's the base rate, 00:04:08.430 --> 00:04:12.576 or it's sometimes called the prevalence of the condition in the population. 00:04:12.970 --> 00:04:16.820 That's simply to say that out of 1,000 people chosen randomly 00:04:16.820 --> 00:04:19.949 in the population, you'd get about 3 that have this condition. 00:04:21.640 --> 00:04:24.880 It's just a percentage of the general population. 00:04:25.910 --> 00:04:28.406 So that's the condition, what about the test? 00:04:28.790 --> 00:04:32.273 Well the first thing we want to know is the sensitivity of the test. 00:04:32.930 --> 00:04:37.460 The sensitivity of the test we're going to assume is 0.99. 00:04:38.620 --> 00:04:46.040 And what that means is that out of 100 people who have this condition, 00:04:46.500 --> 00:04:49.010 99 of them will test positive. 00:04:49.010 --> 00:04:53.570 So this test is pretty good at figuring out, 00:04:53.570 --> 00:04:57.400 from among the people who have the condition, which ones do. 00:04:57.400 --> 00:05:03.018 99 of those 100 people who have the condition will test positive. 00:05:03.210 --> 00:05:08.005 The other feature is specificity, and what that means is 00:05:08.005 --> 00:05:13.413 the percentage of the people who don't have the condition who will test negative. 00:05:14.390 --> 00:05:17.500 The point here is you're not going to get a positive result 00:05:17.500 --> 00:05:20.150 for people who don't have the condition, right? 00:05:20.180 --> 00:05:23.665 Because you want it to be specific to this particular condition 00:05:23.665 --> 00:05:27.600 and not get a bunch of positives for people who have other types of conditions 00:05:27.600 --> 00:05:29.485 or no medical condition at all. 00:05:30.240 --> 00:05:32.442 So the specificity we're going to assume, 00:05:32.442 --> 00:05:37.106 in this particular case we're talking about, is also 99%. 00:05:39.234 --> 00:05:46.490 Now, what we want to know is the probability that you have a cancer, a condition, 00:05:47.120 --> 00:05:50.630 given that you tested positive on the test; 00:05:50.630 --> 00:05:55.462 but notice that the sensitivity tells you the probability 00:05:55.462 --> 00:05:59.039 that you will test positive given that you have the condition. 00:05:59.339 --> 00:06:01.758 We want to know the opposite of that, 00:06:01.758 --> 00:06:04.737 the probability that you have the condition 00:06:04.737 --> 00:06:07.318 given that you tested positive. 00:06:08.180 --> 00:06:10.840 And that's what we have to do a little calculation to figure out. 00:06:10.840 --> 00:06:15.320 But before we do that calculation, I want you to think about these figures 00:06:15.320 --> 00:06:18.310 that I've given you: the prevalence in the population, 00:06:18.310 --> 00:06:22.060 the sensitivity of the test, the specificity of the test, 00:06:22.060 --> 00:06:23.487 and just make a guess. 00:06:23.900 --> 00:06:26.510 Just start out by writing down on a piece of paper 00:06:26.510 --> 00:06:32.430 what you think the probability is that you would have the cancer 00:06:32.430 --> 00:06:35.850 given that you tested positive on the test. 00:06:36.990 --> 00:06:40.230 Take a minute and think about it and write it down. 00:06:41.020 --> 00:06:45.080 But we don't want to just guess about medical conditions, 00:06:45.080 --> 00:06:48.340 about probabilities that really matter as much as this will do. 00:06:48.910 --> 00:06:53.099 Instead, we want to calculate what the probability really is. 00:06:53.680 --> 00:06:58.690 So, let's go through it carefully and show you how to use 00:06:58.690 --> 00:07:04.104 what I'll call the box method in order to calculate the real likelihood 00:07:04.104 --> 00:07:08.575 that you have the condition, given that you got a positive test result. 00:07:09.320 --> 00:07:15.550 What we need to do is to divide the population into four different groups: 00:07:15.983 --> 00:07:19.859 the group that has the condition and tested positive, 00:07:19.859 --> 00:07:22.600 the group that has the condition and tested negative, 00:07:22.900 --> 00:07:25.650 the group that doesn't have the condition and tested positive, 00:07:25.650 --> 00:07:28.660 and the group that doesn't have the condition and tested negative. 00:07:29.490 --> 00:07:34.150 And this chart will show you a nice, simple way of organizing 00:07:34.150 --> 00:07:35.546 all of that information. 00:07:35.870 --> 00:07:43.590 Because this row, the top row, tells you all the people who tested positive. 00:07:44.479 --> 00:07:49.460 The bottom row tells you the people who tested negative. 00:07:50.034 --> 00:07:55.884 Then, the left column gives you the people who do have the medical condition, 00:07:55.884 --> 00:07:57.739 in this case, some kind of cancer. 00:07:58.520 --> 00:08:02.893 And the right column tells you the people who do not have that condition. 00:08:03.690 --> 00:08:07.948 Now what we need to do is to start filling it out with numbers. 00:08:08.940 --> 00:08:12.736 Now the first thing we need to specify is the population. 00:08:13.240 --> 00:08:16.400 In this case we want to start with a big enough population 00:08:16.400 --> 00:08:19.380 that we're not going to have a lot of fractions in the other boxes. 00:08:19.380 --> 00:08:22.639 So, let's just imagine that the population is 100,000. 00:08:23.000 --> 00:08:25.490 Make it a million or 10 million, it doesn't matter 00:08:25.490 --> 00:08:28.747 because we're going to be interested in the ratios with the different groups. 00:08:30.530 --> 00:08:33.500 We can use that 100,000 to fill out the other boxes, 00:08:33.500 --> 00:08:36.380 if we know the prevalence, or the base rate, 00:08:37.099 --> 00:08:40.245 because the base rate tells you what percentage of that 100,000 00:08:40.245 --> 00:08:43.705 actually do have the condition and don't have the condition. 00:08:44.700 --> 00:08:47.660 We imagined -- remember we're just making up numbers here -- 00:08:47.660 --> 00:08:52.180 but we imagined that the prevalence of this condition is 0.3%. 00:08:52.180 --> 00:08:56.200 And that means out of 100,000 people, there will be 300 00:08:56.200 --> 00:08:59.506 who do have the medical condition. 00:09:01.010 --> 00:09:04.100 Well, if there are 300 who have it and there are 100,000 total, 00:09:04.100 --> 00:09:07.533 we can figure out how many don't have the medical condition by just subtracting. 00:09:07.720 --> 00:09:11.876 Which means 99,700 do not have the medical condition. 00:09:12.652 --> 00:09:13.590 Okay? 00:09:13.590 --> 00:09:17.820 Now, we've divided the population into our two columns: 00:09:17.820 --> 00:09:20.724 the ones that do and the ones that don't have the medical condition. 00:09:21.420 --> 00:09:25.730 The next step is to figure out how many are going to test positive 00:09:25.730 --> 00:09:30.147 and how many are going to test negative out of each of these groups. 00:09:30.630 --> 00:09:33.727 For that, we first need the sensitivity. 00:09:34.200 --> 00:09:38.500 The sensitivity tells us the percentage of the cases that have the condition 00:09:38.500 --> 00:09:40.070 who will test positive. 00:09:41.370 --> 00:09:45.260 So the people who have the condition are the 300. 00:09:45.690 --> 00:09:50.147 The ones who test positive are going to go up in this area 00:09:50.879 --> 00:09:56.842 and we know from the sensitivity being 0.99 or 99% 00:09:57.112 --> 00:10:03.385 that the number in that area should be 99% of 300, or 297. 00:10:04.861 --> 00:10:08.353 And of course, if that's the number that test positive, 00:10:08.693 --> 00:10:12.224 then the remainder are going to test negative 00:10:12.224 --> 00:10:13.874 and that means that we'll have three. 00:10:14.410 --> 00:10:18.680 Which shouldn't surprise you because if 99% of the cases that have it 00:10:18.680 --> 00:10:23.659 test positive, then 1% will test negative, and 1% of 300 is 3. 00:10:24.330 --> 00:10:26.070 Good: so we got the first column done. 00:10:26.760 --> 00:10:31.188 Now, the next question is going to be the specificity. 00:10:31.460 --> 00:10:37.297 We can use the specificity to figure out what goes in that next column. 00:10:38.090 --> 00:10:43.620 If the specificity is 99 and we know 00:10:43.620 --> 00:10:50.801 that 99,700 people do not have the condition out of our sample of 100,000, 00:10:51.756 --> 00:10:58.940 well, that means that 99% of 99,700 are going to test negative 00:10:58.940 --> 00:11:02.938 because the specificity is the percentage of cases without the condition 00:11:02.938 --> 00:11:04.605 that test negative. 00:11:04.605 --> 00:11:10.865 And that means that we'll have 98,703 among the people 00:11:10.865 --> 00:11:13.903 who do not have the condition who test negative. 00:11:14.823 --> 00:11:18.157 How many are going to test positive? The rest of them. 00:11:18.467 --> 00:11:27.450 So 99,700 minus 98,703 is going to be 997. 00:11:27.980 --> 00:11:35.100 And of course, that shouldn't be surprising again, because 1% of 99,700 is 997. 00:11:36.375 --> 00:11:38.612 We only got two boxes left to fill out. 00:11:39.190 --> 00:11:40.502 How do you fill out those? 00:11:41.020 --> 00:11:46.988 Well, this box in the upper right, is the total number of people 00:11:46.988 --> 00:11:50.551 in this population of 100,000 who test positive. 00:11:51.238 --> 00:11:56.439 And so, we can get that by adding the ones that do have the condition and test positive 00:11:56.439 --> 00:11:59.750 and the ones that don't have the condition and test positive. 00:11:59.750 --> 00:12:05.587 Just add them together, and you get 1,294. 00:12:06.270 --> 00:12:13.439 And you do the same on the next row, because that blank is the area 00:12:13.439 --> 00:12:16.194 that has all the people who test negative, 00:12:16.194 --> 00:12:20.174 and 3 people who have the condition test negative, 00:12:20.624 --> 00:12:25.326 98,703 people who do not have the condition test negative, 00:12:25.626 --> 00:12:30.490 so the total is going to be 98,706. 00:12:30.496 --> 00:12:35.060 And we can check to make sure that we got it right, 00:12:35.085 --> 00:12:43.576 by just adding them together: 1,294 plus 98,706 is equal to 100,000. 00:12:44.944 --> 00:12:46.526 Phew, we got it right. 00:12:46.526 --> 00:12:52.230 Okay, so now we've divided the population into those people who have the condition, 00:12:52.931 --> 00:12:54.774 those people who don't have the condition, 00:12:55.069 --> 00:12:59.350 and we know how many of each of those groups test positive, 00:12:59.350 --> 00:13:03.188 and how many of each of those groups test negative. 00:13:04.000 --> 00:13:08.160 The real question is what's the probability 00:13:08.160 --> 00:13:12.268 that I have cancer or the medical condition, given that I tested positive? 00:13:12.490 --> 00:13:13.940 How do we figure that out? 00:13:14.374 --> 00:13:19.960 Well, the total number of positive tests was 1,294 00:13:20.817 --> 00:13:27.334 and the people who tested positive who really had the condition was 297. 00:13:27.770 --> 00:13:34.130 So it looks like the probability of actually having the condition, 00:13:34.748 --> 00:13:43.670 given that you tested positive, is 297 out of 1294 or 0.23. 00:13:44.090 --> 00:13:47.235 That's 23%, less than one in four. 00:13:47.788 --> 00:13:49.092 Is that what you guessed? 00:13:49.610 --> 00:13:54.915 Most people, including most doctors, when they hear that the test is 00:13:54.915 --> 00:14:01.215 99% sensitive and 99% specific, will guess a lot higher than one in four. 00:14:01.867 --> 00:14:03.095 >> Oh my gosh! 00:14:03.095 --> 00:14:06.435 I'm a doctor, and I never would have thought that! 00:14:07.036 --> 00:14:07.920 >> Now, don't worry: 00:14:07.920 --> 00:14:11.017 she's not a physician. she's a metaphysician. 00:14:12.390 --> 00:14:16.180 >> But in this case, the probability really is just one in four 00:14:16.180 --> 00:14:17.914 that you had that medical condition. 00:14:17.920 --> 00:14:19.124 Now how did that happen? 00:14:19.630 --> 00:14:23.260 The reason was that the prevalence or the base rate was so low 00:14:23.840 --> 00:14:27.589 that even a small rate of false positives, 00:14:28.211 --> 00:14:33.095 given the massive numbers of people who don't have the condition, 00:14:33.642 --> 00:14:37.152 will mean that there are more false positives, 3 times as many, 00:14:37.642 --> 00:14:39.332 as there are true positives. 00:14:39.642 --> 00:14:42.744 And that's why the probability is just one in four, 00:14:42.744 --> 00:14:44.637 actually a little less than one in four, 00:14:44.790 --> 00:14:48.600 that you have the medical condition even when you tested positive. 00:14:49.150 --> 00:14:53.330 I want to add a quick caveat here, in order to avoid misinterpretation. 00:14:53.780 --> 00:14:58.628 because the point here is that, if you have a screening test for a condition 00:14:58.630 --> 00:15:03.600 with a very low base rate or prevalence, and you don't have any symptoms 00:15:03.600 --> 00:15:09.780 that put you in a special category, then, you need to get another test 00:15:10.411 --> 00:15:14.871 before you jump to any conclusions about having the medical condition. 00:15:15.890 --> 00:15:19.890 Because, if you have that other test, then the fact that you tested positive 00:15:19.890 --> 00:15:22.540 on the first test puts you in a smaller class, 00:15:22.540 --> 00:15:24.995 with a much higher base rate, or prevalence. 00:15:25.000 --> 00:15:27.872 And now, the probability's going to go up. 00:15:28.890 --> 00:15:32.450 Most doctors know that, and that's why, after the first test, 00:15:32.450 --> 00:15:35.340 they don't jump to conclusions, and they order another test, 00:15:35.340 --> 00:15:39.380 but many patients don't realize that and they get extremely worried 00:15:39.380 --> 00:15:42.256 after a single test even when they don't have any symptoms. 00:15:43.500 --> 00:15:45.910 So that's the mistake that we're trying to avoid here 00:15:45.910 --> 00:15:53.167 and that's surprising, but it actually applies to many different areas of life. 00:15:54.850 --> 00:15:59.262 It applies, for example, to medical tests with all kinds of other diseases. 00:15:59.911 --> 00:16:04.420 Not just cancer or colon cancer, but pretty much every disease 00:16:04.420 --> 00:16:06.327 where the prevalence is extremely low. 00:16:07.400 --> 00:16:09.960 It applies also to drug tests. 00:16:10.670 --> 00:16:12.520 If somebody gets a positive drug test, 00:16:12.520 --> 00:16:14.548 does that mean they really were using drugs? 00:16:14.980 --> 00:16:20.320 Well, if it's a population where the base rate or prevelance of drug use 00:16:20.320 --> 00:16:23.130 is quite low, then it might not. 00:16:24.440 --> 00:16:28.290 Of course, if you assume that the prevalence or base rate is quite high, 00:16:28.290 --> 00:16:30.439 then you're going to believe that drug test. 00:16:30.840 --> 00:16:35.100 But you need to know the facts about what the prevalence or base rate really is 00:16:35.100 --> 00:16:39.150 in order to calculate accurately the probability 00:16:39.150 --> 00:16:41.926 that this person really was using drugs. 00:16:43.400 --> 00:16:47.840 Same applies to evidence in legal trials: take eyewitnesses for example, 00:16:47.840 --> 00:16:55.110 it's very tricky, someone's trying to use their eyes as a test for what they see. 00:16:55.110 --> 00:16:57.740 They might identify a friend, or they might just say 00:16:58.230 --> 00:17:02.308 that car that did the hit-and-run accident was a Porsche. 00:17:03.420 --> 00:17:07.716 Well, how good are they at identifying Porsches? 00:17:09.579 --> 00:17:12.932 If they get it right most of the time, but not always, 00:17:12.932 --> 00:17:17.550 and sometimes they don't get it right when it is a Porsche, 00:17:17.550 --> 00:17:22.030 then we've got the sensitivity and specificity of what they identify. 00:17:22.840 --> 00:17:26.014 And we can use that to calculate how likely it is 00:17:26.454 --> 00:17:30.350 that their evidence in the trial really is reliable or not. 00:17:31.090 --> 00:17:34.150 Another example is the prediction of future behavior. 00:17:34.950 --> 00:17:37.121 We might have some kind of marker 00:17:37.740 --> 00:17:40.550 that a certain group of people with that marker 00:17:41.030 --> 00:17:43.602 have a certain likelihood of committing crimes. 00:17:44.150 --> 00:17:49.000 But if crimes are very rare in that community and every other, 00:17:49.000 --> 00:17:55.120 then a test which has a pretty good sensitivity and specificity 00:17:55.120 --> 00:17:59.839 still might not be good enough when we're talking about something like crime 00:18:00.160 --> 00:18:04.800 that's actually very rare and has a very low prevalence or base rate 00:18:04.800 --> 00:18:06.230 in most communities. 00:18:06.670 --> 00:18:08.998 And the same applies to failing out of school. 00:18:10.700 --> 00:18:14.070 Our SAT scores or GRE scores are going to be 00:18:14.070 --> 00:18:16.960 good predictors of who's going to fail out of school. 00:18:18.320 --> 00:18:21.450 Well, if very few people fail out of school, 00:18:21.450 --> 00:18:24.556 so that the prevalence and base rate is very low, 00:18:24.556 --> 00:18:27.710 then, even if they're pretty sensitive and specific, 00:18:27.710 --> 00:18:29.429 they might not be good predictors. 00:18:29.940 --> 00:18:35.301 So this same type of problem arises in a lot of different areas. 00:18:35.840 --> 00:18:38.330 And I'm not going to go through more examples right now, 00:18:38.330 --> 00:18:41.870 but we'll have plenty of examples in the exercises at the end of this chapter. 00:18:43.690 --> 00:18:45.970 I want to end, though, by saying a few things 00:18:45.970 --> 00:18:49.261 that are a bit more technical about this method. 00:18:49.780 --> 00:18:52.490 First, there's a lot of terminology to learn, 00:18:53.043 --> 00:18:57.670 because when you read about using this method in other areas, 00:18:57.670 --> 00:19:01.211 for other types of topics, then you'll run into these terms, 00:19:01.211 --> 00:19:02.690 and it's a good idea to know them. 00:19:04.044 --> 00:19:13.860 So first, the cases where the person does have the condition and also tests positive 00:19:13.860 --> 00:19:17.060 are called hits, or true positives. 00:19:17.060 --> 00:19:18.736 Different people use different terms. 00:19:21.540 --> 00:19:27.625 The cases where the person tests positive, but they don't have the condition, 00:19:27.625 --> 00:19:31.389 are called, false positives or false alarms. 00:19:33.830 --> 00:19:40.525 The cases where a person really does have the condition, but tests negative 00:19:41.120 --> 00:19:44.291 are called misses or false negatives. 00:19:47.360 --> 00:19:51.064 And the cases where the person does not have the condition 00:19:51.594 --> 00:19:54.829 and the test comes out negative are called true negatives, 00:19:54.829 --> 00:19:57.676 because they're negative and it's true that they don't have the condition. 00:19:59.620 --> 00:20:03.385 If we put together the false negatives, and the true negatives, 00:20:04.250 --> 00:20:06.431 we get the total set of negatives. 00:20:07.346 --> 00:20:11.672 And if we put together the true positives and the false positives 00:20:12.112 --> 00:20:14.680 we get, the total set of positives. 00:20:16.410 --> 00:20:18.970 And of course, we have the general population. 00:20:19.350 --> 00:20:23.325 Within that population, a percentage that have the condition 00:20:23.325 --> 00:20:25.527 and a percentage that don't have the condition. 00:20:27.016 --> 00:20:29.412 Now, what's the base rate? 00:20:29.752 --> 00:20:35.250 The base rate in this population is simply the set that have the condition, 00:20:35.510 --> 00:20:41.570 divided by the total population, which is Box 7 divided by Box 9. 00:20:41.940 --> 00:20:44.815 If we use e for the evidence 00:20:45.090 --> 00:20:50.190 and h for the hypothesis being true that the condition really does exist, 00:20:50.190 --> 00:20:52.848 then that's the probability of h, 00:20:54.385 --> 00:21:02.066 and the sensitivity is going to be the total number of true positives 00:21:02.066 --> 00:21:06.232 divided by the total number of people with the condition, 00:21:06.232 --> 00:21:11.727 because it's the percentage of people who have the condition and test positive. 00:21:12.760 --> 00:21:16.100 OK? So that's the probability of e given h, 00:21:16.100 --> 00:21:20.480 and it's box one divided by box 7. 00:21:21.040 --> 00:21:26.870 The specificity in contrast is the ratio of it being a true negative 00:21:26.870 --> 00:21:31.710 to the total number of people who do not have the condition, that is, 00:21:31.710 --> 00:21:35.090 the probability of not e, that is, 00:21:35.090 --> 00:21:38.951 not having the evidence of a positive test result, 00:21:38.951 --> 00:21:43.180 given not h, given that you're in the second column, 00:21:43.180 --> 00:21:47.170 where the hypothesis is false, because you don't have the condition. 00:21:47.170 --> 00:21:52.057 So that's Box 5 divided by Box 8. 00:21:53.730 --> 00:21:55.807 That's the specificity. 00:21:56.000 --> 00:22:00.352 So we can define all of these in terms of each other. 00:22:00.760 --> 00:22:06.815 The hits divided by the total with that condition is going to be the sensitivity. 00:22:06.820 --> 00:22:11.130 And you can use this terminology to guide your way through this box. 00:22:11.140 --> 00:22:15.048 And the big question is again going to be what's the solution? 00:22:15.048 --> 00:22:21.660 What's the probability of the hypothesis having the condition, given the evidence, 00:22:21.660 --> 00:22:27.707 that is, a positive test result: that's going to be Box 1 divided by Box 3. 00:22:28.540 --> 00:22:31.887 And as we saw in the case that we just went through, 00:22:31.887 --> 00:22:37.070 that gives you the probability of having the medical condition, or colon cancer, 00:22:37.070 --> 00:22:39.232 given a positive test result. 00:22:39.480 --> 00:22:44.130 That's called the posterior probability, or in symbols, 00:22:44.130 --> 00:22:47.550 the probability of the hypothesis, given the evidence. 00:22:48.210 --> 00:22:53.170 So I hope this terminology helps you understand some of the discussions of this, 00:22:53.170 --> 00:22:55.856 if you go on and read about it in the literature. 00:22:56.380 --> 00:23:01.206 This procedure that we've been discussing is actually just an application 00:23:01.730 --> 00:23:06.570 of a famous theorem called Bayes' Theorem after Thomas Bayes, 00:23:06.570 --> 00:23:11.270 a 18th century English clergyman, who was also a mathematician 00:23:11.270 --> 00:23:15.686 and proved this extremely important theorem in probability theory. 00:23:16.700 --> 00:23:22.520 Now some of you out there will use the boxes, and it'll make sense to you. 00:23:22.520 --> 00:23:26.290 But some Courserians, I assume, are mathematicians, 00:23:26.290 --> 00:23:28.344 and they want to see the mathematics behind it. 00:23:28.800 --> 00:23:32.850 So now, I want to show you how to derive Bayes' theorem 00:23:32.850 --> 00:23:36.678 from the rules of probability that we learned in earlier lectures. 00:23:37.160 --> 00:23:40.345 So for all you math nerds out there, here goes. 00:23:41.230 --> 00:23:43.552 You start with rule 2G, 00:23:45.180 --> 00:23:50.859 apply it to the probability that the evidence and the hypothesis are both true. 00:23:51.330 --> 00:23:56.500 And by the rule, that probability is equal to the probability of the evidence, 00:23:56.500 --> 00:24:00.990 times the probability of the hypothesis, given the evidence. 00:24:02.320 --> 00:24:04.510 You have to have that conditional probability 00:24:04.510 --> 00:24:07.380 because they're not independent. 00:24:08.800 --> 00:24:14.310 Then you simply divide both sides of that by the probability of the evidence: 00:24:14.310 --> 00:24:15.790 a little simple algebra. 00:24:15.790 --> 00:24:20.221 And you end up with the probability of the hypothesis, given the evidence, 00:24:20.221 --> 00:24:24.790 is equal to the probability of the evidence and the hypothesis, 00:24:24.790 --> 00:24:28.100 divided by the probability of the evidence. 00:24:30.830 --> 00:24:34.528 Now we can do a little trick. This was ingenious. 00:24:35.300 --> 00:24:39.165 Substitute for e, something that's logically equivalent to e, 00:24:39.165 --> 00:24:45.460 namely, the evidence AND the hypothesis or the evidence AND NOT the hypothesis. 00:24:45.880 --> 00:24:48.324 Now if you think about it, you'll see that those are equivalent, 00:24:48.324 --> 00:24:51.260 because either the hypothesis has to be true 00:24:51.260 --> 00:24:54.274 or NOT the hypothesis is true. 00:24:54.274 --> 00:24:55.920 One or the other has to be true. 00:24:56.560 --> 00:25:00.033 And that means that the evidence AND the hypothesis 00:25:00.033 --> 00:25:04.774 or the evidence AND NOT the hypothesis is going to be equivalent to e. 00:25:05.020 --> 00:25:07.960 So this is equivalent to this. 00:25:08.440 --> 00:25:11.360 And because they're equivalent, we can substitute them 00:25:11.360 --> 00:25:15.376 within the formula for probability without affecting the truth values. 00:25:15.720 --> 00:25:23.400 So we just substitute this formula in here for the e up there. 00:25:23.840 --> 00:25:27.910 And we end up with the probability of the hypothesis, given the evidence, 00:25:27.910 --> 00:25:32.240 is equal to the probability of the evidence AND the hypothesis, divided by 00:25:32.240 --> 00:25:35.220 the probability of the evidence AND the hypothesis 00:25:35.220 --> 00:25:37.513 or the evidence AND NOT the hypothesis. 00:25:37.950 --> 00:25:41.426 Now, that's not supposed to make much sense, but it helps with the derivation. 00:25:43.283 --> 00:25:47.600 The next step is to apply rule 3, because we have a disjunction. 00:25:47.600 --> 00:25:51.082 And notice the disjuncts are mutually exclusive. 00:25:51.540 --> 00:25:56.230 It cannot be true, both, that the evidence AND the hypothesis is true, 00:25:56.230 --> 00:26:00.197 and also that the evidence AND NOT the hypothesis is true, 00:26:00.197 --> 00:26:04.247 because it can't be both h and not h. 00:26:05.160 --> 00:26:08.073 So we can apply the simple version of rule 3. 00:26:08.580 --> 00:26:14.316 And that means that the probability of (e&h) or (e&~h) 00:26:14.316 --> 00:26:21.074 is equal to the probability of (e&h + the probability of (e&~h). 00:26:21.510 --> 00:26:23.920 We're just applying that rule 3 for disjunction 00:26:23.920 --> 00:26:26.260 that we learned a few lectures ago. 00:26:27.150 --> 00:26:29.575 Now we apply rule 2G again, 00:26:29.575 --> 00:26:35.380 because we have the probability of a conjunction up in the top. 00:26:36.810 --> 00:26:41.695 And, since these are not independent of each other 00:26:41.880 --> 00:26:44.630 -- we hope not, if it's a hypothesis and the evidence for it -- 00:26:45.345 --> 00:26:48.315 then we have to use the conditional probability. 00:26:48.830 --> 00:26:53.010 And using rule 2G, we find that the probability of the hypothesis, 00:26:53.010 --> 00:26:55.110 given the evidence, is equal to 00:26:55.110 --> 00:26:59.870 the probability of the hypothesis, times the probability of the evidence, 00:26:59.870 --> 00:27:05.130 given the hypothesis, divided by the probability of the hypothesis, 00:27:05.130 --> 00:27:08.586 times the probability of the evidence, given the hypothesis, 00:27:08.831 --> 00:27:12.648 plus the probability of the hypothesis being false, 00:27:12.648 --> 00:27:16.851 that is the probability of NOT h, times the probability of the evidence, 00:27:16.851 --> 00:27:21.130 given NOT h, or the hypothesis being false. 00:27:22.010 --> 00:27:23.170 And that's a mouthful 00:27:23.170 --> 00:27:27.070 and it's a long formula, but that's the mathematical formula 00:27:27.070 --> 00:27:33.270 that Bayes proved in the 18th century and it provides the mathematical basis 00:27:33.500 --> 00:27:36.280 for that whole system of boxes that we talked about before. 00:27:37.470 --> 00:27:42.780 But if you don't like the mathematical proof and that's too confusing for you, 00:27:42.780 --> 00:27:44.090 then use the boxes. 00:27:44.400 --> 00:27:47.396 And if you don't like the boxes, use the mathematical proof. 00:27:47.860 --> 00:27:50.706 They're both going to work: just pick the one that works for you. 00:27:50.990 --> 00:27:53.220 In fact, you don't have to pick either of them, 00:27:53.220 --> 00:27:57.120 because remember, this is an honors lecture, it's optional, 00:27:57.988 --> 00:27:59.999 and it won't be on the quiz. 00:28:00.590 --> 00:28:04.270 But if you do want to try this method, and make sure that you understand it, 00:28:05.075 --> 00:28:08.437 we'll have a bunch of exercises for you, where you can test your skills.