Vous avez un compte YouTube ?

Nouveauté : activer les traductions et les sous-titres créés par les internautes sur votre chaîne YouTube !

English sous-titres

← Lecture 9-7 - Bayes Theorem Honors

From the Think Again: How to Reason and Argue course on Coursera

Obtenir le code d’intégration
1 langue

Afficher la révision 5 créée 07/04/2014 par Claude Almansi.

  1. Coins and dice provide a nice simple model
    of how to calculate probabilities, but
  2. everyday life is a lot more complicated
    and it's not taken up with gambling.
  3. At least, I hope your life is not taken up
    with gambling.
  4. So in order to make probabilities more
    applicable to everyday life,
  5. we need to look at, slightly more
    complicated methods.
  6. Now, because these methods
    are more complicated,
  7. this lecture is going to be
    an honors lecture: it's optional.
  8. It will not be on the quiz,
  9. so don't get worried about that.
  10. But it is still useful, and it's fascinating,
  11. and it'll help you avoid some mistakes
  12. that a lot people make
    and that create a lot of problems.
  13. And so I hope you'll stick with it and listen to this lecture.
  14. And there will be exercises
    to help you figure out
  15. whether you understand
    the material or not.
  16. But don't get too worried, because
    it's not going to be on the quizz.
  17. The real problem
    that we'll be facing in this lecture
  18. is the problem of test
  19. We use tests all the time:
    we use tests to figure out
  20. whether you have
    a certain medical condition.
  21. We use tests to predict the weather
    or to predict people's future behavior.
  22. We have certain indicators
    of how they're going to act,
  23. either commit a crime
    or not commit a crime,
  24. but also whether they're going to pass,
  25. do well in school or fail.
  26. We always use these tests
    when we don't know for certain,
  27. but we want some kind of evidence,
    or some kind of indicator.
  28. The problem is none of these tests
    are perfect.
  29. They always contain errors
    of various sorts.
  30. And what we're going to have to do is to
    see how to take
  31. those errors of different sorts
    and build them together into a method
  32. and then a formula for calculating
    how reliable the method is
  33. for detecting the thing that we want to detect.
  34. This problem is a lot like the problem
    we faced earlier
  35. when we were talking about applying
    generalizations to particular cases
  36. because here we're going to be applying
    probabilities to particular cases.
  37. So it'll seem familiar to you in certain parts,
  38. but you'll see that this case
    is a little trickier.
  39. The best examples occur in medicine.
  40. So just imagine that you go to your doctor
    for a regular checkup.
  41. You don't have any special symptoms,
  42. but he decides to do
    a few screening tests.
  43. And unfortunately, and very worryingly,
    it turns out that you test positive
  44. on one test for a particular form of cancer,
    a certain kind of medical condition.
  45. Well, what that means is that you might
    have cancer.
  46. Might, great.
  47. You want to know whether you do have
  48. But of course, finding out for sure
    whether or not you have cancer
  49. is going to take further tests.
  50. And those tests might be expensive,
    they might be dangerous,
  51. they're going to be invasive
    in various ways.
  52. So you really want to know what's the
  53. given that you've tested positive
    on this one test,
  54. that you really have cancer.
  55. Now clearly that probability is going
    to depend on a number of facts
  56. about this type of cancer,
    about the type of test and so on.
  57. And I am not a doctor.
  58. I am not giving you medical advice.
  59. If you test positive on a test,
    go talk to your doctor,
  60. don't trust me, because I'm just
    making up numbers here.
  61. But let's do make up a few numbers
    and figure out
  62. what the likelihood is of having cancer,
    given that you tested positive.
  63. So let's imagine that the base rate
    of this particular type of cancer
  64. in the population is 0.3%, that is,
    3 out of 1,000, or 0.003.
  65. And they say that's the base rate,
  66. or it's sometimes called the prevalence
    of the condition in the population.
  67. That's simply to say that out of 1,000
    people chosen randomly
  68. in the population, you'd get about 3
    that have this condition.
  69. It's just a percentage
    of the general population.
  70. So that's the condition, what about the
  71. Well the first thing we want to know
    is the sensitivity of the test.
  72. The sensitivity of the test we're going to
    assume is 0.99.
  73. And what that means is that out of
    100 people who have this condition,
  74. 99 of them will test positive.
  75. So this test is pretty good at figuring
  76. from among the people
    who have the condition, which ones do.
  77. 99 of those 100 people who have the
    condition will test positive.
  78. The other feature is specificity, and what
    that means is
  79. the percentage of the people who don't
    have the condition who will test negative.
  80. The point here is you're not going
    to get a positive result
  81. for people who don't have the condition,
  82. Because you want it to be specific
    to this particular condition
  83. and not get a bunch of positives for
    people who have other types of conditions
  84. or no medical condition at all.
  85. So the specificity we're going to assume,
  86. in this particular case we're talking about, is also 99%.
  87. Now, what we want to know is the probability
    that you have a cancer, a condition,
  88. given that you tested positive on the test;
  89. but notice that the sensitivity
    tells you the probability
  90. that you will test positive
    given that you have the condition.
  91. We want to know the opposite of that,
  92. the probability
    that you have the condition
  93. given that you tested positive.
  94. And that's what we have to do
    a little calculation to figure out.
  95. But before we do that calculation,
    I want you to think about these figures
  96. that I've given you:
    the prevalence in the population,
  97. the sensitivity of the test,
    the specificity of the test,
  98. and just make a guess.
  99. Just start out by writing down
    on a piece of paper
  100. what you think the probability is
    that you would have the cancer
  101. given that you tested positive
    on the test.
  102. Take a minute and think about it
    and write it down.
  103. But we don't want to just guess
    about medical conditions,
  104. about probabilities that really matter
    as much as this will do.
  105. Instead, we want to calculate what the
    probability really is.
  106. So, let's go through it carefully and
    show you how to use
  107. what I'll call the box method in order
    to calculate the real likelihood
  108. that you have the condition, given that
    you got a positive test result.
  109. What we need to do is to divide the
    population into four different groups:
  110. the group that has the condition
    and tested positive,
  111. the group that has the condition
    and tested negative,
  112. the group that doesn't have the condition
    and tested positive,
  113. and the group that doesn't have
    the condition and tested negative.
  114. And this chart will show you a nice,
    simple way of organizing
  115. all of that information.
  116. Because this row, the top row, tells
    you all the people who tested positive.
  117. The bottom row tells you the people
    who tested negative.
  118. Then, the left column gives you the
    people who do have the medical condition,
  119. in this case, some kind of cancer.
  120. And the right column tells you the people
    who do not have that condition.
  121. Now what we need to do is to start
    filling it out with numbers.
  122. Now the first thing we need to specify is
    the population.
  123. In this case we want to start with a big
    enough population
  124. that we're not going to have a lot
    of fractions in the other boxes.
  125. So, let's just imagine that the population
    is 100,000.
  126. Make it a million or 10 million,
    it doesn't matter
  127. because we're going to be interested
    in the ratios with the different groups.
  128. We can use that 100,000 to fill out the
    other boxes,
  129. if we know the prevalence, or the
    base rate,
  130. because the base rate tells you what
    percentage of that 100,000
  131. actually do have the condition and
    don't have the condition.
  132. We imagined -- remember we're just
    making up numbers here --
  133. but we imagined that the prevalence
    of this condition is 0.3%.
  134. And that means out of 100,000 people,
    there will be 300
  135. who do have the medical condition.
  136. Well, if there are 300 who have it and
    there are 100,000 total,
  137. we can figure out how many don't have the
    medical condition by just subtracting.
  138. Which means 99,700
    do not have the medical condition.
  139. Okay?
  140. Now, we've divided the population into our
    two columns:
  141. the ones that do and the ones that don't
    have the medical condition.
  142. The next step is to figure out how many
    are going to test positive
  143. and how many are going to test negative
    out of each of these groups.
  144. For that, we first need the sensitivity.
  145. The sensitivity tells us the percentage
    of the cases that have the condition
  146. who will test positive.
  147. So the people who have the condition are
    the 300.
  148. The ones who test positive are going
    to go up in this area
  149. and we know from the sensitivity being 0.99 or 99%
  150. that the number in that area should be 99%
    of 300, or 297.
  151. And of course, if that's the number
    that test positive,
  152. then the remainder
    are going to test negative
  153. and that means that we'll have three.
  154. Which shouldn't surprise you because if
    99% of the cases that have it
  155. test positive, then 1% will test negative,
    and 1% of 300 is 3.
  156. Good: so we got the first column done.
  157. Now, the next question is going to be the
  158. We can use the specificity to figure out
    what goes in that next column.
  159. If the specificity is 99 and we know
  160. that 99,700 people do not have the
    condition out of our sample of 100,000,
  161. well, that means that 99% of 99,700 are
    going to test negative
  162. because the specificity is the
    percentage of cases without the condition
  163. that test negative.
  164. And that means that we'll have
    98,703 among the people
  165. who do not have the condition
    who test negative.
  166. How many are going to test positive?
    The rest of them.
  167. So 99,700 minus 98,703
    is going to be 997.
  168. And of course, that shouldn't be surprising
    again, because 1% of 99,700 is 997.
  169. We only got two boxes left to fill out.
  170. How do you fill out those?
  171. Well, this box in the upper right,
    is the total number of people
  172. in this population of 100,000
    who test positive.
  173. And so, we can get that by adding the ones
    that do have the condition and test positive
  174. and the ones that don't have
    the condition and test positive.
  175. Just add them together, and you get 1,294.
  176. And you do the same on the next row,
    because that blank is the area
  177. that has all the people
    who test negative,
  178. and 3 people who have the condition
    test negative,
  179. 98,703 people who do not have the
    condition test negative,
  180. so the total is going to be 98,706.
  181. And we can check to make sure that
    we got it right,
  182. by just adding them together:
    1,294 plus 98,706 is equal to 100,000.
  183. Phew, we got it right.
  184. Okay, so now we've divided the population
    into those people who have the condition,
  185. those people who don't have the
  186. and we know how many of each
    of those groups test positive,
  187. and how many of each of those groups
    test negative.
  188. The real question is
    what's the probability
  189. that I have cancer or the medical
    condition, given that I tested positive?
  190. How do we figure that out?
  191. Well, the total number
    of positive tests was 1,294
  192. and the people who tested positive
    who really had the condition was 297.
  193. So it looks like the probability of
    actually having the condition,
  194. given that you tested positive,
    is 297 out of 1294 or 0.23.
  195. That's 23%, less than one in four.
  196. Is that what you guessed?
  197. Most people, including most doctors, when
    they hear that the test is
  198. 99% sensitive and 99% specific, will
    guess a lot higher than one in four.
  199. >> Oh my gosh!
  200. I'm a doctor, and I never would have
    thought that!
  201. >> Now, don't worry:
  202. she's not a physician.
    she's a metaphysician.
  203. >> But in this case, the probability
    really is just one in four
  204. that you had that medical condition.
  205. Now how did that happen?
  206. The reason was that the prevalence or the
    base rate was so low
  207. that even a small rate
    of false positives,
  208. given the massive numbers of people who
    don't have the condition,
  209. will mean that there are more false positives,
    3 times as many,
  210. as there are true positives.
  211. And that's why the probability
    is just one in four,
  212. actually a little less than one in four,
  213. that you have the medical condition even
    when you tested positive.
  214. I want to add a quick caveat here, in
    order to avoid misinterpretation.
  215. because the point here is that, if you
    have a screening test for a condition
  216. with a very low base rate or prevalence,
    and you don't have any symptoms
  217. that put you in a special category,
    then, you need to get another test
  218. before you jump to any conclusions
    about having the medical condition.
  219. Because, if you have that other test,
    then the fact that you tested positive
  220. on the first test puts you in a smaller class,
  221. with a much higher base rate, or prevalence.
  222. And now, the probability's going to go up.
  223. Most doctors know that, and that's why,
    after the first test,
  224. they don't jump to conclusions, and they
    order another test,
  225. but many patients don't realize that and
    they get extremely worried
  226. after a single test even when they don't
    have any symptoms.
  227. So that's the mistake
    that we're trying to avoid here
  228. and that's surprising, but it actually
    applies to many different areas of life.
  229. It applies, for example, to medical tests
    with all kinds of other diseases.
  230. Not just cancer or colon cancer, but
    pretty much every disease
  231. where the prevalence is extremely low.
  232. It applies also to drug tests.
  233. If somebody gets a positive drug test,
  234. does that mean they really
    were using drugs?
  235. Well, if it's a population where the
    base rate or prevelance of drug use
  236. is quite low, then it might not.
  237. Of course, if you assume that the
    prevalence or base rate is quite high,
  238. then you're going to believe
    that drug test.
  239. But you need to know the facts about what
    the prevalence or base rate really is
  240. in order to calculate
    accurately the probability
  241. that this person really was using drugs.
  242. Same applies to evidence in legal trials:
    take eyewitnesses for example,
  243. it's very tricky, someone's trying to use
    their eyes as a test for what they see.
  244. They might identify a friend,
    or they might just say
  245. that car that did the hit-and-run accident
    was a Porsche.
  246. Well, how good are they at identifying
  247. If they get it right most of the time,
    but not always,
  248. and sometimes they don't get it right
    when it is a Porsche,
  249. then we've got the sensitivity and
    specificity of what they identify.
  250. And we can use that to calculate
    how likely it is
  251. that their evidence in the trial
    really is reliable or not.
  252. Another example is the prediction of
    future behavior.
  253. We might have some kind of marker
  254. that a certain group of people
    with that marker
  255. have a certain likelihood of
    committing crimes.
  256. But if crimes are very rare
    in that community and every other,
  257. then a test which has a pretty good
    sensitivity and specificity
  258. still might not be good enough when
    we're talking about something like crime
  259. that's actually very rare and has
    a very low prevalence or base rate
  260. in most communities.
  261. And the same applies
    to failing out of school.
  262. Our SAT scores or GRE scores
    are going to be
  263. good predictors of
    who's going to fail out of school.
  264. Well, if very few people fail out of
  265. so that the prevalence and base rate
    is very low,
  266. then, even if they're
    pretty sensitive and specific,
  267. they might not be good predictors.
  268. So this same type of problem arises
    in a lot of different areas.
  269. And I'm not going to go through
    more examples right now,
  270. but we'll have plenty of examples in the
    exercises at the end of this chapter.
  271. I want to end, though,
    by saying a few things
  272. that are a bit more technical
    about this method.
  273. First, there's a lot of terminology to
  274. because when you read about using
    this method in other areas,
  275. for other types of topics,
    then you'll run into these terms,
  276. and it's a good idea to know them.
  277. So first, the cases where the person does
    have the condition and also tests positive
  278. are called hits, or true positives.
  279. Different people use different terms.
  280. The cases where the person tests positive,
    but they don't have the condition,
  281. are called, false positives
    or false alarms.
  282. The cases where a person really does have
    the condition, but tests negative
  283. are called misses or false negatives.
  284. And the cases where the person
    does not have the condition
  285. and the test comes out negative
    are called true negatives,
  286. because they're negative and it's true
    that they don't have the condition.
  287. If we put together the false negatives,
    and the true negatives,
  288. we get the total set of negatives.
  289. And if we put together the true positives
    and the false positives
  290. we get, the total set of positives.
  291. And of course, we have the general
  292. Within that population,
    a percentage that have the condition
  293. and a percentage
    that don't have the condition.
  294. Now, what's the base rate?
  295. The base rate in this population is simply
    the set that have the condition,
  296. divided by the total population,
    which is Box 7 divided by Box 9.
  297. If we use e for the evidence
  298. and h for the hypothesis being true that
    the condition really does exist,
  299. then that's the probability of h,
  300. and the sensitivity is going to be
    the total number of true positives
  301. divided by the total number of people
    with the condition,
  302. because it's the percentage of people who
    have the condition and test positive.
  303. OK? So that's the probability of e given h,
  304. and it's box one divided by box 7.
  305. The specificity in contrast is the ratio
    of it being a true negative
  306. to the total number of people
    who do not have the condition, that is,
  307. the probability of not e, that is,
  308. not having the evidence
    of a positive test result,
  309. given not h,
    given that you're in the second column,
  310. where the hypothesis is false,
    because you don't have the condition.
  311. So that's Box 5 divided by Box 8.
  312. That's the specificity.
  313. So we can define all of these
    in terms of each other.
  314. The hits divided by the total with that
    condition is going to be the sensitivity.
  315. And you can use this terminology to guide
    your way through this box.
  316. And the big question is again going to be
    what's the solution?
  317. What's the probability of the hypothesis
    having the condition, given the evidence,
  318. that is, a positive test result:
    that's going to be Box 1 divided by Box 3.
  319. And as we saw in the case that we just
    went through,
  320. that gives you the probability of having
    the medical condition, or colon cancer,
  321. given a positive test result.
  322. That's called the posterior probability,
    or in symbols,
  323. the probability of the hypothesis,
    given the evidence.
  324. So I hope this terminology helps you
    understand some of the discussions of this,
  325. if you go on and read about it
    in the literature.
  326. This procedure that we've been discussing
    is actually just an application
  327. of a famous theorem called Bayes' Theorem
    after Thomas Bayes,
  328. a 18th century English clergyman,
    who was also a mathematician
  329. and proved this extremely important
    theorem in probability theory.
  330. Now some of you out there will use the
    boxes, and it'll make sense to you.
  331. But some Courserians, I assume,
    are mathematicians,
  332. and they want to see
    the mathematics behind it.
  333. So now, I want to show you how to derive
    Bayes' theorem
  334. from the rules of probability
    that we learned in earlier lectures.
  335. So for all you math nerds out there,
    here goes.
  336. You start with rule 2G,
  337. apply it to the probability that the
    evidence and the hypothesis are both true.
  338. And by the rule, that probability is
    equal to the probability of the evidence,
  339. times the probability of the hypothesis,
    given the evidence.
  340. You have to have
    that conditional probability
  341. because they're not independent.
  342. Then you simply divide both sides of that
    by the probability of the evidence:
  343. a little simple algebra.
  344. And you end up with the probability
    of the hypothesis, given the evidence,
  345. is equal to the probability
    of the evidence and the hypothesis,
  346. divided by the probability
    of the evidence.
  347. Now we can do a little trick.
    This was ingenious.
  348. Substitute for e, something
    that's logically equivalent to e,
  349. namely, the evidence AND the hypothesis
    or the evidence AND NOT the hypothesis.
  350. Now if you think about it, you'll see
    that those are equivalent,
  351. because either the hypothesis
    has to be true
  352. or NOT the hypothesis is true.
  353. One or the other has to be true.
  354. And that means that the evidence
    AND the hypothesis
  355. or the evidence AND NOT the hypothesis
    is going to be equivalent to e.
  356. So this is equivalent to this.
  357. And because they're equivalent,
    we can substitute them
  358. within the formula for probability
    without affecting the truth values.
  359. So we just substitute this formula in
    here for the e up there.
  360. And we end up with the probability of the
    hypothesis, given the evidence,
  361. is equal to the probability of the
    evidence AND the hypothesis, divided by
  362. the probability of the evidence
    AND the hypothesis
  363. or the evidence AND NOT the hypothesis.
  364. Now, that's not supposed to make much
    sense, but it helps with the derivation.
  365. The next step is to apply rule 3, because
    we have a disjunction.
  366. And notice the disjuncts are mutually
  367. It cannot be true, both, that the evidence
    AND the hypothesis is true,
  368. and also that the evidence
    AND NOT the hypothesis is true,
  369. because it can't be both h and not h.
  370. So we can apply the simple version
    of rule 3.
  371. And that means that the probability of
    (e&h) or (e&~h)
  372. is equal to the probability of (e&h
    + the probability of (e&~h).
  373. We're just applying
    that rule 3 for disjunction
  374. that we learned a few lectures ago.
  375. Now we apply rule 2G again,
  376. because we have the probability
    of a conjunction up in the top.
  377. And, since these are not independent of
    each other
  378. -- we hope not, if it's a hypothesis
    and the evidence for it --
  379. then we have to use
    the conditional probability.
  380. And using rule 2G, we find that
    the probability of the hypothesis,
  381. given the evidence, is equal to
  382. the probability of the hypothesis, times
    the probability of the evidence,
  383. given the hypothesis, divided by
    the probability of the hypothesis,
  384. times the probability of the evidence,
    given the hypothesis,
  385. plus the probability
    of the hypothesis being false,
  386. that is the probability of NOT h,
    times the probability of the evidence,
  387. given NOT h, or the hypothesis being false.
  388. And that's a mouthful
  389. and it's a long formula,
    but that's the mathematical formula
  390. that Bayes proved in the 18th century
    and it provides the mathematical basis
  391. for that whole system of boxes
    that we talked about before.
  392. But if you don't like the mathematical
    proof and that's too confusing for you,
  393. then use the boxes.
  394. And if you don't like the boxes,
    use the mathematical proof.
  395. They're both going to work:
    just pick the one that works for you.
  396. In fact, you don't have to pick
    either of them,
  397. because remember, this is an honors
    lecture, it's optional,
  398. and it won't be on the quiz.
  399. But if you do want to try this method,
    and make sure that you understand it,
  400. we'll have a bunch of exercises for you,
    where you can test your skills.