## Vous avez un compte YouTube ?

Nouveauté : activer les traductions et les sous-titres créés par les internautes sur votre chaîne YouTube !

## ← Lecture 9-7 - Bayes Theorem Honors

• 1 Follower
• 400 Lignes

From the Think Again: How to Reason and Argue course on Coursera

### Obtenir le code d’intégration x Embed video Use the following code to embed this video. See our usage guide for more details on embedding. Paste this in your document somewhere (closest to the closing body tag is preferable): ```<script type="text/javascript" src='https://amara.org/embedder-iframe'></script> ``` Paste this inside your HTML body, where you want to include the widget: ```<div class="amara-embed" data-url="https://s3.amazonaws.com/spark-public/thinkagain/recoded_videos/08_07-Bayes-Theorem-Honors.c5fc5d758c0a548c5c55b5c4d7508636.mp4" data-team="null"></div> ``` 1 langue

Afficher la révision 5 créée 07/04/2014 par Claude Almansi.

1. Coins and dice provide a nice simple model
of how to calculate probabilities, but
2. everyday life is a lot more complicated
and it's not taken up with gambling.
3. At least, I hope your life is not taken up
with gambling.
4. So in order to make probabilities more
applicable to everyday life,
5. we need to look at, slightly more
complicated methods.
6. Now, because these methods
are more complicated,
7. this lecture is going to be
an honors lecture: it's optional.
8. It will not be on the quiz,
9. so don't get worried about that.
10. But it is still useful, and it's fascinating,
12. that a lot people make
and that create a lot of problems.
13. And so I hope you'll stick with it and listen to this lecture.
14. And there will be exercises
15. whether you understand
the material or not.
16. But don't get too worried, because
it's not going to be on the quizz.
17. The real problem
that we'll be facing in this lecture
18. is the problem of test
19. We use tests all the time:
we use tests to figure out
20. whether you have
a certain medical condition.
21. We use tests to predict the weather
or to predict people's future behavior.
22. We have certain indicators
of how they're going to act,
23. either commit a crime
or not commit a crime,
24. but also whether they're going to pass,
25. do well in school or fail.
26. We always use these tests
when we don't know for certain,
27. but we want some kind of evidence,
or some kind of indicator.
28. The problem is none of these tests
are perfect.
29. They always contain errors
of various sorts.
30. And what we're going to have to do is to
see how to take
31. those errors of different sorts
and build them together into a method
32. and then a formula for calculating
how reliable the method is
33. for detecting the thing that we want to detect.
34. This problem is a lot like the problem
we faced earlier
35. when we were talking about applying
generalizations to particular cases
36. because here we're going to be applying
probabilities to particular cases.
37. So it'll seem familiar to you in certain parts,
38. but you'll see that this case
is a little trickier.
39. The best examples occur in medicine.
40. So just imagine that you go to your doctor
for a regular checkup.
41. You don't have any special symptoms,
42. but he decides to do
a few screening tests.
43. And unfortunately, and very worryingly,
it turns out that you test positive
44. on one test for a particular form of cancer,
a certain kind of medical condition.
45. Well, what that means is that you might
have cancer.
46. Might, great.
47. You want to know whether you do have
cancer.
48. But of course, finding out for sure
whether or not you have cancer
49. is going to take further tests.
50. And those tests might be expensive,
they might be dangerous,
51. they're going to be invasive
in various ways.
52. So you really want to know what's the
probability,
53. given that you've tested positive
on this one test,
54. that you really have cancer.
55. Now clearly that probability is going
to depend on a number of facts
about the type of test and so on.
57. And I am not a doctor.
58. I am not giving you medical advice.
59. If you test positive on a test,
60. don't trust me, because I'm just
making up numbers here.
61. But let's do make up a few numbers
and figure out
62. what the likelihood is of having cancer,
given that you tested positive.
63. So let's imagine that the base rate
of this particular type of cancer
64. in the population is 0.3%, that is,
3 out of 1,000, or 0.003.
65. And they say that's the base rate,
66. or it's sometimes called the prevalence
of the condition in the population.
67. That's simply to say that out of 1,000
people chosen randomly
68. in the population, you'd get about 3
that have this condition.
69. It's just a percentage
of the general population.
70. So that's the condition, what about the
test?
71. Well the first thing we want to know
is the sensitivity of the test.
72. The sensitivity of the test we're going to
assume is 0.99.
73. And what that means is that out of
100 people who have this condition,
74. 99 of them will test positive.
75. So this test is pretty good at figuring
out,
76. from among the people
who have the condition, which ones do.
77. 99 of those 100 people who have the
condition will test positive.
78. The other feature is specificity, and what
that means is
79. the percentage of the people who don't
have the condition who will test negative.
80. The point here is you're not going
to get a positive result
81. for people who don't have the condition,
right?
82. Because you want it to be specific
to this particular condition
83. and not get a bunch of positives for
people who have other types of conditions
84. or no medical condition at all.
85. So the specificity we're going to assume,
86. in this particular case we're talking about, is also 99%.
87. Now, what we want to know is the probability
that you have a cancer, a condition,
88. given that you tested positive on the test;
89. but notice that the sensitivity
tells you the probability
90. that you will test positive
given that you have the condition.
91. We want to know the opposite of that,
92. the probability
that you have the condition
93. given that you tested positive.
94. And that's what we have to do
a little calculation to figure out.
95. But before we do that calculation,
I want you to think about these figures
96. that I've given you:
the prevalence in the population,
97. the sensitivity of the test,
the specificity of the test,
98. and just make a guess.
99. Just start out by writing down
on a piece of paper
100. what you think the probability is
that you would have the cancer
101. given that you tested positive
on the test.
102. Take a minute and think about it
and write it down.
103. But we don't want to just guess
104. about probabilities that really matter
as much as this will do.
105. Instead, we want to calculate what the
probability really is.
106. So, let's go through it carefully and
show you how to use
107. what I'll call the box method in order
to calculate the real likelihood
108. that you have the condition, given that
you got a positive test result.
109. What we need to do is to divide the
population into four different groups:
110. the group that has the condition
and tested positive,
111. the group that has the condition
and tested negative,
112. the group that doesn't have the condition
and tested positive,
113. and the group that doesn't have
the condition and tested negative.
114. And this chart will show you a nice,
simple way of organizing
115. all of that information.
116. Because this row, the top row, tells
you all the people who tested positive.
117. The bottom row tells you the people
who tested negative.
118. Then, the left column gives you the
people who do have the medical condition,
119. in this case, some kind of cancer.
120. And the right column tells you the people
who do not have that condition.
121. Now what we need to do is to start
filling it out with numbers.
122. Now the first thing we need to specify is
the population.
enough population
124. that we're not going to have a lot
of fractions in the other boxes.
125. So, let's just imagine that the population
is 100,000.
126. Make it a million or 10 million,
it doesn't matter
127. because we're going to be interested
in the ratios with the different groups.
128. We can use that 100,000 to fill out the
other boxes,
129. if we know the prevalence, or the
base rate,
130. because the base rate tells you what
percentage of that 100,000
131. actually do have the condition and
don't have the condition.
132. We imagined -- remember we're just
making up numbers here --
133. but we imagined that the prevalence
of this condition is 0.3%.
134. And that means out of 100,000 people,
there will be 300
135. who do have the medical condition.
136. Well, if there are 300 who have it and
there are 100,000 total,
137. we can figure out how many don't have the
medical condition by just subtracting.
138. Which means 99,700
do not have the medical condition.
139. Okay?
140. Now, we've divided the population into our
two columns:
141. the ones that do and the ones that don't
have the medical condition.
142. The next step is to figure out how many
are going to test positive
143. and how many are going to test negative
out of each of these groups.
144. For that, we first need the sensitivity.
145. The sensitivity tells us the percentage
of the cases that have the condition
146. who will test positive.
147. So the people who have the condition are
the 300.
148. The ones who test positive are going
to go up in this area
149. and we know from the sensitivity being 0.99 or 99%
150. that the number in that area should be 99%
of 300, or 297.
151. And of course, if that's the number
that test positive,
152. then the remainder
are going to test negative
153. and that means that we'll have three.
154. Which shouldn't surprise you because if
99% of the cases that have it
155. test positive, then 1% will test negative,
and 1% of 300 is 3.
156. Good: so we got the first column done.
157. Now, the next question is going to be the
specificity.
158. We can use the specificity to figure out
what goes in that next column.
159. If the specificity is 99 and we know
160. that 99,700 people do not have the
condition out of our sample of 100,000,
161. well, that means that 99% of 99,700 are
going to test negative
162. because the specificity is the
percentage of cases without the condition
163. that test negative.
164. And that means that we'll have
98,703 among the people
165. who do not have the condition
who test negative.
166. How many are going to test positive?
The rest of them.
167. So 99,700 minus 98,703
is going to be 997.
168. And of course, that shouldn't be surprising
again, because 1% of 99,700 is 997.
169. We only got two boxes left to fill out.
170. How do you fill out those?
171. Well, this box in the upper right,
is the total number of people
172. in this population of 100,000
who test positive.
173. And so, we can get that by adding the ones
that do have the condition and test positive
174. and the ones that don't have
the condition and test positive.
175. Just add them together, and you get 1,294.
176. And you do the same on the next row,
because that blank is the area
177. that has all the people
who test negative,
178. and 3 people who have the condition
test negative,
179. 98,703 people who do not have the
condition test negative,
180. so the total is going to be 98,706.
181. And we can check to make sure that
we got it right,
182. by just adding them together:
1,294 plus 98,706 is equal to 100,000.
183. Phew, we got it right.
184. Okay, so now we've divided the population
into those people who have the condition,
185. those people who don't have the
condition,
186. and we know how many of each
of those groups test positive,
187. and how many of each of those groups
test negative.
188. The real question is
what's the probability
189. that I have cancer or the medical
condition, given that I tested positive?
190. How do we figure that out?
191. Well, the total number
of positive tests was 1,294
192. and the people who tested positive
who really had the condition was 297.
193. So it looks like the probability of
actually having the condition,
194. given that you tested positive,
is 297 out of 1294 or 0.23.
195. That's 23%, less than one in four.
196. Is that what you guessed?
197. Most people, including most doctors, when
they hear that the test is
198. 99% sensitive and 99% specific, will
guess a lot higher than one in four.
199. >> Oh my gosh!
200. I'm a doctor, and I never would have
thought that!
201. >> Now, don't worry:
202. she's not a physician.
she's a metaphysician.
203. >> But in this case, the probability
really is just one in four
204. that you had that medical condition.
205. Now how did that happen?
206. The reason was that the prevalence or the
base rate was so low
207. that even a small rate
of false positives,
208. given the massive numbers of people who
don't have the condition,
209. will mean that there are more false positives,
3 times as many,
210. as there are true positives.
211. And that's why the probability
is just one in four,
212. actually a little less than one in four,
213. that you have the medical condition even
when you tested positive.
214. I want to add a quick caveat here, in
order to avoid misinterpretation.
215. because the point here is that, if you
have a screening test for a condition
216. with a very low base rate or prevalence,
and you don't have any symptoms
217. that put you in a special category,
then, you need to get another test
219. Because, if you have that other test,
then the fact that you tested positive
220. on the first test puts you in a smaller class,
221. with a much higher base rate, or prevalence.
222. And now, the probability's going to go up.
223. Most doctors know that, and that's why,
after the first test,
order another test,
225. but many patients don't realize that and
they get extremely worried
226. after a single test even when they don't
have any symptoms.
227. So that's the mistake
that we're trying to avoid here
228. and that's surprising, but it actually
applies to many different areas of life.
229. It applies, for example, to medical tests
with all kinds of other diseases.
230. Not just cancer or colon cancer, but
pretty much every disease
231. where the prevalence is extremely low.
232. It applies also to drug tests.
233. If somebody gets a positive drug test,
234. does that mean they really
were using drugs?
235. Well, if it's a population where the
base rate or prevelance of drug use
236. is quite low, then it might not.
237. Of course, if you assume that the
prevalence or base rate is quite high,
238. then you're going to believe
that drug test.
239. But you need to know the facts about what
the prevalence or base rate really is
240. in order to calculate
accurately the probability
241. that this person really was using drugs.
242. Same applies to evidence in legal trials:
take eyewitnesses for example,
243. it's very tricky, someone's trying to use
their eyes as a test for what they see.
244. They might identify a friend,
or they might just say
245. that car that did the hit-and-run accident
was a Porsche.
246. Well, how good are they at identifying
Porsches?
247. If they get it right most of the time,
but not always,
248. and sometimes they don't get it right
when it is a Porsche,
249. then we've got the sensitivity and
specificity of what they identify.
250. And we can use that to calculate
how likely it is
251. that their evidence in the trial
really is reliable or not.
252. Another example is the prediction of
future behavior.
253. We might have some kind of marker
254. that a certain group of people
with that marker
255. have a certain likelihood of
committing crimes.
256. But if crimes are very rare
in that community and every other,
257. then a test which has a pretty good
sensitivity and specificity
258. still might not be good enough when
we're talking about something like crime
259. that's actually very rare and has
a very low prevalence or base rate
260. in most communities.
261. And the same applies
to failing out of school.
262. Our SAT scores or GRE scores
are going to be
263. good predictors of
who's going to fail out of school.
264. Well, if very few people fail out of
school,
265. so that the prevalence and base rate
is very low,
266. then, even if they're
pretty sensitive and specific,
267. they might not be good predictors.
268. So this same type of problem arises
in a lot of different areas.
269. And I'm not going to go through
more examples right now,
270. but we'll have plenty of examples in the
exercises at the end of this chapter.
271. I want to end, though,
by saying a few things
272. that are a bit more technical
273. First, there's a lot of terminology to
learn,
this method in other areas,
275. for other types of topics,
then you'll run into these terms,
276. and it's a good idea to know them.
277. So first, the cases where the person does
have the condition and also tests positive
278. are called hits, or true positives.
279. Different people use different terms.
280. The cases where the person tests positive,
but they don't have the condition,
281. are called, false positives
or false alarms.
282. The cases where a person really does have
the condition, but tests negative
283. are called misses or false negatives.
284. And the cases where the person
does not have the condition
285. and the test comes out negative
are called true negatives,
286. because they're negative and it's true
that they don't have the condition.
287. If we put together the false negatives,
and the true negatives,
288. we get the total set of negatives.
289. And if we put together the true positives
and the false positives
290. we get, the total set of positives.
291. And of course, we have the general
population.
292. Within that population,
a percentage that have the condition
293. and a percentage
that don't have the condition.
294. Now, what's the base rate?
295. The base rate in this population is simply
the set that have the condition,
296. divided by the total population,
which is Box 7 divided by Box 9.
297. If we use e for the evidence
298. and h for the hypothesis being true that
the condition really does exist,
299. then that's the probability of h,
300. and the sensitivity is going to be
the total number of true positives
301. divided by the total number of people
with the condition,
302. because it's the percentage of people who
have the condition and test positive.
303. OK? So that's the probability of e given h,
304. and it's box one divided by box 7.
305. The specificity in contrast is the ratio
of it being a true negative
306. to the total number of people
who do not have the condition, that is,
307. the probability of not e, that is,
308. not having the evidence
of a positive test result,
309. given not h,
given that you're in the second column,
310. where the hypothesis is false,
because you don't have the condition.
311. So that's Box 5 divided by Box 8.
312. That's the specificity.
313. So we can define all of these
in terms of each other.
314. The hits divided by the total with that
condition is going to be the sensitivity.
315. And you can use this terminology to guide
316. And the big question is again going to be
what's the solution?
317. What's the probability of the hypothesis
having the condition, given the evidence,
318. that is, a positive test result:
that's going to be Box 1 divided by Box 3.
319. And as we saw in the case that we just
went through,
320. that gives you the probability of having
the medical condition, or colon cancer,
321. given a positive test result.
322. That's called the posterior probability,
or in symbols,
323. the probability of the hypothesis,
given the evidence.
324. So I hope this terminology helps you
understand some of the discussions of this,
in the literature.
326. This procedure that we've been discussing
is actually just an application
327. of a famous theorem called Bayes' Theorem
after Thomas Bayes,
328. a 18th century English clergyman,
who was also a mathematician
329. and proved this extremely important
theorem in probability theory.
330. Now some of you out there will use the
boxes, and it'll make sense to you.
331. But some Courserians, I assume,
are mathematicians,
332. and they want to see
the mathematics behind it.
333. So now, I want to show you how to derive
Bayes' theorem
334. from the rules of probability
that we learned in earlier lectures.
335. So for all you math nerds out there,
here goes.
337. apply it to the probability that the
evidence and the hypothesis are both true.
338. And by the rule, that probability is
equal to the probability of the evidence,
339. times the probability of the hypothesis,
given the evidence.
340. You have to have
that conditional probability
341. because they're not independent.
342. Then you simply divide both sides of that
by the probability of the evidence:
343. a little simple algebra.
344. And you end up with the probability
of the hypothesis, given the evidence,
345. is equal to the probability
of the evidence and the hypothesis,
346. divided by the probability
of the evidence.
347. Now we can do a little trick.
This was ingenious.
348. Substitute for e, something
that's logically equivalent to e,
349. namely, the evidence AND the hypothesis
or the evidence AND NOT the hypothesis.
350. Now if you think about it, you'll see
that those are equivalent,
351. because either the hypothesis
has to be true
352. or NOT the hypothesis is true.
353. One or the other has to be true.
354. And that means that the evidence
AND the hypothesis
355. or the evidence AND NOT the hypothesis
is going to be equivalent to e.
356. So this is equivalent to this.
357. And because they're equivalent,
we can substitute them
358. within the formula for probability
without affecting the truth values.
359. So we just substitute this formula in
here for the e up there.
360. And we end up with the probability of the
hypothesis, given the evidence,
361. is equal to the probability of the
evidence AND the hypothesis, divided by
362. the probability of the evidence
AND the hypothesis
363. or the evidence AND NOT the hypothesis.
364. Now, that's not supposed to make much
sense, but it helps with the derivation.
365. The next step is to apply rule 3, because
we have a disjunction.
366. And notice the disjuncts are mutually
exclusive.
367. It cannot be true, both, that the evidence
AND the hypothesis is true,
368. and also that the evidence
AND NOT the hypothesis is true,
369. because it can't be both h and not h.
370. So we can apply the simple version
of rule 3.
371. And that means that the probability of
(e&h) or (e&~h)
372. is equal to the probability of (e&h
+ the probability of (e&~h).
373. We're just applying
that rule 3 for disjunction
374. that we learned a few lectures ago.
375. Now we apply rule 2G again,
376. because we have the probability
of a conjunction up in the top.
377. And, since these are not independent of
each other
378. -- we hope not, if it's a hypothesis
and the evidence for it --
379. then we have to use
the conditional probability.
380. And using rule 2G, we find that
the probability of the hypothesis,
381. given the evidence, is equal to
382. the probability of the hypothesis, times
the probability of the evidence,
383. given the hypothesis, divided by
the probability of the hypothesis,
384. times the probability of the evidence,
given the hypothesis,
385. plus the probability
of the hypothesis being false,
386. that is the probability of NOT h,
times the probability of the evidence,
387. given NOT h, or the hypothesis being false.
388. And that's a mouthful
389. and it's a long formula,
but that's the mathematical formula
390. that Bayes proved in the 18th century
and it provides the mathematical basis
391. for that whole system of boxes
392. But if you don't like the mathematical
proof and that's too confusing for you,
393. then use the boxes.
394. And if you don't like the boxes,
use the mathematical proof.
395. They're both going to work:
just pick the one that works for you.
396. In fact, you don't have to pick
either of them,
397. because remember, this is an honors
lecture, it's optional,
398. and it won't be on the quiz.
399. But if you do want to try this method,
and make sure that you understand it,
400. we'll have a bunch of exercises for you,
where you can test your skills.