Hanno Böck: Yeah, so many of you probably
know me from doing things around IT
security, but I'm gonna surprise you to
almost not talk about IT security today.
But I'm gonna ask the question "Can we
trust the scientific method?". I want to
start this by giving you which is quite a
simple example. So if we do science like
we start with the theory and then we are
trying to test if it's true, right? So I
mean I said I'm not going to talk about IT
security but I chose an example from IT
security or kind of from IT security. So
there was a post on Reddit a while ago,
a picture from some book which claimed that
if you use a Malachite crystal that can
protect you from computer viruses.
Which... to me doesn't sound very
plausible, right? Like, these are crystals and
if you put them on your computer, this book
claims this protects you from malware. But
of course if we really want to know, we
could do a study on this. And if you say
people don't do Studies on crazy things:
that's wrong. I mean people do studies on
homeopathy or all kinds of crazy things
that are completely implausible. So we can
do a study on this and what we will do is
we will do a randomized control trial,
which is kind of the gold standard of
doing a test on these kinds of things. So
this is our question: "Do Malachite
crystals prevent malware infections?" and
how we would test that, our study design
is: ok, we take a group of maybe 20
computer users. And then we split them
randomly to two groups, and then one group
we'll give one of these crystals and tell
them: "Put them on your desk or on your
computer.". Then we need, the other group
is our control group. That's very
important because if we want to know if
they help we need another group to compare
it to. And to rule out that there are any
kinds of placebo effects, we give these
control groups a fake Malachite crystal so
we can compare them against each other.
And then we wait for maybe six months and
then we check how many malware infections
they had. Now, I didn't do that study, but
I simulated it with a Python script and
given that I don't believe that this
theory is true I just simulated this as
random data. So I'm not going to go
through the whole script but I'm just like
generating, I'm assuming there can be
between 0 and 3 malware infections and
it's totally random and then I compare the
two groups. And then I calculate something
which is called a p-value which is a very
common thing in science whenever you do
statistics. A p-value is, it's a bit
technical, but it's the probability that
if you have no effect that you would get
this result. Which kind of in another way
means, if you have 20 results in an
idealized world then one of them is a
false positive which means one of them
says something happens although it
doesn't. And in many fields of science
this p-value of 0.05 is considered that
significant which is like these twenty
studies. So one error in twenty studies
but as I said under idealized conditions.
So and as it's the script and I can run it
in less than a second I just did it twenty
times instead of once. So here are my 20
simulated studies and most of them look
not very interesting so of course we have
a few random variations but nothing very
significant. Except if you look at this
one study, it says the people with the
Malachite crystal had on average 1.8
malware infections and the people with the
fake crystal had 0.8. So it means actually
the crystal made it worse. But also this
result is significant because it has a
p-value of 0.03. So of course we can
publish that, assuming I really did these
00:04:36.110 --> 00:04:40.600
B.: And the other studies we just forget
about. I mean they were not interesting
right and who cares? Non significant
results... Okay so you have just seen that
I created a significant result out of
random data. And that's concerning because
people in science - I mean you can really do
that. And this phenomena is called
publication bias. So what's happening here
is that, you're doing studies and if they
get a positive result - meaning you're
seeing an effect, then you publish them
and if there's no effect you just forget
about them. We learned earlier that with
this p-value of 0.05 means 1 in 20 studies
is a false positive, but you usually don't
see the studies that are not significant,
because they don't get published. And you
may wonder: "Ok, what's stopping a
scientist from doing exactly this? What's
stopping a scientist from just doing so
many experiments till one of them looks
like it's a real result although it's just
a random fluke?". And the disconcerning
answer to that is, it's usually nothing.
And this is not just a theoretical
example. I want to give you an example,
that has quite some impact and that was
researched very well, and that is a
research on antidepressants so called
SSRIs. And in 2008 there was a study, the
interesting situation here was, that the
US Food and Drug Administration, which is
the authority that decides whether a
medical drug can be put on the market,
they had knowledge about all the studies
that had been done to register this
medication. And then some researchers
looked at that and compared it with what
has been published. And they figured out
there were 38 studies that saw that these
medications had a real effect, had real
improvements for patients. And from those
38 studies 37 got published. But then
there were 36 studies that said: "These
medications don't really have any
effect.", "They are not really better than
a placebo effect" and out of those only 14
got published. And even from those 14
there were 11, where the researcher said,
okay they have spent the result in a way
that it sounds like these medications do
something. But they were also a bunch of
studies that were just not published
because they had a negative result. And
it's clear that if you look at the
published studies only and you ignore the
studies with a negative result that
haven't been published, then these
medications look much better than they
really are. And it's not like the earlier
example there is a real effect from
antidepressants, but they are not as good
as people have believed in the past.
So we've learnt in theory with publication bias
00:07:45.860 --> 00:07:50.520
But if you're a researcher and you have a
00:07:50.520 --> 00:07:54.790
to publish something about it, that's not
00:07:54.790 --> 00:07:59.699
20 studies on average to get one of these
00:07:59.699 --> 00:08:06.130
results. So there are more efficient ways
00:08:06.130 --> 00:08:12.780
doing a study then there are a lot of
00:08:12.780 --> 00:08:17.320
example you may have dropouts from your
00:08:17.320 --> 00:08:22.150
to another place or they - you now longer
00:08:22.150 --> 00:08:26.020
your study. And there are different things
00:08:26.020 --> 00:08:30.480
cornercase results, where you're not
00:08:30.480 --> 00:08:34.509
and how do you decide?", "How do you
00:08:34.509 --> 00:08:39.639
be looking for different things, maybe
00:08:39.639 --> 00:08:46.620
people, and you may control for certain
00:08:46.620 --> 00:08:51.639
into separate?", "Do you see them
00:08:51.639 --> 00:08:56.430
age?". So there are many decisions you can
00:08:56.430 --> 00:09:02.050
each of these decisions has a small effect
00:09:02.050 --> 00:09:10.399
that just by trying all the combinations
00:09:10.399 --> 00:09:15.230
it's statistically significant, although
00:09:15.230 --> 00:09:20.670
this term called p-Hacking which means
00:09:20.670 --> 00:09:25.550
enough, that you get a significant result.
00:09:27.050 --> 00:09:32.550
is usually not that a scientist says: "Ok,
00:09:32.550 --> 00:09:36.259
because I know my theory is wrong but I
00:09:36.259 --> 00:09:42.420
subconscious process, because usually the
00:09:42.420 --> 00:09:47.399
Honestly. They honestly think that their
00:09:47.399 --> 00:09:52.040
will show that. So they may subconsciously
00:09:52.040 --> 00:09:58.279
it looks a bit better so I will do this.".
00:09:58.279 --> 00:10:05.079
themselves into getting a result that's
00:10:05.079 --> 00:10:11.449
"What is stopping scientists from
00:10:11.449 --> 00:10:22.009
the same: usually nothing. And I came to
00:10:22.009 --> 00:10:26.069
scientific method it's a way to create
00:10:26.069 --> 00:10:31.899
matter if it's true or not.". And you may
00:10:31.899 --> 00:10:35.720
and I'm saying this even though I'm not
00:10:35.720 --> 00:10:42.480
hacker who, whatever... But I'm not alone
00:10:42.480 --> 00:10:47.759
famous researcher John Ioannidis, who
00:10:47.759 --> 00:10:51.529
findings are false.". He published this in
00:10:51.529 --> 00:10:57.170
doesn't really question that most research
00:10:57.170 --> 00:11:02.560
reasons why this is the case. And he makes
00:11:02.560 --> 00:11:08.499
at that many negative results don't get
00:11:08.499 --> 00:11:12.129
bias. And it comes to a very plausible
00:11:12.129 --> 00:11:17.180
is not even very controversial. If you ask
00:11:17.180 --> 00:11:23.491
science on science or meta science, who
00:11:23.491 --> 00:11:28.410
tell you: "Yeah, of course that's the
00:11:28.410 --> 00:11:32.079
how science works, that's what we
00:11:32.079 --> 00:11:37.689
you take this seriously, it means: if you
00:11:37.689 --> 00:11:43.160
the default assumption should be 'that's
00:11:43.160 --> 00:11:51.179
the opposite. And if science is a method
00:11:51.179 --> 00:11:55.709
you can think about something really
00:11:55.709 --> 00:12:00.939
"Does our mind have
00:12:00.939 --> 00:12:09.720
sense things that happen in an hour?". And
00:12:09.720 --> 00:12:15.559
and he thought that this is the case and
00:12:15.559 --> 00:12:20.399
"feeling the future". He did a lot of
00:12:20.399 --> 00:12:25.449
then something later happened, and he
00:12:25.449 --> 00:12:29.569
what happened later influenced what
00:12:29.569 --> 00:12:34.999
very plausible - based on what we know
00:12:34.999 --> 00:12:41.550
published in a real psychology journal.
00:12:41.550 --> 00:12:46.680
study. Basically, it's a very nice example
00:12:46.680 --> 00:12:51.009
Daryl Bem, where he describes something
00:12:51.009 --> 00:12:55.040
where he says that's how you do
00:12:55.040 --> 00:13:03.870
in line with the existing standards in
00:13:03.870 --> 00:13:08.759
people found concerning. So, if you can
00:13:08.759 --> 00:13:13.619
can see into the future, then what else
00:13:13.619 --> 00:13:19.139
results? And psychology has debated this a
00:13:19.139 --> 00:13:21.880
there's a lot of talk about the
00:13:21.880 --> 00:13:30.009
effects that psychology just thought were
00:13:30.009 --> 00:13:35.040
to repeat these experiments, they couldn't
00:13:35.040 --> 00:13:40.759
subfields were built on these results.
00:13:44.369 --> 00:13:48.069
is one of the ones that is not discussed so
00:13:48.069 --> 00:13:55.540
moral licensing. And the idea is that if
00:13:55.540 --> 00:14:00.649
think is good, then later basically you
00:14:00.649 --> 00:14:04.880
I already did something good now, I don't
00:14:04.880 --> 00:14:10.689
some famous studies that had the theory,
00:14:10.689 --> 00:14:17.870
later they become more judgmental, or less
00:14:17.870 --> 00:14:27.949
last week someone tried to replicate this
00:14:27.949 --> 00:14:32.720
three times with more subjects and better
00:14:32.720 --> 00:14:39.010
couldn't find that effect. But like what
00:14:39.010 --> 00:14:43.790
articles. I have not found a single
00:14:43.790 --> 00:14:51.179
replicated. Maybe they will come but yeah
00:14:51.179 --> 00:14:57.360
now I want to have a small warning for you
00:14:57.360 --> 00:15:01.319
psychologists, that all sounds very
00:15:01.319 --> 00:15:05.329
precognition and whatever", but maybe your
00:15:05.329 --> 00:15:09.889
don't know about it yet because nobody
00:15:09.889 --> 00:15:15.990
your field. And there are other fields
00:15:15.990 --> 00:15:21.670
much worse for example the pharma company
00:15:21.670 --> 00:15:27.279
where they said "We have tried to
00:15:27.279 --> 00:15:32.940
research" that is stuff in a petri dish or
00:15:32.940 --> 00:15:38.869
but what happens before you develop a drug
00:15:38.869 --> 00:15:44.699
out of 53 studies. And these were they
00:15:44.699 --> 00:15:50.050
have been published in the best journals.
00:15:50.050 --> 00:15:54.099
publication because they have not
00:15:54.099 --> 00:15:58.760
told us which studies these were that they
00:15:58.760 --> 00:16:02.730
think they have published three of these
00:16:02.730 --> 00:16:07.290
the dark which points to another problem
00:16:07.290 --> 00:16:10.689
they collaborated with the original
00:16:10.689 --> 00:16:16.109
agreeing that they would not publish the
00:16:16.109 --> 00:16:22.379
concerning so but some fields don't have a
00:16:22.379 --> 00:16:27.170
trying to replicate previous results I
00:16:27.170 --> 00:16:34.269
results hold up. So what can be done about
00:16:34.269 --> 00:16:42.930
core issue here is that the scientific
00:16:42.930 --> 00:16:49.970
we do a study and only after that we
00:16:49.970 --> 00:16:54.759
Or we do a study and only after we have
00:16:54.759 --> 00:17:01.230
essentially we need to decouple the
00:17:01.230 --> 00:17:09.800
one way of doing that is pre-registration
00:17:09.800 --> 00:17:14.490
you start doing a study you will register
00:17:14.490 --> 00:17:20.500
do a study like on this medication or
00:17:20.500 --> 00:17:25.670
that's how I'm gonna do it and then later
00:17:25.670 --> 00:17:33.980
that. And yeah that's what I said. And this
00:17:33.980 --> 00:17:41.179
medical drug trials the summary about it
00:17:41.179 --> 00:17:47.130
better than nothing. So, and the problem
00:17:47.130 --> 00:17:52.029
study and then don't publish it and
00:17:52.029 --> 00:17:57.190
are legally required to publish it. And
00:17:57.190 --> 00:18:01.889
out, there's the all trials campaign which
00:18:01.889 --> 00:18:08.149
doctor from the UK and they like demand
00:18:08.149 --> 00:18:13.330
medication should be published. And
00:18:13.330 --> 00:18:18.870
compare project and they are trying to see
00:18:18.870 --> 00:18:25.380
later published did they do the same or
00:18:25.380 --> 00:18:29.480
protocol and was there a reason for it or
00:18:29.480 --> 00:18:36.799
which they otherwise wouldn't get.But then
00:18:36.799 --> 00:18:41.080
offer get a lot of attention and for good
00:18:41.080 --> 00:18:46.820
medicine then people die, that's pretty
00:18:46.820 --> 00:18:52.960
read about this you always have to think
00:18:52.960 --> 00:18:58.510
they have pre-registration, most
00:18:58.510 --> 00:19:04.330
anything like that. So whenever you hear
00:19:04.330 --> 00:19:08.470
bias in medicine you should always think
00:19:08.470 --> 00:19:12.630
science and usually nobody is doing
00:19:12.630 --> 00:19:18.809
this audience I'd like to say there's
00:19:18.809 --> 00:19:23.580
computer science want to revolutionize
00:19:23.580 --> 00:19:30.300
these things, which in principle is ok but
00:19:30.300 --> 00:19:34.750
very worried about this and the reason is,
00:19:34.750 --> 00:19:39.470
have the same scientific standards as
00:19:39.470 --> 00:19:44.399
say "Yeah we don't need really need to do
00:19:44.399 --> 00:19:50.450
helps" and that is worrying and I come
00:19:50.450 --> 00:19:53.580
understand that people from medicine are
00:19:53.580 --> 00:20:00.540
that goes even further as pre-registration
00:20:00.540 --> 00:20:05.210
is a couple of years ago some scientists
00:20:05.210 --> 00:20:10.539
they.. that was published there and the idea
00:20:10.539 --> 00:20:16.451
publication process upside down, so if you
00:20:16.451 --> 00:20:21.210
would do with the register report is, you
00:20:21.210 --> 00:20:27.000
protocol to the journal and then the
00:20:27.000 --> 00:20:33.110
that before they see any result, because
00:20:33.110 --> 00:20:36.990
then you prevent the journals only publish
00:20:36.990 --> 00:20:42.710
findings. And then you do the study and
00:20:42.710 --> 00:20:46.330
published independent of what the result
00:20:46.330 --> 00:20:53.830
can do to improve science, there's a lot
00:20:53.830 --> 00:20:58.610
sharing methods because if you want to
00:20:58.610 --> 00:21:04.130
you have access to all the details how the
00:21:04.130 --> 00:21:11.090
say "Okay we could do large
00:21:11.090 --> 00:21:15.269
just too small if you have a study with
00:21:15.269 --> 00:21:19.630
reliable outcome. So maybe in many
00:21:19.630 --> 00:21:25.669
10 teams of scientists and let them all do
00:21:25.669 --> 00:21:31.640
reliably answer a question. And also some
00:21:31.640 --> 00:21:36.390
statistical thresholds that p-value of
00:21:36.390 --> 00:21:42.260
recently a paper that just argued which
00:21:42.260 --> 00:21:47.880
the left and have 0.005 and that would
00:21:47.880 --> 00:21:55.029
example in physics they have they have
00:21:55.029 --> 00:22:00.870
zero point and then 5 zeroes and 3 or
00:22:00.870 --> 00:22:08.350
have much higher statistical thresholds.
00:22:08.350 --> 00:22:13.210
scientific field you might ask yourself
00:22:13.210 --> 00:22:20.200
they pre registered in any way and do we
00:22:20.200 --> 00:22:26.380
an effect and we got nothing and are there
00:22:26.380 --> 00:22:32.350
would say if you answer all these
00:22:32.350 --> 00:22:36.289
people will do, then you're not really
00:22:36.289 --> 00:22:41.510
alchemy of our time.
00:22:41.510 --> 00:22:50.220
Herald: Thank you very much..
Hanno: No I have more, sorry, I have
three more slides, that was not the
finishing line. Big issue is also that
there are bad incentives in science, so a
very standard thing to evaluate the impact
of science is citation counts for you say
"if your scientific study is cited a lot
then this is a good thing and if your
journal is cited a lot this is a good
thing" and this for example the impact
factor but there are also other
measurements. And also universities like
publicity so if your study gets a lot of
media reports then your press department
likes you. And these incentives tend to
favor interesting results but they don't
favor correct results and this is bad
because if we are realistic most results
are not that interesting, most results
will be "Yeah we have this interesting and
counterintuitive theory and it's totally
wrong" and then there's this idea that
science is self-correcting. So if you
confront scientists with these issues with
publication bias and peer hacking surely
they will immediately change that's what
scientists do right? And I want to cite
something here with this sorry it's a bit
long but "There are some evidence that
inferior statistical tests are commonly
used research which yields non significant
results is not published." That sounds
like publication bias and then it also
says: "Significant results published in
these fields are seldom verified by
independent replication" so it seems
there's a replication problem. These wise
words were set in 1959, so by a
statistician called Theodore Sterling and
because science is so self-correcting in
1995 he complained that this article
presents evidence that published result of
scientific investigations are not a
representative sample of all scientific
studies. "These results also indicate that
practice leading to publication bias has
not changed over a period of 30 years" and
here we are in 2018 and publication bias
is still a problem. So if science is self-
correcting then it's pretty damn slow in
correcting itself, right? And finally I
would like to ask you, if you're prepared
for boring science, because ultimately, I
think, we have a choice between what I
would like to call TEDTalk science and
boring science..
00:25:40.980 --> 00:25:46.779
positive and surprising results and
00:25:46.779 --> 00:25:53.380
many citations lots of media attention and
00:25:53.380 --> 00:26:00.139
Unfortunately usually it's not true and I
00:26:00.139 --> 00:26:03.820
the alternative which is mostly negative
00:26:03.820 --> 00:26:11.620
it may be closer to the truth. And I would
00:26:11.620 --> 00:26:18.230
it's a pretty tough sell. Sorry I didn't
00:26:18.230 --> 00:26:35.280
hear that. Yeah, thanks for listening.
Herald: Thank you.
Hanno: Two questions, or?
Herald: We don't have that much time for
questions, three minutes, three minutes
guys. Question one - shoot.
Mic: This isn't a question but I just
wanted to comment Hanno you missed out a
very critical topic here, which is the use
of Bayesian probability. So you did
conflate p-values with the scientific
method which isn't.. which gave the rest
of you talk. I felt a slightly unnecessary
anti science slant. On p, p-values isn't
the be-all and end-all of the scientific
method so p-values is sort of calculating
the probability that your data will happen
given that no hypothesis is true whereas
Bayesian probability would be calculating
the probability that your hypothesis is
true given the data and more and more
scientists are slowly starting to realize
that this sort of method is probably a
better way of doing science than p-values.
So this is probably a a third alternative
to your sort of proposal boring science is
doing the other side's Bayesian
Hanno: Sorry yeah, I agree with you I
00:27:34.029 --> 00:27:37.530
half an hour here.
00:27:37.530 --> 00:27:40.610
like where are we going after this lecture
00:27:40.610 --> 00:27:46.269
Hanno: I know him..
00:27:46.269 --> 00:27:50.559
then scientists it's a little bit like the
00:27:50.559 --> 00:27:54.990
it's like: "you scratch my back and I
00:27:54.990 --> 00:27:59.160
Maybe two more minutes?
00:27:59.160 --> 00:28:04.870
Please go ahead.
00:28:04.870 --> 00:28:11.820
curious so you've raised, you know, ways
00:28:11.820 --> 00:28:15.529
assuming people who want to do better
00:28:15.529 --> 00:28:20.769
or willful ignorance. What do we do about
00:28:20.769 --> 00:28:26.389
community drug companies, maybe they
00:28:26.389 --> 00:28:29.539
incentivized by these random control
00:28:29.539 --> 00:28:34.929
do something. How do we begin to address
00:28:34.929 --> 00:28:40.639
or maliciously abuse the pre-reg system or
00:28:40.639 --> 00:28:44.409
Hanno: I mean it's a big question, right?
00:28:44.409 --> 00:28:50.660
confining you so much that there's not
00:28:50.660 --> 00:28:56.380
and a basis and also I don't think
00:28:56.380 --> 00:29:00.110
problem, I actually really think the
00:29:00.110 --> 00:29:07.120
believe what they do is true.
00:29:07.120 --> 00:29:15.640
Mic: So the value in science is often an
00:29:15.640 --> 00:29:20.559
citations so and so on, so is it true that
00:29:20.559 --> 00:29:24.799
described, journals of whose publications
00:29:24.799 --> 00:29:31.120
should impose more higher standards so the
00:29:31.120 --> 00:29:37.470
bar, they should enforce publication of
00:29:37.470 --> 00:29:43.330
etc. So is it journals who should, like,
00:29:43.330 --> 00:29:49.340
scientists do something also? I mean you
00:29:49.340 --> 00:29:53.270
better standards, right? There are
00:29:53.270 --> 00:29:59.299
reports, but of course I mean as a single
00:29:59.299 --> 00:30:03.360
you're playing in a system that has all
00:30:03.360 --> 00:30:06.580
Herald: Okay guys that's it, we have to
00:30:06.580 --> 00:30:12.670
better science dot-org, go there, and one
00:30:12.670 --> 00:30:16.299
00:30:16.299 --> 00:30:24.249
00:30:24.249 --> 00:30:29.245
