34c3 intro
Hanno Böck: Yeah, so many of you probably
know me from doing things around IT
security, but I'm gonna surprise you to
almost not talk about IT security today.
But I'm gonna ask the question "Can we
trust the scientific method?". I want to
start this by giving you which is quite a
simple example. So if we do science like
we start with the theory and then we are
trying to test if it's true, right? So I
mean I said I'm not going to talk about IT
security but I chose an example from IT
security or kind of from IT security. So
there was a post on Reddit a while ago,
a picture from some book which claimed that
if you use a Malachite crystal that can
protect you from computer viruses.
Which... to me doesn't sound very
plausible, right? Like, these are crystals and
if you put them on your computer, this book
claims this protects you from malware. But
of course if we really want to know, we
could do a study on this. And if you say
people don't do Studies on crazy things:
that's wrong. I mean people do studies on
homeopathy or all kinds of crazy things
that are completely implausible. So we can
do a study on this and what we will do is
we will do a randomized control trial,
which is kind of the gold standard of
doing a test on these kinds of things. So
this is our question: "Do Malachite
crystals prevent malware infections?" and
how we would test that, our study design
is: ok, we take a group of maybe 20
computer users. And then we split them
randomly to two groups, and then one group
we'll give one of these crystals and tell
them: "Put them on your desk or on your
computer.". Then we need, the other group
is our control group. That's very
important because if we want to know if
they help we need another group to compare
it to. And to rule out that there are any
kinds of placebo effects, we give these
control groups a fake Malachite crystal so
we can compare them against each other.
And then we wait for maybe six months and
then we check how many malware infections
they had. Now, I didn't do that study, but
I simulated it with a Python script and
given that I don't believe that this
theory is true I just simulated this as
random data. So I'm not going to go
through the whole script but I'm just like
generating, I'm assuming there can be
between 0 and 3 malware infections and
it's totally random and then I compare the
two groups. And then I calculate something
which is called a p-value which is a very
common thing in science whenever you do
statistics. A p-value is, it's a bit
technical, but it's the probability that
if you have no effect that you would get
this result. Which kind of in another way
means, if you have 20 results in an
idealized world then one of them is a
false positive which means one of them
says something happens although it
doesn't. And in many fields of science
this p-value of 0.05 is considered that
significant which is like these twenty
studies. So one error in twenty studies
but as I said under idealized conditions.
So and as it's the script and I can run it
in less than a second I just did it twenty
times instead of once. So here are my 20
simulated studies and most of them look
not very interesting so of course we have
a few random variations but nothing very
significant. Except if you look at this
one study, it says the people with the
Malachite crystal had on average 1.8
malware infections and the people with the
fake crystal had 0.8. So it means actually
the crystal made it worse. But also this
result is significant because it has a
p-value of 0.03. So of course we can
publish that, assuming I really did these
studies.
applause
B.: And the other studies we just forget
about. I mean they were not interesting
right and who cares? Non significant
results... Okay so you have just seen that
I created a significant result out of
random data. And that's concerning because
people in science - I mean you can really do
that. And this phenomena is called
publication bias. So what's happening here
is that, you're doing studies and if they
get a positive result - meaning you're
seeing an effect, then you publish them
and if there's no effect you just forget
about them. We learned earlier that with
this p-value of 0.05 means 1 in 20 studies
is a false positive, but you usually don't
see the studies that are not significant,
because they don't get published. And you
may wonder: "Ok, what's stopping a
scientist from doing exactly this? What's
stopping a scientist from just doing so
many experiments till one of them looks
like it's a real result although it's just
a random fluke?". And the disconcerning
answer to that is, it's usually nothing.
And this is not just a theoretical
example. I want to give you an example,
that has quite some impact and that was
researched very well, and that is a
research on antidepressants so called
SSRIs. And in 2008 there was a study, the
interesting situation here was, that the
US Food and Drug Administration, which is
the authority that decides whether a
medical drug can be put on the market,
they had knowledge about all the studies
that had been done to register this
medication. And then some researchers
looked at that and compared it with what
has been published. And they figured out
there were 38 studies that saw that these
medications had a real effect, had real
improvements for patients. And from those
38 studies 37 got published. But then
there were 36 studies that said: "These
medications don't really have any
effect.", "They are not really better than
a placebo effect" and out of those only 14
got published. And even from those 14
there were 11, where the researcher said,
okay they have spent the result in a way
that it sounds like these medications do
something. But they were also a bunch of
studies that were just not published
because they had a negative result. And
it's clear that if you look at the
published studies only and you ignore the
studies with a negative result that
haven't been published, then these
medications look much better than they
really are. And it's not like the earlier
example there is a real effect from
antidepressants, but they are not as good
as people have believed in the past.
So we've learnt in theory with publication bias
you can create result out of nothing.
But if you're a researcher and you have a
theory that's not true but you really want
to publish something about it, that's not
really efficient, because you have to do
20 studies on average to get one of these
random results that look like real
results. So there are more efficient ways
to get to a result from nothing. If you're
doing a study then there are a lot of
micro decisions you have to make, for
example you may have dropouts from your
study where people, I don't know they move
to another place or they - you now longer
reach them, so they are no longer part of
your study. And there are different things
how you can handle that. Then you may have
cornercase results, where you're not
entirely sure: "Is this an effect or not
and how do you decide?", "How do you
exactly measure?". And then also you may
be looking for different things, maybe
there are different tests you can do on
people, and you may control for certain
variables like "Do you split men and women
into separate?", "Do you see them
separately?" or "Do you separate them by
age?". So there are many decisions you can
make while doing a study. And of course
each of these decisions has a small effect
on the result. And it may very often be,
that just by trying all the combinations
you will get a p-value that looks like
it's statistically significant, although
there's no real effect. So and there's
this term called p-Hacking which means
you're just adjusting your methods long
enough, that you get a significant result.
And I'd like to point out here, that this
is usually not that a scientist says: "Ok,
today I'm going to p-hack my result,
because I know my theory is wrong but I
want to show it's true.". But it's a
subconscious process, because usually the
scientists believe in their theories.
Honestly. They honestly think that their
theory is true and that their research
will show that. So they may subconsciously
say: "Ok, if I analyze my data like this
it looks a bit better so I will do this.".
So subconsciously, they may p-hack
themselves into getting a result that's
not really there. And again we can ask:
"What is stopping scientists from
p-hacking?". And the concerning answer is
the same: usually nothing. And I came to
this conclusion that I say: "Ok, the
scientific method it's a way to create
evidence for whatever theory you like. No
matter if it's true or not.". And you may
say: "That's a pretty bold thing to say.".
and I'm saying this even though I'm not
even a scientist. I'm just like some
hacker who, whatever... But I'm not alone
in this, like there's a paper from a
famous researcher John Ioannidis, who
said: "Why most published research
findings are false.". He published this in
2005 and if you look at the title, he
doesn't really question that most research
findings are false. He only wants to give
reasons why this is the case. And he makes
some very possible assumptions if you look
at that many negative results don't get
published, and that you will have some
bias. And it comes to a very plausible
conclusion, that this is the case and this
is not even very controversial. If you ask
people who are doing what you can call
science on science or meta science, who
look at scientific methodology, they will
tell you: "Yeah, of course that's the
case.". Some will even say: "Yeah, that's
how science works, that's what we
expect.". But I find it concerning. And if
you take this seriously, it means: if you
read about a study, like in a newspaper,
the default assumption should be 'that's
not true' - while we might usually think
the opposite. And if science is a method
to create evidence for whatever you like,
you can think about something really
crazy, like "Can people see into the future?",
"Does our mind have
some extra perception where we can
sense things that happen in an hour?". And
there was a psychologist called Daryl Bem
and he thought that this is the case and
he published a study on it. It was titled
"feeling the future". He did a lot of
experiments where he did something, and
then something later happened, and he
thought he had statistical evidence that
what happened later influenced what
happened earlier. So, I don't think that's
very plausible - based on what we know
about the universe, but yeah... and it was
published in a real psychology journal.
And a lot of things were wrong with this
study. Basically, it's a very nice example
for p-hacking and just even a book by
Daryl Bem, where he describes something
which basically looks like p-hacking,
where he says that's how you do
psychology. But the study was absolutely
in line with the existing standards in
Experimental Psychology. And that a lot of
people found concerning. So, if you can
show that precognition is real, that you
can see into the future, then what else
can you show and how can we trust our
results? And psychology has debated this a
lot in the past couple of years. So
there's a lot of talk about the
replication crisis in psychology. And many
effects that psychology just thought were
true, they figured out, okay, if they try
to repeat these experiments, they couldn't
get these results even though entire
subfields were built on these results.
And I want to show you an example, which
is one of the ones that is not discussed so
much. So there's a theory which is called
moral licensing. And the idea is that if
you do something good, or something you
think is good, then later basically you
behave like an asshole. Because you think
I already did something good now, I don't
have to be so nice anymore. And there were
some famous studies that had the theory,
that people consume organic food, that
later they become more judgmental, or less
social, less nice to their peers. But just
last week someone tried to replicate this
original experiments. And they tried it
three times with more subjects and better
research methodology and they totally
couldn't find that effect. But like what
you've seen here is lots of media
articles. I have not found a single
article reporting that this could not be
replicated. Maybe they will come but yeah
there's just a very recent example. But
now I want to have a small warning for you
because you may think now "yeah these
psychologists, that all sounds very
fishy and they even believe in
precognition and whatever", but maybe your
field is not much better maybe you just
don't know about it yet because nobody
else has started replicating studies in
your field. And there are other fields
that have replication problems and some
much worse for example the pharma company
Amgen in 2012 they published something
where they said "We have tried to
replicate cancer research and preclinical
research" that is stuff in a petri dish or
animal experiments so not drugs on humans
but what happens before you develop a drug
and they were only able to replicate 47
out of 53 studies. And these were they
said landmark studies, so studies that
have been published in the best journals.
Now there are a few problems with this
publication because they have not
published their applications they have not
told us which studies these were that they
could not replicate. In the meantime I
think they have published three of these
replications but most of it is a bit in
the dark which points to another problem
because they say they did this because
they collaborated with the original
researchers and they only did this by
agreeing that they would not publish the
results. But it still sounds very
concerning so but some fields don't have a
replication problem because just nobody is
trying to replicate previous results I
mean then you will never know if your
results hold up. So what can be done about
all this and fundamentally I think the
core issue here is that the scientific
process is tied together with results, so
we do a study and only after that we
decide whether it's going to be published.
Or we do a study and only after we have
the data we're trying to analyze it. So
essentially we need to decouple the
scientific process from its results and
one way of doing that is pre-registration
so what you're doing there is that before
you start doing a study you will register
it in a public register and say "I'm gonna
do a study like on this medication or
whatever on this psychological effect" and
that's how I'm gonna do it and then later
on people can check if you really did
that. And yeah that's what I said. And this
is more or less standard practice in
medical drug trials the summary about it
is it does not work very well but it's
better than nothing. So, and the problem
is mostly enforcement so people register
study and then don't publish it and
nothing happens to them even though they
are legally required to publish it. And
there are two campaigns I'd like to point
out, there's the all trials campaign which
has been started by Ben Goldacre he's a
doctor from the UK and they like demand
that like every trial it's done on
medication should be published. And
there's also a project by the same guy the
compare project and they are trying to see
if a medical trial has been registered and
later published did they do the same or
did they change something in their
protocol and was there a reason for it or
did they just change it to get a result,
which they otherwise wouldn't get.But then
again like these issues in medicine they
offer get a lot of attention and for good
reasons because if we have bad science in
medicine then people die, that's pretty
immediate and pretty massive. But if you
read about this you always have to think
that these issues in drug trials at least
they have pre-registration, most
scientific fields don't bother doing
anything like that. So whenever you hear
something about maybe about publication
bias in medicine you should always think
the same thing happens in many fields of
science and usually nobody is doing
anything about it. And particularly to
this audience I'd like to say there's
currently a big trend that people from
computer science want to revolutionize
medicine: big data and machine learning,
these things, which in principle is ok but
I know a lot of people in medicine are
very worried about this and the reason is,
that these computer science people don't
have the same scientific standards as
people in medicine expect them and might
say "Yeah we don't need really need to do
a study on this it's obvious that this
helps" and that is worrying and I come
from computer science and I very well
understand that people from medicine are
worried about this. So there's an idea
that goes even further as pre-registration
and it's called registered reports. There
is a couple of years ago some scientists
wrote an open letter to the Guardian where
they.. that was published there and the idea
there is that you turn the scientific
publication process upside down, so if you
want to do a study the first thing you
would do with the register report is, you
submit your design your study design
protocol to the journal and then the
journal decides whether they will publish
that before they see any result, because
then you can prevent publication bias and
then you prevent the journals only publish
the nice findings and ignore the negative
findings. And then you do the study and
then it gets published but it gets
published independent of what the result
was. And there of course other things you
can do to improve science, there's a lot
of talk about sharing data, sharing code,
sharing methods because if you want to
replicate a study it's of course easier if
you have access to all the details how the
original study was done. Then you could
say "Okay we could do large
collaborations" because many studies are
just too small if you have a study with
twenty people you just don't get a very
reliable outcome. So maybe in many
situations it would be better get together
10 teams of scientists and let them all do
a big study together and then you can
reliably answer a question. And also some
people propose just to get higher
statistical thresholds that p-value of
0.05 means practically nothing. There was
recently a paper that just argued which
would just like put the dot one more to
the left and have 0.005 and that would
already solve a lot of problems. And for
example in physics they have they have
something called Sigma 5 which is I think
zero point and then 5 zeroes and 3 or
something like that so in physics they
have much higher statistical thresholds.
Now whatever if you're working in any
scientific field you might ask yourself
like "If we have statistic results are
they pre registered in any way and do we
publish negative results?" like we tested
an effect and we got nothing and are there
replications of all relevant results and I
would say if you answer all these
questions with "no" which I think many
people will do, then you're not really
doing science what you're doing is the
alchemy of our time.
Applause
Thanks.
Herald: Thank you very much..
Hanno: No I have more, sorry, I have
three more slides, that was not the
finishing line. Big issue is also that
there are bad incentives in science, so a
very standard thing to evaluate the impact
of science is citation counts for you say
"if your scientific study is cited a lot
then this is a good thing and if your
journal is cited a lot this is a good
thing" and this for example the impact
factor but there are also other
measurements. And also universities like
publicity so if your study gets a lot of
media reports then your press department
likes you. And these incentives tend to
favor interesting results but they don't
favor correct results and this is bad
because if we are realistic most results
are not that interesting, most results
will be "Yeah we have this interesting and
counterintuitive theory and it's totally
wrong" and then there's this idea that
science is self-correcting. So if you
confront scientists with these issues with
publication bias and peer hacking surely
they will immediately change that's what
scientists do right? And I want to cite
something here with this sorry it's a bit
long but "There are some evidence that
inferior statistical tests are commonly
used research which yields non significant
results is not published." That sounds
like publication bias and then it also
says: "Significant results published in
these fields are seldom verified by
independent replication" so it seems
there's a replication problem. These wise
words were set in 1959, so by a
statistician called Theodore Sterling and
because science is so self-correcting in
1995 he complained that this article
presents evidence that published result of
scientific investigations are not a
representative sample of all scientific
studies. "These results also indicate that
practice leading to publication bias has
not changed over a period of 30 years" and
here we are in 2018 and publication bias
is still a problem. So if science is self-
correcting then it's pretty damn slow in
correcting itself, right? And finally I
would like to ask you, if you're prepared
for boring science, because ultimately, I
think, we have a choice between what I
would like to call TEDTalk science and
boring science..
Applause
.. so with tedtalk science we get mostly
positive and surprising results and
interesting results we have large defects
many citations lots of media attention and
you may have a TED talk about it.
Unfortunately usually it's not true and I
would like to propose boring science as
the alternative which is mostly negative
results, pretty boring, small effects but
it may be closer to the truth. And I would
like to have boring science but I know
it's a pretty tough sell. Sorry I didn't
hear that. Yeah, thanks for listening.
Applause
Herald: Thank you.
Hanno: Two questions, or?
Herald: We don't have that much time for
questions, three minutes, three minutes
guys. Question one - shoot.
Mic: This isn't a question but I just
wanted to comment Hanno you missed out a
very critical topic here, which is the use
of Bayesian probability. So you did
conflate p-values with the scientific
method which isn't.. which gave the rest
of you talk. I felt a slightly unnecessary
anti science slant. On p, p-values isn't
the be-all and end-all of the scientific
method so p-values is sort of calculating
the probability that your data will happen
given that no hypothesis is true whereas
Bayesian probability would be calculating
the probability that your hypothesis is
true given the data and more and more
scientists are slowly starting to realize
that this sort of method is probably a
better way of doing science than p-values.
So this is probably a a third alternative
to your sort of proposal boring science is
doing the other side's Bayesian
probability.
Hanno: Sorry yeah, I agree with you I
unfortunately I only had
half an hour here.
Herald: Where are you going after this
like where are we going after this lecture
can they find you somewhere in the bar?
Hanno: I know him..
Herald: You know science is broken but
then scientists it's a little bit like the
next lecture actually that's waiting there
it's like: "you scratch my back and I
scratch yours for publication". Hanno:
Maybe two more minutes?
Herald: One minute.
Please go ahead.
Mic: Yeah hi, thank you for your talk. I'm
curious so you've raised, you know, ways
we can address this assuming good actors,
assuming people who want to do better
science that this happens out of ignorance
or willful ignorance. What do we do about
bad actors. So for example the medical
community drug companies, maybe they
really like the idea of being profitably
incentivized by these random control
trials, to make out essentially a placebo
do something. How do we begin to address
them current trying to maliciously p-hack
or maliciously abuse the pre-reg system or
something like that?
Hanno: I mean it's a big question, right?
But I think if the standards are kind of
confining you so much that there's not
much room to cheat that's way out right
and a basis and also I don't think
deliberate cheating is that much of a
problem, I actually really think the
bigger problem is people honestly
believe what they do is true.
Herald: Okay one last, you sir, please?
Mic: So the value in science is often an
account of publications right? Account of
citations so and so on, so is it true that
to improve this situation you've
described, journals of whose publications
are available, who are like prospective,
should impose more higher standards so the
journals are those who must like raise the
bar, they should enforce publication of
protocols before like accepting and etc
etc. So is it journals who should, like,
do work on that or can we regular
scientists do something also? I mean you
can publish in the journals that have
better standards, right? There are
journals that have these registered
reports, but of course I mean as a single
scientist is always difficult because
you're playing in a system that has all
these wrong incentives.
Herald: Okay guys that's it, we have to
shut down. Please. There is a reference
better science dot-org, go there, and one
last request give really warm applause!
Applause
34c3 outro
subtitles created by c3subtitles.de
in the year 2018. Join, and help us!