Music
Herald: The next talk is about how risky
is software you use. So you may be heard
about Trump versus a Russian security
company. We won't judge this, we won't
comment this, but we dislike the
prejudgments of this case. Tim Carstens
and Parker Thompson will tell you a little
bit more about how risky the software is
you use. Tim Carstens is CITL's Acting
Director and Parker Thompson is CITL's
lead engineer. Please welcome with a very,
very warm applause: Tim and Parker!
Thanks.
Applause
Tim Carstens: Howdy, howdy. So my name is
Tim Carstens. I'm the acting director of
the cyber independent testing lab. It's
four words there, we'll talk about all for
today, especially cyber. With me today as
our lead engineer Parker Thompson. Not on
stage or our other collaborators: Patrick
Stach, Sarah Zatko, and present in the
room but not on stage - Mudge. So today
we're going to be talking about our work,
the lead in. The introduction that was
given is phrased in terms of Kaspersky and
all of that, I'm not gonna be speaking
about Kaspersky and I guarantee you I'm
not gonna be speaking about my president.
Right, yeah? Okay. Thank you.
Applause
All right, so why don't we go ahead and
kick off: I'll mention now parts of this
presentation are going to be quite
technical. Not most of it, and I will
always include analogies and all these
other things if you are here in security
but not a bit-twiddler. But if you do want
to be able to review some of the technical
material, if I go through it too fast you
like to read if you're a mathematician or
if you are a computer scientist, our sides
are already available for download at this
site here. We think our pal our partners
at power door for getting that set up for
us. Let's let's get started on the real
material here. Alright, so we are CITL: a
nonprofit organization based in the United
States founded by our chief scientist
Sarah Zatko and our board chair Mudge. And
our mission is a public good mission - we
are hackers but our mission here is
actually to look out for people who do not
know very much about machines
or as much as the other hackers do.
Specifically, we seek to improve the state
of software security by providing the
public with accurate reporting on the
security of popular software, right? And
so there was a mouthful for you. But no
doubt, no doubt, every single one of you
has received questions of the form: what
do I run on my phone, what do I do with
this, what do I do with that, how do I
protect myself - all these other things
lots of people in the general public
looking for agency in computing. No one's
offering it to them, and so we're trying
to go ahead and provide a forcing function
on the software field in order to, you
know, again be able to enable consumers
and users and all these things. Our social
good work is funded largely by charitable
monies from the Ford Foundation whom we
thank a great deal, but we also have major
partnerships with Consumer Reports, which
is a major organization in the United
States that generally, broadly, looks at
consumer goods for safety and performance.
But also partners with The Digital
Standard, which probably would be of great
interest to many people here at Congress
as it is a holistic standard for
protecting user rights. We'll talk about
some of the work that goes into those
things here in a bit, but first I want to
give the big picture of what it is we're
really trying to do in one one short
little sentence. Something like this but
for security, right? What are the
important facts, how does it rate, you
know, is it easy to consume, is it easy to
go ahead and look and say this thing is
good this thing is not good. Something
like this, but for software security.
Sounds hard doesn't it? So I want to talk
a little bit about what I mean by
something like this.
There are lots of consumer outlook and
watchdog and protection groups - some
private, some government, which are
looking to do this for various things that
are not a software security. And you can
see some examples here that are big in the
United States - I happen to not like these
as much as some of the newer consumer
labels coming out from the EU. But
nonetheless they are examples of the kinds
of things people have done in other
fields, fields that are not security to
try to achieve that same end. And when
these things work well, it is for three
reasons: One, it has to contain the
relevant information. Two: it has to be
based in fact, we're not talking opinions,
this is not a book club or something like
that. And three: it has to be actionable,
it has to be actionable - you have to be
able to know how to make a decision based
on it. How do you do that for software
security? How do you do that for
software security? So the rest of the talk
is going to go in three parts.
First, we're going to give a bit of an
overview for more of the consumer facing
side of things for that we do: look at
some data that we have reported on early
and all these other kinds of good things.
We're then going to go ahead and get
terrifyingly, terrifyingly technical. And
then after that we'll talk about tools to
actually implement all this stuff. The
technical part comes before the tools. So
it just tells you how terrifyingly
technical we're gonna get. It's gonna be
fun right. So how do you do this for
software security: a consumer version. So,
if you set forth to the task of trying to
measure software security, right, many
people here probably do work in the
security field perhaps as consultants
doing reviews; certainly I used to. Then
probably what you're thinking to yourself
right now is that there are lots and lots
and lots and lots of things that affect
the security of a piece of software. Some
of which are, mmm, you're only gonna see
them if you go reversing. And some of
which are just you know kicking around on
the ground waiting for you to notice,
right. So we're going to talk about both
of those kinds of things that you might
measure. But here you see these giant
charts that basically go through on the
left - on the left we have Microsoft Excel
on OS X on the right Google Chrome for OS
X this is a couple years old at this point
maybe one and a half years old but over
here I'm not expecting you to be able to
read these - the real point is to say look
at all of the different things you can
measure very easily.
How do you distill, it how do you boil it
down, right. So this is a the opposite of
a good consumer safety label. This is just
um if you ever done any consulting this is
the kind of report you hand a client to
tell them how good their software is,
right? It's the opposite of consumer
grade. But the reason I'm showing it here
is because, you know, I'm gonna call out
some things and maybe you can't process
all of this because it's too much
material, you know. But I'm gonna call it
some things and once I call them out just
like NP you're gonna recognize them
instantly. So for example, Excel, at the
time of this review - look at this column
of dots. What's this dots telling you?
It's telling you look at all these
libraries -all of them are 32-bit only.
Not 64 bits, not 64 bits. Take a look at
Chrome - exact opposite, exact opposite
64-bit binary, right? What are some other
things? Excel, again, on OSX maybe you can
see these danger warning signs that go
straight straight up the whole thing.
That's the the absence of major heat
protection flags in the binary headers.
We'll talk about some what that means
exactly in a bit. But also if you hop over
here you'll see like yeah yeah yeah like
Chrome has all the different heat
protections that a binary might enable, on
OSX that is, but it also has more dots in
this column here off to the right. And
what do those dots represent?
Those dots represent functions, functions
that historically have been the source of
you know if you call these functions are
very hard to call correctly - if you're a
C programmer the "gets" function is a good
example. But there are lots of them. And
you can see here the Chrome doesn't mind,
it uses them all a bunch. And Excel not so
much. And if you know the history of
Microsoft and the trusted computing
initiative and the SDO and all of that you
will know that a very long time ago
Microsoft made the decision and they said
we're gonna start purging some of these
risky functions from our code bases
because we think it's easier to ban them
than teach our devs to use them correctly.
And you see that reverberating out in
their software. Google on the other hand
says yeah yeah yeah those functions can be
dangerous to use but if you know how to
use them they can be very good and so
they're permitted. The point all of this
is building to is that if you start by
just measuring every little thing that
like your static analyzers can detect in a
piece of software. Two things: one, you
wind up with way more data than you can
show in a slide. And two: the engineering
process, the software development life
cycle that went into the software will
leave behind artifacts that tell you
something about the decisions that went
into designing that engineering process.
And so you know, Google for example:
quite rigorous as far as hitting you know
GCC dash, and then enable all of the
compiler protections. Microsoft may be
less good at that, but much more rigid in
things that's were very popular ideas when
they introduced trusted computing,
alright. So the big takeaway from this
material is that again the software
engineering process results in artifacts
in the software that people can find.
Alright. Ok, so that's that's a whole
bunch of data, certainly it's not a
consumer-friendly label. So how do you
start to get in towards the consumer zone?
Well, the main defect of the big reports
that we just saw is that it's too much
information. It's a very dense on data but
it's very hard to distill it to the "so
what" of it, right?
And so this here is one of our earlier
attempts to go ahead and do that
distillation. What are these charts how
did we come up with these? Well on the
previous slide when we saw all these
different factors that you can analyze in
software, basically here's whose views
that we arrive at this. For each of those
things: pick a weight. Go ahead and
compute a score, average against the
weight: tada, now you have some number.
You can do that for each of the libraries
and the piece of software. And if you do
that for each of the libraries in the
software you can then go ahead and produce
these histograms to show, you know, like
this percentage of the DLLs had a score in
this range. Boom, there's a bar, right.
How do you pick those weights? We'll talk
about that in a sec - it's very technical.
But the the takeaway though, is you know
that you wind up with these charts. Now
I've obscured the labels, I've obscured
the labels and the reason I've done that
is because I don't really care that much
about the actual counts. I want to talk
about the shapes, the shapes of these
charts: it's a qualitative thing.
So here: good scores appear on the right,
bad scores appear on the left. The
histogram measuring all the libraries and
components and so a very secure piece of
software in this model manifests as a tall
bar far to the right. And you can see a
clear example at in our custom Gentoo
build. Anyone here is a Gentoo fan knows -
hey I'm going to install this thing, I
think I'm going to go ahead and turn on
every single one of those flags, and lo
and behold if you do that yeah you wind up
with tall bar far to the right. Here's in
Ubuntu 16, I bet it's 16.04 but I don't
recall exactly, 16 LTS. Here you see a lot
of tall bars to the right - not quite as
consolidated as a custom Gentoo build, but
that makes sense doesn't it right? Because
then you know you don't do your whole
Ubuntu build. Now I want to contrast. I
want to contrast. So over here on the
right we see in the same model, an
analysis of the firmware obtained from two
smart televisions. Last year's models from
Samsung and LG. And here the model
numbers. We did this work in concert with
Consumer Reports. And what do you notice
about these histograms, right. Are the
bars tall and to the right? No, they look
almost normal, not quite, but that doesn't
really matter. The main thing that matters
is that this is the shape you would expect
to get if you were playing a random game
basically to decide what security features
to enable in your software. This is the
shape of not having a security program, is
my bet. That's my bet. And so what do you
see? You see heavy concentration here in
the middle, right, that seems fair, and
like it tails off. On the Samsung nothing
scored all that great, same on the LG.
Both of them are you know running their
respective operating systems and they're
basically just inheriting whatever
security came from whatever open source
thing they forked, right.
So this is this is the kind of message,
this right here is the kind of thing that
we serve to exist for. This is us
producing charts showing that the current
practices in the not-so consumer-friendly
space of running your own Linux distros
far exceed the products being delivered,
certainly in this case in the smart TV
market. But I think you might agree with
me, it's much worse than this. So let's
dig into that a little bit more, I have a
different point that I want to make about
that same data set - so this table here
this table is again looking at the LG
Samsung and Gentoo Linux installations.
And on this table we're just pulling out
some of the easy to identify security
features you might enable in a binary
right. So percentage of binaries with
address space layout randomization, right?
Let's talk about that on our Gentoo build
it's over 99%. That also holds for the
Amazon Linux AMI - it holds in Ubuntu.
ASLR is incredibly common in modern Linux.
And despite that, fewer than 70 percent of
the binaries on the LG television had it
enabled. Fewer than 70 percent. And the
Samsung was doing, you know, better than
that I guess, but 80 percent is a pretty
disappointing when a default Linux
install, you know, mainstream Linux distro
is going to get you 99, right? And it only
gets worse, it only gets worse right you
know?
RELRO support, if you don't know what that
is that's ok but if you do, look abysmal
coverage look at this abysmal coverage
coming out of these IOT devices very sad.
And you see it over and over and over
again. I'm showing this because some
people in this room or watching this video
ship software - and I have a message, I
have a message to those people who ship
software who aren't working on say Chrome
or any of the other big-name Pwn2Own kinds
of targets. Look at this: you can be
leading the pack by mastering the
fundamentals. You can be leading the pack
by mastering the fundamentals. This is a
point that really as a security field we
really need to be driving home. You know,
one of the things that we're seeing here
in our data is that if you're the vendor
who is shipping the product everyone has
heard of in the security field and maybe
your game is pretty decent right? If
you're shipping say Windows or if you're
shipping Firefox or whatever. But if
you're if you're doing one of these things
where people are just kind of beating you
up for default passwords, then your
problems are way further than just default
passwords, right? Like the house, the
house is messy it needs to be cleaned,
needs to be cleaned. So the rest of the
talk like I said we're going to be
discussing a lot of other things that
amount to getting you know a peek behind
the curtain and where some of these things
come from and getting very specific about
how this business works, but if you're
interested in more of the high level
material - especially if you're interested
in interesting results and insights, some
of which I'm going to have here later. But
I really encourage you though to take a
look at the talk from this past summer by
our chief scientist Sarah Zatko, which is
predominantly on the topic of surprising
results in the data.
Today, though, this being our first time
presenting here in Europe, we figured we
would take more of an overarching kind of
view. What we're doing and why we're
excited about it and where it's headed. So
we're about to move into a little bit of
the underlying theory, you know. Why do I
think it's reasonable to even try to
measure the security of software from a
technical perspective. But before we can
get into that I need to talk a little bit
about our goals, so that the decisions and
the theory; the motivation is clear,
right. Our goals are really simple: it's a
very easy organization to run because of
that. Goal number one: remain independent
of vendor influence. We are not the first
organization to purport to be looking out
for the consumer. But unlike many of our
predecessors, we are not taking money from
the people we review, right? Seems like
some basic stuff. Seems like some basic
stuff right? Thank you, okay.
Two: automated, comparable, quantitative
analysis. Why automated? Well, we need our
test results to be reproducible. And Tim
goes in opens up your software in IDA and
finds a bunch of stuff that makes them all
stoped - that's not a very repeatable kind
of a standard for things. And so we're
interested in things which are automated.
We'll talk about, maybe a few hackers in
here know how hard that is. We'll talk
about that, but then last we also we're
well acting as a watchdog - we're
protecting the interests of the user, the
consumer, however you would like to look
at it. But we also have three non goals,
three non goals that are equally
important. One: we have a non goal of
finding and disclosing vulnerabilities. I
reserve the right to find and disclose
vulnerabilities. But that's not my goal,
it's not my goal. Another non goal is to
tell software vendors what to do. If a
vendor asks me how to remediate their
terrible score, I will tell them what we
are measuring but I'm not there to help
them remediate. It's on them to be able to
ship a secure product without me holding
their hand. We'll see. And then three:
non-goal, perform free security testing
for vendors. Our testing happens after you
release. Because when you release your
software you are telling people it is
ready to be used. Is it really though, is
it really though, right?
Applause
Yeah, thank you. Yeah, so we are not there
to give you a preview of what your score
will be. There is no sum of money you can
hand me that will get you an early preview
of what your score is - you can try me,
you can try me: there's a fee for trying
me. There's a fee for trying me. But I'm
not gonna look at your stuff until I'm
ready to drop it, right. Yeah bitte, yeah.
All right. So moving into this theory
territory. Three big questions, three big
questions that need to be addressed if you
want to do our work efficiently: what
works, what works for improving security,
what are the things that need or that you
really want to see in software. Two: how
do you recognize when it's being done?
It's no good if someone hands you a piece
of software and says, "I've done all the
latest things" and it's a complete black
box. If you can't check the claim, the
claim is as good as false, in practical
terms, period, right. Software has to be
reviewable or a priori, I'll think you're
full of it. And then three: who's doing it
- of all the things that work, that you
can recognize, who's actually doing it.
You know, let's go ahead - our field is
famous for ruining people's holidays and
weekends over Friday bug disclosures, you
know New Year's Eve bug disclosures. I
would like us to also be famous for
calling out those teams and those software
organizations which are being as good as
the bad guys are being bad, yeah? So
provide someone an incentive to be maybe
happy to see us for a change, right. Okay,
so thank you. Yeah, all right. So how do
we actually pull these things off; the
basic idea. So, I'm going to get into some
deeper theory: if you're not a theorist I
want you to focus on this slide.
And I'm gonna bring it back, it's not all
theory from here on out after this but if
you're not a theorist I really want you to
focus on this slide. The basic motivation,
the basic motivation behind what we're
doing; the technical motivation - why we
think that it's possible to measure and
report on security. It all boils down to
this right. So we start with a thought
experiment, a gedankent, right? Given a
piece of software we can ask: overall, how
secure is it? Kind of a vague question but
you could imagine you know there's
versions of that question. And two: what
are its vulnerabilities. Maybe you want to
nitpick with me about what the word
vulnerability means but broadly you know
this is a much more specific question
right. And here's here's the enticing
thing: the first question appears to ask
for less information than the second
question. And maybe if we were taking bets
I would put my money on, yes, it actually
does ask for less information. What do I
mean by that what do I mean by that? Well,
let's say that someone told you all of the
vulnerabilities in a system right? They
said, "Hey, I got them all", right? You're
like all right that's cool, that's cool.
And if someone asks you hey how secure is
this system you can give them a very
precise answer. You can say it has N
vulnerabilities, and they're of this kind
and like all this stuff right so certainly
the second question is enough to answer
the first. But, is the reverse true?
Namely, if someone were to tell you, for
example, "hey, this piece of software has
exactly 32 vulnerabilities in it." Does
that make it easier to find any of them?
Right, there's room for to maybe do that
using some algorithms that are not yet in
existence.
Certainly the computer scientists in here
are saying, "well, you know, yeah maybe
counting the number of SAT solutions
doesn't help you practically find
solutions. But it might and we just don't
know." Okay fine fine fine. Maybe these
things are the same, but the my experience
in security, and the experience of many
others perhaps is that they probably
aren't the same question. And this
motivates what I'm calling here is Zatko's
question, which is basically asking for an
algorithm that demonstrates that the first
question is easier than the second
question, right. So Zatko's question:
develop a heuristic which can to
efficiently answer one, but not
necessarily two. If you're looking for a
metaphor, if you want to know why I care
about this distinction, I want you to
think about some certain controversial
technologies: maybe think about say
nuclear technology, right. An algorithm
that answers one, but not two, it's a very
safe algorithm to publish. Very safe
algorithm publish indeed. Okay, Claude
Shannon would like more information. happy
to oblige. Let's take a look at this
question from a different perspective
maybe a more hands-on perspective: the
hacker perspective, right? If you're a
hacker and you're watching me up here and
I'm waving my hands around and I'm showing
you charts maybe you're thinking to
yourself yeah boy, what do you got? Right,
how does this actually go. And maybe what
you're thinking to yourself is that, you
know, finding good vulns: that's an
artisan craft right? You're in IDA, you
know you're reversing old way you're doing
all these things or hit and Comm, I don't
know all that stuff. And like, you know,
this kind of clever game; cleverness is
not like this thing that feels very
automatable. But you know on the other
hand there are a lot of tools that do
automate things and so it's not completely
not automatable.
And if you're into fuzzing then perhaps
you are aware of this very simple
observation, which is that if your harness
is perfect if you really know what you're
doing if you have a decent fuzzer then in
principle fuzzing can find every single
problem. You have to be able to look for
it you have to be able harness for it but
in principle it will, right. So the hacker
perspective on Zatko's question is maybe
of two minds on the one hand assessing
security is a game of cleverness, but on
the other hand we're kind of right now at
the cusp of having some game-changing tech
really go - maybe you're saying like
fuzzing is not at the cusp, I promise it's
just at the cusp. We haven't seen all the
fuzzing has to offer right and so maybe
there's room maybe there's room for some
automation to be possible in pursuit of
Zatko's question. Of course, there are
many challenges still in, you know, using
existing hacker technology. Mostly of the
form of various open questions. For
example if you're into fuzzing, you know,
hey: identifying unique crashes. There's
an open question. We'll talk about some of
those, we'll talk about some of those. But
I'm going to offer another perspective
here: so maybe you're not in the business
of doing software reviews but you know a
little computer science. And maybe that
computer science has you wondering what's
this guy talking about, right. I'm here to
acknowledge that. So whatever you think
the word security means: I've got a list
of questions up here. Whatever you think
the word security means, probably, some of
these questions are relevant to your
definition. Right.
Does the software have a hidden backdoor
or any kind of hidden functionality, does
it handle crypto material correctly, etc,
so forth. Anyone in here who knows some
computers abilities theory knows that
every single one of these questions and
many others like them are undecidable due
to reasons essentially no different than
the reason the halting problem is
undecidable,\ which is to say due to
reasons essentially first identified and
studied by Alan Turing a long time before
we had microarchitectures and all these
other things. And so, the computability
perspective says that, you know, whatever
your definition of security is ultimately
you have this recognizability problem:
fancy way of saying that algorithms won't
be able to recognize secure software
because of the undecidability these
issues. The takeaway, the takeaway is that
the computability angle on all of this
says: anyone who's in the business that
we're in has to use heuristics. You have
to, you have to.
All right, this guy gets it. All right, so
on the tech side our last technical
perspective that we're going to take now
is certainly the most abstract which is
the Bayesian perspective, right. So if
you're a frequentist, you need to get with
the times you know it's everything
Bayesian now. So, let's talk about this
for a bit. Only two slides of math, I
promise, only two! So, let's say that I
have some corpus of software. Perhaps it's
a collection of all modern browsers,
perhaps it's the collection of all the
packages in the Debian repository, perhaps
it's everything on github that builds on
this system, perhaps it's a hard drive
full of warez that some guy mailed you,
right? You have some corpus of software
and for a random program in that corpus we
can consider this probability: the
probability distribution of which software
is secure versus which is not. For reasons
described on the computability
perspective, this number is not a
computable number for any reasonable
definition of security. So that's a neat
and so, for practical terms, if you want
to do some probabilistic reasoning, you
need some surrogate for that and so we
consider this here. So, instead of
considering the probability that a piece
of software is secure, a non computable
non verifiable claim, we take a look here
at this indexed collection of
probabilities. This is an infinite
countable family of probability
distributions, basically P sub h,k is just
the probability that for a random piece of
software in the corpus, h work units of
fuzzing will find no more than k unique
crashes, right. And why is this relevant?
Well, at the bottom we have this analytic
observation, which is that in the limit as
h goes to infinity you're basically
saying: "Hey, you know, if I fuzz this
thing for infinity times, you know, what's
that look like?" And, essentially, here we
have analytically that this should
converge. The P sub h,1 should converge to
the probability that a piece of software
just simply cannot be made to crash. Not
the same thing as being secure, but
certainly not a small concern relevant to
security. So, none of that stuff actually
was Bayesian yet, so we need to get there.
And so here we go, right: so, the previous
slide described a probability distribution
measured based on fuzzing. But fuzzing is
expensive and it is also not an answer to
Zatko's question because it finds
vulnerabilities, it doesn't measure
security in the general sense and so
here's where we make the jump to
conditional probabilities: Let M be some
observable property of software has ASLR,
has RELRO, calls these functions, doesn't
call those functions... take your pick.
For random s in S we now consider these
conditional probability distributions and
this is the same kind of probability as we
had on the previous slide but conditioned
on this observable being true, and this
leads to the refined of the Siddall
variant of Zatko's question:
Which observable properties of software
satisfy that, when the software has
property m, the probability of fuzzing
being hard is very high? That's what this
version of this question phrases, and here
we say, you know, in large log(h)/k, in
other words: exponentially more fuzzing
than you expect to find bugs. So this is
the technical version of what we're after.
All of this can be explored, you can
brute-force your way to finding all of
this stuff, and that's exactly what we're
doing. So we're looking for all kinds of
things, we're looking for all kinds of
things that correlate with fuzzing having
low yield on a piece of software, and
there's a lot of ways in which that can
happen. It could be that you are looking
at a feature of software that literally
prevents crashes. Maybe it's the never
crash flag, I don't know. But most of the
things I've talked about, ASLR, RERO, etc.
don't prevent crashes. In fact a ASLR can
take non-crashing programs and make them
crashing. It's the number one reason
vendors don't enable it, right? So why am
I talking about ASLR? Why am I talking
about RELRO? Why am i talking about all
these things that have nothing to do with
stopping crashes and I'm claiming I'm
measuring crashes? This is because, in the
Bayesian perspective, correlation is not
the same thing as causation, right?
Correlation is not the same thing as
causation. It could be that M's presence
literally prevents crashes, but it could
also be that, by some underlying
coincidence, the things we're looking for
are mostly only found in software that's
robust against crashing.
If you're looking for security, I submit
to you that the difference doesn't matter.
Okay, end of my math, danke. I will now go
ahead and do this like really nice analogy
of all those things that I just described,
right. So we're looking for indicators of
a piece of software being secure enough to
be good for consumers, right. So here's an
analogy. Let's say you're a geologist, you
study minerals and all of that and you're
looking for diamonds. Who isn't, right?
Want those diamonds! And like how do you
find diamonds? Even in places that are
rich in diamonds, diamonds are not common.
You don't just go walking around in your
boots, kicking until your toe stubs on a
diamond? You don't do that. Instead you
look for other minerals that are mostly
only found near diamonds but are much more
abundant in those locations than the
diamonds. So, this is mineral science 101,
I guess, I don't know. So, for example,
you want to go find diamond: put on your
boots and go kicking until you find some
chromite, look for some diopside, you
know, look for some garnet. None of these
things turn into diamonds, none of these
things cause diamonds but if you're
finding good concentrations of these
things, then, statistically, there's
probably diamonds nearby. That's what
we're doing. We're not looking for the
things that cause good security per se.
Rather, we're looking for the indicators
that you have put the effort into your
software, right? How's that working out
for us? How's that working out for us?
Well, we're still doing studies. It's, you
know, early to say exactly but we do have
the following interesting coincidence: and
so, here presented I have a collection of
prices that somebody gave much for so-
called the underground exploits. And I can
tell you these prices are maybe a little
low these days but if you work in that
business, if you go to Cyscin, if you do
that kind of stuff, maybe you know that
this is ballpark, it's ballpark.
Alright, and, just a coincidence, maybe it
means we're on the right track, I don't
know, but it's an encouraging sign: When
we run these programs through our
analysis, our rankings more or less
correspond to the actual prices that you
encounter in the wild for access via these
applications. Up above, I have one of our
histogram charts. You can see here that
Chrome and Edge in this particular model
scored very close to the same and it's a
test model, so, let's say they're
basically the same.
Firefox, you know, behind there a little
bit. I don't have Safari on this chart
because - this or all Windows applications
- but the Safari score falls in between.
So, lots of theory, lots of theory, lots
of theory and then we have this. So, we're
going to go ahead now and hand off to our
lead engineer, Parker, who is going to
talk about some of the concrete stuff, the
non-chalkboard stuff, the software stuff
that actually makes this work.
Thompson: Yeah, so I want to talk about
the process of actually doing it. Building
the tooling that's required to collect
these observables. Effectively, how do you
go mining for indicator indicator
minerals? But first the progression of
where we are and where we're going. We
initially broke this out into three major
tracks of our technology. We have our
static analysis engine, which started as a
prototype, and we have now recently
completed a much more mature and solid
engine that's allowing us to be much more
extensible and digging deeper into
programs, and provide a much more deep
observables. Then, we have the data
collection and data reporting. Tim showed
some of our early stabs at this. We're
right now in the process of building new
engines to make the data more accessible
and easy to work with and hopefully more
of that will be available soon. Finally,
we have our fuzzer track. We needed to get
some early data, so we played with some
existing off-the-shelf fuzzers, including
AFL, and, while that was fun,
unfortunately it's a lot of work to
manually instrument a lot of fuzzers for
hundreds of binaries.
So, we then built an automated solution
that started to get us closer to having a
fuzzing harness that could autogenerate
itself, depending on the software, the
software's behavior. But, right now,
unfortunately that technology showed us
more deficiencies than it showed
successes. So, we are now working on a
much more mature fuzzer that will allow us
to dig deeper into programs as we're
running and collect very specific things
that we need for our model and our
analysis. But on to our analytic pipeline
today. This is one of the most concrete
components of our engine and one of the
most fun!
We effectively wanted some type of
software hopper, where you could just pour
programs in, installers and then, on the
other end, come reports: Fully annotated
and actionable information that we can
present to people. So, we went about the
process of building a large-scale engine.
It starts off with a simple REST API,
where we can push software in, which then
gets moved over to our computation cluster
that effectively provides us a fabric to
work with. It makes is made up of a lot of
different software suites, starting off
with our data processing, which is done by
apache spark and then moves over into data
data handling and data analysis in spark,
and then we have a common HDFS layer to
provide a place for the data to be stored
and then a resource manager and Yarn. All
of that is backed by our compute and data
nodes, which scale out linearly. That then
moves into our data science engine, which
is effectively spark with Apache Zeppelin,
which provides us a really fun interface
where we can work with the data in an
interactive manner but be kicking off
large-scale jobs into the cluster. And
finally, this goes into our report
generation engine. What this bought us,
was the ability to linearly scale and make
that hopper bigger and bigger as we need,
but also provide us a way to process data
that doesn't fit in a single machine's
RAM. You can push the instance sizes as
you large as you want, but we have
datasets that blow away any single host
RAM set. So this allows us to work with
really large collections of observables.
I want to dive down now into our actual
static analysis. But first we have to
explore the problem space, because it's a
nasty one. Effectively in settles mission
is to process as much software as
possible. Hopefully all of it, but it's
hard to get your hand on all the binaries
that are out there. When you start to look
at that problem you understand there's a
lot of combinations: there's a lot of CPU
architectures, there's a lot of operating
systems, there's a lot of file formats,
there's a lot of environments the software
gets deployed into, and every single one
of them has their own app Archer app
armory features. And it can be
specifically set for one combination
button on another and you don't want to
penalize a developer for not turning on a
feature they had no access to ever turn
on. So effectively we need to solve this
in a much more generic way. And so what we
did is our static analysis engine
effectively looks like a gigantic
collection of abstraction libraries to
handle binary programs. You take in some
type of input file be it ELF, PE, MachO
and then the pipeline splits. It goes off
into two major analyzer classes, our
format analyzers, which look at the
software much like how a linker or loader
would look at it. I want to understand how
it's going to be loaded up, what type of
armory feature is going to be applied and
then we can run analyzers over that. In
order to achieve that we need abstraction
libraries that can provide us an abstract
memory map, a symbol resolver, generic
section properties. So all that feeds in
and then we run over a collection of
analyzers to collect data and observables.
Next we have our code analyzers, these are
the analyzers that run over the code
itself. I need to be able to look at every
possible executable path. In order to do
that we need to do function discovery,
feed that into a control flow recovery
engine, and then as a post-processing step
dig through all of the possible metadata
in the software, such as like a switch
table, or something like that to get even
deeper into the software. Then this
provides us a basic list of basic blocks,
functions, instruction ranges. And does so
in an efficient manner so we can process a
lot of software as it goes. Then all that
gets fed over into the main modular
analyzers. Finally, all of this comes
together and gets put into a gigantic blob
of observables and fed up to the pipeline.
We really want to thank the Ford
Foundation for supporting our work in
this, because the pipeline and the static
analysis has been a massive boon for our
project and we're only beginning now to
really get our engine running and we're
having a great time with it. So digging
into the observables themselves, what are
we looking at and let's break them apart.
So the format structure components, things
like ASLR, DEP, RELRO.
basic app armory, that's going to go into
the feature and gonna be enabled at the OS
layer when it gets loaded up or linked.
And we also collect other metadata about
the program such as like: "What libraries
are linked in?", "What's its dependency
tree look like – completely?", "How did
those software, how did those library
score?", because that can affect your main
software. Interesting example on Linux, if
you link a library that requires an
executable stack, guess what your software
now has an executable stack, even if you
didn't mark that. So we need to be owners
to understand what ecosystem the software
is gonna live in. And the code structure
analyzers look at things like
functionality: "What's the software
doing?", "What type of app armory is
getting injected into the code?". A great
example of that is something like stack
guards or fortify source. These are our
main features that only really apply and
can be observed inside of the control flow
or inside of the actual instructions
themselves. This is why control
photographs are key.
We played around with a number of
different ways of analyzing software that
we could scale out and ultimately we had
to come down to working with control
photographs. Provided here is a basic
visualization of what I'm talking about
with a control photograph, provided by
Benja, which has wonderful visualization
tools, hence this photo, and not our
engine because we don't build their very
many visualization engines. But you
basically have a function that's broken up
into basic blocks, which is broken up into
instructions, and then you have basic flow
between them. Having this as an iterable
structure that we can work with, allows us
to walk over that and walk every single
instruction, understand the references,
understand where code and data is being
referenced, and how is it being
referenced.
And then what type of functionalities
being used, so this is a great way to find
something, like whether or not your stack
guards are being applied on every function
that needs them, how deep are they being
applied, and is the compiler possibly
introducing errors into your armory
features. which are interesting side
studies. Also why we did this is because
we want to push the concept of what type
of observables even farther. Let's say
take this example you want to be able to
take instruction abstractions. Let's say
for all major architectures you can break
them up into major categories. Be it
arithmetic instructions, data manipulation
instructions, like load stores and then
control flow instructions. Then with these
basic fundamental building blocks you can
make artifacts. Think of them like a unit
of functionality: has some type of input,
some type of output, it provides some type
of operation on it. And then with these
little units of functionality, you can
link them together and think of these
artifacts as may be sub-basic block or
crossing a few basic blocks, but a
different way to break up the software.
Because a basic block is just a branch
break, but we want to look at
functionality brakes, because these
artifacts can provide the basic
fundamental building blocks of the
software itself. It's more important, when
we want to start doing symbolic lifting.
So that we can lift the entire software up
into a generic representation, that we can
slice and dice as needed.
Moving from there, I want to talk about
fuzzing a little bit more. Fuzzing is
effectively at the heart of our project.
It provides us the rich dataset that we
can use to derive a model. It also
provides us awesome other metadata on the
side. But why? Why do we care about
fuzzing? Why is fuzzing the metric, that
you build an engine, that you build a
model that you drive some type of reason
from? So think of the set of bugs,
vulnerabilities, and exploitable
vulnerabilities. In an ideal world you'd
want to just have a machine that pulls out
exploitable vulnerabilities.
Unfortunately, this is exceedingly costly
for a series of decision problems, that go
between these sets. So now consider the
superset of bugs or faults. A fuzzer can
easily recognize, or other software can
easily recognize faults, but if you want
to move down the sets you unfortunately
need to jump through a lot of decision
hoops. For example, if you want to move to
a vulnerability you have to understand:
Does the attacker have some type of
control? Is there a trust boundary being
crossed? Is this software configured in
the right way for this to be vulnerable
right now? So they're human factors that
are not deducible from the outside. You
then amplify this decision problem even
worse going to exploitable
vulnerabilities. So if we collect the
superset of bugs, we will know that there
is some proportion of subsets in there.
And this provides us a datasets easily
recognizable and we can collect in a cost-
efficient manner. Finally, fuzzing is key
and we're investing a lot of our time
right now and working on a new fuzzing
engine, because there are some key things
we want to do.
We want to be able to understand all of
the different paths the software could be
taking, and as you're fuzzing you're
effectively driving the software down as
many unique paths while referencing as
many unique data manipulations as
possible. So if we save off every path,
annotate the ones that are faulting, we
now have this beautiful rich data set of
exactly where the software went as we were
driving it in specific ways. Then we feed
that back into our static analysis engine
and begin to generate those instruction
out of those instruction abstractions,
those artifacts. And with that, imagine we
have these gigantic traces of instruction
abstractions. From there we can then begin
to train the model to explore around the
fault location and begin to understand and
try and study the fundamental building
blocks of what a bug looks like in an
abstract instruction agnostic way. This is
why we're spending a lot of time on our
Fuzzing engine right now. But hopefully
soon we'll be able to talk about that more
and maybe a tech track and not the policy
track.
C: Yeah, so from then on when anything
went wrong with the computer we said it
had bugs in it. laughs All right, I
promised you a technical journey, I
promised you a technical journey into the
dark abyss of as deep as you want to get
with it. So let's go ahead and bring it
up. Let's wrap it up and bring it up a
little bit here. We've talked a great deal
today about some theory. We've talked
about development in our tooling and
everything else and so I figured I should
end with some things that are not in
progress, but in fact which are done in
yesterday's news. Just to go ahead and
make that shared here with Europe. So in
the midst of all of our development we
have been discovering and reporting bugs,
again this not our primary purpose really.
But you know you can't help but do it. You
know how computers are these days. You
find bugs just for turning them on, right?
So we've been disclosing all of that a
little while ago. At DEFCON and Black Hat
our chief scientist Sarah together with
Mudge went ahead and dropped this
bombshell on the Firefox team which is
that for some period of time they had ASLR
disabled on OS X. When we first found it
we assumed it was a bug in our tools. When
we first mentioned it in a talk they came
to us and said it's definitely a bug on
our tools or might be or some level of
surprise and then people started looking
into it and in fact at one point it had
been enabled and then temporarily
disabled. No one knew, everyone thought it
was on. It takes someone looking to notice
that kind of stuff, right. Major shout out
though, they fixed it immediately despite
our full disclosure on stage and
everything. So very impressed, but in
addition to popping surprises on people
we've also been doing the usual process of
submitting patches and bugs, particularly
to LLVM and Qemu and if you work in
software analysis you could probably guess
why.
Incidentally, if you're looking for a
target to fuzz if you want to go home from
CCC and you want to find a ton of findings
LLVM comes with a bunch of parsers. You
should fuzz them, you should fuzz them and
I say that because I know for a fact you
are gonna get a bunch of findings and it'd
be really nice. I would appreciate it if I
didn't have to pay the people to fix them.
So if you wouldn't mind disclosing them
that would help. But besides these bug
reports and all these other things we've
also been working with lots of others. You
know we gave a talk earlier this summer,
Sarah gave a talk earlier this summer,
about these things and she presented
findings on comparing some of these base
scores of different Linux distributions.
And based on those findings there was a
person on the fedora red team, Jason
Calloway, who sat there and well I can't
read his mind but I'm sure that he was
thinking to himself: golly it would be
nice to not, you know, be surprised at the
next one of these talks. They score very
well by the way. They were leading in
many, many of our metrics. Well, in any
case, he left Vegas and he went back home
and him and his colleagues have been
working on essentially re-implementing
much of our tooling so that they can check
the stuff that we check before they
release. Before they release. Looking for
security before you release. So that would
be a good thing for others to do and I'm
hoping that that idea really catches on.
laughs Yeah, yeah right, that would be
nice. That would be nice.
But in addition to that, in addition to
that our mission really is to get results
out to the public and so in order to
achieve that, we have broad partnerships
with Consumer Reports and the digital
standard. Especially if you're into cyber
policy, I really encourage you to take a
look at the proposed digital standard,
which is encompassing of the things we
look for and and and so much more. URLs,
data, traffic, motion and cryptography and
update mechanisms and all that good stuff.
So, where we are and where we're going,
the big takeaways here for if you're
looking for that, so what, three points
for you: one we are building a tooling
necessary to do larger and larger and
larger studies regarding these surrogate
security stores. My hope is that in some
period of the not-too-distant future, I
would like to be able to, with my
colleagues, publish some really nice
findings about what are the things that
you can observe in software, which have a
suspiciously high correlation with the
software being good. Right, nobody really
knows right now. It's an empirical
question. As far as I know, the study
hasn't been done. We've been running it on
the small scale. We're building the
tooling to do it on a much larger scale.
We are hoping that this winds up being a
useful field in security as that
technology develops. In the meantime our
static analyzers are already making
surprising discoveries: hit YouTube and
take a look for Sara Zatko's recent talks
at DEFCON/Blackhat. Lots of fun findings
in there. Lots of things that anyone who
looks would have found it. Lots of that.
And then lastly, if you were in the
business of shipping software and you are
thinking to yourself.. okay so these guys,
someone gave them some money to mess up my
day and you're wondering: what can I do to
not have my day messed up? One simple
piece of advice, one simple piece of
advice: make sure your software employs
every exploit mitigation technique Mudge
has ever or will ever hear of. And he's
heard of a lot of them. He's only gonna,
you know all that, turn all those things
on and if you don't know anything about
that stuff, if nobody on your team knows
anything about that stuff didn't I don't
even know I'm saying this if you hear you
know about that stuff so do that. If
you're not here, then you should be here.
Danke, Danke.
Herald Angel: Thank you, Tim and Parker.
Do we have any questions from the
audience? It's really hard to see you with
that bright light in my face. I think the
signal angel has a question. Signal Angel:
So the IRC channel was impressed by your
tools and your models that you wrote. And
they are wondering what's going to happen
to that, because you do have funding from
the Ford foundation now and so what are
your plans with this? Do you plan on
commercializing this or is it going to be
open source or how do we get our hands on
this?
C: It's an excellent question. So for the
time being the money that we are receiving
is to develop the tooling, pay for the AWS
instances, pay for the engineers and all
that stuff. The direction as an
organization that we would like to take
things I have no interest in running a
monopoly. That sounds like a fantastic
amount of work and I really don't want to
do it. However, I have a great deal of
interest in taking the gains that we are
making in the technology and releasing the
data so that other competent researchers
can go through and find useful things that
we may not have noticed ourselves. So
we're not at a point where we are
releasing data in bulk just yet, but that
is simply a matter of engineering our
tools, are still in flux as we, you know.
When we do that, we want to make sure the
data is correct and so our software has to
have its own low bug counts and all these
other things. But ultimately for the
scientific aspect of our mission. Though
the science is not our primary mission.
Our primary mission is to apply it to help
consumers. At the same time, it is our
belief that an opaque model is as good as
crap, no one should trust an opaque model,
if somebody is telling you that they have
some statistics and they do not provide
you with any underlying data and it is not
reproducible you should ignore them.
Consequently what we are working towards
right now is getting to a point where we
will be able to share all of those
findings. The surrogate scores, the
interesting correlations between
observables and fuzzing. All that will be
public as the material comes online.
Signal Angel: Thank you.
C: Thank you.
Herald Angel: Thank you. And microphone
number three please.
Mic3: Hi, thanks so some really
interesting work you presented here. So
there's something I'm not sure I
understand about the approach that you're
taking. If you are evaluating the security
of say a library function or the
implementation of a network protocol for
example you know there'd be a precise
specification you could check that
against. And the techniques you're using
would make sense to me. But it's not so
clear since you've set the goal that
you've set for yourself is to evaluate
security of consumer software. It's not
clear to me whether it's fair to call
these results security scores in the
absence of a threat model so. So my
question is, you know, how is it
meaningful to make a claim that a piece of
software is secure if you don't have a
threat model for it?
C: This is an excellent question and I
anyone who disagrees is they should the
wrong. Security without a threat model is
not security at all. It's absolutely a
true point. So the things that we are
looking for, most of them are things that
you will already find present in your
threat model. And so for example we were
reporting on the presence of things like a
ASLR and lots of other things that get to
the heart of exploitability of a piece of
software. So for example if we are
reviewing a piece of software, that has no
attack surface
then it is canonically not in the threat
model and in that sense it makes no sense
to report on its overall security. On the
other hand, if we're talking about
software like say a word processor, a
browser, anything on your phone, anything
that talks on the network, we're talking
about those kinds of applications then I
would argue that exploit mitigations and
the other things that we are measuring are
almost certainly very relevant. So there's
a sense in which what we are measuring is
the lowest common denominator among what
we imagine or the dominant threat models
for the applications. The hand-wavy
answer, but I promised heuristics so there
you go.
Mic3: Thanks.
C: Thank you.
Herald Angel: Any questions? No raising
hands, okay. And then the herald can ask a
question, because I never can. So the
question is: you mentioned earlier these
security labels and for example what
institution could give out the security
labels? Because as obviously the vendor
has no interest in IT security?
C: Yes it's a very good question. So our
partnership with Consumer Reports. I don't
know if you're familiar with them, but in
the United States Consumer Reports is a
major huge consumer watchdog organization.
They test the safety of automobiles, they
test you know lots of consumer appliances.
All kinds of things both to see if they
function more or less as advertised but
most importantly they're checking for
quality, reliability and safety. So our
partnership with Consumer Reports is all
about us doing our work and then
publishing that. And so for example the
televisions that we presented the data on
all of that was collected and published in
partnership with Consumer Reports.
Herald: Thank you.
C: Thank you.
Herald: Any other questions for stream. I
hear a no. Well in this case people thank
you.
Thank Tim and Parker for their nice talk
and please give them a very very warm hall
round of applause.
applause
C: Thank you. T: Thank you.
subtitles created by c3subtitles.de
in the year 2017. Join, and help us!