<i>Music</i>
Herald: The next talk is about how risky

is software you use. So you may be heard
about Trump versus a Russian security

company. We won't judge this, we won't
comment this, but we dislike the

prejudgments of this case. Tim Carstens
and Parker Thompson will tell you a little

bit more about how risky the software is
you use. Tim Carstens is CITL's Acting

Director and Parker Thompson is CITL's
lead engineer. Please welcome with a very,

very warm applause: Tim and Parker!
Thanks.

<i>Applause</i>
Tim Carstens: Howdy, howdy. So my name is

Tim Carstens. I'm the acting director of
the cyber independent testing lab. It's

four words there, we'll talk about all for
today, especially cyber. With me today as

our lead engineer Parker Thompson. Not on
stage or our other collaborators: Patrick

Stach, Sarah Zatko, and present in the
room but not on stage - Mudge. So today

we're going to be talking about our work,
the lead in. The introduction that was

given is phrased in terms of Kaspersky and
all of that, I'm not gonna be speaking

about Kaspersky and I guarantee you I'm
not gonna be speaking about my president.

Right, yeah? Okay. Thank you.
<i>Applause</i>

All right, so why don't we go ahead and
kick off: I'll mention now parts of this

presentation are going to be quite
technical. Not most of it, and I will

always include analogies and all these
other things if you are here in security

but not a bit-twiddler. But if you do want
to be able to review some of the technical

material, if I go through it too fast you
like to read if you're a mathematician or

if you are a computer scientist, our sides
are already available for download at this

site here. We think our pal our partners
at power door for getting that set up for

us. Let's let's get started on the real
material here. Alright, so we are CITL: a

nonprofit organization based in the United
States founded by our chief scientist

Sarah Zatko and our board chair Mudge. And
our mission is a public good mission - we

are hackers but our mission here is
actually to look out for people who do not

know very much about machines
or as much as the other hackers do.

Specifically, we seek to improve the state
of software security by providing the

public with accurate reporting on the
security of popular software, right? And

so there was a mouthful for you. But no
doubt, no doubt, every single one of you

has received questions of the form: what
do I run on my phone, what do I do with

this, what do I do with that, how do I
protect myself - all these other things

lots of people in the general public
looking for agency in computing. No one's

offering it to them, and so we're trying
to go ahead and provide a forcing function

on the software field in order to, you
know, again be able to enable consumers

and users and all these things. Our social
good work is funded largely by charitable

monies from the Ford Foundation whom we
thank a great deal, but we also have major

partnerships with Consumer Reports, which
is a major organization in the United

States that generally, broadly, looks at
consumer goods for safety and performance.

But also partners with The Digital
Standard, which probably would be of great

interest to many people here at Congress
as it is a holistic standard for

protecting user rights. We'll talk about
some of the work that goes into those

things here in a bit, but first I want to
give the big picture of what it is we're

really trying to do in one one short
little sentence. Something like this but

for security, right? What are the
important facts, how does it rate, you

know, is it easy to consume, is it easy to
go ahead and look and say this thing is

good this thing is not good. Something
like this, but for software security.

Sounds hard doesn't it? So I want to talk
a little bit about what I mean by

something like this.
There are lots of consumer outlook and

watchdog and protection groups - some
private, some government, which are

looking to do this for various things that
are not a software security. And you can

see some examples here that are big in the
United States - I happen to not like these

as much as some of the newer consumer
labels coming out from the EU. But

nonetheless they are examples of the kinds
of things people have done in other

fields, fields that are not security to
try to achieve that same end. And when

these things work well, it is for three
reasons: One, it has to contain the

relevant information. Two: it has to be
based in fact, we're not talking opinions,

this is not a book club or something like
that. And three: it has to be actionable,

it has to be actionable - you have to be
able to know how to make a decision based

on it. How do you do that for software
security? How <i>do</i> you do that for

software security? So the rest of the talk
is going to go in three parts.

First, we're going to give a bit of an
overview for more of the consumer facing

side of things for that we do: look at
some data that we have reported on early

and all these other kinds of good things.
We're then going to go ahead and get

terrifyingly, terrifyingly technical. And
then after that we'll talk about tools to

actually implement all this stuff. The
technical part comes before the tools. So

it just tells you how terrifyingly
technical we're gonna get. It's gonna be

fun right. So how do you do this for
software security: a consumer version. So,

if you set forth to the task of trying to
measure software security, right, many

people here probably do work in the
security field perhaps as consultants

doing reviews; certainly I used to. Then
probably what you're thinking to yourself

right now is that there are lots and lots
and lots and lots of things that affect

the security of a piece of software. Some
of which are, mmm, you're only gonna see

them if you go reversing. And some of
which are just you know kicking around on

the ground waiting for you to notice,
right. So we're going to talk about both

of those kinds of things that you might
measure. But here you see these giant

charts that basically go through on the
left - on the left we have Microsoft Excel

on OS X on the right Google Chrome for OS
X this is a couple years old at this point

maybe one and a half years old but over
here I'm not expecting you to be able to

read these - the real point is to say look
at all of the different things you can

measure very easily.
How do you distill, it how do you boil it

down, right. So this is a the opposite of
a good consumer safety label. This is just

um if you ever done any consulting this is
the kind of report you hand a client to

tell them how good their software is,
right? It's the opposite of consumer

grade. But the reason I'm showing it here
is because, you know, I'm gonna call out

some things and maybe you can't process
all of this because it's too much

material, you know. But I'm gonna call it
some things and once I call them out just

like NP you're gonna recognize them
instantly. So for example, Excel, at the

time of this review - look at this column
of dots. What's this dots telling you?

It's telling you look at all these
libraries -all of them are 32-bit only.

Not 64 bits, not 64 bits. Take a look at
Chrome - exact opposite, exact opposite

64-bit binary, right? What are some other
things? Excel, again, on OSX maybe you can

see these danger warning signs that go
straight straight up the whole thing.

That's the the absence of major heat
protection flags in the binary headers.

We'll talk about some what that means
exactly in a bit. But also if you hop over

here you'll see like yeah yeah yeah like
Chrome has all the different heat

protections that a binary might enable, on
OSX that is, but it also has more dots in

this column here off to the right. And
what do those dots represent?

Those dots represent functions, functions
that historically have been the source of

you know if you call these functions are
very hard to call correctly - if you're a

C programmer the "gets" function is a good
example. But there are lots of them. And

you can see here the Chrome doesn't mind,
it uses them all a bunch. And Excel not so

much. And if you know the history of
Microsoft and the trusted computing

initiative and the SDO and all of that you
will know that a very long time ago

Microsoft made the decision and they said
we're gonna start purging some of these

risky functions from our code bases
because we think it's easier to ban them

than teach our devs to use them correctly.
And you see that reverberating out in

their software. Google on the other hand
says yeah yeah yeah those functions can be

dangerous to use but if you know how to
use them they can be very good and so

they're permitted. The point all of this
is building to is that if you start by

just measuring every little thing that
like your static analyzers can detect in a

piece of software. Two things: one, you
wind up with way more data than you can

show in a slide. And two: the engineering
process, the software development life

cycle that went into the software will
leave behind artifacts that tell you

something about the decisions that went
into designing that engineering process.

And so you know, Google for example:
quite rigorous as far as hitting you know

GCC dash, and then enable all of the
compiler protections. Microsoft may be

less good at that, but much more rigid in
things that's were very popular ideas when

they introduced trusted computing,
alright. So the big takeaway from this

material is that again the software
engineering process results in artifacts

in the software that people can find.
Alright. Ok, so that's that's a whole

bunch of data, certainly it's not a
consumer-friendly label. So how do you

start to get in towards the consumer zone?
Well, the main defect of the big reports

that we just saw is that it's too much
information. It's a very dense on data but

it's very hard to distill it to the "so
what" of it, right?

And so this here is one of our earlier
attempts to go ahead and do that

distillation. What are these charts how
did we come up with these? Well on the

previous slide when we saw all these
different factors that you can analyze in

software, basically here's whose views
that we arrive at this. For each of those

things: pick a weight. Go ahead and
compute a score, average against the

weight: tada, now you have some number.
You can do that for each of the libraries

and the piece of software. And if you do
that for each of the libraries in the

software you can then go ahead and produce
these histograms to show, you know, like

this percentage of the DLLs had a score in
this range. Boom, there's a bar, right.

How do you pick those weights? We'll talk
about that in a sec - it's very technical.

But the the takeaway though, is you know
that you wind up with these charts. Now

I've obscured the labels, I've obscured
the labels and the reason I've done that

is because I don't really care that much
about the actual counts. I want to talk

about the shapes, the shapes of these
charts: it's a qualitative thing.

So here: good scores appear on the right,
bad scores appear on the left. The

histogram measuring all the libraries and
components and so a very secure piece of

software in this model manifests as a tall
bar far to the right. And you can see a

clear example at in our custom Gentoo
build. Anyone here is a Gentoo fan knows -

hey I'm going to install this thing, I
think I'm going to go ahead and turn on

every single one of those flags, and lo
and behold if you do that yeah you wind up

with tall bar far to the right. Here's in
Ubuntu 16, I bet it's 16.04 but I don't

recall exactly, 16 LTS. Here you see a lot
of tall bars to the right - not quite as

consolidated as a custom Gentoo build, but
that makes sense doesn't it right? Because

then you know you don't do your whole
Ubuntu build. Now I want to contrast. I

want to contrast. So over here on the
right we see in the same model, an

analysis of the firmware obtained from two
smart televisions. Last year's models from

Samsung and LG. And here the model
numbers. We did this work in concert with

Consumer Reports. And what do you notice
about these histograms, right. Are the

bars tall and to the right? No, they look
almost normal, not quite, but that doesn't

really matter. The main thing that matters
is that this is the shape you would expect

to get if you were playing a random game
basically to decide what security features

to enable in your software. This is the
shape of not having a security program, is

my bet. That's my bet. And so what do you
see? You see heavy concentration here in

the middle, right, that seems fair, and
like it tails off. On the Samsung nothing

scored all that great, same on the LG.
Both of them are you know running their

respective operating systems and they're
basically just inheriting whatever

security came from whatever open source
thing they forked, right.

So this is this is the kind of message,
this right here is the kind of thing that

we serve to exist for. This is us
producing charts showing that the current

practices in the not-so consumer-friendly
space of running your own Linux distros

far exceed the products being delivered,
certainly in this case in the smart TV

market. But I think you might agree with
me, it's much worse than this. So let's

dig into that a little bit more, I have a
different point that I want to make about

that same data set - so this table here
this table is again looking at the LG

Samsung and Gentoo Linux installations.
And on this table we're just pulling out

some of the easy to identify security
features you might enable in a binary

right. So percentage of binaries with
address space layout randomization, right?

Let's talk about that on our Gentoo build
it's over 99%. That also holds for the

Amazon Linux AMI - it holds in Ubuntu.
ASLR is incredibly common in modern Linux.

And despite that, fewer than 70 percent of
the binaries on the LG television had it

enabled. Fewer than 70 percent. And the
Samsung was doing, you know, better than

that I guess, but 80 percent is a pretty
disappointing when a default Linux

install, you know, mainstream Linux distro
is going to get you 99, right? And it only

gets worse, it only gets worse right you
know?

RELRO support, if you don't know what that
is that's ok but if you do, look abysmal

coverage look at this abysmal coverage
coming out of these IOT devices very sad.

And you see it over and over and over
again. I'm showing this because some

people in this room or watching this video
ship software - and I have a message, I

have a message to those people who ship
software who aren't working on say Chrome

or any of the other big-name Pwn2Own kinds
of targets. Look at this: you can be

leading the pack by mastering the
fundamentals. You can be leading the pack

by mastering the fundamentals. This is a
point that really as a security field we

really need to be driving home. You know,
one of the things that we're seeing here

in our data is that if you're the vendor
who is shipping the product everyone has

heard of in the security field and maybe
your game is pretty decent right? If

you're shipping say Windows or if you're
shipping Firefox or whatever. But if

you're if you're doing one of these things
where people are just kind of beating you

up for default passwords, then your
problems are way further than just default

passwords, right? Like the house, the
house is messy it needs to be cleaned,

needs to be cleaned. So the rest of the
talk like I said we're going to be

discussing a lot of other things that
amount to getting you know a peek behind

the curtain and where some of these things
come from and getting very specific about

how this business works, but if you're
interested in more of the high level

material - especially if you're interested
in interesting results and insights, some

of which I'm going to have here later. But
I really encourage you though to take a

look at the talk from this past summer by
our chief scientist Sarah Zatko, which is

predominantly on the topic of surprising
results in the data.

Today, though, this being our first time
presenting here in Europe, we figured we

would take more of an overarching kind of
view. What we're doing and why we're

excited about it and where it's headed. So
we're about to move into a little bit of

the underlying theory, you know. Why do I
think it's reasonable to even try to

measure the security of software from a
technical perspective. But before we can

get into that I need to talk a little bit
about our goals, so that the decisions and

the theory; the motivation is clear,
right. Our goals are really simple: it's a

very easy organization to run because of
that. Goal number one: remain independent

of vendor influence. We are not the first
organization to purport to be looking out

for the consumer. But unlike many of our
predecessors, we are not taking money from

the people we review, right? Seems like
some basic stuff. Seems like some basic

stuff right? Thank you, okay.
Two: automated, comparable, quantitative

analysis. Why automated? Well, we need our
test results to be reproducible. And Tim

goes in opens up your software in IDA and
finds a bunch of stuff that makes them all

stoped - that's not a very repeatable kind
of a standard for things. And so we're

interested in things which are automated.
We'll talk about, maybe a few hackers in

here know how hard that is. We'll talk
about that, but then last we also we're

well acting as a watchdog - we're
protecting the interests of the user, the

consumer, however you would like to look
at it. But we also have three non goals,

three non goals that are equally
important. One: we have a non goal of

finding and disclosing vulnerabilities. I
reserve the right to find and disclose

vulnerabilities. But that's not my goal,
it's not my goal. Another non goal is to

tell software vendors what to do. If a
vendor asks me how to remediate their

terrible score, I will tell them what we
are measuring but I'm not there to help

them remediate. It's on them to be able to
ship a secure product without me holding

their hand. We'll see. And then three:
non-goal, perform free security testing

for vendors. Our testing happens after you
release. Because when you release your

software you are telling people it is
ready to be used. Is it really though, is

it really though, right?
<i>Applause</i>

Yeah, thank you. Yeah, so we are not there
to give you a preview of what your score

will be. There is no sum of money you can
hand me that will get you an early preview

of what your score is - you can try me,
you can try me: there's a fee for trying

me. There's a fee for trying me. But I'm
not gonna look at your stuff until I'm

ready to drop it, right. Yeah bitte, yeah.
All right. So moving into this theory

territory. Three big questions, three big
questions that need to be addressed if you

want to do our work efficiently: what
works, what works for improving security,

what are the things that need or that you
really want to see in software. Two: how

do you recognize when it's being done?
It's no good if someone hands you a piece

of software and says, "I've done all the
latest things" and it's a complete black

box. If you can't check the claim, the
claim is as good as false, in practical

terms, period, right. Software has to be
reviewable or a priori, I'll think you're

full of it. And then three: who's doing it
- of all the things that work, that you

can recognize, who's actually doing it.
You know, let's go ahead - our field is

famous for ruining people's holidays and
weekends over Friday bug disclosures, you

know New Year's Eve bug disclosures. I
would like us to also be famous for

calling out those teams and those software
organizations which are being as good as

the bad guys are being bad, yeah? So
provide someone an incentive to be maybe

happy to see us for a change, right. Okay,
so thank you. Yeah, all right. So how do

we actually pull these things off; the
basic idea. So, I'm going to get into some

deeper theory: if you're not a theorist I
want you to focus on this slide.

And I'm gonna bring it back, it's not all
theory from here on out after this but if

you're not a theorist I really want you to
focus on this slide. The basic motivation,

the basic motivation behind what we're
doing; the technical motivation - why we

think that it's possible to measure and
report on security. It all boils down to

this right. So we start with a thought
experiment, a gedankent, right? Given a

piece of software we can ask: overall, how
secure is it? Kind of a vague question but

you could imagine you know there's
versions of that question. And two: what

are its vulnerabilities. Maybe you want to
nitpick with me about what the word

vulnerability means but broadly you know
this is a much more specific question

right. And here's here's the enticing
thing: the first question appears to ask

for less information than the second
question. And maybe if we were taking bets

I would put my money on, yes, it actually
does ask for less information. What do I

mean by that what do I mean by that? Well,
let's say that someone told you all of the

vulnerabilities in a system right? They
said, "Hey, I got them all", right? You're

like all right that's cool, that's cool.
And if someone asks you hey how secure is

this system you can give them a very
precise answer. You can say it has N

vulnerabilities, and they're of this kind
and like all this stuff right so certainly

the second question is enough to answer
the first. But, is the reverse true?

Namely, if someone were to tell you, for
example, "hey, this piece of software has

exactly 32 vulnerabilities in it." Does
that make it easier to find any of them?

Right, there's room for to maybe do that
using some algorithms that are not yet in

existence.
Certainly the computer scientists in here

are saying, "well, you know, yeah maybe
counting the number of SAT solutions

doesn't help you practically find
solutions. But it might and we just don't

know." Okay fine fine fine. Maybe these
things are the same, but the my experience

in security, and the experience of many
others perhaps is that they probably

aren't the same question. And this
motivates what I'm calling here is Zatko's

question, which is basically asking for an
algorithm that demonstrates that the first

question is easier than the second
question, right. So Zatko's question:

develop a heuristic which can to
efficiently answer one, but not

necessarily two. If you're looking for a
metaphor, if you want to know why I care

about this distinction, I want you to
think about some certain controversial

technologies: maybe think about say
nuclear technology, right. An algorithm

that answers one, but not two, it's a very
safe algorithm to publish. Very safe

algorithm publish indeed. Okay, Claude
Shannon would like more information. happy

to oblige. Let's take a look at this
question from a different perspective

maybe a more hands-on perspective: the
hacker perspective, right? If you're a

hacker and you're watching me up here and
I'm waving my hands around and I'm showing

you charts maybe you're thinking to
yourself yeah boy, what do you got? Right,

how does this actually go. And maybe what
you're thinking to yourself is that, you

know, finding good vulns: that's an
artisan craft right? You're in IDA, you

know you're reversing old way you're doing
all these things or hit and Comm, I don't

know all that stuff. And like, you know,
this kind of clever game; cleverness is

not like this thing that feels very
automatable. But you know on the other

hand there are a lot of tools that do
automate things and so it's not completely

not automatable.
And if you're into fuzzing then perhaps

you are aware of this very simple
observation, which is that if your harness

is perfect if you really know what you're
doing if you have a decent fuzzer then in

principle fuzzing can find every single
problem. You have to be able to look for

it you have to be able harness for it but
in principle it will, right. So the hacker

perspective on Zatko's question is maybe
of two minds on the one hand assessing

security is a game of cleverness, but on
the other hand we're kind of right now at

the cusp of having some game-changing tech
really go - maybe you're saying like

fuzzing is not at the cusp, I promise it's
just at the cusp. We haven't seen all the

fuzzing has to offer right and so maybe
there's room maybe there's room for some

automation to be possible in pursuit of
Zatko's question. Of course, there are

many challenges still in, you know, using
existing hacker technology. Mostly of the

form of various open questions. For
example if you're into fuzzing, you know,

hey: identifying unique crashes. There's
an open question. We'll talk about some of

those, we'll talk about some of those. But
I'm going to offer another perspective

here: so maybe you're not in the business
of doing software reviews but you know a

little computer science. And maybe that
computer science has you wondering what's

this guy talking about, right. I'm here to
acknowledge that. So whatever you think

the word security means: I've got a list
of questions up here. Whatever you think

the word security means, probably, some of
these questions are relevant to your

definition. Right.
Does the software have a hidden backdoor

or any kind of hidden functionality, does
it handle crypto material correctly, etc,

so forth. Anyone in here who knows some
computers abilities theory knows that

every single one of these questions and
many others like them are undecidable due

to reasons essentially no different than
the reason the halting problem is

undecidable,\ which is to say due to
reasons essentially first identified and

studied by Alan Turing a long time before
we had microarchitectures and all these

other things. And so, the computability
perspective says that, you know, whatever

your definition of security is ultimately
you have this recognizability problem:

fancy way of saying that algorithms won't
be able to recognize secure software

because of the undecidability these
issues. The takeaway, the takeaway is that

the computability angle on all of this
says: anyone who's in the business that

we're in has to use heuristics. You have
to, you have to.

All right, this guy gets it. All right, so
on the tech side our last technical

perspective that we're going to take now
is certainly the most abstract which is

the Bayesian perspective, right. So if
you're a frequentist, you need to get with

the times you know it's everything
Bayesian now. So, let's talk about this

for a bit. Only two slides of math, I
promise, only two! So, let's say that I

have some corpus of software. Perhaps it's
a collection of all modern browsers,

perhaps it's the collection of all the
packages in the Debian repository, perhaps

it's everything on github that builds on
this system, perhaps it's a hard drive

full of warez that some guy mailed you,
right? You have some corpus of software

and for a random program in that corpus we
can consider this probability: the

probability distribution of which software
is secure versus which is not. For reasons

described on the computability
perspective, this number is not a

computable number for any reasonable
definition of security. So that's a neat

and so, for practical terms, if you want
to do some probabilistic reasoning, you

need some surrogate for that and so we
consider this here. So, instead of

considering the probability that a piece
of software is secure, a non computable

non verifiable claim, we take a look here
at this indexed collection of

probabilities. This is an infinite
countable family of probability

distributions, basically P sub h,k is just
the probability that for a random piece of

software in the corpus, h work units of
fuzzing will find no more than k unique

crashes, right. And why is this relevant?
Well, at the bottom we have this analytic

observation, which is that in the limit as
h goes to infinity you're basically

saying: "Hey, you know, if I fuzz this
thing for infinity times, you know, what's

that look like?" And, essentially, here we
have analytically that this should

converge. The P sub h,1 should converge to
the probability that a piece of software

just simply cannot be made to crash. Not
the same thing as being secure, but

certainly not a small concern relevant to
security. So, none of that stuff actually

was Bayesian yet, so we need to get there.
And so here we go, right: so, the previous

slide described a probability distribution
measured based on fuzzing. But fuzzing is

expensive and it is also not an answer to
Zatko's question because it finds

vulnerabilities, it doesn't measure
security in the general sense and so

here's where we make the jump to
conditional probabilities: Let M be some

observable property of software has ASLR,
has RELRO, calls these functions, doesn't

call those functions... take your pick.
For random s in S we now consider these

conditional probability distributions and
this is the same kind of probability as we

had on the previous slide but conditioned
on this observable being true, and this

leads to the refined of the Siddall
variant of Zatko's question:

Which observable properties of software
satisfy that, when the software has

property m, the probability of fuzzing
being hard is very high? That's what this

version of this question phrases, and here
we say, you know, in large log(h)/k, in

other words: exponentially more fuzzing
than you expect to find bugs. So this is

the technical version of what we're after.
All of this can be explored, you can

brute-force your way to finding all of
this stuff, and that's exactly what we're

doing. So we're looking for all kinds of
things, we're looking for all kinds of

things that correlate with fuzzing having
low yield on a piece of software, and

there's a lot of ways in which that can
happen. It could be that you are looking

at a feature of software that literally
prevents crashes. Maybe it's the never

crash flag, I don't know. But most of the
things I've talked about, ASLR, RERO, etc.

don't prevent crashes. In fact a ASLR can
take non-crashing programs and make them

crashing. It's the number one reason
vendors don't enable it, right? So why am

I talking about ASLR? Why am I talking
about RELRO? Why am i talking about all

these things that have nothing to do with
stopping crashes and I'm claiming I'm

measuring crashes? This is because, in the
Bayesian perspective, correlation is not

the same thing as causation, right?
Correlation is not the same thing as

causation. It could be that M's presence
literally prevents crashes, but it could

also be that, by some underlying
coincidence, the things we're looking for

are mostly only found in software that's
robust against crashing.

If you're looking for security, I submit
to you that the difference doesn't matter.

Okay, end of my math, danke. I will now go
ahead and do this like really nice analogy

of all those things that I just described,
right. So we're looking for indicators of

a piece of software being secure enough to
be good for consumers, right. So here's an

analogy. Let's say you're a geologist, you
study minerals and all of that and you're

looking for diamonds. Who isn't, right?
Want those diamonds! And like how do you

find diamonds? Even in places that are
rich in diamonds, diamonds are not common.

You don't just go walking around in your
boots, kicking until your toe stubs on a

diamond? You don't do that. Instead you
look for other minerals that are mostly

only found near diamonds but are much more
abundant in those locations than the

diamonds. So, this is mineral science 101,
I guess, I don't know. So, for example,

you want to go find diamond: put on your
boots and go kicking until you find some

chromite, look for some diopside, you
know, look for some garnet. None of these

things turn into diamonds, none of these
things cause diamonds but if you're

finding good concentrations of these
things, then, statistically, there's

probably diamonds nearby. That's what
we're doing. We're not looking for the

things that cause good security per se.
Rather, we're looking for the indicators

that you have put the effort into your
software, right? How's that working out

for us? How's that working out for us?
Well, we're still doing studies. It's, you

know, early to say exactly but we do have
the following interesting coincidence: and

so, here presented I have a collection of
prices that somebody gave much for so-

called the underground exploits. And I can
tell you these prices are maybe a little

low these days but if you work in that
business, if you go to Cyscin, if you do

that kind of stuff, maybe you know that
this is ballpark, it's ballpark.

Alright, and, just a coincidence, maybe it
means we're on the right track, I don't

know, but it's an encouraging sign: When
we run these programs through our

analysis, our rankings more or less
correspond to the actual prices that you

encounter in the wild for access via these
applications. Up above, I have one of our

histogram charts. You can see here that
Chrome and Edge in this particular model

scored very close to the same and it's a
test model, so, let's say they're

basically the same.
Firefox, you know, behind there a little

bit. I don't have Safari on this chart
because - this or all Windows applications

- but the Safari score falls in between.
So, lots of theory, lots of theory, lots

of theory and then we have this. So, we're
going to go ahead now and hand off to our

lead engineer, Parker, who is going to
talk about some of the concrete stuff, the

non-chalkboard stuff, the software stuff
that actually makes this work.

Thompson: Yeah, so I want to talk about
the process of actually doing it. Building

the tooling that's required to collect
these observables. Effectively, how do you

go mining for indicator indicator
minerals? But first the progression of

where we are and where we're going. We
initially broke this out into three major

tracks of our technology. We have our
static analysis engine, which started as a

prototype, and we have now recently
completed a much more mature and solid

engine that's allowing us to be much more
extensible and digging deeper into

programs, and provide a much more deep
observables. Then, we have the data

collection and data reporting. Tim showed
some of our early stabs at this. We're

right now in the process of building new
engines to make the data more accessible

and easy to work with and hopefully more
of that will be available soon. Finally,

we have our fuzzer track. We needed to get
some early data, so we played with some

existing off-the-shelf fuzzers, including
AFL, and, while that was fun,

unfortunately it's a lot of work to
manually instrument a lot of fuzzers for

hundreds of binaries.
So, we then built an automated solution

that started to get us closer to having a
fuzzing harness that could autogenerate

itself, depending on the software, the
software's behavior. But, right now,

unfortunately that technology showed us
more deficiencies than it showed

successes. So, we are now working on a
much more mature fuzzer that will allow us

to dig deeper into programs as we're
running and collect very specific things

that we need for our model and our
analysis. But on to our analytic pipeline

today. This is one of the most concrete
components of our engine and one of the

most fun!
We effectively wanted some type of

software hopper, where you could just pour
programs in, installers and then, on the

other end, come reports: Fully annotated
and actionable information that we can

present to people. So, we went about the
process of building a large-scale engine.

It starts off with a simple REST API,
where we can push software in, which then

gets moved over to our computation cluster
that effectively provides us a fabric to

work with. It makes is made up of a lot of
different software suites, starting off

with our data processing, which is done by
apache spark and then moves over into data

data handling and data analysis in spark,
and then we have a common HDFS layer to

provide a place for the data to be stored
and then a resource manager and Yarn. All

of that is backed by our compute and data
nodes, which scale out linearly. That then

moves into our data science engine, which
is effectively spark with Apache Zeppelin,

which provides us a really fun interface
where we can work with the data in an

interactive manner but be kicking off
large-scale jobs into the cluster. And

finally, this goes into our report
generation engine. What this bought us,

was the ability to linearly scale and make
that hopper bigger and bigger as we need,

but also provide us a way to process data
that doesn't fit in a single machine's

RAM. You can push the instance sizes as
you large as you want, but we have

datasets that blow away any single host
RAM set. So this allows us to work with

really large collections of observables.
I want to dive down now into our actual

static analysis. But first we have to
explore the problem space, because it's a

nasty one. Effectively in settles mission
is to process as much software as

possible. Hopefully all of it, but it's
hard to get your hand on all the binaries

that are out there. When you start to look
at that problem you understand there's a

lot of combinations: there's a lot of CPU
architectures, there's a lot of operating

systems, there's a lot of file formats,
there's a lot of environments the software

gets deployed into, and every single one
of them has their own app Archer app

armory features. And it can be
specifically set for one combination

button on another and you don't want to
penalize a developer for not turning on a

feature they had no access to ever turn
on. So effectively we need to solve this

in a much more generic way. And so what we
did is our static analysis engine

effectively looks like a gigantic
collection of abstraction libraries to

handle binary programs. You take in some
type of input file be it ELF, PE, MachO

and then the pipeline splits. It goes off
into two major analyzer classes, our

format analyzers, which look at the
software much like how a linker or loader

would look at it. I want to understand how
it's going to be loaded up, what type of

armory feature is going to be applied and
then we can run analyzers over that. In

order to achieve that we need abstraction
libraries that can provide us an abstract

memory map, a symbol resolver, generic
section properties. So all that feeds in

and then we run over a collection of
analyzers to collect data and observables.

Next we have our code analyzers, these are
the analyzers that run over the code

itself. I need to be able to look at every
possible executable path. In order to do

that we need to do function discovery,
feed that into a control flow recovery

engine, and then as a post-processing step
dig through all of the possible metadata

in the software, such as like a switch
table, or something like that to get even

deeper into the software. Then this
provides us a basic list of basic blocks,

functions, instruction ranges. And does so
in an efficient manner so we can process a

lot of software as it goes. Then all that
gets fed over into the main modular

analyzers. Finally, all of this comes
together and gets put into a gigantic blob

of observables and fed up to the pipeline.
We really want to thank the Ford

Foundation for supporting our work in
this, because the pipeline and the static

analysis has been a massive boon for our
project and we're only beginning now to

really get our engine running and we're
having a great time with it. So digging

into the observables themselves, what are
we looking at and let's break them apart.

So the format structure components, things
like ASLR, DEP, RELRO.

basic app armory, that's going to go into
the feature and gonna be enabled at the OS

layer when it gets loaded up or linked.
And we also collect other metadata about

the program such as like: "What libraries
are linked in?", "What's its dependency

tree look like – completely?", "How did
those software, how did those library

score?", because that can affect your main
software. Interesting example on Linux, if

you link a library that requires an
executable stack, guess what your software

now has an executable stack, even if you
didn't mark that. So we need to be owners

to understand what ecosystem the software
is gonna live in. And the code structure

analyzers look at things like
functionality: "What's the software

doing?", "What type of app armory is
getting injected into the code?". A great

example of that is something like stack
guards or fortify source. These are our

main features that only really apply and
can be observed inside of the control flow

or inside of the actual instructions
themselves. This is why control

photographs are key.
We played around with a number of

different ways of analyzing software that
we could scale out and ultimately we had

to come down to working with control
photographs. Provided here is a basic

visualization of what I'm talking about
with a control photograph, provided by

Benja, which has wonderful visualization
tools, hence this photo, and not our

engine because we don't build their very
many visualization engines. But you

basically have a function that's broken up
into basic blocks, which is broken up into

instructions, and then you have basic flow
between them. Having this as an iterable

structure that we can work with, allows us
to walk over that and walk every single

instruction, understand the references,
understand where code and data is being

referenced, and how is it being
referenced.

And then what type of functionalities
being used, so this is a great way to find

something, like whether or not your stack
guards are being applied on every function

that needs them, how deep are they being
applied, and is the compiler possibly

introducing errors into your armory
features. which are interesting side

studies. Also why we did this is because
we want to push the concept of what type

of observables even farther. Let's say
take this example you want to be able to

take instruction abstractions. Let's say
for all major architectures you can break

them up into major categories. Be it
arithmetic instructions, data manipulation

instructions, like load stores and then
control flow instructions. Then with these

basic fundamental building blocks you can
make artifacts. Think of them like a unit

of functionality: has some type of input,
some type of output, it provides some type

of operation on it. And then with these
little units of functionality, you can

link them together and think of these
artifacts as may be sub-basic block or

crossing a few basic blocks, but a
different way to break up the software.

Because a basic block is just a branch
break, but we want to look at

functionality brakes, because these
artifacts can provide the basic

fundamental building blocks of the
software itself. It's more important, when

we want to start doing symbolic lifting.
So that we can lift the entire software up

into a generic representation, that we can
slice and dice as needed.

Moving from there, I want to talk about
fuzzing a little bit more. Fuzzing is

effectively at the heart of our project.
It provides us the rich dataset that we

can use to derive a model. It also
provides us awesome other metadata on the

side. But why? Why do we care about
fuzzing? Why is fuzzing the metric, that

you build an engine, that you build a
model that you drive some type of reason

from? So think of the set of bugs,
vulnerabilities, and exploitable

vulnerabilities. In an ideal world you'd
want to just have a machine that pulls out

exploitable vulnerabilities.
Unfortunately, this is exceedingly costly

for a series of decision problems, that go
between these sets. So now consider the

superset of bugs or faults. A fuzzer can
easily recognize, or other software can

easily recognize faults, but if you want
to move down the sets you unfortunately

need to jump through a lot of decision
hoops. For example, if you want to move to

a vulnerability you have to understand:
Does the attacker have some type of

control? Is there a trust boundary being
crossed? Is this software configured in

the right way for this to be vulnerable
right now? So they're human factors that

are not deducible from the outside. You
then amplify this decision problem even

worse going to exploitable
vulnerabilities. So if we collect the

superset of bugs, we will know that there
is some proportion of subsets in there.

And this provides us a datasets easily
recognizable and we can collect in a cost-

efficient manner. Finally, fuzzing is key
and we're investing a lot of our time

right now and working on a new fuzzing
engine, because there are some key things

we want to do.
We want to be able to understand all of

the different paths the software could be
taking, and as you're fuzzing you're

effectively driving the software down as
many unique paths while referencing as

many unique data manipulations as
possible. So if we save off every path,

annotate the ones that are faulting, we
now have this beautiful rich data set of

exactly where the software went as we were
driving it in specific ways. Then we feed

that back into our static analysis engine
and begin to generate those instruction

out of those instruction abstractions,
those artifacts. And with that, imagine we

have these gigantic traces of instruction
abstractions. From there we can then begin

to train the model to explore around the
fault location and begin to understand and

try and study the fundamental building
blocks of what a bug looks like in an

abstract instruction agnostic way. This is
why we're spending a lot of time on our

Fuzzing engine right now. But hopefully
soon we'll be able to talk about that more

and maybe a tech track and not the policy
track.

C: Yeah, so from then on when anything
went wrong with the computer we said it

had bugs in it. <i>laughs</i> All right, I
promised you a technical journey, I

promised you a technical journey into the
dark abyss of as deep as you want to get

with it. So let's go ahead and bring it
up. Let's wrap it up and bring it up a

little bit here. We've talked a great deal
today about some theory. We've talked

about development in our tooling and
everything else and so I figured I should

end with some things that are not in
progress, but in fact which are done in

yesterday's news. Just to go ahead and
make that shared here with Europe. So in

the midst of all of our development we
have been discovering and reporting bugs,

again this not our primary purpose really.
But you know you can't help but do it. You

know how computers are these days. You
find bugs just for turning them on, right?

So we've been disclosing all of that a
little while ago. At DEFCON and Black Hat

our chief scientist Sarah together with
Mudge went ahead and dropped this

bombshell on the Firefox team which is
that for some period of time they had ASLR

disabled on OS X. When we first found it
we assumed it was a bug in our tools. When

we first mentioned it in a talk they came
to us and said it's definitely a bug on

our tools or might be or some level of
surprise and then people started looking

into it and in fact at one point it had
been enabled and then temporarily

disabled. No one knew, everyone thought it
was on. It takes someone looking to notice

that kind of stuff, right. Major shout out
though, they fixed it immediately despite

our full disclosure on stage and
everything. So very impressed, but in

addition to popping surprises on people
we've also been doing the usual process of

submitting patches and bugs, particularly
to LLVM and Qemu and if you work in

software analysis you could probably guess
why.

Incidentally, if you're looking for a
target to fuzz if you want to go home from

CCC and you want to find a ton of findings
LLVM comes with a bunch of parsers. You

should fuzz them, you should fuzz them and
I say that because I know for a fact you

are gonna get a bunch of findings and it'd
be really nice. I would appreciate it if I

didn't have to pay the people to fix them.
So if you wouldn't mind disclosing them

that would help. But besides these bug
reports and all these other things we've

also been working with lots of others. You
know we gave a talk earlier this summer,

Sarah gave a talk earlier this summer,
about these things and she presented

findings on comparing some of these base
scores of different Linux distributions.

And based on those findings there was a
person on the fedora red team, Jason

Calloway, who sat there and well I can't
read his mind but I'm sure that he was

thinking to himself: golly it would be
nice to not, you know, be surprised at the

next one of these talks. They score very
well by the way. They were leading in

many, many of our metrics. Well, in any
case, he left Vegas and he went back home

and him and his colleagues have been
working on essentially re-implementing

much of our tooling so that they can check
the stuff that we check before they

release. Before they release. Looking for
security before you release. So that would

be a good thing for others to do and I'm
hoping that that idea really catches on.

<i>laughs</i> Yeah, yeah right, that would be
nice. That would be nice.

But in addition to that, in addition to
that our mission really is to get results

out to the public and so in order to
achieve that, we have broad partnerships

with Consumer Reports and the digital
standard. Especially if you're into cyber

policy, I really encourage you to take a
look at the proposed digital standard,

which is encompassing of the things we
look for and and and so much more. URLs,

data, traffic, motion and cryptography and
update mechanisms and all that good stuff.

So, where we are and where we're going,
the big takeaways here for if you're

looking for that, so what, three points
for you: one we are building a tooling

necessary to do larger and larger and
larger studies regarding these surrogate

security stores. My hope is that in some
period of the not-too-distant future, I

would like to be able to, with my
colleagues, publish some really nice

findings about what are the things that
you can observe in software, which have a

suspiciously high correlation with the
software being good. Right, nobody really

knows right now. It's an empirical
question. As far as I know, the study

hasn't been done. We've been running it on
the small scale. We're building the

tooling to do it on a much larger scale.
We are hoping that this winds up being a

useful field in security as that
technology develops. In the meantime our

static analyzers are already making
surprising discoveries: hit YouTube and

take a look for Sara Zatko's recent talks
at DEFCON/Blackhat. Lots of fun findings

in there. Lots of things that anyone who
looks would have found it. Lots of that.

And then lastly, if you were in the
business of shipping software and you are

thinking to yourself.. okay so these guys,
someone gave them some money to mess up my

day and you're wondering: what can I do to
not have my day messed up? One simple

piece of advice, one simple piece of
advice: make sure your software employs

every exploit mitigation technique Mudge
has ever or will ever hear of. And he's

heard of a lot of them. He's only gonna,
you know all that, turn all those things

on and if you don't know anything about
that stuff, if nobody on your team knows

anything about that stuff didn't I don't
even know I'm saying this if you hear you

know about that stuff so do that. If
you're not here, then you should be here.

Danke, Danke.
Herald Angel: Thank you, Tim and Parker.

Do we have any questions from the
audience? It's really hard to see you with

that bright light in my face. I think the
signal angel has a question. Signal Angel:

So the IRC channel was impressed by your
tools and your models that you wrote. And

they are wondering what's going to happen
to that, because you do have funding from

the Ford foundation now and so what are
your plans with this? Do you plan on

commercializing this or is it going to be
open source or how do we get our hands on

this?
C: It's an excellent question. So for the

time being the money that we are receiving
is to develop the tooling, pay for the AWS

instances, pay for the engineers and all
that stuff. The direction as an

organization that we would like to take
things I have no interest in running a

monopoly. That sounds like a fantastic
amount of work and I really don't want to

do it. However, I have a great deal of
interest in taking the gains that we are

making in the technology and releasing the
data so that other competent researchers

can go through and find useful things that
we may not have noticed ourselves. So

we're not at a point where we are
releasing data in bulk just yet, but that

is simply a matter of engineering our
tools, are still in flux as we, you know.

When we do that, we want to make sure the
data is correct and so our software has to

have its own low bug counts and all these
other things. But ultimately for the

scientific aspect of our mission. Though
the science is not our primary mission.

Our primary mission is to apply it to help
consumers. At the same time, it is our

belief that an opaque model is as good as
crap, no one should trust an opaque model,

if somebody is telling you that they have
some statistics and they do not provide

you with any underlying data and it is not
reproducible you should ignore them.

Consequently what we are working towards
right now is getting to a point where we

will be able to share all of those
findings. The surrogate scores, the

interesting correlations between
observables and fuzzing. All that will be

public as the material comes online.
Signal Angel: Thank you.

C: Thank you.
Herald Angel: Thank you. And microphone

number three please.
Mic3: Hi, thanks so some really

interesting work you presented here. So
there's something I'm not sure I

understand about the approach that you're
taking. If you are evaluating the security

of say a library function or the
implementation of a network protocol for

example you know there'd be a precise
specification you could check that

against. And the techniques you're using
would make sense to me. But it's not so

clear since you've set the goal that
you've set for yourself is to evaluate

security of consumer software. It's not
clear to me whether it's fair to call

these results security scores in the
absence of a threat model so. So my

question is, you know, how is it
meaningful to make a claim that a piece of

software is secure if you don't have a
threat model for it?

C: This is an excellent question and I
anyone who disagrees is they should the

wrong. Security without a threat model is
not security at all. It's absolutely a

true point. So the things that we are
looking for, most of them are things that

you will already find present in your
threat model. And so for example we were

reporting on the presence of things like a
ASLR and lots of other things that get to

the heart of exploitability of a piece of
software. So for example if we are

reviewing a piece of software, that has no
attack surface

then it is canonically not in the threat
model and in that sense it makes no sense

to report on its overall security. On the
other hand, if we're talking about

software like say a word processor, a
browser, anything on your phone, anything

that talks on the network, we're talking
about those kinds of applications then I

would argue that exploit mitigations and
the other things that we are measuring are

almost certainly very relevant. So there's
a sense in which what we are measuring is

the lowest common denominator among what
we imagine or the dominant threat models

for the applications. The hand-wavy
answer, but I promised heuristics so there

you go.
Mic3: Thanks.

C: Thank you.
Herald Angel: Any questions? No raising

hands, okay. And then the herald can ask a
question, because I never can. So the

question is: you mentioned earlier these
security labels and for example what

institution could give out the security
labels? Because as obviously the vendor

has no interest in IT security?
C: Yes it's a very good question. So our

partnership with Consumer Reports. I don't
know if you're familiar with them, but in

the United States Consumer Reports is a
major huge consumer watchdog organization.

They test the safety of automobiles, they
test you know lots of consumer appliances.

All kinds of things both to see if they
function more or less as advertised but

most importantly they're checking for
quality, reliability and safety. So our

partnership with Consumer Reports is all
about us doing our work and then

publishing that. And so for example the
televisions that we presented the data on

all of that was collected and published in
partnership with Consumer Reports.

Herald: Thank you.
C: Thank you.

Herald: Any other questions for stream. I
hear a no. Well in this case people thank

you.
Thank Tim and Parker for their nice talk

and please give them a very very warm hall
round of applause.

<i>applause</i>
C: Thank you. T: Thank you.

subtitles created by c3subtitles.de
in the year 2017. Join, and help us!