Music Herald: The next talk is about how risky is software you use. So you may be heard about Trump versus a Russian security company. We won't judge this, we won't comment this, but we dislike the prejudgments of this case. Tim Carstens and Parker Thompson will tell you a little bit more about how risky the software is you use. Tim Carstens is CITL's Acting Director and Parker Thompson is CITL's lead engineer. Please welcome with a very, very warm applause: Tim and Parker! Thanks. Applause Tim Carstens: Howdy, howdy. So my name is Tim Carstens. I'm the acting director of the cyber independent testing lab. It's four words there, we'll talk about all for today, especially cyber. With me today as our lead engineer Parker Thompson. Not on stage or our other collaborators: Patrick Stach, Sarah Zatko, and present in the room but not on stage - Mudge. So today we're going to be talking about our work, the lead in. The introduction that was given is phrased in terms of Kaspersky and all of that, I'm not gonna be speaking about Kaspersky and I guarantee you I'm not gonna be speaking about my president. Right, yeah? Okay. Thank you. Applause All right, so why don't we go ahead and kick off: I'll mention now parts of this presentation are going to be quite technical. Not most of it, and I will always include analogies and all these other things if you are here in security but not a bit-twiddler. But if you do want to be able to review some of the technical material, if I go through it too fast you like to read if you're a mathematician or if you are a computer scientist, our sides are already available for download at this site here. We think our pal our partners at power door for getting that set up for us. Let's let's get started on the real material here. Alright, so we are CITL: a nonprofit organization based in the United States founded by our chief scientist Sarah Zatko and our board chair Mudge. And our mission is a public good mission - we are hackers but our mission here is actually to look out for people who do not know very much about machines or as much as the other hackers do. Specifically, we seek to improve the state of software security by providing the public with accurate reporting on the security of popular software, right? And so there was a mouthful for you. But no doubt, no doubt, every single one of you has received questions of the form: what do I run on my phone, what do I do with this, what do I do with that, how do I protect myself - all these other things lots of people in the general public looking for agency in computing. No one's offering it to them, and so we're trying to go ahead and provide a forcing function on the software field in order to, you know, again be able to enable consumers and users and all these things. Our social good work is funded largely by charitable monies from the Ford Foundation whom we thank a great deal, but we also have major partnerships with Consumer Reports, which is a major organization in the United States that generally, broadly, looks at consumer goods for safety and performance. But also partners with The Digital Standard, which probably would be of great interest to many people here at Congress as it is a holistic standard for protecting user rights. We'll talk about some of the work that goes into those things here in a bit, but first I want to give the big picture of what it is we're really trying to do in one one short little sentence. Something like this but for security, right? What are the important facts, how does it rate, you know, is it easy to consume, is it easy to go ahead and look and say this thing is good this thing is not good. Something like this, but for software security. Sounds hard doesn't it? So I want to talk a little bit about what I mean by something like this. There are lots of consumer outlook and watchdog and protection groups - some private, some government, which are looking to do this for various things that are not a software security. And you can see some examples here that are big in the United States - I happen to not like these as much as some of the newer consumer labels coming out from the EU. But nonetheless they are examples of the kinds of things people have done in other fields, fields that are not security to try to achieve that same end. And when these things work well, it is for three reasons: One, it has to contain the relevant information. Two: it has to be based in fact, we're not talking opinions, this is not a book club or something like that. And three: it has to be actionable, it has to be actionable - you have to be able to know how to make a decision based on it. How do you do that for software security? How do you do that for software security? So the rest of the talk is going to go in three parts. First, we're going to give a bit of an overview for more of the consumer facing side of things for that we do: look at some data that we have reported on early and all these other kinds of good things. We're then going to go ahead and get terrifyingly, terrifyingly technical. And then after that we'll talk about tools to actually implement all this stuff. The technical part comes before the tools. So it just tells you how terrifyingly technical we're gonna get. It's gonna be fun right. So how do you do this for software security: a consumer version. So, if you set forth to the task of trying to measure software security, right, many people here probably do work in the security field perhaps as consultants doing reviews; certainly I used to. Then probably what you're thinking to yourself right now is that there are lots and lots and lots and lots of things that affect the security of a piece of software. Some of which are, mmm, you're only gonna see them if you go reversing. And some of which are just you know kicking around on the ground waiting for you to notice, right. So we're going to talk about both of those kinds of things that you might measure. But here you see these giant charts that basically go through on the left - on the left we have Microsoft Excel on OS X on the right Google Chrome for OS X this is a couple years old at this point maybe one and a half years old but over here I'm not expecting you to be able to read these - the real point is to say look at all of the different things you can measure very easily. How do you distill, it how do you boil it down, right. So this is a the opposite of a good consumer safety label. This is just um if you ever done any consulting this is the kind of report you hand a client to tell them how good their software is, right? It's the opposite of consumer grade. But the reason I'm showing it here is because, you know, I'm gonna call out some things and maybe you can't process all of this because it's too much material, you know. But I'm gonna call it some things and once I call them out just like NP you're gonna recognize them instantly. So for example, Excel, at the time of this review - look at this column of dots. What's this dots telling you? It's telling you look at all these libraries -all of them are 32-bit only. Not 64 bits, not 64 bits. Take a look at Chrome - exact opposite, exact opposite 64-bit binary, right? What are some other things? Excel, again, on OSX maybe you can see these danger warning signs that go straight straight up the whole thing. That's the the absence of major heat protection flags in the binary headers. We'll talk about some what that means exactly in a bit. But also if you hop over here you'll see like yeah yeah yeah like Chrome has all the different heat protections that a binary might enable, on OSX that is, but it also has more dots in this column here off to the right. And what do those dots represent? Those dots represent functions, functions that historically have been the source of you know if you call these functions are very hard to call correctly - if you're a C programmer the "gets" function is a good example. But there are lots of them. And you can see here the Chrome doesn't mind, it uses them all a bunch. And Excel not so much. And if you know the history of Microsoft and the trusted computing initiative and the SDO and all of that you will know that a very long time ago Microsoft made the decision and they said we're gonna start purging some of these risky functions from our code bases because we think it's easier to ban them than teach our devs to use them correctly. And you see that reverberating out in their software. Google on the other hand says yeah yeah yeah those functions can be dangerous to use but if you know how to use them they can be very good and so they're permitted. The point all of this is building to is that if you start by just measuring every little thing that like your static analyzers can detect in a piece of software. Two things: one, you wind up with way more data than you can show in a slide. And two: the engineering process, the software development life cycle that went into the software will leave behind artifacts that tell you something about the decisions that went into designing that engineering process. And so you know, Google for example: quite rigorous as far as hitting you know GCC dash, and then enable all of the compiler protections. Microsoft may be less good at that, but much more rigid in things that's were very popular ideas when they introduced trusted computing, alright. So the big takeaway from this material is that again the software engineering process results in artifacts in the software that people can find. Alright. Ok, so that's that's a whole bunch of data, certainly it's not a consumer-friendly label. So how do you start to get in towards the consumer zone? Well, the main defect of the big reports that we just saw is that it's too much information. It's a very dense on data but it's very hard to distill it to the "so what" of it, right? And so this here is one of our earlier attempts to go ahead and do that distillation. What are these charts how did we come up with these? Well on the previous slide when we saw all these different factors that you can analyze in software, basically here's whose views that we arrive at this. For each of those things: pick a weight. Go ahead and compute a score, average against the weight: tada, now you have some number. You can do that for each of the libraries and the piece of software. And if you do that for each of the libraries in the software you can then go ahead and produce these histograms to show, you know, like this percentage of the DLLs had a score in this range. Boom, there's a bar, right. How do you pick those weights? We'll talk about that in a sec - it's very technical. But the the takeaway though, is you know that you wind up with these charts. Now I've obscured the labels, I've obscured the labels and the reason I've done that is because I don't really care that much about the actual counts. I want to talk about the shapes, the shapes of these charts: it's a qualitative thing. So here: good scores appear on the right, bad scores appear on the left. The histogram measuring all the libraries and components and so a very secure piece of software in this model manifests as a tall bar far to the right. And you can see a clear example at in our custom Gentoo build. Anyone here is a Gentoo fan knows - hey I'm going to install this thing, I think I'm going to go ahead and turn on every single one of those flags, and lo and behold if you do that yeah you wind up with tall bar far to the right. Here's in Ubuntu 16, I bet it's 16.04 but I don't recall exactly, 16 LTS. Here you see a lot of tall bars to the right - not quite as consolidated as a custom Gentoo build, but that makes sense doesn't it right? Because then you know you don't do your whole Ubuntu build. Now I want to contrast. I want to contrast. So over here on the right we see in the same model, an analysis of the firmware obtained from two smart televisions. Last year's models from Samsung and LG. And here the model numbers. We did this work in concert with Consumer Reports. And what do you notice about these histograms, right. Are the bars tall and to the right? No, they look almost normal, not quite, but that doesn't really matter. The main thing that matters is that this is the shape you would expect to get if you were playing a random game basically to decide what security features to enable in your software. This is the shape of not having a security program, is my bet. That's my bet. And so what do you see? You see heavy concentration here in the middle, right, that seems fair, and like it tails off. On the Samsung nothing scored all that great, same on the LG. Both of them are you know running their respective operating systems and they're basically just inheriting whatever security came from whatever open source thing they forked, right. So this is this is the kind of message, this right here is the kind of thing that we serve to exist for. This is us producing charts showing that the current practices in the not-so consumer-friendly space of running your own Linux distros far exceed the products being delivered, certainly in this case in the smart TV market. But I think you might agree with me, it's much worse than this. So let's dig into that a little bit more, I have a different point that I want to make about that same data set - so this table here this table is again looking at the LG Samsung and Gentoo Linux installations. And on this table we're just pulling out some of the easy to identify security features you might enable in a binary right. So percentage of binaries with address space layout randomization, right? Let's talk about that on our Gentoo build it's over 99%. That also holds for the Amazon Linux AMI - it holds in Ubuntu. ASLR is incredibly common in modern Linux. And despite that, fewer than 70 percent of the binaries on the LG television had it enabled. Fewer than 70 percent. And the Samsung was doing, you know, better than that I guess, but 80 percent is a pretty disappointing when a default Linux install, you know, mainstream Linux distro is going to get you 99, right? And it only gets worse, it only gets worse right you know? RELRO support, if you don't know what that is that's ok but if you do, look abysmal coverage look at this abysmal coverage coming out of these IOT devices very sad. And you see it over and over and over again. I'm showing this because some people in this room or watching this video ship software - and I have a message, I have a message to those people who ship software who aren't working on say Chrome or any of the other big-name Pwn2Own kinds of targets. Look at this: you can be leading the pack by mastering the fundamentals. You can be leading the pack by mastering the fundamentals. This is a point that really as a security field we really need to be driving home. You know, one of the things that we're seeing here in our data is that if you're the vendor who is shipping the product everyone has heard of in the security field and maybe your game is pretty decent right? If you're shipping say Windows or if you're shipping Firefox or whatever. But if you're if you're doing one of these things where people are just kind of beating you up for default passwords, then your problems are way further than just default passwords, right? Like the house, the house is messy it needs to be cleaned, needs to be cleaned. So the rest of the talk like I said we're going to be discussing a lot of other things that amount to getting you know a peek behind the curtain and where some of these things come from and getting very specific about how this business works, but if you're interested in more of the high level material - especially if you're interested in interesting results and insights, some of which I'm going to have here later. But I really encourage you though to take a look at the talk from this past summer by our chief scientist Sarah Zatko, which is predominantly on the topic of surprising results in the data. Today, though, this being our first time presenting here in Europe, we figured we would take more of an overarching kind of view. What we're doing and why we're excited about it and where it's headed. So we're about to move into a little bit of the underlying theory, you know. Why do I think it's reasonable to even try to measure the security of software from a technical perspective. But before we can get into that I need to talk a little bit about our goals, so that the decisions and the theory; the motivation is clear, right. Our goals are really simple: it's a very easy organization to run because of that. Goal number one: remain independent of vendor influence. We are not the first organization to purport to be looking out for the consumer. But unlike many of our predecessors, we are not taking money from the people we review, right? Seems like some basic stuff. Seems like some basic stuff right? Thank you, okay. Two: automated, comparable, quantitative analysis. Why automated? Well, we need our test results to be reproducible. And Tim goes in opens up your software in IDA and finds a bunch of stuff that makes them all stoped - that's not a very repeatable kind of a standard for things. And so we're interested in things which are automated. We'll talk about, maybe a few hackers in here know how hard that is. We'll talk about that, but then last we also we're well acting as a watchdog - we're protecting the interests of the user, the consumer, however you would like to look at it. But we also have three non goals, three non goals that are equally important. One: we have a non goal of finding and disclosing vulnerabilities. I reserve the right to find and disclose vulnerabilities. But that's not my goal, it's not my goal. Another non goal is to tell software vendors what to do. If a vendor asks me how to remediate their terrible score, I will tell them what we are measuring but I'm not there to help them remediate. It's on them to be able to ship a secure product without me holding their hand. We'll see. And then three: non-goal, perform free security testing for vendors. Our testing happens after you release. Because when you release your software you are telling people it is ready to be used. Is it really though, is it really though, right? Applause Yeah, thank you. Yeah, so we are not there to give you a preview of what your score will be. There is no sum of money you can hand me that will get you an early preview of what your score is - you can try me, you can try me: there's a fee for trying me. There's a fee for trying me. But I'm not gonna look at your stuff until I'm ready to drop it, right. Yeah bitte, yeah. All right. So moving into this theory territory. Three big questions, three big questions that need to be addressed if you want to do our work efficiently: what works, what works for improving security, what are the things that need or that you really want to see in software. Two: how do you recognize when it's being done? It's no good if someone hands you a piece of software and says, "I've done all the latest things" and it's a complete black box. If you can't check the claim, the claim is as good as false, in practical terms, period, right. Software has to be reviewable or a priori, I'll think you're full of it. And then three: who's doing it - of all the things that work, that you can recognize, who's actually doing it. You know, let's go ahead - our field is famous for ruining people's holidays and weekends over Friday bug disclosures, you know New Year's Eve bug disclosures. I would like us to also be famous for calling out those teams and those software organizations which are being as good as the bad guys are being bad, yeah? So provide someone an incentive to be maybe happy to see us for a change, right. Okay, so thank you. Yeah, all right. So how do we actually pull these things off; the basic idea. So, I'm going to get into some deeper theory: if you're not a theorist I want you to focus on this slide. And I'm gonna bring it back, it's not all theory from here on out after this but if you're not a theorist I really want you to focus on this slide. The basic motivation, the basic motivation behind what we're doing; the technical motivation - why we think that it's possible to measure and report on security. It all boils down to this right. So we start with a thought experiment, a gedankent, right? Given a piece of software we can ask: overall, how secure is it? Kind of a vague question but you could imagine you know there's versions of that question. And two: what are its vulnerabilities. Maybe you want to nitpick with me about what the word vulnerability means but broadly you know this is a much more specific question right. And here's here's the enticing thing: the first question appears to ask for less information than the second question. And maybe if we were taking bets I would put my money on, yes, it actually does ask for less information. What do I mean by that what do I mean by that? Well, let's say that someone told you all of the vulnerabilities in a system right? They said, "Hey, I got them all", right? You're like all right that's cool, that's cool. And if someone asks you hey how secure is this system you can give them a very precise answer. You can say it has N vulnerabilities, and they're of this kind and like all this stuff right so certainly the second question is enough to answer the first. But, is the reverse true? Namely, if someone were to tell you, for example, "hey, this piece of software has exactly 32 vulnerabilities in it." Does that make it easier to find any of them? Right, there's room for to maybe do that using some algorithms that are not yet in existence. Certainly the computer scientists in here are saying, "well, you know, yeah maybe counting the number of SAT solutions doesn't help you practically find solutions. But it might and we just don't know." Okay fine fine fine. Maybe these things are the same, but the my experience in security, and the experience of many others perhaps is that they probably aren't the same question. And this motivates what I'm calling here is Zatko's question, which is basically asking for an algorithm that demonstrates that the first question is easier than the second question, right. So Zatko's question: develop a heuristic which can to efficiently answer one, but not necessarily two. If you're looking for a metaphor, if you want to know why I care about this distinction, I want you to think about some certain controversial technologies: maybe think about say nuclear technology, right. An algorithm that answers one, but not two, it's a very safe algorithm to publish. Very safe algorithm publish indeed. Okay, Claude Shannon would like more information. happy to oblige. Let's take a look at this question from a different perspective maybe a more hands-on perspective: the hacker perspective, right? If you're a hacker and you're watching me up here and I'm waving my hands around and I'm showing you charts maybe you're thinking to yourself yeah boy, what do you got? Right, how does this actually go. And maybe what you're thinking to yourself is that, you know, finding good vulns: that's an artisan craft right? You're in IDA, you know you're reversing old way you're doing all these things or hit and Comm, I don't know all that stuff. And like, you know, this kind of clever game; cleverness is not like this thing that feels very automatable. But you know on the other hand there are a lot of tools that do automate things and so it's not completely not automatable. And if you're into fuzzing then perhaps you are aware of this very simple observation, which is that if your harness is perfect if you really know what you're doing if you have a decent fuzzer then in principle fuzzing can find every single problem. You have to be able to look for it you have to be able harness for it but in principle it will, right. So the hacker perspective on Zatko's question is maybe of two minds on the one hand assessing security is a game of cleverness, but on the other hand we're kind of right now at the cusp of having some game-changing tech really go - maybe you're saying like fuzzing is not at the cusp, I promise it's just at the cusp. We haven't seen all the fuzzing has to offer right and so maybe there's room maybe there's room for some automation to be possible in pursuit of Zatko's question. Of course, there are many challenges still in, you know, using existing hacker technology. Mostly of the form of various open questions. For example if you're into fuzzing, you know, hey: identifying unique crashes. There's an open question. We'll talk about some of those, we'll talk about some of those. But I'm going to offer another perspective here: so maybe you're not in the business of doing software reviews but you know a little computer science. And maybe that computer science has you wondering what's this guy talking about, right. I'm here to acknowledge that. So whatever you think the word security means: I've got a list of questions up here. Whatever you think the word security means, probably, some of these questions are relevant to your definition. Right. Does the software have a hidden backdoor or any kind of hidden functionality, does it handle crypto material correctly, etc, so forth. Anyone in here who knows some computers abilities theory knows that every single one of these questions and many others like them are undecidable due to reasons essentially no different than the reason the halting problem is undecidable,\ which is to say due to reasons essentially first identified and studied by Alan Turing a long time before we had microarchitectures and all these other things. And so, the computability perspective says that, you know, whatever your definition of security is ultimately you have this recognizability problem: fancy way of saying that algorithms won't be able to recognize secure software because of the undecidability these issues. The takeaway, the takeaway is that the computability angle on all of this says: anyone who's in the business that we're in has to use heuristics. You have to, you have to. All right, this guy gets it. All right, so on the tech side our last technical perspective that we're going to take now is certainly the most abstract which is the Bayesian perspective, right. So if you're a frequentist, you need to get with the times you know it's everything Bayesian now. So, let's talk about this for a bit. Only two slides of math, I promise, only two! So, let's say that I have some corpus of software. Perhaps it's a collection of all modern browsers, perhaps it's the collection of all the packages in the Debian repository, perhaps it's everything on github that builds on this system, perhaps it's a hard drive full of warez that some guy mailed you, right? You have some corpus of software and for a random program in that corpus we can consider this probability: the probability distribution of which software is secure versus which is not. For reasons described on the computability perspective, this number is not a computable number for any reasonable definition of security. So that's a neat and so, for practical terms, if you want to do some probabilistic reasoning, you need some surrogate for that and so we consider this here. So, instead of considering the probability that a piece of software is secure, a non computable non verifiable claim, we take a look here at this indexed collection of probabilities. This is an infinite countable family of probability distributions, basically P sub h,k is just the probability that for a random piece of software in the corpus, h work units of fuzzing will find no more than k unique crashes, right. And why is this relevant? Well, at the bottom we have this analytic observation, which is that in the limit as h goes to infinity you're basically saying: "Hey, you know, if I fuzz this thing for infinity times, you know, what's that look like?" And, essentially, here we have analytically that this should converge. The P sub h,1 should converge to the probability that a piece of software just simply cannot be made to crash. Not the same thing as being secure, but certainly not a small concern relevant to security. So, none of that stuff actually was Bayesian yet, so we need to get there. And so here we go, right: so, the previous slide described a probability distribution measured based on fuzzing. But fuzzing is expensive and it is also not an answer to Zatko's question because it finds vulnerabilities, it doesn't measure security in the general sense and so here's where we make the jump to conditional probabilities: Let M be some observable property of software has ASLR, has RELRO, calls these functions, doesn't call those functions... take your pick. For random s in S we now consider these conditional probability distributions and this is the same kind of probability as we had on the previous slide but conditioned on this observable being true, and this leads to the refined of the Siddall variant of Zatko's question: Which observable properties of software satisfy that, when the software has property m, the probability of fuzzing being hard is very high? That's what this version of this question phrases, and here we say, you know, in large log(h)/k, in other words: exponentially more fuzzing than you expect to find bugs. So this is the technical version of what we're after. All of this can be explored, you can brute-force your way to finding all of this stuff, and that's exactly what we're doing. So we're looking for all kinds of things, we're looking for all kinds of things that correlate with fuzzing having low yield on a piece of software, and there's a lot of ways in which that can happen. It could be that you are looking at a feature of software that literally prevents crashes. Maybe it's the never crash flag, I don't know. But most of the things I've talked about, ASLR, RERO, etc. don't prevent crashes. In fact a ASLR can take non-crashing programs and make them crashing. It's the number one reason vendors don't enable it, right? So why am I talking about ASLR? Why am I talking about RELRO? Why am i talking about all these things that have nothing to do with stopping crashes and I'm claiming I'm measuring crashes? This is because, in the Bayesian perspective, correlation is not the same thing as causation, right? Correlation is not the same thing as causation. It could be that M's presence literally prevents crashes, but it could also be that, by some underlying coincidence, the things we're looking for are mostly only found in software that's robust against crashing. If you're looking for security, I submit to you that the difference doesn't matter. Okay, end of my math, danke. I will now go ahead and do this like really nice analogy of all those things that I just described, right. So we're looking for indicators of a piece of software being secure enough to be good for consumers, right. So here's an analogy. Let's say you're a geologist, you study minerals and all of that and you're looking for diamonds. Who isn't, right? Want those diamonds! And like how do you find diamonds? Even in places that are rich in diamonds, diamonds are not common. You don't just go walking around in your boots, kicking until your toe stubs on a diamond? You don't do that. Instead you look for other minerals that are mostly only found near diamonds but are much more abundant in those locations than the diamonds. So, this is mineral science 101, I guess, I don't know. So, for example, you want to go find diamond: put on your boots and go kicking until you find some chromite, look for some diopside, you know, look for some garnet. None of these things turn into diamonds, none of these things cause diamonds but if you're finding good concentrations of these things, then, statistically, there's probably diamonds nearby. That's what we're doing. We're not looking for the things that cause good security per se. Rather, we're looking for the indicators that you have put the effort into your software, right? How's that working out for us? How's that working out for us? Well, we're still doing studies. It's, you know, early to say exactly but we do have the following interesting coincidence: and so, here presented I have a collection of prices that somebody gave much for so- called the underground exploits. And I can tell you these prices are maybe a little low these days but if you work in that business, if you go to Cyscin, if you do that kind of stuff, maybe you know that this is ballpark, it's ballpark. Alright, and, just a coincidence, maybe it means we're on the right track, I don't know, but it's an encouraging sign: When we run these programs through our analysis, our rankings more or less correspond to the actual prices that you encounter in the wild for access via these applications. Up above, I have one of our histogram charts. You can see here that Chrome and Edge in this particular model scored very close to the same and it's a test model, so, let's say they're basically the same. Firefox, you know, behind there a little bit. I don't have Safari on this chart because - this or all Windows applications - but the Safari score falls in between. So, lots of theory, lots of theory, lots of theory and then we have this. So, we're going to go ahead now and hand off to our lead engineer, Parker, who is going to talk about some of the concrete stuff, the non-chalkboard stuff, the software stuff that actually makes this work. Thompson: Yeah, so I want to talk about the process of actually doing it. Building the tooling that's required to collect these observables. Effectively, how do you go mining for indicator indicator minerals? But first the progression of where we are and where we're going. We initially broke this out into three major tracks of our technology. We have our static analysis engine, which started as a prototype, and we have now recently completed a much more mature and solid engine that's allowing us to be much more extensible and digging deeper into programs, and provide a much more deep observables. Then, we have the data collection and data reporting. Tim showed some of our early stabs at this. We're right now in the process of building new engines to make the data more accessible and easy to work with and hopefully more of that will be available soon. Finally, we have our fuzzer track. We needed to get some early data, so we played with some existing off-the-shelf fuzzers, including AFL, and, while that was fun, unfortunately it's a lot of work to manually instrument a lot of fuzzers for hundreds of binaries. So, we then built an automated solution that started to get us closer to having a fuzzing harness that could autogenerate itself, depending on the software, the software's behavior. But, right now, unfortunately that technology showed us more deficiencies than it showed successes. So, we are now working on a much more mature fuzzer that will allow us to dig deeper into programs as we're running and collect very specific things that we need for our model and our analysis. But on to our analytic pipeline today. This is one of the most concrete components of our engine and one of the most fun! We effectively wanted some type of software hopper, where you could just pour programs in, installers and then, on the other end, come reports: Fully annotated and actionable information that we can present to people. So, we went about the process of building a large-scale engine. It starts off with a simple REST API, where we can push software in, which then gets moved over to our computation cluster that effectively provides us a fabric to work with. It makes is made up of a lot of different software suites, starting off with our data processing, which is done by apache spark and then moves over into data data handling and data analysis in spark, and then we have a common HDFS layer to provide a place for the data to be stored and then a resource manager and Yarn. All of that is backed by our compute and data nodes, which scale out linearly. That then moves into our data science engine, which is effectively spark with Apache Zeppelin, which provides us a really fun interface where we can work with the data in an interactive manner but be kicking off large-scale jobs into the cluster. And finally, this goes into our report generation engine. What this bought us, was the ability to linearly scale and make that hopper bigger and bigger as we need, but also provide us a way to process data that doesn't fit in a single machine's RAM. You can push the instance sizes as you large as you want, but we have datasets that blow away any single host RAM set. So this allows us to work with really large collections of observables. I want to dive down now into our actual static analysis. But first we have to explore the problem space, because it's a nasty one. Effectively in settles mission is to process as much software as possible. Hopefully all of it, but it's hard to get your hand on all the binaries that are out there. When you start to look at that problem you understand there's a lot of combinations: there's a lot of CPU architectures, there's a lot of operating systems, there's a lot of file formats, there's a lot of environments the software gets deployed into, and every single one of them has their own app Archer app armory features. And it can be specifically set for one combination button on another and you don't want to penalize a developer for not turning on a feature they had no access to ever turn on. So effectively we need to solve this in a much more generic way. And so what we did is our static analysis engine effectively looks like a gigantic collection of abstraction libraries to handle binary programs. You take in some type of input file be it ELF, PE, MachO and then the pipeline splits. It goes off into two major analyzer classes, our format analyzers, which look at the software much like how a linker or loader would look at it. I want to understand how it's going to be loaded up, what type of armory feature is going to be applied and then we can run analyzers over that. In order to achieve that we need abstraction libraries that can provide us an abstract memory map, a symbol resolver, generic section properties. So all that feeds in and then we run over a collection of analyzers to collect data and observables. Next we have our code analyzers, these are the analyzers that run over the code itself. I need to be able to look at every possible executable path. In order to do that we need to do function discovery, feed that into a control flow recovery engine, and then as a post-processing step dig through all of the possible metadata in the software, such as like a switch table, or something like that to get even deeper into the software. Then this provides us a basic list of basic blocks, functions, instruction ranges. And does so in an efficient manner so we can process a lot of software as it goes. Then all that gets fed over into the main modular analyzers. Finally, all of this comes together and gets put into a gigantic blob of observables and fed up to the pipeline. We really want to thank the Ford Foundation for supporting our work in this, because the pipeline and the static analysis has been a massive boon for our project and we're only beginning now to really get our engine running and we're having a great time with it. So digging into the observables themselves, what are we looking at and let's break them apart. So the format structure components, things like ASLR, DEP, RELRO. basic app armory, that's going to go into the feature and gonna be enabled at the OS layer when it gets loaded up or linked. And we also collect other metadata about the program such as like: "What libraries are linked in?", "What's its dependency tree look like – completely?", "How did those software, how did those library score?", because that can affect your main software. Interesting example on Linux, if you link a library that requires an executable stack, guess what your software now has an executable stack, even if you didn't mark that. So we need to be owners to understand what ecosystem the software is gonna live in. And the code structure analyzers look at things like functionality: "What's the software doing?", "What type of app armory is getting injected into the code?". A great example of that is something like stack guards or fortify source. These are our main features that only really apply and can be observed inside of the control flow or inside of the actual instructions themselves. This is why control photographs are key. We played around with a number of different ways of analyzing software that we could scale out and ultimately we had to come down to working with control photographs. Provided here is a basic visualization of what I'm talking about with a control photograph, provided by Benja, which has wonderful visualization tools, hence this photo, and not our engine because we don't build their very many visualization engines. But you basically have a function that's broken up into basic blocks, which is broken up into instructions, and then you have basic flow between them. Having this as an iterable structure that we can work with, allows us to walk over that and walk every single instruction, understand the references, understand where code and data is being referenced, and how is it being referenced. And then what type of functionalities being used, so this is a great way to find something, like whether or not your stack guards are being applied on every function that needs them, how deep are they being applied, and is the compiler possibly introducing errors into your armory features. which are interesting side studies. Also why we did this is because we want to push the concept of what type of observables even farther. Let's say take this example you want to be able to take instruction abstractions. Let's say for all major architectures you can break them up into major categories. Be it arithmetic instructions, data manipulation instructions, like load stores and then control flow instructions. Then with these basic fundamental building blocks you can make artifacts. Think of them like a unit of functionality: has some type of input, some type of output, it provides some type of operation on it. And then with these little units of functionality, you can link them together and think of these artifacts as may be sub-basic block or crossing a few basic blocks, but a different way to break up the software. Because a basic block is just a branch break, but we want to look at functionality brakes, because these artifacts can provide the basic fundamental building blocks of the software itself. It's more important, when we want to start doing symbolic lifting. So that we can lift the entire software up into a generic representation, that we can slice and dice as needed. Moving from there, I want to talk about fuzzing a little bit more. Fuzzing is effectively at the heart of our project. It provides us the rich dataset that we can use to derive a model. It also provides us awesome other metadata on the side. But why? Why do we care about fuzzing? Why is fuzzing the metric, that you build an engine, that you build a model that you drive some type of reason from? So think of the set of bugs, vulnerabilities, and exploitable vulnerabilities. In an ideal world you'd want to just have a machine that pulls out exploitable vulnerabilities. Unfortunately, this is exceedingly costly for a series of decision problems, that go between these sets. So now consider the superset of bugs or faults. A fuzzer can easily recognize, or other software can easily recognize faults, but if you want to move down the sets you unfortunately need to jump through a lot of decision hoops. For example, if you want to move to a vulnerability you have to understand: Does the attacker have some type of control? Is there a trust boundary being crossed? Is this software configured in the right way for this to be vulnerable right now? So they're human factors that are not deducible from the outside. You then amplify this decision problem even worse going to exploitable vulnerabilities. So if we collect the superset of bugs, we will know that there is some proportion of subsets in there. And this provides us a datasets easily recognizable and we can collect in a cost- efficient manner. Finally, fuzzing is key and we're investing a lot of our time right now and working on a new fuzzing engine, because there are some key things we want to do. We want to be able to understand all of the different paths the software could be taking, and as you're fuzzing you're effectively driving the software down as many unique paths while referencing as many unique data manipulations as possible. So if we save off every path, annotate the ones that are faulting, we now have this beautiful rich data set of exactly where the software went as we were driving it in specific ways. Then we feed that back into our static analysis engine and begin to generate those instruction out of those instruction abstractions, those artifacts. And with that, imagine we have these gigantic traces of instruction abstractions. From there we can then begin to train the model to explore around the fault location and begin to understand and try and study the fundamental building blocks of what a bug looks like in an abstract instruction agnostic way. This is why we're spending a lot of time on our Fuzzing engine right now. But hopefully soon we'll be able to talk about that more and maybe a tech track and not the policy track. C: Yeah, so from then on when anything went wrong with the computer we said it had bugs in it. laughs All right, I promised you a technical journey, I promised you a technical journey into the dark abyss of as deep as you want to get with it. So let's go ahead and bring it up. Let's wrap it up and bring it up a little bit here. We've talked a great deal today about some theory. We've talked about development in our tooling and everything else and so I figured I should end with some things that are not in progress, but in fact which are done in yesterday's news. Just to go ahead and make that shared here with Europe. So in the midst of all of our development we have been discovering and reporting bugs, again this not our primary purpose really. But you know you can't help but do it. You know how computers are these days. You find bugs just for turning them on, right? So we've been disclosing all of that a little while ago. At DEFCON and Black Hat our chief scientist Sarah together with Mudge went ahead and dropped this bombshell on the Firefox team which is that for some period of time they had ASLR disabled on OS X. When we first found it we assumed it was a bug in our tools. When we first mentioned it in a talk they came to us and said it's definitely a bug on our tools or might be or some level of surprise and then people started looking into it and in fact at one point it had been enabled and then temporarily disabled. No one knew, everyone thought it was on. It takes someone looking to notice that kind of stuff, right. Major shout out though, they fixed it immediately despite our full disclosure on stage and everything. So very impressed, but in addition to popping surprises on people we've also been doing the usual process of submitting patches and bugs, particularly to LLVM and Qemu and if you work in software analysis you could probably guess why. Incidentally, if you're looking for a target to fuzz if you want to go home from CCC and you want to find a ton of findings LLVM comes with a bunch of parsers. You should fuzz them, you should fuzz them and I say that because I know for a fact you are gonna get a bunch of findings and it'd be really nice. I would appreciate it if I didn't have to pay the people to fix them. So if you wouldn't mind disclosing them that would help. But besides these bug reports and all these other things we've also been working with lots of others. You know we gave a talk earlier this summer, Sarah gave a talk earlier this summer, about these things and she presented findings on comparing some of these base scores of different Linux distributions. And based on those findings there was a person on the fedora red team, Jason Calloway, who sat there and well I can't read his mind but I'm sure that he was thinking to himself: golly it would be nice to not, you know, be surprised at the next one of these talks. They score very well by the way. They were leading in many, many of our metrics. Well, in any case, he left Vegas and he went back home and him and his colleagues have been working on essentially re-implementing much of our tooling so that they can check the stuff that we check before they release. Before they release. Looking for security before you release. So that would be a good thing for others to do and I'm hoping that that idea really catches on. laughs Yeah, yeah right, that would be nice. That would be nice. But in addition to that, in addition to that our mission really is to get results out to the public and so in order to achieve that, we have broad partnerships with Consumer Reports and the digital standard. Especially if you're into cyber policy, I really encourage you to take a look at the proposed digital standard, which is encompassing of the things we look for and and and so much more. URLs, data, traffic, motion and cryptography and update mechanisms and all that good stuff. So, where we are and where we're going, the big takeaways here for if you're looking for that, so what, three points for you: one we are building a tooling necessary to do larger and larger and larger studies regarding these surrogate security stores. My hope is that in some period of the not-too-distant future, I would like to be able to, with my colleagues, publish some really nice findings about what are the things that you can observe in software, which have a suspiciously high correlation with the software being good. Right, nobody really knows right now. It's an empirical question. As far as I know, the study hasn't been done. We've been running it on the small scale. We're building the tooling to do it on a much larger scale. We are hoping that this winds up being a useful field in security as that technology develops. In the meantime our static analyzers are already making surprising discoveries: hit YouTube and take a look for Sara Zatko's recent talks at DEFCON/Blackhat. Lots of fun findings in there. Lots of things that anyone who looks would have found it. Lots of that. And then lastly, if you were in the business of shipping software and you are thinking to yourself.. okay so these guys, someone gave them some money to mess up my day and you're wondering: what can I do to not have my day messed up? One simple piece of advice, one simple piece of advice: make sure your software employs every exploit mitigation technique Mudge has ever or will ever hear of. And he's heard of a lot of them. He's only gonna, you know all that, turn all those things on and if you don't know anything about that stuff, if nobody on your team knows anything about that stuff didn't I don't even know I'm saying this if you hear you know about that stuff so do that. If you're not here, then you should be here. Danke, Danke. Herald Angel: Thank you, Tim and Parker. Do we have any questions from the audience? It's really hard to see you with that bright light in my face. I think the signal angel has a question. Signal Angel: So the IRC channel was impressed by your tools and your models that you wrote. And they are wondering what's going to happen to that, because you do have funding from the Ford foundation now and so what are your plans with this? Do you plan on commercializing this or is it going to be open source or how do we get our hands on this? C: It's an excellent question. So for the time being the money that we are receiving is to develop the tooling, pay for the AWS instances, pay for the engineers and all that stuff. The direction as an organization that we would like to take things I have no interest in running a monopoly. That sounds like a fantastic amount of work and I really don't want to do it. However, I have a great deal of interest in taking the gains that we are making in the technology and releasing the data so that other competent researchers can go through and find useful things that we may not have noticed ourselves. So we're not at a point where we are releasing data in bulk just yet, but that is simply a matter of engineering our tools, are still in flux as we, you know. When we do that, we want to make sure the data is correct and so our software has to have its own low bug counts and all these other things. But ultimately for the scientific aspect of our mission. Though the science is not our primary mission. Our primary mission is to apply it to help consumers. At the same time, it is our belief that an opaque model is as good as crap, no one should trust an opaque model, if somebody is telling you that they have some statistics and they do not provide you with any underlying data and it is not reproducible you should ignore them. Consequently what we are working towards right now is getting to a point where we will be able to share all of those findings. The surrogate scores, the interesting correlations between observables and fuzzing. All that will be public as the material comes online. Signal Angel: Thank you. C: Thank you. Herald Angel: Thank you. And microphone number three please. Mic3: Hi, thanks so some really interesting work you presented here. So there's something I'm not sure I understand about the approach that you're taking. If you are evaluating the security of say a library function or the implementation of a network protocol for example you know there'd be a precise specification you could check that against. And the techniques you're using would make sense to me. But it's not so clear since you've set the goal that you've set for yourself is to evaluate security of consumer software. It's not clear to me whether it's fair to call these results security scores in the absence of a threat model so. So my question is, you know, how is it meaningful to make a claim that a piece of software is secure if you don't have a threat model for it? C: This is an excellent question and I anyone who disagrees is they should the wrong. Security without a threat model is not security at all. It's absolutely a true point. So the things that we are looking for, most of them are things that you will already find present in your threat model. And so for example we were reporting on the presence of things like a ASLR and lots of other things that get to the heart of exploitability of a piece of software. So for example if we are reviewing a piece of software, that has no attack surface then it is canonically not in the threat model and in that sense it makes no sense to report on its overall security. On the other hand, if we're talking about software like say a word processor, a browser, anything on your phone, anything that talks on the network, we're talking about those kinds of applications then I would argue that exploit mitigations and the other things that we are measuring are almost certainly very relevant. So there's a sense in which what we are measuring is the lowest common denominator among what we imagine or the dominant threat models for the applications. The hand-wavy answer, but I promised heuristics so there you go. Mic3: Thanks. C: Thank you. Herald Angel: Any questions? No raising hands, okay. And then the herald can ask a question, because I never can. So the question is: you mentioned earlier these security labels and for example what institution could give out the security labels? Because as obviously the vendor has no interest in IT security? C: Yes it's a very good question. So our partnership with Consumer Reports. I don't know if you're familiar with them, but in the United States Consumer Reports is a major huge consumer watchdog organization. They test the safety of automobiles, they test you know lots of consumer appliances. All kinds of things both to see if they function more or less as advertised but most importantly they're checking for quality, reliability and safety. So our partnership with Consumer Reports is all about us doing our work and then publishing that. And so for example the televisions that we presented the data on all of that was collected and published in partnership with Consumer Reports. Herald: Thank you. C: Thank you. Herald: Any other questions for stream. I hear a no. Well in this case people thank you. Thank Tim and Parker for their nice talk and please give them a very very warm hall round of applause. applause C: Thank you. T: Thank you. subtitles created by c3subtitles.de in the year 2017. Join, and help us!