34c3 intro Herald: The next talk will be about embedded systems security and Pascal, the speaker, will explain how you can hijack debug components for embedded security in ARM processors. Pascal is not only an embedded software security engineer but also a researcher in his spare time. Please give a very very warm welcoming good morning applause to Pascal. applause Pascal: OK, thanks for the introduction. As it was said, I'm an engineer by day in a French company where I work as an embedded system security engineer. But this talk is mainly about my spare-time activity which is researcher, hacker or whatever you call it. This is because I work with a PhD student called Muhammad Abdul Wahab. He's a third year PhD student in a French lab. So, this talk will be mainly a representation on his work about embedded systems security and especially debug components available in ARM processors. Don't worry about the link. At the end, there will be also the link with all the slides, documentations and everything. So, before the congress, I didn't know about what kind of background you will need for my talk. So, I put there some links, I mean some references of some talks where you will have all the vocabulary needed to understand at least some parts of my talk. About computer architecture and embedded system security, I hope you had attended the talk by Alastair about the formal verification of software and also the talk by Keegan about Trusted Execution Environments (TEEs such as TrustZone). And, in this talk, I will also talk about FPGA stuff. About FPGAs, there was a talk on day 2 about FPGA reverse engineering. And, if you don't know about FPGAs, I hope that you had some time to go to the OpenFPGA assembly because these guys are doing a great job about FPGA open-source tools. When you see this slide, the first question is that why I put "TrustZone is not enough"? Just a quick reminder about what TrustZone is. TrustZone is about separating a system between a non-secure world in red and a secure world in green. When we want to use the TrustZone framework, we have lots of hardware components, lots of software components allowing us to, let's say, run separately a secure OS and a non-secure OS. In our case, what we wanted to do is to use the debug components (you can see it on the left side of the picture) to see if we can make some security with it. Furthermore, we wanted to use something else than TrustZone because if you have attended the talk about the security in the Nintendo Switch, you can see that the TrustZone framework can be bypassed under specific cases. Furthermore, this talk is something quite complimentary because we will do something at a lower level, at the processor architecture level. I will talk in a later part of my talk about what we can do between TrustZone and the approach developed in this work. So, basically, the presentation will be a quick introduction. I will talk about some works aiming to use debug components to make some security. Then, I will talk about ARMHEx which is the name of the system we developed to use the debug components in a hardcore processor. And, finally, some results and a conclusion. In the context of our project, we are working with System-on- Chips. So, System-on-Chips are a kind of devices where we have in the green part a processor. So it can be a single core, dual core or even quad core processor. And another interesting part which is in yellow in the image is the programmable logic. Which is also called an FPGA in this case. And in this kind of System-on- Chip, you have the hardcore processor, the FPGA and some links between those two units. You can see here, in the red rectangle, one of the two processors. This picture is an image of a System-on-Chip called Zynq provided by Xilinx which is also a FPGA provider. In this kind of chip, we usually have 2 Cortex-A9 processors and some FPGA logic to work with. What we want to do with the debug components is to work about Dynamic Information Flow Tracking. Basically, what is information flow? Information flow is the transfer of information from an information container C1 to C2 given a process P. In other words, if we take this simple code over there: if you have 4 variables (for instance, a, b, w and x), the idea is that if you have some metadata in a, the metadata will be transmitted to w. In other words, what kind of information will we transmit into the code? Basically, the information I'm talking in the first block is "OK, this data is private, this data is public" and we should not mix data which are public and private together. Basically we can say that the information can be binary information which is "public or private" but of course we'll be able to have several levels of information. In the following parts, this information will be called taint or even tags and to be a bit more simple we will use some colors to say "OK, my tag is red or green" just to say if it's private or public data. As I said, if the tag contained in a is red, the data contained in w will be red as well. Same thing for b and x. If we have a quick example over there, if we look at a buffer overflow. In the upper part of the slide you have the assembly code and on the lower part, the green columns will be the color of the tags. On the right side of these columns you have the status of the different registers. This code is basically: OK, when my input is red at the beginning, we just use the tainted input into the index variable. The register 2 which contains the idx variable will be red as well. Then, when we want to access buffer[idx] which is the second line in the C code at the beginning, the information we have there will be red as well. And, of course, the result of the operation which is x will be red as well. Basically, that means that if there is a tainted input at the beginning, we must be able to transmit this information until the return address of this code just to say "OK, if this tainted input is private, the return adress at the end of the code should be private as well". What can we do with that? There is a simple code over there. This is a simple code saying if you are a normal user, if in your code, you just have to open the welcome file. Otherwise, if you are a root user, you must open the password file. So this is to say if we want to open the welcome file, this is a public information: you can do whatever you want with it. Otherwise, if it's a root user, maybe the password will contain for instance a cryptographic key and we should not go to the printf function at the end of this code. The idea behind that is to check that the fs variable containing the data of the file is private or public. There are mainly three steps for that. First of all, the compilation will give us the assembly code. Then, we must modify system calls to send the tags. The tags will be as I said before the private or public information about my fs variable. I will talk a bit about that later: maybe, in future works, the idea is to make or at least to compile an Operating System with integrated support for DIFT. There were already some works about Dynamic Information Flow Tracking. So, we should do this kind of information flow tracking in two manners. The first one at the application level working at the Java or Android level. Some works also propose some solutions at the OS level: for instance, KBlare. But what we wanted to do here is to work at a lower level so this is not at the application or the OS leve but more at the hardware level or, at least, at the processor architecture level. If you want to have some information about the OS level implementations of information flow tracking, you can go to blare-ids.org where you have some implementations of an Android port and a Java port of intrusion detection systems. In the rest of my talk, I will just go through the existing works and see what we can do about that. When we talk about dynamic information flow tracking at a low level, there are mainly three approaches. The first one is the one in the left-side of this slide. The idea is that in the upper-side of this figure, we have the normal processor pipeline: basically, decode stage, register file and Arithmetic & Logic Unit. The basic idea is that when we want to process with tags or taints, we just duplicate the processor pipeline (the grey pipeline under the normal one) just to process data. And, it implies two things: First of all, we must have the source code of the processor itself just to duplicate the processor pipeline and to make the DIFT pipeline. This is quite inconvenient because we must have the source code of the processor which is not really easy sometimes. Otherwise, the main advantage of this approach is that we can do nearly anything we want because we have access to all codes. So, we can pull all wires we need from the processor just to get the information we need. On the second approach (right side of the picture), there is something a bit more different: instead of having a single processor aiming to do the normal application flow + the information flow tracking, we should separate the normal execution and the information flow tracking (this is the second approach over there). This approach is not satisfying as well because you will have one core running the normal application but core #2 will be just able to make DIFT controls. Basically, it's a shame to use a processor just to make DIFT controls. The best compromise we can do is to make a dedicated coprocessor just to make the information flow tracking processing. Basically, the most interesting work in this topic is to have a main core processor aiming to make the normal application and a dedicated coprocessor to make the IFT controls. You will have some communications between those two cores. If we want to make a quick comparison between different works. If you want to run the dynamic information flow control in pure software (I will talk about that in the slide after), this is really painful in terms of time overhead because you will see that the time to do information flow tracking in pure software is really unacceptable. Regarding hardware-assisted approaches, the best advantage in all cases is that we have a low overhead in terms of silicon area: it means that, on this slide, the overhead between the main core and the main core + the coprocessor is not so important. We will see that, in the case of my talk, the dedicated DIFT coprocessor is also easier to get different security policies. As I said in the pure software solution (the first line of this table), the basic idea behind that is to use instrumentation. If you were there on day 2, the instrumentation is the transformation of a program into its own measurement tool. It means that we will put some sensors in all parts of my code just to monitor its activity and gather some information from it. If we want to measure the impact of instrumentation on the execution time of an application, you can see in this diagram over there, the normal application level which is normalized to 1. When we want to use instrumentation with it, the minimal overhead we have is about 75%. The time with instrumentation will be most of the time twice higher than the normal execution time. This is completely unacceptable because it will just run slower your application. Basically, as I said, the main concern about my talk is about reducing the overhead of software instrumentation. I will talk also a bit about the security of the DIFT coprocessor because we can't include a DIFT coprocessor without taking care of its security. According to my knowledge, this is the first work about DIFT in ARM-based system-on-chips. On the talk about the security of the Nintendo Switch, the speaker said that black-box testing is fun ... except that it isn't. In our case, we have only a black-box because we can't modify the structure of the processor, we must make our job without, let's say, decaping the processor and so on. This is an overall schematic of our architecture. On the left side, in light green, you have the ARM processor. In this case, this is a simplified version with only one core. And, on the right side, you have the structure of the coprocessor we implemented in the FPGA. You can notice, for instance, for the moment sorry, two things. The first is that you have some links between the FPGA and the CPU. These links are already existing in the system- on-chip. And you can see another thing regarding the memory: you have separate memory for the processor and the FPGA. And we will see later that we can use TrustZone to add a layer of security, just to be sure that we won't mix the memory between the CPU and the FPGA. Basically, when we want to work with ARM processors, we must use ARM datasheets, we must read ARM datasheets. First of all, don't be afraid by the length of ARM datasheets because, in my case, I used to work with the ARM-v7 technical manual which is already 2000 pages. The ARM-v8 manual is about 6000 pages. Anyway. Of course, what is also difficult is that the information is split between different documents. Anyway, when we want to use debug components in the case of ARM, we just have this register over there which is called DBGOSLAR. We can see that, in this register, we can say that writing the key value 0xC5A-blabla to this field locks the debug registers. And if your write any other value, it will just unlock those debug registers. So that was basically the first step to enable the debug components: Just writing a random value to this register just to unlock my debug components. Here is again a schematic of the overall system-on-chip. As you see, you have the two processors and, on the top part, you have what are called Coresight components. These are the famous debug components I will talk in the second part of my talk. Here is a simplified view of the debug components we have in Zynq SoCs. On the left side, we have the two processors (CPU0 and CPU1) and all the Coresight components are: PTM, the one which is in the red rectangle; and also the ECT which is the Embedded Cross Trigger; and the ITM which is the Instrumentation Trace Macrocell. Basically, when we want to extract some data from the Coresight components, the basic path we use is the PTM, go through the Funnel and, at this step, we have two choices to store the information taken from debug components. The first one is the Embedded Trace Buffer which is a small memory embedded in the processor. Unfortunately, this memory is really small because it's only about 4KBytes as far as I remember. But the other possibility is just to export some data to the Trace Packet Output and this is what we will use just to export some data to the coprocessor implemented in the FPGA. Basically, what PTM is able to do? The first thing that PTM can do is to trace whatever in your memory. For instance, you can trace all your code. Basically, all the blue sections. But, you can also let's say trace specific regions of the code: You can say OK I just want to trace the code in my section 1 or section 2 or section N. Then the PTM is also able to make some Branch Broadcasting. That is something that was not present in the Linux kernel. So, we already submitted a patch that was accepted to manage the Branch Broadcasting into the PTM. And we can do some timestamping and other things just to be able to store the information in the traces. Basically, what a trace looks like? Here is the most simple code we could had: it's just a for loop doing nothing. The assembly code over there. And the trace will look like this. In the first 5 bytes, some kind of start packet which is called the A-sync packet just to say "OK, this is the beginning of the trace". In the green part, we'll have the address which corresponds to the beginning of the loop. And, in the orange part, we will have the Branch Address Packet. You can see that you have 10 iterations of this Branch Address Packet because we have 10 iterations of the for loop. This is just to show what is the general structure of a trace. This is just a control flow graph just to say what we could have about this. Of course, if we have another loop at the end of this control flow graph, we'll just make the trace a bit longer just to have the information about the second loop and so on. Once we have all these traces, the next step is to say I have my tags but how do I define the rules just to transmit my tags. And this is there we will use static analysis for this. Basically, in this example, if we have the instruction "add register1 + register2 and put the result in register0". For this, we will use static analysis which allows us to say that the tag associated with register0 will be the tag of register1 or the tag of register2. Static analysis will be done before running my code just to say I have all the rules for all the lines of my code. Now that we have the trace, we know how to transmit the tags all over my code, the final step will be just to make the static analysis in the LLVM backend. The final step will be about instrumentation. As I said before, we can recover all the memory addresses we need through instrumentation. Otherwise, we can also only get the register-relative memory addresses through instrumentation. In this first case, on this simple code, we can instrument all the code but the main drawback of this solution is that it will completely excess the time of the instruction. Otherwise, what we can do is that with the store instruction over there, we can get data from the trace: basically, we will use the Program Counter from the trace. Then, for the Stack Pointer, we will use static analysis to get information from the Stack Pointer. And, finally, we can use only one instrumented instruction at the end. If I go back to this system, the communication overhead will be the main drawback as I said before because if we have over there the processor and the FPGA running in different parts, the main problem will be how we can transmit data in real-time or, at least, in the highest speed we can between the processor and the FPGA. This is the time overhead when we enable Coresight components or not. In blue, we have the basic time overhead when the traces are disabled. And we can see that, when we enable traces, the time overhead is nearly negligible. Regarding time instrumentation, we can see that regarding the strategy 2 which is using the Coresight components, using the static analysis and the instrumentation, we can lower the instrumentation overhead from 53% down to 5%. We still have some overhead due to instrumentation but it's really low compared to the related works where all the code was instrumented. This is an overview that shows that in the grey lines some overhead of related works with full instrumentation and we can see that, in our approach (with the greeen lines over there), the time overhead with our code is much much smaller. Basically, how we can use TrustZone with this? This is just an overview of our system. And we can say we can use TrustZone just to separate the CPU from the FPGA coprocessor. If we make a comparison with related works, we can see that compared to the first works, we are able to make some information flow control with an hardcore processor which was not the case with the two first works in this table. It means you can use a basic ARM processor just to make the information flow tracking instead of having a specific processor. And, of course, the area overhead, which is another important topic, is much much smaller compared to the existing works. It's time for the conclusion. As I presented in this talk, we are able to use the PTM component just to obtain runtime information about my application. This is a non-intrusive tracing because we still have negligible performance overhead. And we also improve the software security just because we were able to make some security on the coprocessor. The future perspective of that work is mainly to work with multicore processors and see if we can use the same approach for Intel and maybe ST microcontrollers to see if we can also do information flow tracking in this case. That was my talk. Thanks for listening. applause Herald: Thank you very much for this talk. Unfortunately, we don't have time for Q&A, so please, if you leave the room and take your trash with you, that makes the angels happy. Pascal: I was a bit long, sorry. Herald: Another round of applause for Pascal. applause 34c3 outro subtitles created by c3subtitles.de in the year 2020. Join, and help us!