-
34c3 intro
-
Herald: The next talk will be about
embedded systems security and Pascal, the
-
speaker, will explain how you can hijack
debug components for embedded security in
-
ARM processors. Pascal is not only an
embedded software security engineer but
-
also a researcher in his spare time.
Please give a very very warm
-
welcoming good morning applause to
Pascal.
-
applause
-
Pascal: OK, thanks for the introduction.
As it was said, I'm an engineer by day in
-
a French company where I work as an
embedded system security engineer. But
-
this talk is mainly about my spare-time
activity which is researcher, hacker or
-
whatever you call it. This is because I
work with a PhD student called Muhammad
-
Abdul Wahab. He's a third year PhD student
in a French lab. So, this talk will be
-
mainly a representation on his work about
embedded systems security and especially
-
debug components available in ARM
processors. Don't worry about the link. At
-
the end, there will be also the link with
all the slides, documentations and
-
everything. So, before the congress, I
didn't know about what kind of background
-
you will need for my talk. So, I
put there some links, I mean some
-
references of some talks where you will
have all the vocabulary needed to
-
understand at least some parts of my talk.
About computer architecture and embedded
-
system security, I hope you had attended
the talk by Alastair about the formal
-
verification of software and also the talk
by Keegan about Trusted Execution
-
Environments (TEEs such as TrustZone).
And, in this talk, I will also talk about
-
FPGA stuff. About FPGAs, there was a talk
on day 2 about FPGA reverse engineering.
-
And, if you don't know about FPGAs, I hope
that you had some time to go to the
-
OpenFPGA assembly because these guys are
doing a great job about FPGA open-source
-
tools. When you see this slide, the first
question is that why I put "TrustZone is
-
not enough"? Just a quick reminder about
what TrustZone is. TrustZone is about
-
separating a system between a non-secure
world in red and a secure world in green.
-
When we want to use the TrustZone
framework, we have lots of hardware
-
components, lots of software components
allowing us to, let's say, run separately
-
a secure OS and a non-secure OS. In our
case, what we wanted to do is to use the
-
debug components (you can see it on the
left side of the picture) to see if we can
-
make some security with it. Furthermore,
we wanted to use something else than
-
TrustZone because if you have attended the
talk about the security in the Nintendo
-
Switch, you can see that the TrustZone
framework can be bypassed under specific
-
cases. Furthermore, this talk is something
quite complimentary because we will do
-
something at a lower level, at the
processor architecture level. I will talk
-
in a later part of my talk about what we
can do between TrustZone and the approach
-
developed in this work. So, basically, the
presentation will be a quick introduction.
-
I will talk about some works aiming to use
debug components to make some security.
-
Then, I will talk about ARMHEx which
is the name of the system we developed to
-
use the debug components in a hardcore
processor. And, finally, some results and
-
a conclusion. In the context of our
project, we are working with System-on-
-
Chips. So, System-on-Chips are a kind of
devices where we have in the green part a
-
processor. So it can be a single core,
dual core or even quad core processor.
-
And another interesting part which is in
yellow in the image is the programmable
-
logic. Which is also called an FPGA
in this case. And
-
in this kind of System-on-
Chip, you have the hardcore processor,
-
the FPGA and some links between those two
units. You can see here, in the red
-
rectangle, one of the two processors. This
picture is an image of a System-on-Chip
-
called Zynq provided by Xilinx which is
also a FPGA provider. In this kind of
-
chip, we usually have 2 Cortex-A9
processors and some FPGA logic to work
-
with. What we want to do with the debug
components is to work about Dynamic
-
Information Flow Tracking. Basically, what
is information flow? Information flow is
-
the transfer of information from an
information container C1 to C2 given a
-
process P. In other words, if we take this
simple code over there: if you have 4
-
variables (for instance, a, b, w and x),
the idea is that if you have some metadata
-
in a, the metadata will be transmitted to
w. In other words, what kind of
-
information will we transmit into the
code? Basically, the information I'm
-
talking in the first block is "OK, this
data is private, this data is public" and
-
we should not mix data which are public
and private together. Basically we can say
-
that the information can be binary
information which is "public or private"
-
but of course we'll be able to have
several levels of information. In the
-
following parts, this information will be
called taint or even tags and to be a bit
-
more simple we will use some colors to
say "OK, my tag is red or green" just to
-
say if it's private or public data. As I
said, if the tag contained in a is red,
-
the data contained in w will be red as
well. Same thing for b and x. If we have a
-
quick example over there, if we look at a
buffer overflow. In the upper part of the
-
slide you have the assembly code and on
the lower part, the green columns will be
-
the color of the tags. On the right side
of these columns you have the status of
-
the different registers. This code is
basically: OK, when my input is red at the
-
beginning, we just use the tainted input
into the index variable. The register 2
-
which contains the idx variable will be
red as well. Then, when we want to access
-
buffer[idx] which is the second line in
the C code at the beginning, the
-
information we have there will be red as
well. And, of course, the result of the
-
operation which is x will be red as well.
Basically, that means that if there is a
-
tainted input at the beginning, we must
be able to transmit this information until
-
the return address of this code just to
say "OK, if this tainted input is private,
-
the return adress at the end of the code
should be private as well". What can we do
-
with that? There is a simple code over
there. This is a simple code saying if you
-
are a normal user, if in your code, you
just have to open the welcome file.
-
Otherwise, if you are a root user, you
must open the password file. So this is to
-
say if we want to open the welcome file,
this is a public information: you can do
-
whatever you want with it. Otherwise, if
it's a root user, maybe the password will
-
contain for instance a cryptographic key
and we should not go to the printf
-
function at the end of this code. The idea
behind that is to check that the fs
-
variable containing the data of the file
is private or public. There are mainly
-
three steps for that. First of all, the
compilation will give us the assembly
-
code. Then, we must modify system calls to
send the tags. The tags will be as I said
-
before the private or public information
about my fs variable. I will talk a bit
-
about that later: maybe, in future works,
the idea is to make or at least to compile
-
an Operating System with integrated
support for DIFT. There were already some
-
works about Dynamic Information Flow
Tracking. So, we should do this kind of
-
information flow tracking in two manners.
The first one at the application level
-
working at the Java or Android level. Some
works also propose some solutions at the
-
OS level: for instance, KBlare. But what
we wanted to do here is to work at a lower
-
level so this is not at the application or
the OS leve but more at the hardware level
-
or, at least, at the processor
architecture level. If you want to have
-
some information about the OS level
implementations of information flow
-
tracking, you can go to blare-ids.org
where you have some implementations of an
-
Android port and a Java port of intrusion
detection systems. In the rest of my talk,
-
I will just go through the existing works
and see what we can do about that. When we
-
talk about dynamic information flow
tracking at a low level, there are mainly
-
three approaches. The first one is the
one in the left-side of this slide. The idea is
-
that in the upper-side of this figure, we
have the normal processor pipeline:
-
basically, decode stage, register file and
Arithmetic & Logic Unit. The basic idea is
-
that when we want to process with tags or
taints, we just duplicate the processor
-
pipeline (the grey pipeline under the
normal one) just to process data. And, it
-
implies two things: First of all, we must
have the source code of the processor
-
itself just to duplicate the processor
pipeline and to make the DIFT pipeline.
-
This is quite inconvenient because we
must have the source code of the processor
-
which is not really easy sometimes.
Otherwise, the main advantage of this
-
approach is that we can do nearly anything
we want because we have access to all
-
codes. So, we can pull all wires we need
from the processor just to get the
-
information we need. On the second
approach (right side of the picture),
-
there is something a bit more different:
instead of having a single processor
-
aiming to do the normal application flow +
the information flow tracking, we should
-
separate the normal execution and the
information flow tracking (this is the
-
second approach over there). This approach
is not satisfying as well because you will
-
have one core running the normal
application but core #2 will be just able
-
to make DIFT controls. Basically, it's a
shame to use a processor just to make DIFT
-
controls. The best compromise we can do is
to make a dedicated coprocessor just to
-
make the information flow tracking
processing. Basically, the most
-
interesting work in this topic is to have
a main core processor aiming to make the
-
normal application and a dedicated
coprocessor to make the IFT controls. You
-
will have some communications between
those two cores. If we want to make a
-
quick comparison between different works.
If you want to run the dynamic information
-
flow control in pure software (I will talk
about that in the slide after), this is
-
really painful in terms of time overhead
because you will see that the time to do
-
information flow tracking in pure software
is really unacceptable. Regarding
-
hardware-assisted approaches, the best
advantage in all cases is that we have a
-
low overhead in terms of silicon area: it
means that, on this slide, the overhead
-
between the main core and the main core +
the coprocessor is not so important. We
-
will see that, in the case of my talk, the
dedicated DIFT coprocessor is also easier
-
to get different security policies. As I
said in the pure software solution (the
-
first line of this table), the basic idea
behind that is to use instrumentation. If
-
you were there on day 2, the
instrumentation is the transformation of a
-
program into its own measurement tool. It
means that we will put some sensors in all
-
parts of my code just to monitor its
activity and gather some information from
-
it. If we want to measure the impact of
instrumentation on the execution time of
-
an application, you can see in this
diagram over there, the normal application
-
level which is normalized to 1. When we
want to use instrumentation with it, the
-
minimal overhead we have is about 75%. The
time with instrumentation will be most of
-
the time twice higher than the normal
execution time. This is completely
-
unacceptable because it will just run
slower your application. Basically, as I
-
said, the main concern about my talk is
about reducing the overhead of software
-
instrumentation. I will talk also a bit
about the security of the DIFT coprocessor
-
because we can't include a DIFT
coprocessor without taking care of its
-
security. According to my knowledge, this
is the first work about DIFT in ARM-based
-
system-on-chips. On the talk about the
security of the Nintendo Switch, the
-
speaker said that black-box testing is fun
... except that it isn't. In our case, we
-
have only a black-box because we can't
modify the structure of the processor, we
-
must make our job without, let's say,
decaping the processor and so on. This is
-
an overall schematic of our architecture.
On the left side, in light green, you have
-
the ARM processor. In this case, this is a
simplified version with only one core.
-
And, on the right side, you have the
structure of the coprocessor we
-
implemented in the FPGA. You can notice,
for instance, for the moment sorry, two
-
things. The first is that you have some
links between the FPGA and the CPU. These
-
links are already existing in the system-
on-chip. And you can see another thing
-
regarding the memory: you have separate
memory for the processor and the FPGA. And
-
we will see later that we can use
TrustZone to add a layer of security, just
-
to be sure that we won't mix the memory
between the CPU and the FPGA. Basically,
-
when we want to work with ARM processors,
we must use ARM datasheets, we must read
-
ARM datasheets. First of all, don't be
afraid by the length of ARM datasheets
-
because, in my case, I used to work with
the ARM-v7 technical manual which is
-
already 2000 pages. The ARM-v8 manual is
about 6000 pages. Anyway. Of course, what
-
is also difficult is that the information
is split between different documents.
-
Anyway, when we want to use debug
components in the case of ARM, we just
-
have this register over there which is
called DBGOSLAR. We can see that, in this
-
register, we can say that writing the key
value 0xC5A-blabla to this field locks the
-
debug registers. And if your write any
other value, it will just unlock those
-
debug registers. So that was basically the
first step to enable the debug components:
-
Just writing a random value to this register
just to unlock my debug components. Here
-
is again a schematic of the overall
system-on-chip. As you see, you have the
-
two processors and, on the top part, you
have what are called Coresight components.
-
These are the famous debug components I
will talk in the second part of my talk.
-
Here is a simplified view of the debug
components we have in Zynq SoCs. On the
-
left side, we have the two processors
(CPU0 and CPU1) and all the Coresight
-
components are: PTM, the one which is in
the red rectangle; and also the ECT which
-
is the Embedded Cross Trigger; and the ITM
which is the Instrumentation Trace
-
Macrocell. Basically, when we want to
extract some data from the Coresight
-
components, the basic path we use is the
PTM, go through the Funnel and, at this
-
step, we have two choices to store the
information taken from debug components.
-
The first one is the Embedded Trace Buffer
which is a small memory embedded in the
-
processor. Unfortunately, this memory is
really small because it's only about
-
4KBytes as far as I remember. But the other
possibility is just to export some data to
-
the Trace Packet Output and this is what
we will use just to export some data to
-
the coprocessor implemented in the FPGA.
Basically, what PTM is able to do? The
-
first thing that PTM can do is to trace
whatever in your memory. For instance, you
-
can trace all your code. Basically, all
the blue sections. But, you can also let's
-
say trace specific regions of the code:
You can say OK I just want to trace the
-
code in my section 1 or section 2 or
section N. Then the PTM is also able to
-
make some Branch Broadcasting. That is
something that was not present in the
-
Linux kernel. So, we already submitted a
patch that was accepted to manage the
-
Branch Broadcasting into the PTM. And we
can do some timestamping and other things
-
just to be able to store the information
in the traces. Basically, what a trace
-
looks like? Here is the most simple
code we could had: it's just a for loop
-
doing nothing. The assembly code over
there. And the trace will look like this.
-
In the first 5 bytes, some kind of start
packet which is called the A-sync packet
-
just to say "OK, this is the beginning of
the trace". In the green part, we'll have
-
the address which corresponds to the
beginning of the loop. And, in the orange
-
part, we will have the Branch Address
Packet. You can see that you have 10
-
iterations of this Branch Address Packet
because we have 10 iterations of the for
-
loop. This is just to show what is the
general structure of a trace. This is just
-
a control flow graph just to say what we
could have about this. Of course, if we
-
have another loop at the end of this
control flow graph, we'll just make the
-
trace a bit longer just to have the
information about the second loop and so
-
on. Once we have all these traces, the
next step is to say I have my tags but how
-
do I define the rules just to transmit my
tags. And this is there we will use static
-
analysis for this. Basically, in this
example, if we have the instruction "add
-
register1 + register2 and put the result
in register0". For this, we will use
-
static analysis which allows us to say that
the tag associated with register0 will be
-
the tag of register1 or the tag of
register2. Static analysis will be done
-
before running my code just to say I have
all the rules for all the lines of my
-
code. Now that we have the trace, we know
how to transmit the tags all over my code,
-
the final step will be just to make the
static analysis in the LLVM backend. The
-
final step will be about instrumentation.
As I said before, we can recover all the
-
memory addresses we need through
instrumentation. Otherwise, we can also
-
only get the register-relative memory
addresses through instrumentation. In this
-
first case, on this simple code, we can
instrument all the code but the main
-
drawback of this solution is that it will
completely excess the time of the
-
instruction. Otherwise, what we can do is
that with the store instruction over
-
there, we can get data from the trace:
basically, we will use the Program Counter
-
from the trace. Then, for the Stack
Pointer, we will use static analysis to
-
get information from the Stack Pointer.
And, finally, we can use only one
-
instrumented instruction at the end. If I
go back to this system, the communication
-
overhead will be the main drawback as I
said before because if we have over there
-
the processor and the FPGA running in
different parts, the main problem will be
-
how we can transmit data in real-time or,
at least, in the highest speed we can
-
between the processor and the FPGA. This
is the time overhead when we enable
-
Coresight components or not. In blue, we
have the basic time overhead when the
-
traces are disabled. And we can see that,
when we enable traces, the time overhead
-
is nearly negligible. Regarding time
instrumentation, we can see that regarding
-
the strategy 2 which is using the
Coresight components, using the static
-
analysis and the instrumentation, we can
lower the instrumentation overhead from
-
53% down to 5%. We still have some
overhead due to instrumentation but it's
-
really low compared to the related works
where all the code was instrumented. This
-
is an overview that shows that in the
grey lines some overhead of related works
-
with full instrumentation and we can see
that, in our approach (with the greeen
-
lines over there), the time overhead with
our code is much much smaller. Basically,
-
how we can use TrustZone with this? This
is just an overview of our system. And we
-
can say we can use TrustZone just to
separate the CPU from the FPGA
-
coprocessor. If we make a comparison with
related works, we can see that compared to
-
the first works, we are able to make some
information flow control with an hardcore
-
processor which was not the case with the
two first works in this table. It means
-
you can use a basic ARM processor just to
make the information flow tracking instead
-
of having a specific processor. And, of
course, the area overhead, which is
-
another important topic, is much much
smaller compared to the existing works.
-
It's time for the conclusion. As I
presented in this talk, we are able to use
-
the PTM component just to obtain runtime
information about my application. This is
-
a non-intrusive tracing because we still
have negligible performance overhead.
-
And we also improve the software security
just because we were able to make some
-
security on the coprocessor. The future
perspective of that work is mainly to work
-
with multicore processors and see if we can
use the same approach for Intel and maybe
-
ST microcontrollers to see if we can also
do information flow tracking in this case.
-
That was my talk. Thanks for listening.
-
applause
-
Herald: Thank you very much for this talk.
-
Unfortunately, we don't have time for Q&A,
so please, if you leave the room and take
-
your trash with you, that makes the angels
happy.
-
Pascal: I was a bit long, sorry.
-
Herald: Another round
of applause for Pascal.
-
applause
-
34c3 outro
-
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!