34c3 intro
Herald: The next talk will be about
embedded systems security and Pascal, the
speaker, will explain how you can hijack
debug components for embedded security in
ARM processors. Pascal is not only an
embedded software security engineer but
also a researcher in his spare time.
Please give a very very warm
welcoming good morning applause to
Pascal.
applause
Pascal: OK, thanks for the introduction.
As it was said, I'm an engineer by day in
a French company where I work as an
embedded system security engineer. But
this talk is mainly about my spare-time
activity which is researcher, hacker or
whatever you call it. This is because I
work with a PhD student called Muhammad
Abdul Wahab. He's a third year PhD student
in a French lab. So, this talk will be
mainly a representation on his work about
embedded systems security and especially
debug components available in ARM
processors. Don't worry about the link. At
the end, there will be also the link with
all the slides, documentations and
everything. So, before the congress, I
didn't know about what kind of background
you will need for my talk. So, I
put there some links, I mean some
references of some talks where you will
have all the vocabulary needed to
understand at least some parts of my talk.
About computer architecture and embedded
system security, I hope you had attended
the talk by Alastair about the formal
verification of software and also the talk
by Keegan about Trusted Execution
Environments (TEEs such as TrustZone).
And, in this talk, I will also talk about
FPGA stuff. About FPGAs, there was a talk
on day 2 about FPGA reverse engineering.
And, if you don't know about FPGAs, I hope
that you had some time to go to the
OpenFPGA assembly because these guys are
doing a great job about FPGA open-source
tools. When you see this slide, the first
question is that why I put "TrustZone is
not enough"? Just a quick reminder about
what TrustZone is. TrustZone is about
separating a system between a non-secure
world in red and a secure world in green.
When we want to use the TrustZone
framework, we have lots of hardware
components, lots of software components
allowing us to, let's say, run separately
a secure OS and a non-secure OS. In our
case, what we wanted to do is to use the
debug components (you can see it on the
left side of the picture) to see if we can
make some security with it. Furthermore,
we wanted to use something else than
TrustZone because if you have attended the
talk about the security in the Nintendo
Switch, you can see that the TrustZone
framework can be bypassed under specific
cases. Furthermore, this talk is something
quite complimentary because we will do
something at a lower level, at the
processor architecture level. I will talk
in a later part of my talk about what we
can do between TrustZone and the approach
developed in this work. So, basically, the
presentation will be a quick introduction.
I will talk about some works aiming to use
debug components to make some security.
Then, I will talk about ARMHEx which
is the name of the system we developed to
use the debug components in a hardcore
processor. And, finally, some results and
a conclusion. In the context of our
project, we are working with System-on-
Chips. So, System-on-Chips are a kind of
devices where we have in the green part a
processor. So it can be a single core,
dual core or even quad core processor.
And another interesting part which is in
yellow in the image is the programmable
logic. Which is also called an FPGA
in this case. And
in this kind of System-on-
Chip, you have the hardcore processor,
the FPGA and some links between those two
units. You can see here, in the red
rectangle, one of the two processors. This
picture is an image of a System-on-Chip
called Zynq provided by Xilinx which is
also a FPGA provider. In this kind of
chip, we usually have 2 Cortex-A9
processors and some FPGA logic to work
with. What we want to do with the debug
components is to work about Dynamic
Information Flow Tracking. Basically, what
is information flow? Information flow is
the transfer of information from an
information container C1 to C2 given a
process P. In other words, if we take this
simple code over there: if you have 4
variables (for instance, a, b, w and x),
the idea is that if you have some metadata
in a, the metadata will be transmitted to
w. In other words, what kind of
information will we transmit into the
code? Basically, the information I'm
talking in the first block is "OK, this
data is private, this data is public" and
we should not mix data which are public
and private together. Basically we can say
that the information can be binary
information which is "public or private"
but of course we'll be able to have
several levels of information. In the
following parts, this information will be
called taint or even tags and to be a bit
more simple we will use some colors to
say "OK, my tag is red or green" just to
say if it's private or public data. As I
said, if the tag contained in a is red,
the data contained in w will be red as
well. Same thing for b and x. If we have a
quick example over there, if we look at a
buffer overflow. In the upper part of the
slide you have the assembly code and on
the lower part, the green columns will be
the color of the tags. On the right side
of these columns you have the status of
the different registers. This code is
basically: OK, when my input is red at the
beginning, we just use the tainted input
into the index variable. The register 2
which contains the idx variable will be
red as well. Then, when we want to access
buffer[idx] which is the second line in
the C code at the beginning, the
information we have there will be red as
well. And, of course, the result of the
operation which is x will be red as well.
Basically, that means that if there is a
tainted input at the beginning, we must
be able to transmit this information until
the return address of this code just to
say "OK, if this tainted input is private,
the return adress at the end of the code
should be private as well". What can we do
with that? There is a simple code over
there. This is a simple code saying if you
are a normal user, if in your code, you
just have to open the welcome file.
Otherwise, if you are a root user, you
must open the password file. So this is to
say if we want to open the welcome file,
this is a public information: you can do
whatever you want with it. Otherwise, if
it's a root user, maybe the password will
contain for instance a cryptographic key
and we should not go to the printf
function at the end of this code. The idea
behind that is to check that the fs
variable containing the data of the file
is private or public. There are mainly
three steps for that. First of all, the
compilation will give us the assembly
code. Then, we must modify system calls to
send the tags. The tags will be as I said
before the private or public information
about my fs variable. I will talk a bit
about that later: maybe, in future works,
the idea is to make or at least to compile
an Operating System with integrated
support for DIFT. There were already some
works about Dynamic Information Flow
Tracking. So, we should do this kind of
information flow tracking in two manners.
The first one at the application level
working at the Java or Android level. Some
works also propose some solutions at the
OS level: for instance, KBlare. But what
we wanted to do here is to work at a lower
level so this is not at the application or
the OS leve but more at the hardware level
or, at least, at the processor
architecture level. If you want to have
some information about the OS level
implementations of information flow
tracking, you can go to blare-ids.org
where you have some implementations of an
Android port and a Java port of intrusion
detection systems. In the rest of my talk,
I will just go through the existing works
and see what we can do about that. When we
talk about dynamic information flow
tracking at a low level, there are mainly
three approaches. The first one is the
one in the left-side of this slide. The idea is
that in the upper-side of this figure, we
have the normal processor pipeline:
basically, decode stage, register file and
Arithmetic & Logic Unit. The basic idea is
that when we want to process with tags or
taints, we just duplicate the processor
pipeline (the grey pipeline under the
normal one) just to process data. And, it
implies two things: First of all, we must
have the source code of the processor
itself just to duplicate the processor
pipeline and to make the DIFT pipeline.
This is quite inconvenient because we
must have the source code of the processor
which is not really easy sometimes.
Otherwise, the main advantage of this
approach is that we can do nearly anything
we want because we have access to all
codes. So, we can pull all wires we need
from the processor just to get the
information we need. On the second
approach (right side of the picture),
there is something a bit more different:
instead of having a single processor
aiming to do the normal application flow +
the information flow tracking, we should
separate the normal execution and the
information flow tracking (this is the
second approach over there). This approach
is not satisfying as well because you will
have one core running the normal
application but core #2 will be just able
to make DIFT controls. Basically, it's a
shame to use a processor just to make DIFT
controls. The best compromise we can do is
to make a dedicated coprocessor just to
make the information flow tracking
processing. Basically, the most
interesting work in this topic is to have
a main core processor aiming to make the
normal application and a dedicated
coprocessor to make the IFT controls. You
will have some communications between
those two cores. If we want to make a
quick comparison between different works.
If you want to run the dynamic information
flow control in pure software (I will talk
about that in the slide after), this is
really painful in terms of time overhead
because you will see that the time to do
information flow tracking in pure software
is really unacceptable. Regarding
hardware-assisted approaches, the best
advantage in all cases is that we have a
low overhead in terms of silicon area: it
means that, on this slide, the overhead
between the main core and the main core +
the coprocessor is not so important. We
will see that, in the case of my talk, the
dedicated DIFT coprocessor is also easier
to get different security policies. As I
said in the pure software solution (the
first line of this table), the basic idea
behind that is to use instrumentation. If
you were there on day 2, the
instrumentation is the transformation of a
program into its own measurement tool. It
means that we will put some sensors in all
parts of my code just to monitor its
activity and gather some information from
it. If we want to measure the impact of
instrumentation on the execution time of
an application, you can see in this
diagram over there, the normal application
level which is normalized to 1. When we
want to use instrumentation with it, the
minimal overhead we have is about 75%. The
time with instrumentation will be most of
the time twice higher than the normal
execution time. This is completely
unacceptable because it will just run
slower your application. Basically, as I
said, the main concern about my talk is
about reducing the overhead of software
instrumentation. I will talk also a bit
about the security of the DIFT coprocessor
because we can't include a DIFT
coprocessor without taking care of its
security. According to my knowledge, this
is the first work about DIFT in ARM-based
system-on-chips. On the talk about the
security of the Nintendo Switch, the
speaker said that black-box testing is fun
... except that it isn't. In our case, we
have only a black-box because we can't
modify the structure of the processor, we
must make our job without, let's say,
decaping the processor and so on. This is
an overall schematic of our architecture.
On the left side, in light green, you have
the ARM processor. In this case, this is a
simplified version with only one core.
And, on the right side, you have the
structure of the coprocessor we
implemented in the FPGA. You can notice,
for instance, for the moment sorry, two
things. The first is that you have some
links between the FPGA and the CPU. These
links are already existing in the system-
on-chip. And you can see another thing
regarding the memory: you have separate
memory for the processor and the FPGA. And
we will see later that we can use
TrustZone to add a layer of security, just
to be sure that we won't mix the memory
between the CPU and the FPGA. Basically,
when we want to work with ARM processors,
we must use ARM datasheets, we must read
ARM datasheets. First of all, don't be
afraid by the length of ARM datasheets
because, in my case, I used to work with
the ARM-v7 technical manual which is
already 2000 pages. The ARM-v8 manual is
about 6000 pages. Anyway. Of course, what
is also difficult is that the information
is split between different documents.
Anyway, when we want to use debug
components in the case of ARM, we just
have this register over there which is
called DBGOSLAR. We can see that, in this
register, we can say that writing the key
value 0xC5A-blabla to this field locks the
debug registers. And if your write any
other value, it will just unlock those
debug registers. So that was basically the
first step to enable the debug components:
Just writing a random value to this register
just to unlock my debug components. Here
is again a schematic of the overall
system-on-chip. As you see, you have the
two processors and, on the top part, you
have what are called Coresight components.
These are the famous debug components I
will talk in the second part of my talk.
Here is a simplified view of the debug
components we have in Zynq SoCs. On the
left side, we have the two processors
(CPU0 and CPU1) and all the Coresight
components are: PTM, the one which is in
the red rectangle; and also the ECT which
is the Embedded Cross Trigger; and the ITM
which is the Instrumentation Trace
Macrocell. Basically, when we want to
extract some data from the Coresight
components, the basic path we use is the
PTM, go through the Funnel and, at this
step, we have two choices to store the
information taken from debug components.
The first one is the Embedded Trace Buffer
which is a small memory embedded in the
processor. Unfortunately, this memory is
really small because it's only about
4KBytes as far as I remember. But the other
possibility is just to export some data to
the Trace Packet Output and this is what
we will use just to export some data to
the coprocessor implemented in the FPGA.
Basically, what PTM is able to do? The
first thing that PTM can do is to trace
whatever in your memory. For instance, you
can trace all your code. Basically, all
the blue sections. But, you can also let's
say trace specific regions of the code:
You can say OK I just want to trace the
code in my section 1 or section 2 or
section N. Then the PTM is also able to
make some Branch Broadcasting. That is
something that was not present in the
Linux kernel. So, we already submitted a
patch that was accepted to manage the
Branch Broadcasting into the PTM. And we
can do some timestamping and other things
just to be able to store the information
in the traces. Basically, what a trace
looks like? Here is the most simple
code we could had: it's just a for loop
doing nothing. The assembly code over
there. And the trace will look like this.
In the first 5 bytes, some kind of start
packet which is called the A-sync packet
just to say "OK, this is the beginning of
the trace". In the green part, we'll have
the address which corresponds to the
beginning of the loop. And, in the orange
part, we will have the Branch Address
Packet. You can see that you have 10
iterations of this Branch Address Packet
because we have 10 iterations of the for
loop. This is just to show what is the
general structure of a trace. This is just
a control flow graph just to say what we
could have about this. Of course, if we
have another loop at the end of this
control flow graph, we'll just make the
trace a bit longer just to have the
information about the second loop and so
on. Once we have all these traces, the
next step is to say I have my tags but how
do I define the rules just to transmit my
tags. And this is there we will use static
analysis for this. Basically, in this
example, if we have the instruction "add
register1 + register2 and put the result
in register0". For this, we will use
static analysis which allows us to say that
the tag associated with register0 will be
the tag of register1 or the tag of
register2. Static analysis will be done
before running my code just to say I have
all the rules for all the lines of my
code. Now that we have the trace, we know
how to transmit the tags all over my code,
the final step will be just to make the
static analysis in the LLVM backend. The
final step will be about instrumentation.
As I said before, we can recover all the
memory addresses we need through
instrumentation. Otherwise, we can also
only get the register-relative memory
addresses through instrumentation. In this
first case, on this simple code, we can
instrument all the code but the main
drawback of this solution is that it will
completely excess the time of the
instruction. Otherwise, what we can do is
that with the store instruction over
there, we can get data from the trace:
basically, we will use the Program Counter
from the trace. Then, for the Stack
Pointer, we will use static analysis to
get information from the Stack Pointer.
And, finally, we can use only one
instrumented instruction at the end. If I
go back to this system, the communication
overhead will be the main drawback as I
said before because if we have over there
the processor and the FPGA running in
different parts, the main problem will be
how we can transmit data in real-time or,
at least, in the highest speed we can
between the processor and the FPGA. This
is the time overhead when we enable
Coresight components or not. In blue, we
have the basic time overhead when the
traces are disabled. And we can see that,
when we enable traces, the time overhead
is nearly negligible. Regarding time
instrumentation, we can see that regarding
the strategy 2 which is using the
Coresight components, using the static
analysis and the instrumentation, we can
lower the instrumentation overhead from
53% down to 5%. We still have some
overhead due to instrumentation but it's
really low compared to the related works
where all the code was instrumented. This
is an overview that shows that in the
grey lines some overhead of related works
with full instrumentation and we can see
that, in our approach (with the greeen
lines over there), the time overhead with
our code is much much smaller. Basically,
how we can use TrustZone with this? This
is just an overview of our system. And we
can say we can use TrustZone just to
separate the CPU from the FPGA
coprocessor. If we make a comparison with
related works, we can see that compared to
the first works, we are able to make some
information flow control with an hardcore
processor which was not the case with the
two first works in this table. It means
you can use a basic ARM processor just to
make the information flow tracking instead
of having a specific processor. And, of
course, the area overhead, which is
another important topic, is much much
smaller compared to the existing works.
It's time for the conclusion. As I
presented in this talk, we are able to use
the PTM component just to obtain runtime
information about my application. This is
a non-intrusive tracing because we still
have negligible performance overhead.
And we also improve the software security
just because we were able to make some
security on the coprocessor. The future
perspective of that work is mainly to work
with multicore processors and see if we can
use the same approach for Intel and maybe
ST microcontrollers to see if we can also
do information flow tracking in this case.
That was my talk. Thanks for listening.
applause
Herald: Thank you very much for this talk.
Unfortunately, we don't have time for Q&A,
so please, if you leave the room and take
your trash with you, that makes the angels
happy.
Pascal: I was a bit long, sorry.
Herald: Another round
of applause for Pascal.
applause
34c3 outro
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!