<i>34c3 intro</i>

Herald: The next talk will be about
embedded systems security and Pascal, the

speaker, will explain how you can hijack
debug components for embedded security in

ARM processors. Pascal is not only an
embedded software security engineer but

also a researcher in his spare time.
Please give a very very warm

welcoming good morning applause to
Pascal.

<i>applause</i>

Pascal: OK, thanks for the introduction.
As it was said, I'm an engineer by day in

a French company where I work as an
embedded system security engineer. But

this talk is mainly about my spare-time
activity which is researcher, hacker or

whatever you call it. This is because I
work with a PhD student called Muhammad

Abdul Wahab. He's a third year PhD student
in a French lab. So, this talk will be

mainly a representation on his work about
embedded systems security and especially

debug components available in ARM
processors. Don't worry about the link. At

the end, there will be also the link with
all the slides, documentations and

everything. So, before the congress, I
didn't know about what kind of background

you will need for my talk. So, I
put there some links, I mean some

references of some talks where you will
have all the vocabulary needed to

understand at least some parts of my talk.
About computer architecture and embedded

system security, I hope you had attended
the talk by Alastair about the formal

verification of software and also the talk
by Keegan about Trusted Execution

Environments (TEEs such as TrustZone).
And, in this talk, I will also talk about

FPGA stuff. About FPGAs, there was a talk
on day 2 about FPGA reverse engineering.

And, if you don't know about FPGAs, I hope
that you had some time to go to the

OpenFPGA assembly because these guys are
doing a great job about FPGA open-source

tools. When you see this slide, the first
question is that why I put "TrustZone is

not enough"? Just a quick reminder about
what TrustZone is. TrustZone is about

separating a system between a non-secure
world in red and a secure world in green.

When we want to use the TrustZone
framework, we have lots of hardware

components, lots of software components
allowing us to, let's say, run separately

a secure OS and a non-secure OS. In our
case, what we wanted to do is to use the

debug components (you can see it on the
left side of the picture) to see if we can

make some security with it. Furthermore,
we wanted to use something else than

TrustZone because if you have attended the
talk about the security in the Nintendo

Switch, you can see that the TrustZone
framework can be bypassed under specific

cases. Furthermore, this talk is something
quite complimentary because we will do

something at a lower level, at the
processor architecture level. I will talk

in a later part of my talk about what we
can do between TrustZone and the approach

developed in this work. So, basically, the
presentation will be a quick introduction.

I will talk about some works aiming to use
debug components to make some security.

Then, I will talk about ARMHEx which
is the name of the system we developed to

use the debug components in a hardcore
processor. And, finally, some results and

a conclusion. In the context of our
project, we are working with System-on-

Chips. So, System-on-Chips are a kind of
devices where we have in the green part a

processor. So it can be a single core,
dual core or even quad core processor.

And another interesting part which is in
yellow in the image is the programmable

logic. Which is also called an FPGA
in this case. And

in this kind of System-on-
Chip, you have the hardcore processor,

the FPGA and some links between those two
units. You can see here, in the red

rectangle, one of the two processors. This
picture is an image of a System-on-Chip

called Zynq provided by Xilinx which is
also a FPGA provider. In this kind of

chip, we usually have 2 Cortex-A9
processors and some FPGA logic to work

with. What we want to do with the debug
components is to work about Dynamic

Information Flow Tracking. Basically, what
is information flow? Information flow is

the transfer of information from an
information container C1 to C2 given a

process P. In other words, if we take this
simple code over there: if you have 4

variables (for instance, a, b, w and x),
the idea is that if you have some metadata

in a, the metadata will be transmitted to
w. In other words, what kind of

information will we transmit into the
code? Basically, the information I'm

talking in the first block is "OK, this
data is private, this data is public" and

we should not mix data which are public
and private together. Basically we can say

that the information can be binary
information which is "public or private"

but of course we'll be able to have
several levels of information. In the

following parts, this information will be
called taint or even tags and to be a bit

more simple we will use some colors to
say "OK, my tag is red or green" just to

say if it's private or public data. As I
said, if the tag contained in a is red,

the data contained in w will be red as
well. Same thing for b and x. If we have a

quick example over there, if we look at a
buffer overflow. In the upper part of the

slide you have the assembly code and on
the lower part, the green columns will be

the color of the tags. On the right side
of these columns you have the status of

the different registers. This code is
basically: OK, when my input is red at the

beginning, we just use the tainted input
into the index variable. The register 2

which contains the idx variable will be
red as well. Then, when we want to access

buffer[idx] which is the second line in
the C code at the beginning, the

information we have there will be red as
well. And, of course, the result of the

operation which is x will be red as well.
Basically, that means that if there is a

tainted input at the beginning, we must
be able to transmit this information until

the return address of this code just to
say "OK, if this tainted input is private,

the return adress at the end of the code
should be private as well". What can we do

with that? There is a simple code over
there. This is a simple code saying if you

are a normal user, if in your code, you
just have to open the welcome file.

Otherwise, if you are a root user, you
must open the password file. So this is to

say if we want to open the welcome file,
this is a public information: you can do

whatever you want with it. Otherwise, if
it's a root user, maybe the password will

contain for instance a cryptographic key
and we should not go to the printf

function at the end of this code. The idea
behind that is to check that the fs

variable containing the data of the file
is private or public. There are mainly

three steps for that. First of all, the
compilation will give us the assembly

code. Then, we must modify system calls to
send the tags. The tags will be as I said

before the private or public information
about my fs variable. I will talk a bit

about that later: maybe, in future works,
the idea is to make or at least to compile

an Operating System with integrated
support for DIFT. There were already some

works about Dynamic Information Flow
Tracking. So, we should do this kind of

information flow tracking in two manners.
The first one at the application level

working at the Java or Android level. Some
works also propose some solutions at the

OS level: for instance, KBlare. But what
we wanted to do here is to work at a lower

level so this is not at the application or
the OS leve but more at the hardware level

or, at least, at the processor
architecture level. If you want to have

some information about the OS level
implementations of information flow

tracking, you can go to blare-ids.org
where you have some implementations of an

Android port and a Java port of intrusion
detection systems. In the rest of my talk,

I will just go through the existing works
and see what we can do about that. When we

talk about dynamic information flow
tracking at a low level, there are mainly

three approaches. The first one is the
one in the left-side of this slide. The idea is

that in the upper-side of this figure, we
have the normal processor pipeline:

basically, decode stage, register file and
Arithmetic &amp; Logic Unit. The basic idea is

that when we want to process with tags or
taints, we just duplicate the processor

pipeline (the grey pipeline under the
normal one) just to process data. And, it

implies two things: First of all, we must
have the source code of the processor

itself just to duplicate the processor
pipeline and to make the DIFT pipeline.

This is quite inconvenient because we
must have the source code of the processor

which is not really easy sometimes.
Otherwise, the main advantage of this

approach is that we can do nearly anything
we want because we have access to all

codes. So, we can pull all wires we need
from the processor just to get the

information we need. On the second
approach (right side of the picture),

there is something a bit more different:
instead of having a single processor

aiming to do the normal application flow +
the information flow tracking, we should

separate the normal execution and the
information flow tracking (this is the

second approach over there). This approach
is not satisfying as well because you will

have one core running the normal
application but core #2 will be just able

to make DIFT controls. Basically, it's a
shame to use a processor just to make DIFT

controls. The best compromise we can do is
to make a dedicated coprocessor just to

make the information flow tracking
processing. Basically, the most

interesting work in this topic is to have
a main core processor aiming to make the

normal application and a dedicated
coprocessor to make the IFT controls. You

will have some communications between
those two cores. If we want to make a

quick comparison between different works.
If you want to run the dynamic information

flow control in pure software (I will talk
about that in the slide after), this is

really painful in terms of time overhead
because you will see that the time to do

information flow tracking in pure software
is really unacceptable. Regarding

hardware-assisted approaches, the best
advantage in all cases is that we have a

low overhead in terms of silicon area: it
means that, on this slide, the overhead

between the main core and the main core +
the coprocessor is not so important. We

will see that, in the case of my talk, the
dedicated DIFT coprocessor is also easier

to get different security policies. As I
said in the pure software solution (the

first line of this table), the basic idea
behind that is to use instrumentation. If

you were there on day 2, the
instrumentation is the transformation of a

program into its own measurement tool. It
means that we will put some sensors in all

parts of my code just to monitor its
activity and gather some information from

it. If we want to measure the impact of
instrumentation on the execution time of

an application, you can see in this
diagram over there, the normal application

level which is normalized to 1. When we
want to use instrumentation with it, the

minimal overhead we have is about 75%. The
time with instrumentation will be most of

the time twice higher than the normal
execution time. This is completely

unacceptable because it will just run
slower your application. Basically, as I

said, the main concern about my talk is
about reducing the overhead of software

instrumentation. I will talk also a bit
about the security of the DIFT coprocessor

because we can't include a DIFT
coprocessor without taking care of its

security. According to my knowledge, this
is the first work about DIFT in ARM-based

system-on-chips. On the talk about the
security of the Nintendo Switch, the

speaker said that black-box testing is fun
... except that it isn't. In our case, we

have only a black-box because we can't
modify the structure of the processor, we

must make our job without, let's say,
decaping the processor and so on. This is

an overall schematic of our architecture.
On the left side, in light green, you have

the ARM processor. In this case, this is a
simplified version with only one core.

And, on the right side, you have the
structure of the coprocessor we

implemented in the FPGA. You can notice,
for instance, for the moment sorry, two

things. The first is that you have some
links between the FPGA and the CPU. These

links are already existing in the system-
on-chip. And you can see another thing

regarding the memory: you have separate
memory for the processor and the FPGA. And

we will see later that we can use
TrustZone to add a layer of security, just

to be sure that we won't mix the memory
between the CPU and the FPGA. Basically,

when we want to work with ARM processors,
we must use ARM datasheets, we must read

ARM datasheets. First of all, don't be
afraid by the length of ARM datasheets

because, in my case, I used to work with
the ARM-v7 technical manual which is

already 2000 pages. The ARM-v8 manual is
about 6000 pages. Anyway. Of course, what

is also difficult is that the information
is split between different documents.

Anyway, when we want to use debug
components in the case of ARM, we just

have this register over there which is
called DBGOSLAR. We can see that, in this

register, we can say that writing the key
value 0xC5A-blabla to this field locks the

debug registers. And if your write any
other value, it will just unlock those

debug registers. So that was basically the
first step to enable the debug components:

Just writing a random value to this register
just to unlock my debug components. Here

is again a schematic of the overall
system-on-chip. As you see, you have the

two processors and, on the top part, you
have what are called Coresight components.

These are the famous debug components I
will talk in the second part of my talk.

Here is a simplified view of the debug
components we have in Zynq SoCs. On the

left side, we have the two processors
(CPU0 and CPU1) and all the Coresight

components are: PTM, the one which is in
the red rectangle; and also the ECT which

is the Embedded Cross Trigger; and the ITM
which is the Instrumentation Trace

Macrocell. Basically, when we want to
extract some data from the Coresight

components, the basic path we use is the
PTM, go through the Funnel and, at this

step, we have two choices to store the
information taken from debug components.

The first one is the Embedded Trace Buffer
which is a small memory embedded in the

processor. Unfortunately, this memory is
really small because it's only about

4KBytes as far as I remember. But the other
possibility is just to export some data to

the Trace Packet Output and this is what
we will use just to export some data to

the coprocessor implemented in the FPGA.
Basically, what PTM is able to do? The

first thing that PTM can do is to trace
whatever in your memory. For instance, you

can trace all your code. Basically, all
the blue sections. But, you can also let's

say trace specific regions of the code:
You can say OK I just want to trace the

code in my section 1 or section 2 or
section N. Then the PTM is also able to

make some Branch Broadcasting. That is
something that was not present in the

Linux kernel. So, we already submitted a
patch that was accepted to manage the

Branch Broadcasting into the PTM. And we
can do some timestamping and other things

just to be able to store the information
in the traces. Basically, what a trace

looks like? Here is the most simple
code we could had: it's just a for loop

doing nothing. The assembly code over
there. And the trace will look like this.

In the first 5 bytes, some kind of start
packet which is called the A-sync packet

just to say "OK, this is the beginning of
the trace". In the green part, we'll have

the address which corresponds to the
beginning of the loop. And, in the orange

part, we will have the Branch Address
Packet. You can see that you have 10

iterations of this Branch Address Packet
because we have 10 iterations of the for

loop. This is just to show what is the
general structure of a trace. This is just

a control flow graph just to say what we
could have about this. Of course, if we

have another loop at the end of this
control flow graph, we'll just make the

trace a bit longer just to have the
information about the second loop and so

on. Once we have all these traces, the
next step is to say I have my tags but how

do I define the rules just to transmit my
tags. And this is there we will use static

analysis for this. Basically, in this
example, if we have the instruction "add

register1 + register2 and put the result
in register0". For this, we will use

static analysis which allows us to say that
the tag associated with register0 will be

the tag of register1 or the tag of
register2. Static analysis will be done

before running my code just to say I have
all the rules for all the lines of my

code. Now that we have the trace, we know
how to transmit the tags all over my code,

the final step will be just to make the
static analysis in the LLVM backend. The

final step will be about instrumentation.
As I said before, we can recover all the

memory addresses we need through
instrumentation. Otherwise, we can also

only get the register-relative memory
addresses through instrumentation. In this

first case, on this simple code, we can
instrument all the code but the main

drawback of this solution is that it will
completely excess the time of the

instruction. Otherwise, what we can do is
that with the store instruction over

there, we can get data from the trace:
basically, we will use the Program Counter

from the trace. Then, for the Stack
Pointer, we will use static analysis to

get information from the Stack Pointer.
And, finally, we can use only one

instrumented instruction at the end. If I
go back to this system, the communication

overhead will be the main drawback as I
said before because if we have over there

the processor and the FPGA running in
different parts, the main problem will be

how we can transmit data in real-time or,
at least, in the highest speed we can

between the processor and the FPGA. This
is the time overhead when we enable

Coresight components or not. In blue, we
have the basic time overhead when the

traces are disabled. And we can see that,
when we enable traces, the time overhead

is nearly negligible. Regarding time
instrumentation, we can see that regarding

the strategy 2 which is using the
Coresight components, using the static

analysis and the instrumentation, we can
lower the instrumentation overhead from

53% down to 5%. We still have some
overhead due to instrumentation but it's

really low compared to the related works
where all the code was instrumented. This

is an overview that shows that in the
grey lines some overhead of related works

with full instrumentation and we can see
that, in our approach (with the greeen

lines over there), the time overhead with
our code is much much smaller. Basically,

how we can use TrustZone with this? This
is just an overview of our system. And we

can say we can use TrustZone just to
separate the CPU from the FPGA

coprocessor. If we make a comparison with
related works, we can see that compared to

the first works, we are able to make some
information flow control with an hardcore

processor which was not the case with the
two first works in this table. It means

you can use a basic ARM processor just to
make the information flow tracking instead

of having a specific processor. And, of
course, the area overhead, which is

another important topic, is much much
smaller compared to the existing works.

It's time for the conclusion. As I
presented in this talk, we are able to use

the PTM component just to obtain runtime
information about my application. This is

a non-intrusive tracing because we still
have negligible performance overhead.

And we also improve the software security
just because we were able to make some

security on the coprocessor. The future
perspective of that work is mainly to work

with multicore processors and see if we can
use the same approach for Intel and maybe

ST microcontrollers to see if we can also
do information flow tracking in this case.

That was my talk. Thanks for listening.

<i>applause</i>

Herald: Thank you very much for this talk.

Unfortunately, we don't have time for Q&amp;A,
so please, if you leave the room and take

your trash with you, that makes the angels
happy.

Pascal: I was a bit long, sorry.

Herald: Another round
of applause for Pascal.

<i>applause</i>

<i>34c3 outro</i>

subtitles created by c3subtitles.de
in the year 2020. Join, and help us!