34C3 - TrustZone is not enough

Edit subtitles

0:00 - 0:14

34c3 intro
0:14 - 0:20

Herald: The next talk will be about
embedded systems security and Pascal, the
0:20 - 0:26

speaker, will explain how you can hijack
debug components for embedded security in
0:26 - 0:33

ARM processors. Pascal is not only an
embedded software security engineer but
0:33 - 0:39

also a researcher in his spare time.
Please give a very very warm
0:39 - 0:42

welcoming good morning applause to
Pascal.
0:42 - 0:48

applause
0:48 - 0:54

Pascal: OK, thanks for the introduction.
As it was said, I'm an engineer by day in
0:54 - 0:59

a French company where I work as an
embedded system security engineer. But
0:59 - 1:04

this talk is mainly about my spare-time
activity which is researcher, hacker or
1:04 - 1:11

whatever you call it. This is because I
work with a PhD student called Muhammad
1:11 - 1:18

Abdul Wahab. He's a third year PhD student
in a French lab. So, this talk will be
1:18 - 1:23

mainly a representation on his work about
embedded systems security and especially
1:23 - 1:30

debug components available in ARM
processors. Don't worry about the link. At
1:30 - 1:34

the end, there will be also the link with
all the slides, documentations and
1:34 - 1:42

everything. So, before the congress, I
didn't know about what kind of background
1:42 - 1:47

you will need for my talk. So, I
put there some links, I mean some
1:47 - 1:52

references of some talks where you will
have all the vocabulary needed to
1:52 - 1:57

understand at least some parts of my talk.
About computer architecture and embedded
1:57 - 2:03

system security, I hope you had attended
the talk by Alastair about the formal
2:03 - 2:09

verification of software and also the talk
by Keegan about Trusted Execution
2:09 - 2:18

Environments (TEEs such as TrustZone).
And, in this talk, I will also talk about
2:18 - 2:26

FPGA stuff. About FPGAs, there was a talk
on day 2 about FPGA reverse engineering.
2:26 - 2:31

And, if you don't know about FPGAs, I hope
that you had some time to go to the
2:31 - 2:38

OpenFPGA assembly because these guys are
doing a great job about FPGA open-source
2:38 - 2:47

tools. When you see this slide, the first
question is that why I put "TrustZone is
2:47 - 2:54

not enough"? Just a quick reminder about
what TrustZone is. TrustZone is about
2:54 - 3:04

separating a system between a non-secure
world in red and a secure world in green.
3:04 - 3:09

When we want to use the TrustZone
framework, we have lots of hardware
3:09 - 3:17

components, lots of software components
allowing us to, let's say, run separately
3:17 - 3:25

a secure OS and a non-secure OS. In our
case, what we wanted to do is to use the
3:25 - 3:31

debug components (you can see it on the
left side of the picture) to see if we can
3:31 - 3:39

make some security with it. Furthermore,
we wanted to use something else than
3:39 - 3:45

TrustZone because if you have attended the
talk about the security in the Nintendo
3:45 - 3:51

Switch, you can see that the TrustZone
framework can be bypassed under specific
3:51 - 3:59

cases. Furthermore, this talk is something
quite complimentary because we will do
3:59 - 4:08

something at a lower level, at the
processor architecture level. I will talk
4:08 - 4:15

in a later part of my talk about what we
can do between TrustZone and the approach
4:15 - 4:21

developed in this work. So, basically, the
presentation will be a quick introduction.
4:21 - 4:27

I will talk about some works aiming to use
debug components to make some security.
4:27 - 4:34

Then, I will talk about ARMHEx which
is the name of the system we developed to
4:34 - 4:38

use the debug components in a hardcore
processor. And, finally, some results and
4:38 - 4:46

a conclusion. In the context of our
project, we are working with System-on-
4:46 - 4:54

Chips. So, System-on-Chips are a kind of
devices where we have in the green part a
4:54 - 4:59

processor. So it can be a single core,
dual core or even quad core processor.
4:59 - 5:06

And another interesting part which is in
yellow in the image is the programmable
5:06 - 5:10

logic. Which is also called an FPGA
in this case. And
5:10 - 5:14

in this kind of System-on-
Chip, you have the hardcore processor,
5:14 - 5:24

the FPGA and some links between those two
units. You can see here, in the red
5:24 - 5:33

rectangle, one of the two processors. This
picture is an image of a System-on-Chip
5:33 - 5:38

called Zynq provided by Xilinx which is
also a FPGA provider. In this kind of
5:38 - 5:45

chip, we usually have 2 Cortex-A9
processors and some FPGA logic to work
5:45 - 5:54

with. What we want to do with the debug
components is to work about Dynamic
5:54 - 6:00

Information Flow Tracking. Basically, what
is information flow? Information flow is
6:00 - 6:07

the transfer of information from an
information container C1 to C2 given a
6:07 - 6:14

process P. In other words, if we take this
simple code over there: if you have 4
6:14 - 6:24

variables (for instance, a, b, w and x),
the idea is that if you have some metadata
6:24 - 6:32

in a, the metadata will be transmitted to
w. In other words, what kind of
6:32 - 6:40

information will we transmit into the
code? Basically, the information I'm
6:40 - 6:48

talking in the first block is "OK, this
data is private, this data is public" and
6:48 - 6:55

we should not mix data which are public
and private together. Basically we can say
6:55 - 7:00

that the information can be binary
information which is "public or private"
7:00 - 7:09

but of course we'll be able to have
several levels of information. In the
7:09 - 7:16

following parts, this information will be
called taint or even tags and to be a bit
7:16 - 7:22

more simple we will use some colors to
say "OK, my tag is red or green" just to
7:22 - 7:34

say if it's private or public data. As I
said, if the tag contained in a is red,
7:34 - 7:42

the data contained in w will be red as
well. Same thing for b and x. If we have a
7:42 - 7:49

quick example over there, if we look at a
buffer overflow. In the upper part of the
7:49 - 7:57

slide you have the assembly code and on
the lower part, the green columns will be
7:57 - 8:04

the color of the tags. On the right side
of these columns you have the status of
8:04 - 8:11

the different registers. This code is
basically: OK, when my input is red at the
8:11 - 8:20

beginning, we just use the tainted input
into the index variable. The register 2
8:20 - 8:28

which contains the idx variable will be
red as well. Then, when we want to access
8:28 - 8:37

buffer[idx] which is the second line in
the C code at the beginning, the
8:37 - 8:44

information we have there will be red as
well. And, of course, the result of the
8:44 - 8:50

operation which is x will be red as well.
Basically, that means that if there is a
8:50 - 8:57

tainted input at the beginning, we must
be able to transmit this information until
8:57 - 9:03

the return address of this code just to
say "OK, if this tainted input is private,
9:03 - 9:12

the return adress at the end of the code
should be private as well". What can we do
9:12 - 9:18

with that? There is a simple code over
there. This is a simple code saying if you
9:18 - 9:26

are a normal user, if in your code, you
just have to open the welcome file.
9:26 - 9:33

Otherwise, if you are a root user, you
must open the password file. So this is to
9:33 - 9:39

say if we want to open the welcome file,
this is a public information: you can do
9:39 - 9:45

whatever you want with it. Otherwise, if
it's a root user, maybe the password will
9:45 - 9:52

contain for instance a cryptographic key
and we should not go to the printf
9:52 - 10:02

function at the end of this code. The idea
behind that is to check that the fs
10:02 - 10:08

variable containing the data of the file
is private or public. There are mainly
10:08 - 10:14

three steps for that. First of all, the
compilation will give us the assembly
10:14 - 10:25

code. Then, we must modify system calls to
send the tags. The tags will be as I said
10:25 - 10:34

before the private or public information
about my fs variable. I will talk a bit
10:34 - 10:41

about that later: maybe, in future works,
the idea is to make or at least to compile
10:41 - 10:52

an Operating System with integrated
support for DIFT. There were already some
10:52 - 10:58

works about Dynamic Information Flow
Tracking. So, we should do this kind of
10:58 - 11:05

information flow tracking in two manners.
The first one at the application level
11:05 - 11:15

working at the Java or Android level. Some
works also propose some solutions at the
11:15 - 11:21

OS level: for instance, KBlare. But what
we wanted to do here is to work at a lower
11:21 - 11:28

level so this is not at the application or
the OS leve but more at the hardware level
11:28 - 11:35

or, at least, at the processor
architecture level. If you want to have
11:35 - 11:40

some information about the OS level
implementations of information flow
11:40 - 11:47

tracking, you can go to blare-ids.org
where you have some implementations of an
11:47 - 11:56

Android port and a Java port of intrusion
detection systems. In the rest of my talk,
11:56 - 12:05

I will just go through the existing works
and see what we can do about that. When we
12:05 - 12:11

talk about dynamic information flow
tracking at a low level, there are mainly
12:11 - 12:22

three approaches. The first one is the
one in the left-side of this slide. The idea is
12:22 - 12:29

that in the upper-side of this figure, we
have the normal processor pipeline:
12:29 - 12:38

basically, decode stage, register file and
Arithmetic & Logic Unit. The basic idea is
12:38 - 12:44

that when we want to process with tags or
taints, we just duplicate the processor
12:44 - 12:54

pipeline (the grey pipeline under the
normal one) just to process data. And, it
12:54 - 12:58

implies two things: First of all, we must
have the source code of the processor
12:58 - 13:09

itself just to duplicate the processor
pipeline and to make the DIFT pipeline.
13:09 - 13:16

This is quite inconvenient because we
must have the source code of the processor
13:16 - 13:25

which is not really easy sometimes.
Otherwise, the main advantage of this
13:25 - 13:30

approach is that we can do nearly anything
we want because we have access to all
13:30 - 13:35

codes. So, we can pull all wires we need
from the processor just to get the
13:35 - 13:41

information we need. On the second
approach (right side of the picture),
13:41 - 13:47

there is something a bit more different:
instead of having a single processor
13:47 - 13:52

aiming to do the normal application flow +
the information flow tracking, we should
13:52 - 13:59

separate the normal execution and the
information flow tracking (this is the
13:59 - 14:05

second approach over there). This approach
is not satisfying as well because you will
14:05 - 14:15

have one core running the normal
application but core #2 will be just able
14:15 - 14:22

to make DIFT controls. Basically, it's a
shame to use a processor just to make DIFT
14:22 - 14:30

controls. The best compromise we can do is
to make a dedicated coprocessor just to
14:30 - 14:36

make the information flow tracking
processing. Basically, the most
14:36 - 14:42

interesting work in this topic is to have
a main core processor aiming to make the
14:42 - 14:47

normal application and a dedicated
coprocessor to make the IFT controls. You
14:47 - 14:54

will have some communications between
those two cores. If we want to make a
14:54 - 15:01

quick comparison between different works.
If you want to run the dynamic information
15:01 - 15:09

flow control in pure software (I will talk
about that in the slide after), this is
15:09 - 15:20

really painful in terms of time overhead
because you will see that the time to do
15:20 - 15:25

information flow tracking in pure software
is really unacceptable. Regarding
15:25 - 15:31

hardware-assisted approaches, the best
advantage in all cases is that we have a
15:31 - 15:38

low overhead in terms of silicon area: it
means that, on this slide, the overhead
15:38 - 15:46

between the main core and the main core +
the coprocessor is not so important. We
15:46 - 16:01

will see that, in the case of my talk, the
dedicated DIFT coprocessor is also easier
16:01 - 16:10

to get different security policies. As I
said in the pure software solution (the
16:10 - 16:17

first line of this table), the basic idea
behind that is to use instrumentation. If
16:17 - 16:24

you were there on day 2, the
instrumentation is the transformation of a
16:24 - 16:30

program into its own measurement tool. It
means that we will put some sensors in all
16:30 - 16:37

parts of my code just to monitor its
activity and gather some information from
16:37 - 16:43

it. If we want to measure the impact of
instrumentation on the execution time of
16:43 - 16:48

an application, you can see in this
diagram over there, the normal application
16:48 - 16:54

level which is normalized to 1. When we
want to use instrumentation with it, the
16:54 - 17:06

minimal overhead we have is about 75%. The
time with instrumentation will be most of
17:06 - 17:12

the time twice higher than the normal
execution time. This is completely
17:12 - 17:19

unacceptable because it will just run
slower your application. Basically, as I
17:19 - 17:24

said, the main concern about my talk is
about reducing the overhead of software
17:24 - 17:30

instrumentation. I will talk also a bit
about the security of the DIFT coprocessor
17:30 - 17:37

because we can't include a DIFT
coprocessor without taking care of its
17:37 - 17:45

security. According to my knowledge, this
is the first work about DIFT in ARM-based
17:45 - 17:53

system-on-chips. On the talk about the
security of the Nintendo Switch, the
17:53 - 17:59

speaker said that black-box testing is fun
... except that it isn't. In our case, we
17:59 - 18:05

have only a black-box because we can't
modify the structure of the processor, we
18:05 - 18:14

must make our job without, let's say,
decaping the processor and so on. This is
18:14 - 18:22

an overall schematic of our architecture.
On the left side, in light green, you have
18:22 - 18:27

the ARM processor. In this case, this is a
simplified version with only one core.
18:27 - 18:33

And, on the right side, you have the
structure of the coprocessor we
18:33 - 18:41

implemented in the FPGA. You can notice,
for instance, for the moment sorry, two
18:41 - 18:48

things. The first is that you have some
links between the FPGA and the CPU. These
18:48 - 18:54

links are already existing in the system-
on-chip. And you can see another thing
18:54 - 19:04

regarding the memory: you have separate
memory for the processor and the FPGA. And
19:04 - 19:09

we will see later that we can use
TrustZone to add a layer of security, just
19:09 - 19:17

to be sure that we won't mix the memory
between the CPU and the FPGA. Basically,
19:17 - 19:24

when we want to work with ARM processors,
we must use ARM datasheets, we must read
19:24 - 19:30

ARM datasheets. First of all, don't be
afraid by the length of ARM datasheets
19:30 - 19:37

because, in my case, I used to work with
the ARM-v7 technical manual which is
19:37 - 19:49

already 2000 pages. The ARM-v8 manual is
about 6000 pages. Anyway. Of course, what
19:49 - 19:55

is also difficult is that the information
is split between different documents.
19:55 - 20:01

Anyway, when we want to use debug
components in the case of ARM, we just
20:01 - 20:08

have this register over there which is
called DBGOSLAR. We can see that, in this
20:08 - 20:15

register, we can say that writing the key
value 0xC5A-blabla to this field locks the
20:15 - 20:20

debug registers. And if your write any
other value, it will just unlock those
20:20 - 20:28

debug registers. So that was basically the
first step to enable the debug components:
20:28 - 20:39

Just writing a random value to this register
just to unlock my debug components. Here
20:39 - 20:45

is again a schematic of the overall
system-on-chip. As you see, you have the
20:45 - 20:50

two processors and, on the top part, you
have what are called Coresight components.
20:50 - 20:56

These are the famous debug components I
will talk in the second part of my talk.
20:56 - 21:06

Here is a simplified view of the debug
components we have in Zynq SoCs. On the
21:06 - 21:13

left side, we have the two processors
(CPU0 and CPU1) and all the Coresight
21:13 - 21:21

components are: PTM, the one which is in
the red rectangle; and also the ECT which
21:21 - 21:26

is the Embedded Cross Trigger; and the ITM
which is the Instrumentation Trace
21:26 - 21:33

Macrocell. Basically, when we want to
extract some data from the Coresight
21:33 - 21:44

components, the basic path we use is the
PTM, go through the Funnel and, at this
21:44 - 21:51

step, we have two choices to store the
information taken from debug components.
21:51 - 21:56

The first one is the Embedded Trace Buffer
which is a small memory embedded in the
21:56 - 22:04

processor. Unfortunately, this memory is
really small because it's only about
22:04 - 22:11

4KBytes as far as I remember. But the other
possibility is just to export some data to
22:11 - 22:16

the Trace Packet Output and this is what
we will use just to export some data to
22:16 - 22:26

the coprocessor implemented in the FPGA.
Basically, what PTM is able to do? The
22:26 - 22:34

first thing that PTM can do is to trace
whatever in your memory. For instance, you
22:34 - 22:42

can trace all your code. Basically, all
the blue sections. But, you can also let's
22:42 - 22:48

say trace specific regions of the code:
You can say OK I just want to trace the
22:48 - 22:56

code in my section 1 or section 2 or
section N. Then the PTM is also able to
22:56 - 23:00

make some Branch Broadcasting. That is
something that was not present in the
23:00 - 23:07

Linux kernel. So, we already submitted a
patch that was accepted to manage the
23:07 - 23:14

Branch Broadcasting into the PTM. And we
can do some timestamping and other things
23:14 - 23:22

just to be able to store the information
in the traces. Basically, what a trace
23:22 - 23:27

looks like? Here is the most simple
code we could had: it's just a for loop
23:27 - 23:36

doing nothing. The assembly code over
there. And the trace will look like this.
23:36 - 23:45

In the first 5 bytes, some kind of start
packet which is called the A-sync packet
23:45 - 23:50

just to say "OK, this is the beginning of
the trace". In the green part, we'll have
23:50 - 23:56

the address which corresponds to the
beginning of the loop. And, in the orange
23:56 - 24:03

part, we will have the Branch Address
Packet. You can see that you have 10
24:03 - 24:08

iterations of this Branch Address Packet
because we have 10 iterations of the for
24:08 - 24:19

loop. This is just to show what is the
general structure of a trace. This is just
24:19 - 24:23

a control flow graph just to say what we
could have about this. Of course, if we
24:23 - 24:27

have another loop at the end of this
control flow graph, we'll just make the
24:27 - 24:32

trace a bit longer just to have the
information about the second loop and so
24:32 - 24:41

on. Once we have all these traces, the
next step is to say I have my tags but how
24:41 - 24:49

do I define the rules just to transmit my
tags. And this is there we will use static
24:49 - 24:56

analysis for this. Basically, in this
example, if we have the instruction "add
24:56 - 25:06

register1 + register2 and put the result
in register0". For this, we will use
25:06 - 25:13

static analysis which allows us to say that
the tag associated with register0 will be
25:13 - 25:19

the tag of register1 or the tag of
register2. Static analysis will be done
25:19 - 25:25

before running my code just to say I have
all the rules for all the lines of my
25:25 - 25:34

code. Now that we have the trace, we know
how to transmit the tags all over my code,
25:34 - 25:42

the final step will be just to make the
static analysis in the LLVM backend. The
25:42 - 25:47

final step will be about instrumentation.
As I said before, we can recover all the
25:47 - 25:52

memory addresses we need through
instrumentation. Otherwise, we can also
25:52 - 26:03

only get the register-relative memory
addresses through instrumentation. In this
26:03 - 26:12

first case, on this simple code, we can
instrument all the code but the main
26:12 - 26:20

drawback of this solution is that it will
completely excess the time of the
26:20 - 26:27

instruction. Otherwise, what we can do is
that with the store instruction over
26:27 - 26:34

there, we can get data from the trace:
basically, we will use the Program Counter
26:34 - 26:38

from the trace. Then, for the Stack
Pointer, we will use static analysis to
26:38 - 26:43

get information from the Stack Pointer.
And, finally, we can use only one
26:43 - 26:55

instrumented instruction at the end. If I
go back to this system, the communication
26:55 - 27:03

overhead will be the main drawback as I
said before because if we have over there
27:03 - 27:09

the processor and the FPGA running in
different parts, the main problem will be
27:09 - 27:18

how we can transmit data in real-time or,
at least, in the highest speed we can
27:18 - 27:27

between the processor and the FPGA. This
is the time overhead when we enable
27:27 - 27:35

Coresight components or not. In blue, we
have the basic time overhead when the
27:35 - 27:41

traces are disabled. And we can see that,
when we enable traces, the time overhead
27:41 - 27:51

is nearly negligible. Regarding time
instrumentation, we can see that regarding
27:51 - 27:57

the strategy 2 which is using the
Coresight components, using the static
27:57 - 28:02

analysis and the instrumentation, we can
lower the instrumentation overhead from
28:02 - 28:11

53% down to 5%. We still have some
overhead due to instrumentation but it's
28:11 - 28:18

really low compared to the related works
where all the code was instrumented. This
28:18 - 28:26

is an overview that shows that in the
grey lines some overhead of related works
28:26 - 28:31

with full instrumentation and we can see
that, in our approach (with the greeen
28:31 - 28:44

lines over there), the time overhead with
our code is much much smaller. Basically,
28:44 - 28:49

how we can use TrustZone with this? This
is just an overview of our system. And we
28:49 - 28:56

can say we can use TrustZone just to
separate the CPU from the FPGA
28:56 - 29:07

coprocessor. If we make a comparison with
related works, we can see that compared to
29:07 - 29:14

the first works, we are able to make some
information flow control with an hardcore
29:14 - 29:22

processor which was not the case with the
two first works in this table. It means
29:22 - 29:27

you can use a basic ARM processor just to
make the information flow tracking instead
29:27 - 29:33

of having a specific processor. And, of
course, the area overhead, which is
29:33 - 29:39

another important topic, is much much
smaller compared to the existing works.
29:39 - 29:45

It's time for the conclusion. As I
presented in this talk, we are able to use
29:45 - 29:51

the PTM component just to obtain runtime
information about my application. This is
29:51 - 29:57

a non-intrusive tracing because we still
have negligible performance overhead.
29:57 - 30:02

And we also improve the software security
just because we were able to make some
30:02 - 30:08

security on the coprocessor. The future
perspective of that work is mainly to work
30:08 - 30:16

with multicore processors and see if we can
use the same approach for Intel and maybe
30:16 - 30:21

ST microcontrollers to see if we can also
do information flow tracking in this case.
30:21 - 30:26

That was my talk. Thanks for listening.
30:26 - 30:33

applause
30:35 - 30:38

Herald: Thank you very much for this talk.
30:38 - 30:45

Unfortunately, we don't have time for Q&A,
so please, if you leave the room and take
30:45 - 30:48

your trash with you, that makes the angels
happy.
30:48 - 30:55

Pascal: I was a bit long, sorry.
30:55 - 30:57

Herald: Another round
of applause for Pascal.
30:57 - 31:03

applause
31:03 - 31:08

34c3 outro
31:08 - 31:24

subtitles created by c3subtitles.de
in the year 2020. Join, and help us!

Title:: 34C3 - TrustZone is not enough
Description:: more » « less
Video Language:: English
Duration:: 31:24

	Johann.rc3 edited English subtitles for 34C3 - TrustZone is not enough
	Johann.rc3 edited English subtitles for 34C3 - TrustZone is not enough
	C3Subtitles edited English subtitles for 34C3 - TrustZone is not enough
	C3Subtitles edited English subtitles for 34C3 - TrustZone is not enough
	C3Subtitles edited English subtitles for 34C3 - TrustZone is not enough
	Pascal Cotret edited English subtitles for 34C3 - TrustZone is not enough
	Pascal Cotret edited English subtitles for 34C3 - TrustZone is not enough
	Pascal Cotret edited English subtitles for 34C3 - TrustZone is not enough

Show all

English subtitles

Revisions

Revision 9 Edited

Johann.rc3

34C3 - TrustZone is not enough

Revisions

Our website uses cookies

Operating cookies (Required)