34C3 preroll music
Herald: Hello fellow creatures.
Welcome and
I wanna start with a question.
Another one: Who do we trust?
Do we trust the TrustZones
on our smartphones?
Well Keegan Ryan, he's really
fortunate to be here and
he was inspired by another talk from the
CCC before - I think it was 29C3 and his
research on smartphones and systems on a
chip used in smart phones will answer
these questions if you can trust those
trusted execution environments. Please
give a warm round of applause
to Keegan and enjoy!
Applause
Kegan Ryan: All right, thank you! So I'm
Keegan Ryan, I'm a consultant with NCC
group and this is micro architectural
attacks on Trusted Execution Environments.
So, in order to understand what a Trusted
Execution Environment is we need to go
back into processor security, specifically
on x86. So as many of you are probably
aware there are a couple different modes
which we can execute code under in x86
processors and that includes ring 3, which
is the user code and the applications, and
also ring 0 which is the kernel code. Now
there's also a ring 1 and ring 2 that are
supposedly used for drivers or guest
operating systems but really it just boils
down to ring 0 and ring 3. And in this
diagram we have here we see that privilege
increases as we go up the diagram, so ring
0 is the most privileged ring and ring 3
is the least privileged ring. So all of
our secrets, all of our sensitive
information, all of the attackers goals
are in ring 0 and the attacker is trying
to access those from the unprivileged
world of ring 3. Now you may have a
question what if I want to add a processor
feature that I don't want ring 0 to be
able to access? Well then you add ring -1
which is often used for a hypervisor. Now
the hypervisor has all the secrets and the
hypervisor can manage different guest
operating systems and each of these guest
operating systems can execute in ring 0
without having any idea of the other
operating systems. So this way now the
secrets are all in ring -1 so now the
attackers goals have shifted from ring 0
to ring -1. The attacker has to attack
ring -1 from a less privileged ring and
tries to access those secrets. But what if
you want to add a processor feature that
you don't want ring -1 to be able to
access? So you add ring -2 which is System
Management Mode and that's capable of
monitoring power, directly interfacing
with firmware and other chips on a
motherboard and it's able to access and do
a lot of things that the hypervisor is not
able to and now all of your secrets and
all of your attacker goals are in ring -2
and the attacker has to attack those from
a less privileged ring. Now maybe you want
to add something to your processor that
you don't want ring -2 to be able access,
so you add ring -3 and I think you get the
picture now. And we just keep on adding
more and more privilege rings and keep
putting our secrets and our attackers
goals in these higher and higher
privileged rings but what if we're
thinking about it wrong? What if instead
we want to put all the secrets in the
least privileged ring? So this is sort of
the idea behind SGX and it's useful for
things like DRM where you want that to run
ring 3 code but have sensitive secrets or
other assigning capabilities running in
ring 3. But this picture is getting a
little bit complicated, this diagram is a
little bit complex so let's simplify it a
little bit. We'll only be looking at ring
0 through ring 3 which is the kernel, the
userland and the SGX enclave which also
executes in ring 3. Now when you're
executing code in the SGX enclave you
first load the code into the enclave and
then from that point on you trust the
execution of whatever's going on in that
enclave. You trust that the other elements
the kernel, the userland, the other rings
are not going to be able to access what's
in that enclave so you've made your
Trusted Execution Environment. This is a
bit of a weird model because now your
attacker is in the ring 0 kernel and your
target victim here is in ring 3. So
instead of the attacker trying to move up
the privilege chain, the attacker is
trying to move down. Which is pretty
strange and you might have some questions
like "under this model who handles memory
management?" because traditionally that's
something that ring 0 would manage and
ring 0 would be responsible for paging
memory in and out for different processes
in different code that's executing it in
ring 3. But on the other hand you don't
want that to happen with the SGX enclave
because what if the malicious ring 0 adds
a page to the enclave that the enclave
doesn't expect? So in order to solve this
problem, SGX does allow ring 0 to handle
page faults. But simultaneously and in
parallel it verifies every memory load to
make sure that no access violations are
made so that all the SGX memory is safe.
So it allows ring 0 to do its job but it
sort of watches over at the same time to
make sure that nothing is messed up. So
it's a bit of a weird convoluted solution
to a strange inverted problem but it works
and that's essentially how SGX works and
the idea behind SGX. Now we can look at
x86 and we can see that ARMv8 is
constructed in a similar way but it
improves on x86 in a couple key ways. So
first of all ARMv8 gets rid of ring 1 and
ring 2 so you don't have to worry about
those and it just has different privilege
levels for userland and the kernel. And
these different privilege levels are
called exception levels in the ARM
terminology. And the second thing that ARM
gets right compared to x86 is that instead
of starting at 3 and counting down as
privilege goes up, ARM starts at 0 and
counts up so we don't have to worry about
negative numbers anymore. Now when we add
the next privilege level the hypervisor we
call it exception level 2 and the next one
after that is the monitor in exception
level 3. So at this point we still want to
have the ability to run trusted code in
exception level 0 the least privileged
level of the ARMv8 processor. So in order
to support this we need to separate this
diagram into two different sections. In
ARMv8 these are called the secure world
and the non-secure world. So we have the
non-secure world on the left in blue that
consists of the userland, the kernel and
the hypervisor and we have the secure
world on the right which consists of the
monitor in exception level 3, a trusted
operating system in exception level 1 and
trusted applications in exception level 0.
So the idea is that if you run anything in
the secure world, it should not be
accessible or modifiable by anything in
the non secure world. So that's how our
attacker is trying to access it. The
attacker has access to the non secure
kernel, which is often Linux, and they're
trying to go after the trusted apps. So
once again we have this weird inversion
where we're trying to go from a more
privileged level to a less privileged
level and trying to extract secrets in
that way. So the question that arises when
using these Trusted Execution Environments
that are implemented in SGX and TrustZone
in ARM is "can we use these privilege
modes in our privilege access in order to
attack these Trusted Execution
Environments?". Now transfer that question
and we can start looking at a few
different research papers. The first one
that I want to go into is one called
CLKSCREW and it's an attack on TrustZone.
So throughout this presentation I'm going
to go through a few different papers and
just to make it clear which papers have
already been published and which ones are
old I'll include the citations in the
upper right hand corner so that way you
can tell what's old and what's new. And as
far as papers go this CLKSCREW paper is
relatively new. It was released in 2017.
And the way CLKSCREW works is it takes
advantage of the energy management
features of a processor. So a non-secure
operating system has the ability to manage
the energy consumption of the different
cores. So if a certain target core doesn't
have much scheduled to do then the
operating system is able to scale back
that voltage or dial down the frequency on
that core so that core uses less energy
which is a great thing for performance: it
really extends battery life, it makes the
the cores last longer and it gives better
performance overall. But the problem here
is what if you have two separate cores and
one of your cores is running this non-
trusted operating system and the other
core is running code in the secure world?
It's running that trusted code those
trusted applications so that non secure
operating system can still dial down that
voltage and it can still change that
frequency and those changes will affect
the secure world code. So what the
CLKSCREW attack does is the non secure
operating system core will dial down the
voltage, it will overclock the frequency
on the target secure world core in order
to induce faults to make sure to make the
computation on that core fail in some way
and when that computation fails you get
certain cryptographic errors that the
attack can use to infer things like secret
keys, secret AES keys and to bypass code
signing implemented in the secure world.
So it's a very powerful attack that's made
possible because the non-secure operating
system is privileged enough in order to
use these energy management features. Now
CLKSCREW is an example of an active attack
where the attacker is actively changing
the outcome of the victim code of that
code in the secure world. But what about
passive attacks? So in a passive attack,
the attacker does not modify the actual
outcome of the process. The attacker just
tries to monitor that process infer what's
going on and that is the sort of attack
that we'll be considering for the rest of
the presentation. So in a lot of SGX and
TrustZone implementations, the trusted and
the non-trusted code both share the same
hardware and this shared hardware could be
a shared cache, it could be a branch
predictor, it could be a TLB. The point is
that they share the same hardware so that
the changes made by the secure code may be
reflected in the behavior of the non-
secure code. So the trusted code might
execute, change the state of that shared
cache for example and then the untrusted
code may be able to go in, see the changes
in that cache and infer information about
the behavior of the secure code. So that's
essentially how our side channel attacks
are going to work. If the non-secure code
is going to monitor these shared hardware
resources for state changes that reflect
the behavior of the secure code. Now we've
all talked about how Intel and SGX address
the problem of memory management and who's
responsible for making sure that those
attacks don't work on SGX. So what do they
have to say on how they protect against
these side channel attacks and attacks on
this shared cache hardware? They don't..
at all. They essentially say "we do not
consider this part of our threat model. It
is up to the developer to implement the
protections needed to protect against
these side-channel attacks". Which is
great news for us because these side
channel attacks can be very powerful and
if there aren't any hardware features that
are necessarily stopping us from being
able to accomplish our goal it makes us
that more likely to succeed. So with that
we can sort of take a step back from trust
zone industry acts and just take a look at
cache attacks to make sure that we all
have the same understanding of how the
cache attacks will be applied to these
Trusted Execution Environments. To start
that let's go over a brief recap of how a
cache works. So caches are necessary in
processors because accessing the main
memory is slow. When you try to access
something from the main memory it takes a
while to be read into the process. So the
cache exists as sort of a layer to
remember what that information is so if
the process ever needs information from
that same address it just reloads it from
the cache and that access is going to be
fast. So it really speeds up the memory
access for repeated accesses to the same
address. And then if we try to access a
different address then that will also be
read into the cache, slowly at first but
then quickly for repeated accesses and so
on and so forth. Now as you can probably
tell from all of these examples the memory
blocks have been moving horizontally
they've always been staying in the same
row. And that is reflective of the idea of
sets in a cache. So there are a number of
different set IDs and that corresponds to
the different rows in this diagram. So for
our example there are four different set
IDs and each address in the main memory
maps to a different set ID. So that
address in main memory will only go into
that location in the cache with the same
set ID so it will only travel along those
rows. So that means if you have two
different blocks of memory that mapped to
different set IDs they're not going to
interfere with each other in the cache.
But that raises the question "what about
two memory blocks that do map to the same
set ID?". Well if there's room in the
cache then the same thing will happen as
before: those memory contents will be
loaded into the cache and then retrieved
from the cache for future accesses. And
the number of possible entries for a
particular set ID within a cache is called
the associativity. And on this diagram
that's represented by the number of
columns in the cache. So we will call our
cache in this example a 2-way set-
associative cache. Now the next question
is "what happens if you try to read a
memory address that maps the same set ID
but all of those entries within that said ID
within the cache are full?". Well one of
those entries is chosen, it's evicted from
the cache, the new memory is read in and
then that's fed to the process. So it
doesn't really matter how the cache entry
is chosen that you're evicting for the
purpose of the presentation you can just
assume that it's random. But the important
thing is that if you try to access that
same memory that was evicted before you're
not going to have to wait for that time
penalty for that to be reloaded into the
cache and read into the process. So those
are caches in a nutshell in particularly
set associative caches, we can begin
looking at the different types of cache
attacks. So for a cache attack we have two
different processes we have an attacker
process and a victim process. For this
type of attack that we're considering both
of them share the same underlying code so
they're trying to access the same
resources which could be the case if you
have page deduplication in virtual
machines or if you have copy-on-write
mechanisms for shared code and shared
libraries. But the point is that they
share the same underlying memory. Now the
Flush and Reload Attack works in two
stages for the attacker. The attacker
first starts by flushing out the cache.
They flush each and every addresses in the
cache so the cache is just empty. Then the
attacker let's the victim executes for a
small amount of time so the victim might
read on an address from main memory
loading that into the cache and then the
second stage of the attack is the reload
phase. In the reload phase the attacker
tries to load different memory addresses
from main memory and see if those entries
are in the cache or not. Here the attacker
will first try to load address 0 and see
that because it takes a long time to read
the contents of address 0 the attacker can
infer that address 0 was not part of the
cache which makes sense because the
attacker flushed it from the cache in the
first stage. The attacker then tries to
read the memory at address 1 and sees that
this operation is fast so the attacker
infers that the contents of address 1 are
in the cache and because the attacker
flushed everything from the cache before
the victim executed, the attacker then
concludes that the victim is responsible
for bringing address 1 into the cache.
This Flush+Reload attack reveals which
memory addresses the victim accesses
during that small slice of time. Then
after that reload phase, the attack
repeats so the attacker flushes again
let's the victim execute, reloads again
and so on. There's also a variant on the
Flush+Reload attack that's called the
Flush+Flush attack which I'm not going to
go into the details of, but essentially
it's the same idea. But instead of using
load instructions to determine whether or
not a piece of memory is in the cache or
not, it uses flush instructions because
flush instructions will take longer if
something is in the cache already. The
important thing is that both the
Flush+Reload attack and the Flush+Flush
attack rely on the attacker and the victim
sharing the same memory. But this isn't
always the case so we need to consider
what happens when the attacker and the
victim do not share memory. For this we
have the Prime+Probe attack. The
Prime+Probe attack once again works in two
separate stages. In the first stage the
attacker prime's the cache by reading all
the attacker memory into the cache and
then the attacker lets the victim execute
for a small amount of time. So no matter
what the victim accesses from main memory
since the cache is full of the attacker
data, one of those attacker entries will
be replaced by a victim entry. Then in the
second phase of the attack, during the
probe phase, the attacker checks the
different cache entries for particular set
IDs and sees if all of the attacker
entries are still in the cache. So maybe
our attacker is curious about the last set
ID, the bottom row, so the attacker first
tries to load the memory at address 3 and
because this operation is fast the
attacker knows that address 3 is in the
cache. The attacker tries the same thing
with address 7, sees that this operation
is slow and infers that at some point
address 7 was evicted from the cache so
the attacker knows that something had to
evicted from the cache and it had to be
from the victim so the attacker concludes
that the victim accessed something in that
last set ID and that bottom row. The
attacker doesn't know if it was the
contents of address 11 or the contents of
address 15 or even what those contents
are, but the attacker has a good idea of
which set ID it was. So, the good things,
the important things to remember about
cache attacks is that caches are very
important, they're crucial for performance
on processors, they give a huge speed
boost and there's a huge time difference
between having a cache and not having a
cache for your executables. But the
downside to this is that big time
difference also allows the attacker to
infer information about how the victim is
using the cache. We're able to use these
cache attacks in the two different
scenarios of, where memory is shared, in
the case of the Flush+Reload and
Flush+Flush attacks and in the case where
memory is not shared, in the case of the
Prime+Probe attack. And finally the
important thing to keep in mind is that,
for these cache attacks, we know where the
victim is looking, but we don't know what
they see. So we don't know the contents of
the memory that the victim is actually
seeing, we just know the location and the
addresses. So, what does an example trace
of these attacks look like? Well, there's
an easy way to represent these as two-
dimensional images. So in this image, we
have our horizontal axis as time, so each
column in this image represents a
different time slice, a different
iteration of the Prime measure and Probe.
So, then we also have the vertical access
which is the different set IDs, which is
the location that's accessed by the victim
process, and then here a pixel is white if
the victim accessed that set ID during
that time slice. So, as you look from left
to right as time moves forward, you can
sort of see the changes in the patterns of
the memory accesses made by the victim
process. Now, for this particular example
the trace is captured on an execution of
AES repeated several times, an AES
encryption repeated about 20 times. And
you can tell that this is a repeated
action because you see the same repeated
memory access patterns in the data, you
see the same structures repeated over and
over. So, you know that this is reflecting
at what's going on throughout time, but
what does it have to do with AES itself?
Well, if we take the same trace with the
same settings, but a different key, we see
that there is a different memory access
pattern with different repetition within
the trace. So, only the key changed, the
code didn't change. So, even though we're
not able to read the contents of the key
directly using this cache attack, we know
that the key is changing these memory
access patterns, and if we can see these
memory access patterns, then we can infer
the key. So, that's the essential idea: we
want to make these images as clear as
possible and as descriptive as possible so
we have the best chance of learning what
those secrets are. And we can define the
metrics for what makes these cache attacks
powerful in a few different ways. So, the
three ways we'll be looking at are spatial
resolution, temporal resolution and noise.
So, spatial resolution refers to how
accurately we can determine the where. If
we know that the victim access to memory
address within 1,000 bytes, that's
obviously not as powerful as knowing where
they accessed within 512 bytes. Temporal
resolution is similar, where we want to
know the order of what accesses the victim
made. So if that time slice during our
attack is 1 millisecond, we're going to
get much better ordering information on
those memory access than we would get if
we only saw all the memory accesses over
the course of one second. So the shorter
that time slice, the better the temporal
resolution, the longer our picture will be
on the horizontal access, and the clearer
of an image of the cache that we'll see.
And the last metric to evaluate our
attacks on is noise and that reflects how
accurately our measurements reflect the
true state of the cache. So, right now
we've been using time and data to infer
whether or not an item was in the cache or
not, but this is a little bit noisy. It's
possible that we'll have false positives
or false negatives, so we want to keep
that in mind as we look at the different
attacks. So, that's essentially cache
attacks, and then, in a nutshell and
that's all you really need to understand
in order to understand these attacks as
they've been implemented on Trusted
Execution Environments. And the first
particular attack that we're going to be
looking at is called a Controlled-Channel
Attack on SGX, and this attack isn't
necessarily a cache attack, but we can
analyze it in the same way that we analyze
the cache attacks. So, it's still useful
to look at. Now, if you remember how
memory management occurs with SGX, we know
that if a page fault occurs during SGX
Enclave code execution, that page fault is
handled by the kernel. So, the kernel has
to know which page the Enclave needs to be
paged in. The kernel already gets some
information about what the Enclave is
looking at. Now, in the Controlled-Channel
attack, there's a, what the attacker does
from the non-trusted OS is the attacker
pages almost every other page from the
Enclave out of memory. So no matter
whatever page that Enclave tries to
access, it's very likely to cause a page
fault, which will be redirected to the
non-trusted OS, where the non-trusted OS
can record it, page out any other pages
and continue execution. So, the OS
essentially gets a list of sequential page
accesses made by the SGX Enclaves, all by
capturing the page fault handler. This is
a very general attack, you don't need to
know what's going on in the Enclave in
order to pull this off. You just load up
an arbitrary Enclave and you're able to
see which pages that Enclave is trying to
access. So, how does it do on our metrics?
First of all, this spatial resolution is
not great. We can only see where the
victim is accessing within 4096 bytes or
the size of a full page because SGX
obscures the offset into the page where
the page fault occurs. The temporal
resolution is good but not great, because
even though we're able to see any
sequential accesses to different pages
we're not able to see sequential accesses
to the same page because we need to keep
that same page paged-in while we let our
SGX Enclave run for that small time slice.
So temporal resolution is good but not
perfect. But the noise is, there is no
noise in this attack because no matter
where the page fault occurs, the untrusted
operating system is going to capture that
page fault and is going to handle it. So,
it's very low noise, not great spatial
resolution but overall still a powerful
attack. But we still want to improve on
that spatial resolution, we want to be
able to see what the Enclave is doing that
greater than a resolution of a one page of
four kilobytes. So that's exactly what the
CacheZoom paper does, and instead of
interrupting the SGX Enclave execution
with page faults, it uses timer
interrupts. Because the untrusted
operating system is able to schedule when
timer interrupts occur, so it's able to
schedule them at very tight intervals, so
it's able to get that small and tight
temporal resolution. And essentially what
happens in between is this timer
interrupts fires, the untrusted operating
system runs the Prime+Probe attack code in
this case, and resumes execution of the
onclick process, and this repeats. So this
is a Prime+Probe attack on the L1 data
cache. So, this attack let's you see what
data The Enclave is looking at. Now, this
attack could be easily modified to use the
L1 instruction cache, so in that case you
learn which instructions The Enclave is
executing. And overall this is an even
more powerful attack than the Control-
Channel attack. If we look at the metrics,
we can see that the spatial resolution is
a lot better, now we're looking at spatial
resolution of 64 bytes or the size of an
individual line. The temporal resolution
is very good, it's "almost unlimited", to
quote the paper, because the untrusted
operating system has the privilege to keep
scheduling those time interrupts closer
and closer together until it's able to
capture very small time slices of the
victim process .And the noise itself is
low, we're still using a cycle counter to
measure the time it takes to load memory
in and out of the cache, but it's, it's
useful, the chances of having a false
positive or false negative are low, so the
noise is low as well. Now, we can also
look at Trust Zone attacks, because so far
the attacks that we've looked at, the
passive attacks, have been against SGX and
those attacks on SGX have been pretty
powerful. So, what are the published
attacks on Trust Zone? Well, there's one
called TruSpy, which is kind of similar in
concept to the CacheZoom attack that we
just looked at on SGX. It's once again a
Prime+probe attack on the L1 data cache,
and the difference here is that instead of
interrupting the victim code execution
multiple times, the TruSpy attack does the
prime step, does the full AES encryption,
and then does the probe step. And the
reason they do this, is because as they
say, the secure world is protected, and is
not interruptible in the same way that SGX
is interruptable. But even despite this,
just having one measurement per execution,
the TruSpy authors were able to use some
statistics to still recover the AES key
from that noise. And their methods were so
powerful, they are able to do this from an
unapproved application in user land, so
they don't even need to be running within
the kernel in order to be able to pull off
this attack. So, how does this attack
measure up? The spatial resolution is once
again 64 bytes because that's the size of
a cache line on this processor, and the
temporal resolution is, is pretty poor
here, because we only get one measurement
per execution of the AES encryption. This
is also a particularly noisy attack
because we're making the measurements from
the user land, but even if we make the
measurements from the kernel, we're still
going to have the same issues of false
positives and false negatives associated
with using a cycle counter to measure
membership in a cache. So, we'd like to
improve this a little bit. We'd like to
improve the temporal resolution, so we
have the power of the cache attack to be a
little bit closer on TrustZone, as it is
on SGX. So, we want to improve that
temporal resolution. Let's dig into that
statement a little bit, that the secure
world is protected and not interruptable.
And to do, this we go back to this diagram
of ARMv8 and how that TrustZone is set up.
So, it is true that when an interrupt
occurs, it is directed to the monitor and,
because the monitor operates in the secure
world, if we interrupt secure code that's
running an exception level 0, we're just
going to end up running secure code an
exception level 3. So, this doesn't
necessarily get us anything. I think,
that's what the author's mean by saying
that it's protected against this. Just by
setting an interrupt, we don't have a
way to redirect our flow to the non-
trusted code. At least that's how it works
in theory. In practice, the Linux
operating system, running in exception
level 1 in the non-secure world, kind of
needs interrupts in order to be able to
work, so if an interrupt occurs and it's
being sent to the monitor, the monitor
will just forward it right to the non-
secure operating system. So, we have
interrupts just the same way as we did in
CacheZoom. And we can improve the
TrustZone attacks by using this idea: We
have 2 cores, where one core is running
the secure code, the other core is running
the non-secure code, and the non-secure
code is sending interrupts to the secure-
world core and that will give us that
interleaving of attacker process and
victim process that allow us to have a
powerful prime-and-probe attack. So, what
does this look like? We have the attack
core and the victim core. The attack core
sends an interrupt to the victim core.
This interrupt is captured by the monitor,
which passes it to the non-secure
operating system. The not-secure operating
system transfers this to our attack code,
which runs the prime-and-probe attack.
Then, we leave the interrupt, the
execution within the victim code in the
secure world resumes and we just repeat
this over and over. So, now we have that
interleaving of data... of the processes
of the attacker and the victim. So, now,
instead of having a temporal resolution of
one measurement per execution, we once
again have almost unlimited temporal
resolution, because we can just schedule
when we send those interrupts from the
attacker core. Now, we'd also like to
improve the noise measurements. The...
because if we can improve the noise, we'll
get clearer pictures and we'll be able to
infer those secrets more clearly. So, we
can get some improvement by switching the
measurements from userland and starting to
do those in the kernel, but again we have
the cycle counters. So, what if, instead
of using the cycle counter to measure
whether or not something is in the cache,
we use the other performance counters?
Because on ARMv8 platforms, there is a way
to use performance counters to measure
different events, such as cache hits and
cache misses. So, these events and these
performance monitors require privileged
access in order to use, which, for this
attack, we do have. Now, in a typical
cache text scenario we wouldn't have
access to these performance monitors,
which is why they haven't really been
explored before, but in this weird
scenario where we're attacking the less
privileged code from the more privileged
code, we do have access to these
performance monitors and we can use these
monitors during the probe step to get a
very accurate count of whether or not a
certain memory load caused a cache miss or
a cache hit. So, we're able to essentially
get rid of the different levels of noise.
Now, one thing to point out is that maybe
we'd like to use these ARMv8 performance
counters in order to count the different
events that are occurring in the secure
world code. So, maybe we start the
performance counters from the non-secure
world, let the secure world run and then,
when they secure world exits, we use the
non-secure world to read these performance
counters and maybe we'd like to see how
many instructions the secure world
executed or how many branch instructions
or how many arithmetic instructions or how
many cache misses there were. But
unfortunately, ARMv8 took this into
account and by default, performance
counters that are started in the non-
secure world will not measure events that
happen in the secure world, which is
smart; which is how it should be. And the
only reason I bring this up is because
that's not how it is an ARMv7. So, we go
into a whole different talk with that,
just exploring the different implications
of what that means, but I want to focus on
ARMv8, because that's that's the newest of
the new. So, we'll keep looking at that.
So, we instrument the primary probe attack
to use these performance counters, so we
can get a clear picture of what is and
what is not in the cache. And instead of
having noisy measurements based on time,
we have virtually no noise at all, because
we get the truth straight from the
processor itself, whether or not we
experience a cache miss. So, how do we
implement these attacks, where do we go
from here? We have all these ideas; we
have ways to make these TrustZone attacks
more powerful, but that's not worthwhile,
unless we actually implement them. So, the
goal here is to implement these attacks on
TrustZone and since typically the non-
secure world operating system is based on
Linux, we'll take that into account when
making our implementation. So, we'll write
a kernel module that uses these
performance counters and these inner
processor interrupts, in order to actually
accomplish these attacks; and we'll write
it in such a way that it's very
generalizable. So you can take this kernel
module that's was written for one device
-- in my case I did most of my attention
on the Nexus 5x -- and it's very easy to
transfer this module to any other Linux-
based device that has a trust zone that has
these shared caches, so it should be very
easy to port this over and to perform
these same powerful cache attacks on
different platforms. We can also do clever
things based on the Linux operating
system, so that we limit that collection
window to just when we're executing within
the secure world, so we can align our
traces a lot more easily that way. And the
end result is having a synchronized trace
for each different attacks, because, since
we've written in a modular way, we're able
to run different attacks simultaneously.
So, maybe we're running one prime-and-
probe attack on the L1 data cache, to
learn where the victim is accessing
memory, and we're simultaneously running
an attack on the L1 instruction cache, so
we can see what instructions the victim is
executing. And these can be aligned. So,
the tool that I've written is a
combination of a kernel module which
actually performs this attack, a userland
binary which schedules these processes to
different cores, and a GUI that will allow
you to interact with this kernel module
and rapidly start doing these cache
attacks for yourself and perform them
against different processes and secure
code and secure world code. So, the
intention behind this tool is to be very
generalizable to make it very easy to use
this platform for different devices and to
allow people way to, once again, quickly
develop these attacks; and also to see if
their own code is vulnerable to these
cache attacks, to see if their code has
these secret dependent memory accesses.
So, can we get even better... spatial
resolution? Right now, we're down to 64
bytes and that's the size of a cache line,
which is the size of our shared hardware.
And on SGX, we actually can get better
than 64 bytes, based on something called a
branch-shadowing attack. So, a branch-
shadowing attack takes advantage of
something called the branch target buffer.
And the branch target buffer is a
structure that's used for branch
prediction. It's similar to a cache, but
there's a key difference where the branch
target buffer doesn't compare the full
address, when seeing if something is
already in the cache or not: It doesn't
compare all of the upper level bits. So,
that means that it's possible that two
different addresses will experience a
collision, and the same entry from that
BTB cache will be read out for an improper
address. Now, since this is just for
branch prediction, the worst that can
happen is, you'll get a misprediction and
a small time penalty, but that's about it.
The idea of behind the branch-shadowing
attack is leveraging the small difference
in this overlapping and this collision of
addresses in order to sort of execute a
shared code cell flush-and-reload attack
on the branch target buffer. So, here what
goes on is, during the attack the attacker
modifies the SGX Enclave to make sure that
the branches that are within the Enclave
will collide with branches that are not in
the Enclave. The attacker executes the
Enclave code and then the attacker
executes their own code and based on the
outcome of the the victim code in that
cache, the attacker code may or may not
experience a branch prediction. So, the
attacker is able to tell the outcome of a
branch, because of this overlap in this
collision, like would be in a flush-and-
reload attack, where those memories
overlap between the attacker and the
victim. So here, our spatial resolution is
fantastic: We can tell down to individual
branch instructions in SGX; we can tell
exactly, which branches were executed and
which directions they were taken, in the
case of conditional branches. The temporal
resolution is also, once again, almost
unlimited, because we can use the same
timer interrupts in order to schedule our
process, our attacker process. And the
noise is, once again, very low, because we
can, once again, use the same sort of
branch misprediction counters, that exist
in the Intel world, in order to measure
this noise. So, does anything of that
apply to the TrustZone attacks? Well, in
this case the victim and attacker don't
share entries in the branch target buffer,
because the attacker is not able to map
the virtual address of the victim process.
But this is kind of reminiscent of our
earlier cache attacks, so our flush-and-
reload attack only worked when the attack
on the victim shared that memory, but we
still have the prime-and-probe attack for
when they don't. So, what if we use a
prime-and-probe-style attack on the branch
target buffer cache in ARM processors? So,
essentially what we do here is, we prime
the branch target buffer by executing mini
attacker branches to sort of fill up this
BTB cache with the attacker branch
prediction data; we let the victim execute
a branch which will evict an attacker BTB
entry; and then we have the attacker re-
execute those branches and see if there
have been any mispredictions. So now, the
cool thing about this attack is, the
structure of the BTB cache is different
from that of the L1 caches. So, instead of
having 256 different sets in the L1 cache,
the BTB cache has 2048 different sets, so
we can tell which branch it attacks, based
on which one of 2048 different set IDs
that it could fall into. And even more
than that, on the ARM platform, at least
on the Nexus 5x that I was working with,
the granularity is no longer 64 bytes,
which is the size of the line, it's now 16
bytes. So, we can see which branches the
the trusted code within TrustZone is
executing within 16 bytes. So, what does
this look like? So, previously with the
true-spy attack, this is sort of the
outcome of our prime-and-probe attack: We
get 1 measurement for those 256 different
set IDs. When we added those interrupts,
we're able to get that time resolution,
and it looks something like this. Now,
maybe you can see a little bit at the top
of the screen, how there's these repeated
sections of little white blocks, and you
can sort of use that to infer, maybe
there's the same cache line and cache
instructions that are called over and
over. So, just looking at this L1-I cache
attack, you can tell some information
about how the process went. Now, let's
compare that to the BTB attack. And I
don't know if you can see too clearly --
it's a it's a bit too high of resolution
right now -- so let's just focus in on one
small part of this overall trace. And this
is what it looks like. So, each of those
white pixels represents a branch that was
taken by that secure-world code and we can
see repeated patterns, we can see maybe
different functions that were called, we
can see different loops. And just by
looking at this 1 trace, we can infer a
lot of information on how that secure
world executed. So, it's incredibly
powerful and all of those secrets are just
waiting to be uncovered using these new
tools. So, where do we go from here? What
sort of countermeasures do we have? Well,
first of all I think, the long term
solution is going to be moving to no more
shared hardware. We need to have separate
hardware and no more shared caches in
order to fully get rid of these different
cache attacks. And we've already seen this
trend in different cell phones. So, for
example, in Apple SSEs for a long time now
-- I think since the Apple A7 -- the
secure Enclave, which runs the secure
code, has its own cache. So, these cache
attacks can't be accomplished from code
outside of that secure Enclave. So, just
by using that separate hardware, it knocks
out a whole class of different potential
side-channel and microarchitecture
attacks. And just recently, the Pixel 2 is
moving in the same direction. The Pixel 2
now includes a hardware security module
that includes cryptographic operations;
and that chip also has its own memory and
its own caches, so now we can no longer
use this attack to extract information
about what's going on in this external
hardware security module. But even then,
using this separate hardware, that doesn't
solve all of our problems. Because we
still have the question of "What do we
include in this separate hardware?" On the
one hand, we want to include more code in
that a separate hardware, so we're less
vulnerable to these side-channel attacks,
but on the other hand, we don't want to
expand the attack surface anymore. Because
the more code we include in these secure
environments, the more like that a
vulnerabiliyy will be found and the
attacker will be able to get a foothold
within the secure, trusted environment.
So, there's going to be a balance between
what do you choose to include in the
separate hardware and what you don't. So,
do you include DRM code? Do you include
cryptographic code? It's still an open
question. And that's sort of the long-term
approach. In the short term, you just kind
of have to write side-channel-free
software: Just be very careful about what
your process does, if there are any
secret, dependent memory accesses or a
secret, dependent branching or secret,
dependent function calls, because any of
those can leak the secrets out of your
trusted execution environment. So, here
are the things that, if you are a
developer of trusted execution environment
code, that I want you to keep in mind:
First of all, performance is very often at
odds with security. We've seen over and
over that the performance enhancements to
these processors open up the ability for
these microarchitectural attacks to be
more efficient. Additionally, these
trusted execution environments don't
protect against everything; there are
still these side-channel attacks and these
microarchitectural attacks that these
systems are vulnerable to. These attacks
are very powerful; they can be
accomplished simply; and with the
publication of the code that I've written,
it should be very simple to get set up and
to analyze your own code to see "Am I
vulnerable, do I expose information in the
same way?" And lastly, it only takes 1
small error, 1 tiny leak from your trusted
and secure code, in order to extract the
entire secret, in order to bring the whole
thing down. So, what I want to leave you
with is: I want you to remember that you
are responsible for making sure that your
program is not vulnerable to these
microarchitectural attacks, because if you
do not take responsibility for this, who
will? Thank you!
Applause
Herald: Thank you very much. Please, if
you want to leave the hall, please do it
quiet and take all your belongings with
you and respect the speaker. We have
plenty of time, 16, 17 minutes for Q&A, so
please line up on the microphones. No
questions from the signal angel, all
right. So, we can start with microphone 6,
please.
Mic 6: Okay. There was a symbol of secure
OSes at the ARM TrustZone. Which a idea of
them if the non-secure OS gets all the
interrupts? What does is
the secure OS for?
Keegan: Yeah so, in the ARMv8 there are a
couple different kinds of interrupts. So,
I think -- if I'm remembering the
terminology correctly -- there is an IRQ
and an FIQ interrupt. So, the non-secure
mode handles the IRQ interrupts and the
secure mode handles the FIQ interrupts.
So, depending on which one you send, it
will depend on which direction that
monitor will direct that interrupt.
Mic 6: Thank you.
Herald: Okay, thank you. Microphone number
7, please.
Mic 7: Does any of your present attacks on
TrustZone also apply to the AMD
implementation of TrustZone or are you
looking into it?
Keegan: I haven't looked into AMD too
much, because, as far as I can tell,
that's not used as commonly, but there are
many different types of trusted execution
environments. The 2 that I focus on were
SGX and TrustZone, because those are the
most common examples that I've seen.
Herald: Thank you. Microphone
number 8, please.
Mic 8: When TrustZone is moved to
dedicated hardware, dedicated memory,
couldn't you replicate the userspace
attacks by loading your own trusted
userspace app and use it as an
oracle of some sorts?
Keegan: If you can load your own trust
code, then yes, you could do that. But in
many of the models I've seen today, that's
not possible. So, that's why you have
things like code signing, which prevent
the arbitrary user from running their own
code in the trusted OS... or in the the
trusted environment.
Herald: All right. Microphone number 1.
Mic 1: So, these attacks are more powerful
against code that's running in... just the
execution environments than similar
attacks would be against ring-3 code, or,
in general, trusted code. Does that mean
that trusting execution environments are
basically an attractive nuisance that we
shouldn't use?
Keegan: There's still a large benefit to
using these trusted execution
environments. The point I want to get
across is that, although they add a lot of
features, they don't protect against
everything, so you should keep in mind
that these side-channel attacks do still
exist and you still need to protect
against them. But overall, these are
better things and worthwhile in including.
Herald: Thank you. Microphone number 1
again, please
Mic 1: So, AMD is doing something with
encrypting memory and I'm not sure if they
encrypt addresses, too, and but would that
be a defense against such attacks?
Keegan: So, I'm not too familiar with AMD,
but SGX also encrypts memory. It encrypts
it in between the lowest-level cache and
the main memory. But that doesn't really
have an impact on the actual operation,
because the memories encrypt at the cache
line level and as the attacker, we don't
care what that data is within that cache
line, we only care which cache line is
being accessed.
Mic 1: If you encrypt addresses, wouldn't
that help against that?
Keegan: I'm not sure, how you would
encrypt the addresses yourself. As long as
those adresses map into the same set IDs
that the victim can map into, then the
victim could still pull off the same style
of attacks.
Herald: Great. We have a question from the
internet, please.
Signal Angel: The question is "Does the
secure enclave on the Samsung Exynos
distinguish the receiver of the messag, so
that if the user application asked to
decode an AES message, can one sniff on
the value that the secure
enclave returns?"
Keegan: So, that sounds like it's asking
about the true-spy style attack, where
it's calling to the secure world to
encrypt something with AES. I think, that
would all depend on the different
implementation: As long as it's encrypting
for a certain key and it's able to do that
repeatably, then the attack would,
assuming a vulnerable AES implementation,
would be able to extract that key out.
Herald: Cool. Microphone number 2, please.
Mic 2: Do you recommend a reference to
understand how these cache line attacks
and branch oracles actually lead to key
recovery?
Keegan: Yeah. So, I will flip through
these pages which include a lot of the
references for the attacks that I've
mentioned, so if you're watching the
video, you can see these right away or
just access the slides. And a lot of these
contain good starting points. So, I didn't
go into a lot of the details on how, for
example, the true-spy attack recovered
that AES key., but that paper does have a
lot of good links, how those areas can
lead to key recovery. Same thing with the
CLKSCREW attack, how the different fault
injection can lead to key recovery.
Herald: Microphone number 6, please.
Mic 6: I think my question might have been
very, almost the same thing: How hard is
it actually to recover the keys? Is this
like a massive machine learning problem or
is this something that you can do
practically on a single machine?
Keegan: It varies entirely by the end
implementation. So, for all these attacks
work, you need to have some sort of
vulnerable implementation and some
implementations leak more data than
others. In the case of a lot of the AES
attacks, where you're doing the passive
attacks, those are very easy to do on just
your own computer. For the AES fault
injection attack, I think that one
required more brute force, in the CLKSCREW
paper, so that one required more computing
resources, but still, it was entirely
practical to do in a realistic setting.
Herald: Cool, thank you. So, we have one
more: Microphone number 1, please.
Mic 1: So, I hope it's not a too naive
question, but I was wondering, since all
these attacks are based on cache hit and
misses, isn't it possible to forcibly
flush or invalidate or insert noise in
cache after each operation in this trust
environment, in order to mess up the
guesswork of the attacker? So, discarding
optimization and performance for
additional security benefits.
Keegan: Yeah, and that is absolutely
possible and you are absolutely right: It
does lead to a performance degradation,
because if you always flush the entire
cache every time you do a context switch,
that will be a huge performance hit. So
again, that comes down to the question of
the performance and security trade-off:
Which one do you end up going with? And it
seems historically the choice has been
more in the direction of performance.
Mic 1: Thank you.
Herald: But we have one more: Microphone
number 1, please.
Mic 1: So, I have more of a moral
question: So, how well should we really
protect from attacks which need some
ring-0 cooperation? Because, basically,
when we use TrustZone for purpose... we
would see clear, like protecting the
browser from interacting from outside
world, then we are basically using the
safe execution environment for sandboxing
the process. But once we need some
cooperation from the kernel, some of that
attacks, is in fact, empower the user
instead of the hardware producer.
Keegan: Yeah, and you're right. It
depends entirely on what your application
is and what your threat model is that
you're looking at. So, if you're using
these trusted execution environments to do
DRM, for example, then maybe you wouldn't
be worried about that ring-0 attack or
that privileged attacker who has their
phone rooted and is trying to recover
these media encryption keys from this
execution environment. But maybe there are
other scenarios where you're not as
worried about having an attack with a
compromised ring 0. So, it entirely
depends on context.
Herald: Alright, thank you. So, we have
one more: Microphone number 1, again.
Mic 1: Hey there. Great talk, thank you
very much.
Keegan: Thank you.
Mic 1: Just a short question: Do you have
any success stories about attacking the
TrustZone and the different
implementations of TE with some vendors
like some OEMs creating phones and stuff?
Keegan: Not that I'm announcing
at this time.
Herald: So, thank you very much. Please,
again a warm round of applause for Keegan!
Applause
34c3 postroll music
subtitles created by c3subtitles.de
in the year 2018. Join, and help us!