WEBVTT
00:00:00.000 --> 00:00:13.245
Music
00:00:13.245 --> 00:00:17.060
Herald Angel: We are here with a motto,
and the motto of this year is "Works For
00:00:17.060 --> 00:00:21.670
Me" and I think, who many people, how many
people in here are programmmers? Raise
00:00:21.670 --> 00:00:28.700
your hands or shout or... Whoa, that's a
lot. Okay. So I think many of you will
00:00:28.700 --> 00:00:38.990
work on x86. Yeah. And I think you assume
that it works, and that everything works
00:00:38.990 --> 00:00:48.150
as intended And I mean: What could go
wrong? Our next talk, the first one today,
00:00:48.150 --> 00:00:52.290
will be by Clémentine Maurice, who
previously was here with RowhammerJS,
00:00:52.290 --> 00:01:01.740
something I would call scary, and Moritz
Lipp, who has worked on the Armageddon
00:01:01.740 --> 00:01:09.820
exploit, back, what is it? Okay, so the
next... I would like to hear a really warm
00:01:09.820 --> 00:01:14.460
applause for the speakers for the talk
"What could what could possibly go wrong
00:01:14.460 --> 00:01:17.280
with insert x86 instruction here?"
00:01:17.280 --> 00:01:18.375
thank you.
00:01:18.375 --> 00:01:28.290
Applause
00:01:28.290 --> 00:01:32.530
Clémentine Maurice (CM): Well, thank you
all for being here this morning. Yes, this
00:01:32.530 --> 00:01:38.080
is our talk "What could possibly go wrong
with insert x86 instructions here". So
00:01:38.080 --> 00:01:42.850
just a few words about ourselves: So I'm
Clémentine Maurice, I got my PhD last year
00:01:42.850 --> 00:01:47.119
in computer science and I'm now working as
a postdoc at Graz University of Technology
00:01:47.119 --> 00:01:52.090
in Austria. You can reach me on Twitter or
by email but there's also I think a lots
00:01:52.090 --> 00:01:56.670
of time before the Congress is over.
Moritz Lipp (ML): Hi and my name is Moritz
00:01:56.670 --> 00:02:01.520
Lipp, I'm a PhD student at Graz University
of Technology and you can also reach me on
00:02:01.520 --> 00:02:06.679
Twitter or just after our talk and in the
next days.
00:02:06.679 --> 00:02:10.860
CM: So, about this talk: So, the title
says this is a talk about x86
00:02:10.860 --> 00:02:17.720
instructions, but this is not a talk about
software. Don't leave yet! I'm actually
00:02:17.720 --> 00:02:22.440
even assuming safe software and the point
that we want to make is that safe software
00:02:22.440 --> 00:02:27.390
does not mean safe execution and we have
information leakage because of the
00:02:27.390 --> 00:02:32.560
underlying hardware and this is what we're
going to talk about today. So we'll be
00:02:32.560 --> 00:02:36.819
talking about cache attacks, what are
they, what can we do with that and also a
00:02:36.819 --> 00:02:41.510
special kind of cache attack that we found
this year. So... doing cache attacks
00:02:41.510 --> 00:02:48.590
without memory accesses and how to use
that even to bypass kernel ASLR.
00:02:48.590 --> 00:02:53.129
So again, the talk says is to talk about
x86 instructions but this is even more
00:02:53.129 --> 00:02:58.209
global than that. We can also mount these
cache attacks on ARM and not only on the
00:02:58.209 --> 00:03:07.050
x86. So some of the examples that you will
see also applies to ARM. So today we'll do
00:03:07.050 --> 00:03:11.420
have a bit of background, but actually
most of the background will be along the
00:03:11.420 --> 00:03:19.251
lines because this covers really a huge
chunk of our research, and we'll see
00:03:19.251 --> 00:03:24.209
mainly three instructions: So "mov" and
how we can perform these cache attacks,
00:03:24.209 --> 00:03:29.430
what are they... The instruction
"clflush", so here we'll be doing cache
00:03:29.430 --> 00:03:36.370
attacks without any memory accesses. Then
we'll see "prefetch" and how we can bypass
00:03:36.370 --> 00:03:43.420
kernel ASLR and lots of translations
levels, and then there's even a bonus
00:03:43.420 --> 00:03:48.950
track, so it's this this will be not our
works, but even more instructions and even
00:03:48.950 --> 00:03:54.210
more text.
Okay, so let's start with a bit of an
00:03:54.210 --> 00:04:01.190
introduction. So we will be mainly
focusing on Intel CPUs, and this is
00:04:01.190 --> 00:04:05.599
roughly in terms of cores and caches, how
it looks like today. So we have different
00:04:05.599 --> 00:04:09.440
levels of cores ...uh... different cores
so here four cores, and different levels
00:04:09.440 --> 00:04:14.220
of caches. So here usually we have three
levels of caches. We have level 1 and
00:04:14.220 --> 00:04:18.269
level 2 that are private to each call,
which means that core 0 can only access
00:04:18.269 --> 00:04:24.520
its level 1 and its level 2 and not level
1 and level 2 of, for example, core 3, and
00:04:24.520 --> 00:04:30.130
we have the last level cache... so here if
you can see the pointer... So this one is
00:04:30.130 --> 00:04:36.289
divided in slices so we have as many
slices as cores, so here 4 slices, but all
00:04:36.289 --> 00:04:40.659
the slices are shared across core so core
0 can access the whole last level cache,
00:04:40.659 --> 00:04:48.669
that's 0 1 2 & 3. We also have a nice
property on Intel CPUs is that this level
00:04:48.669 --> 00:04:52.280
of cache is inclusive, and what it means
is that everything that is contained in
00:04:52.280 --> 00:04:56.889
level 1 and level 2 will also be contained
in the last level cache, and this will
00:04:56.889 --> 00:05:01.439
prove to be quite useful for cache
attacks.
00:05:01.439 --> 00:05:08.430
So today we mostly have set associative
caches. What it means is that we have data
00:05:08.430 --> 00:05:13.249
that is loaded in specific sets and that
depends only on its address. So we have
00:05:13.249 --> 00:05:18.900
some bits of the address that gives us the
index and that says "Ok the line is going
00:05:18.900 --> 00:05:24.610
to be loaded in this cache set", so this
is a cache set. Then we have several ways
00:05:24.610 --> 00:05:30.629 line:1
per set so here we have 4 ways and the
cacheine is going to be loaded in a
00:05:30.629 --> 00:05:35.270
specific way and that will only depend on
the replacement policy and not on the
00:05:35.270 --> 00:05:40.800
address itself, so when you load a line
into the cache, usually the cache is
00:05:40.800 --> 00:05:44.830
already full and you have to make room for
a new line. So this is where the
00:05:44.830 --> 00:05:49.729
replacement replacement policy—this is
what it does—it says ok I'm going to
00:05:49.729 --> 00:05:57.779
remove this line to make room for the next
line. So for today we're going to see only
00:05:57.779 --> 00:06:01.960
three instruction as I've been telling
you. So the move instruction, it does a
00:06:01.960 --> 00:06:06.610
lot of things but the only aspect that
we're interested in about it that can
00:06:06.610 --> 00:06:12.809
access data in the main memory.
We're going to see a clflush what it does
00:06:12.809 --> 00:06:18.349
is that it removes a cache line from the
cache, from the whole cache. And we're
00:06:18.349 --> 00:06:25.569
going to see prefetch, it prefetches a
cache line for future use. So we're going
00:06:25.569 --> 00:06:30.520
to see what they do and the kind of side
effects that they have and all the attacks
00:06:30.520 --> 00:06:34.800
that we can do with them. And that's
basically all the example you need for
00:06:34.800 --> 00:06:39.830
today so even if you're not an expert of
x86 don't worry it's not just slides full
00:06:39.830 --> 00:06:44.899
of assembly and stuff. Okay so on to the
first one.
00:06:44.899 --> 00:06:49.940
ML: So we will first start with the 'mov'
instruction and actually the first slide
00:06:49.940 --> 00:06:57.809
is full of code, however as you can see
the mov instruction is used to move data
00:06:57.809 --> 00:07:02.629
from registers to registers, from the main
memory and back to the main memory and as
00:07:02.629 --> 00:07:07.240
you can see there are many moves you can
use but basically it's just to move data
00:07:07.240 --> 00:07:12.589
and that's all we need to know. In
addition, a lot of exceptions can occur so
00:07:12.589 --> 00:07:18.139
we can assume that those restrictions are
so tight that nothing can go wrong when
00:07:18.139 --> 00:07:22.210
you just move data because moving data is
simple.
00:07:22.210 --> 00:07:27.879
However while there are a lot of
exceptions the data that is accessed is
00:07:27.879 --> 00:07:35.009
always loaded into the cache, so data is
in the cache and this is transparent to
00:07:35.009 --> 00:07:40.870
the program that is running. However,
there are side-effects when you run these
00:07:40.870 --> 00:07:46.219
instructions, and we will see how they
look like with the mov instruction. So you
00:07:46.219 --> 00:07:51.289
probably all know that data can either be
in CPU registers, in the different levels
00:07:51.289 --> 00:07:56.029
of the cache that Clementine showed to you
earlier, in the main memory, or on the
00:07:56.029 --> 00:08:02.219
disk, and depending on where the memory
and the data is located it needs a longer
00:08:02.219 --> 00:08:09.689
time to be loaded back to the CPU, and
this is what we can see in this plot. So
00:08:09.689 --> 00:08:15.739
we try here to measure the access time of
an address over and over again, assuming
00:08:15.739 --> 00:08:21.759
that when we access it more often, it is
already stored in the cache. So around 70
00:08:21.759 --> 00:08:27.289
cycles, most of the time we can assume
when we load an address and it takes 70
00:08:27.289 --> 00:08:34.809
cycles, it's loaded into the cache.
However, when we assume that the data is
00:08:34.809 --> 00:08:39.659
loaded from the main memory, we can
clearly see that it needs a much longer
00:08:39.659 --> 00:08:46.720
time like a bit more than 200 cycles. So
depending when we measure the time it
00:08:46.720 --> 00:08:51.470
takes to load the address we can say the
data has been loaded to the cache or the
00:08:51.470 --> 00:08:58.339
data is still located in the main memory.
And this property is what we can exploit
00:08:58.339 --> 00:09:05.339
using cache attacks. So we measure the
timing differences on memory accesses. And
00:09:05.339 --> 00:09:09.940
what an attacker does he monitors the
cache lines, but he has no way to know
00:09:09.940 --> 00:09:14.459
what's actually the content of the cache
line. So we can only monitor that this
00:09:14.459 --> 00:09:20.099
cache line has been accessed and not
what's actually stored in the cache line.
00:09:20.099 --> 00:09:24.411
And what you can do with this is you can
implement covert channels, so you can
00:09:24.411 --> 00:09:29.580
allow two processes to communicate with
each other evading the permission system
00:09:29.580 --> 00:09:35.060
what we will see later on. In addition you
can also do side channel attacks, so you
00:09:35.060 --> 00:09:40.600
can spy with a malicious attacking
application on benign processes, and you
00:09:40.600 --> 00:09:46.140
can use this to steal cryptographic keys
or to spy on keystrokes.
00:09:46.140 --> 00:09:53.649
And basically we have different types of
cache attacks and I want to explain the
00:09:53.649 --> 00:09:58.810
most popular one, the "Flush+Reload"
attack, in the beginning. So on the left,
00:09:58.810 --> 00:10:03.110
you have the address space of the victim,
and on the right you have the address
00:10:03.110 --> 00:10:08.560
space of the attacker who maps a shared
library—an executable—that the victim is
00:10:08.560 --> 00:10:14.899
using in to its own address space, like
the red rectangle. And this means that
00:10:14.899 --> 00:10:22.760
when this data is stored in the cache,
it's cached for both processes. Now the
00:10:22.760 --> 00:10:28.170
attacker can use the flush instruction to
remove the data out of the cache, so it's
00:10:28.170 --> 00:10:34.420
not in the cache anymore, so it's also not
cached for the victim. Now the attacker
00:10:34.420 --> 00:10:39.100
can schedule the victim and if the victim
decides "yeah, I need this data", it will
00:10:39.100 --> 00:10:44.970
be loaded back into the cache. And now the
attacker can reload the data, measure the
00:10:44.970 --> 00:10:49.661
time how long it took, and then decide
"okay, the victim has accessed the data in
00:10:49.661 --> 00:10:54.179
the meantime" or "the victim has not
accessed the data in the meantime." And by
00:10:54.179 --> 00:10:58.959
that you can spy if this address has been
used.
00:10:58.959 --> 00:11:03.240
The second type of attack is called
"Prime+Probe" and it does not rely on the
00:11:03.240 --> 00:11:08.971
shared memory like the "Flush+Reload"
attack, and it works as following: Instead
00:11:08.971 --> 00:11:16.139
of mapping anything into its own address
space, the attacker loads a lot of data
00:11:16.139 --> 00:11:24.589
into one cache line, here, and fills the
cache. Now he again schedules the victim
00:11:24.589 --> 00:11:31.820
and the schedule can access data that maps
to the same cache set.
00:11:31.820 --> 00:11:38.050
So the cache set is used by the attacker
and the victim at the same time. Now the
00:11:38.050 --> 00:11:43.050
attacker can start measuring the access
time to the addresses he loaded into the
00:11:43.050 --> 00:11:49.050
cache before, and when he accesses an
address that is still in the cache it's
00:11:49.050 --> 00:11:55.649
faster so he measures the lower time. And
if it's not in the cache anymore it has to
00:11:55.649 --> 00:12:01.279
be reloaded into the cache so it takes a
longer time. He can sum this up and detect
00:12:01.279 --> 00:12:07.870
if the victim has loaded data into the
cache as well. So the first thing we want
00:12:07.870 --> 00:12:11.900
to show you is what you can do with cache
attacks is you can implement a covert
00:12:11.900 --> 00:12:17.439
channel and this could be happening in the
following scenario.
00:12:17.439 --> 00:12:23.610
You install an app on your phone to view
your favorite images you take, to apply
00:12:23.610 --> 00:12:28.630
some filters, and in the end you don't
know that it's malicious because the only
00:12:28.630 --> 00:12:33.609
permission it requires is to access your
images which makes sense. So you can
00:12:33.609 --> 00:12:38.700
easily install it without any fear. In
addition you want to know what the weather
00:12:38.700 --> 00:12:43.040
is outside, so you install a nice little
weather widget, and the only permission it
00:12:43.040 --> 00:12:48.230
has is to access the internet because it
has to load the information from
00:12:48.230 --> 00:12:55.569
somewhere. So what happens if you're able
to implement a covert channel between two
00:12:55.569 --> 00:12:59.779
these two applications, without any
permissions and privileges so they can
00:12:59.779 --> 00:13:05.060
communicate with each other without using
any mechanisms provided by the operating
00:13:05.060 --> 00:13:11.149
system, so it's hidden. It can happen that
now the gallery app can send the image to
00:13:11.149 --> 00:13:18.680
the internet, it will be uploaded and
exposed for everyone. So maybe you don't
00:13:18.680 --> 00:13:25.610
want to see the cat picture everywhere.
While we can do this with those
00:13:25.610 --> 00:13:30.219
Prime+Probe/ Flush+Reload attacks, we will
discuss a covert channel using
00:13:30.219 --> 00:13:35.690
Prime+Probe. So how can we transmit this
data? We need to transmit ones and zeros
00:13:35.690 --> 00:13:40.980
at some point. So the sender and the
receiver agree on one cache set that they
00:13:40.980 --> 00:13:49.319
both use. The receiver probes the set all
the time. When the sender wants to
00:13:49.319 --> 00:13:57.529
transmit a zero he just does nothing, so
the lines of the receiver are in the cache
00:13:57.529 --> 00:14:01.809
all the time, and he knows "okay, he's
sending nothing", so it's a zero.
00:14:01.809 --> 00:14:05.940
On the other hand if the sender wants to
transmit a one, he starts accessing
00:14:05.940 --> 00:14:10.800
addresses that map to the same cache set
so it will take a longer time for the
00:14:10.800 --> 00:14:16.540
receiver to access its addresses again,
and he knows "okay, the sender just sent
00:14:16.540 --> 00:14:23.059
me a one", and Clementine will show you
what you can do with this covert channel.
00:14:23.059 --> 00:14:25.180
CM: So the really nice thing about
00:14:25.180 --> 00:14:28.959
Prime+Probe is that it has really low
requirements. It doesn't need any kind of
00:14:28.959 --> 00:14:34.349
shared memory. For example if you have two
virtual machines you could have some
00:14:34.349 --> 00:14:38.700
shared memory via memory deduplication.
The thing is that this is highly insecure,
00:14:38.700 --> 00:14:43.969
so cloud providers like Amazon ec2, they
disable that. Now we can still use
00:14:43.969 --> 00:14:50.429
Prime+Probe because it doesn't need this
shared memory. Another problem with cache
00:14:50.429 --> 00:14:54.999
covert channels is that they are quite
noisy. So when you have other applications
00:14:54.999 --> 00:14:59.259
that are also running on the system, they
are all competing for the cache and they
00:14:59.259 --> 00:15:03.009
might, like, evict some cache lines,
especially if it's an application that is
00:15:03.009 --> 00:15:08.749
very memory intensive. And you also have
noise due to the fact that the sender and
00:15:08.749 --> 00:15:12.770
the receiver might not be scheduled at the
same time. So if you have your sender that
00:15:12.770 --> 00:15:16.649
sends all the things and the receiver is
not scheduled then some part of the
00:15:16.649 --> 00:15:22.539
transmission can get lost. So what we did
is we tried to build an error-free covert
00:15:22.539 --> 00:15:30.829
channel. We took care of all these noise
issues by using some error detection to
00:15:30.829 --> 00:15:36.470
resynchronize the sender and the receiver
and then we use some error correction to
00:15:36.470 --> 00:15:40.779
correct the remaining errors.
So we managed to have a completely error-
00:15:40.779 --> 00:15:46.069
free covert channel even if you have a lot
of noise, so let's say another virtual
00:15:46.069 --> 00:15:54.119
machine also on the machine serving files
through a web server, also doing lots of
00:15:54.119 --> 00:16:01.600
memory-intensive tasks at the same time,
and the covert channel stayed completely
00:16:01.600 --> 00:16:07.610
error-free, and around 40 to 75 kilobytes
per second, which is still quite a lot.
00:16:07.610 --> 00:16:14.470
All of this is between virtual machines on
Amazon ec2. And the really neat thing—we
00:16:14.470 --> 00:16:19.389
wanted to do something with that—and
basically we managed to create an SSH
00:16:19.389 --> 00:16:27.060
connection really over the cache. So they
don't have any network between
00:16:27.060 --> 00:16:31.439
them, but just we are sending the zeros
and the ones and we have an SSH connection
00:16:31.439 --> 00:16:36.839
between them. So you could say that cache
covert channels are nothing, but I think
00:16:36.839 --> 00:16:43.079
it's a real threat. And if you want to
have more details about this work in
00:16:43.079 --> 00:16:49.220
particular, it will be published soon at
NDSS.
00:16:49.220 --> 00:16:54.040
So the second application that we wanted
to show you is that we can attack crypto
00:16:54.040 --> 00:17:01.340
with cache attacks. In particular we are
going to show an attack on AES and a
00:17:01.340 --> 00:17:04.990
special implementation of AES that uses
T-Tables. so that's the fast software
00:17:04.990 --> 00:17:11.650
implementation because it uses some
precomputed lookup tables. It's known to
00:17:11.650 --> 00:17:17.490
be vulnerable to side-channel attacks
since 2006 by Osvik et al, and it's a one-
00:17:17.490 --> 00:17:24.110
round known plaintext attack, so you have
p—or plaintext—and k, your secret key. And
00:17:24.110 --> 00:17:29.570
the AES algorithm, what it does is compute
an intermediate state at each round r.
00:17:29.570 --> 00:17:38.559
And in the first round, the accessed table
indices are just p XOR k. Now it's a known
00:17:38.559 --> 00:17:43.500
plaintext attack, what this means is that
if you can recover the accessed table
00:17:43.500 --> 00:17:49.460
indices you've also managed to recover the
key because it's just XOR. So that would
00:17:49.460 --> 00:17:55.450
be bad, right, if we could recover these
accessed table indices. Well we can, with
00:17:55.450 --> 00:18:00.510
cache attacks! So we did that with
Flush+Reload and with Prime+Probe. On the
00:18:00.510 --> 00:18:05.809
x-axis you have the plaintext byte values
and on the y-axis you have the addresses
00:18:05.809 --> 00:18:15.529
which are essentially the T table entries.
So a black cell means that we've monitored
00:18:15.529 --> 00:18:19.970
the cache line, and we've seen a lot of
cache hits. So basically the blacker it
00:18:19.970 --> 00:18:25.650
is, the more certain we are that the
T-Table entry has been accessed. And here
00:18:25.650 --> 00:18:31.779
it's a toy example, the key is all-zeros,
but you would basically just have a
00:18:31.779 --> 00:18:35.700
different pattern if the key was not all-
zeros, and as long as you can see this
00:18:35.700 --> 00:18:43.409
nice diagonal or a pattern then you have
recovered the key. So it's an old attack,
00:18:43.409 --> 00:18:48.890
2006, it's been 10 years, everything
should be fixed by now, and you see where
00:18:48.890 --> 00:18:56.880
I'm going: it's not. So on Android the
bouncy castle implementation it uses by
00:18:56.880 --> 00:19:03.360
default the T-table, so that's bad. Also
many implementations that you can find
00:19:03.360 --> 00:19:11.380
online uses pre-computed values, so maybe
be wary about this kind of attacks. The
00:19:11.380 --> 00:19:17.240
last application we wanted to show you is
how we can spy on keystrokes.
00:19:17.240 --> 00:19:21.480
So for that we will use flush and reload
because it's a really fine grained
00:19:21.480 --> 00:19:26.309
attack. We can see very precisely which
cache line has been accessed, and a cache
00:19:26.309 --> 00:19:31.440
line is only 64 bytes so it's really not a
lot and we're going to use that to spy on
00:19:31.440 --> 00:19:37.690
keystrokes and we even have a small demo
for you.
00:19:40.110 --> 00:19:45.640
ML: So what you can see on the screen this
is not on Intel x86 it's on a smartphone,
00:19:45.640 --> 00:19:50.330
on the Galaxy S6, but you can also apply
these cache attacks there so that's what
00:19:50.330 --> 00:19:53.850
we want to emphasize.
So on the left you see the screen and on
00:19:53.850 --> 00:19:57.960
the right we have connected a shell with
no privileges and permissions, so it can
00:19:57.960 --> 00:20:00.799
basically be an app that you install
glass bottle falling
00:20:00.799 --> 00:20:09.480
from the App Store and on the right we are
going to start our spy tool, and on the
00:20:09.480 --> 00:20:14.110
left we just open the messenger app and
whenever the user hits any key on the
00:20:14.110 --> 00:20:19.690
keyboard our spy tool takes care of that
and notices that. Also if he presses the
00:20:19.690 --> 00:20:26.120
spacebar we can also measure that. If the
user decides "ok, I want to delete the
00:20:26.120 --> 00:20:30.880
word" because he changed his mind, we can
also register if the user pressed the
00:20:30.880 --> 00:20:37.929
backspace button, so in the end we can see
exactly how long the words were, the user
00:20:37.929 --> 00:20:45.630
typed into his phone without any
permissions and privileges, which is bad.
00:20:45.630 --> 00:20:55.250
laughs
applause
00:20:55.250 --> 00:21:00.320
ML: so enough about the mov instruction,
let's head to clflush.
00:21:00.320 --> 00:21:07.230
CM: So the clflush instruction: What it
does is that it invalidates from every
00:21:07.230 --> 00:21:12.309
level the cache line that contains the
address that you pass to this instruction.
00:21:12.309 --> 00:21:16.990
So in itself it's kind of bad because it
enables the Flush+Reload attacks that we
00:21:16.990 --> 00:21:21.300
showed earlier, that was just flush,
reload, and the flush part is done with
00:21:21.300 --> 00:21:29.140
clflush. But there's actually more to it,
how wonderful. So there's a first timing
00:21:29.140 --> 00:21:33.320
leakage with it, so we're going to see
that the clflush instruction has a
00:21:33.320 --> 00:21:37.890
different timing depending on whether the
data that you that you pass to it is
00:21:37.890 --> 00:21:44.710
cached or not. So imagine you have a cache
line that is on the level 1 by inclu...
00:21:44.710 --> 00:21:50.299
With the inclusion property it has to be
also in the last level cache. Now this is
00:21:50.299 --> 00:21:54.350
quite convenient and this is also why we
have this inclusion property for
00:21:54.350 --> 00:22:00.019
performance reason on Intel CPUs, if you
want to see if a line is present at all in
00:22:00.019 --> 00:22:04.209
the cache you just have to look in the
last level cache. So this is basically
00:22:04.209 --> 00:22:08.010
what the clflush instruction does. It goes
to the last last level cache, sees "ok
00:22:08.010 --> 00:22:12.890
there's a line, I'm going to flush this
one" and then there's something that tells
00:22:12.890 --> 00:22:18.950
ok the line is also present somewhere else
so then it flushes the line in level 1
00:22:18.950 --> 00:22:26.390
and/or level 2. So that's slow. Now if you
perform clflush on some data that is not
00:22:26.390 --> 00:22:32.240
cached, basically it does the same, goes
to the last level cache, see that there's
00:22:32.240 --> 00:22:36.659
no line and there can't be any... This
data can't be anywhere else in the cache
00:22:36.659 --> 00:22:41.269
because it would be in the last level
cache if it was anywhere, so it does
00:22:41.269 --> 00:22:47.430
nothing and it stop there. So that's fast.
So how exactly fast and slow am I talking
00:22:47.430 --> 00:22:53.760
about? So it's actually only a very few
cycles so we did this experiments on
00:22:53.760 --> 00:22:59.072
different microarchitecture so Center
Bridge, Ivy Bridge, and Haswell and...
00:22:59.072 --> 00:23:03.250
So it different colors correspond to the
different microarchitecture. So first
00:23:03.250 --> 00:23:07.880
thing that is already... kinda funny is
that you can see that you can distinguish
00:23:07.880 --> 00:23:14.649
the micro architecture quite nicely with
this, but the real point is that you have
00:23:14.649 --> 00:23:20.280
really a different zones. The solids...
The solid line is when we performed the
00:23:20.280 --> 00:23:25.200
measurement on clflush with the line that
was already in the cache, and the dashed
00:23:25.200 --> 00:23:30.840
line is when the line was not in the
cache, and in all microarchitectures you
00:23:30.840 --> 00:23:36.539
can see that we can see a difference: It's
only a few cycles, it's a bit noisy, so
00:23:36.539 --> 00:23:43.250
what could go wrong? Okay, so exploiting
these few cycles, we still managed to
00:23:43.250 --> 00:23:47.029
perform a new cache attacks that we call
"Flush+Flush", so I'm going to explain
00:23:47.029 --> 00:23:52.220
that to you: So basically everything that
we could do with "Flush+Reload", we can
00:23:52.220 --> 00:23:56.899
also do with "Flush+Flush". We can perform
cover channels and sidechannel attacks.
00:23:56.899 --> 00:24:01.090
It's stealthier than previous cache
attacks, I'm going to go back on this one,
00:24:01.090 --> 00:24:07.220
and it's also faster than previous cache
attacks. So how does it work exactly? So
00:24:07.220 --> 00:24:12.210
the principle is a bit similar to
"Flush+Reload": So we have the attacker
00:24:12.210 --> 00:24:16.131
and the victim that have some kind of
shared memory, let's say a shared library.
00:24:16.131 --> 00:24:21.340
It will be shared in the cache The
attacker will start by flushing the cache
00:24:21.340 --> 00:24:26.510
line then let's the victim perform
whatever it does, let's say encryption,
00:24:26.510 --> 00:24:32.120
the victim will load some data into the
cache, automatically, and now the attacker
00:24:32.120 --> 00:24:36.720
wants to know again if the victim accessed
this precise cache line and instead of
00:24:36.720 --> 00:24:43.540
reloading it is going to flush it again.
And since we have this timing difference
00:24:43.540 --> 00:24:47.040
depending on whether the data is in the
cache or not, it gives us the same
00:24:47.040 --> 00:24:54.889
information as if we reloaded it, except
it's way faster. So I talked about
00:24:54.889 --> 00:24:59.690
stealthiness. So the thing is that
basically these cache attacks and that
00:24:59.690 --> 00:25:06.340
also applies to "Rowhammer": They are
already stealth in themselves, because
00:25:06.340 --> 00:25:10.470
there's no antivirus today that can detect
them. but some people thought that we
00:25:10.470 --> 00:25:14.351
could detect them with performance
counters because they do a lot of cache
00:25:14.351 --> 00:25:18.549
misses and cache references that happen
when the data is flushed and when you
00:25:18.549 --> 00:25:26.090
reaccess memory. now what we thought is
that yeah but that also not the only
00:25:26.090 --> 00:25:31.269
program steps to lots of cache misses and
cache references so we would like to have
00:25:31.269 --> 00:25:38.120
a slightly parametric. So these cache
attacks they have a very heavy activity on
00:25:38.120 --> 00:25:43.840
the cache but they're also very particular
because there are very short loops of code
00:25:43.840 --> 00:25:48.610
if you take flush and reload this just
flush one line reload the line and then
00:25:48.610 --> 00:25:53.750
again flush reload that's very short loop
and that creates a very low pressure on
00:25:53.750 --> 00:26:01.490
the instruction therapy which is kind of
particular for of cache attacks so what we
00:26:01.490 --> 00:26:05.380
decided to do is normalizing the cache
even so the cache misses and cache
00:26:05.380 --> 00:26:10.720
references by events that have to do with
the instruction TLB and there we could
00:26:10.720 --> 00:26:19.360
manage to detect cache attacks and
Rowhammer without having false positives
00:26:19.360 --> 00:26:24.510
so the system metric that I'm going to use
when I talk about stealthiness so we
00:26:24.510 --> 00:26:29.750
started by creating a cover channel. First
we wanted to have it as fast as possible
00:26:29.750 --> 00:26:36.160
so we created a protocol to evaluates all
the kind of cache attack that we had so
00:26:36.160 --> 00:26:40.540
flush+flush, flush+reload, and
prime+probe and we started with a
00:26:40.540 --> 00:26:47.010
packet side of 28 doesn't really matter.
We measured the capacity of our covert
00:26:47.010 --> 00:26:52.799
channel and flush+flush is around
500 kB/s whereas Flush+Reload
00:26:52.799 --> 00:26:56.340
was only 300 kB/s
so Flush+Flush is already quite an
00:26:56.340 --> 00:27:00.740
improvement on the speed.
Then we measured the stealth zone at this
00:27:00.740 --> 00:27:06.100
speed only Flush+Flush was stealth and
now the thing is that Flush+Flush and
00:27:06.100 --> 00:27:10.200
Flush+Reload as you've seen there are
some similarities so for a covert channel
00:27:10.200 --> 00:27:15.309
they also share the same center on it is
receivers different and for this one the
00:27:15.309 --> 00:27:20.000
center was not stealth for both of them
anyway if you want a fast covert channel
00:27:20.000 --> 00:27:26.640
then just try flush+flush that works.
Now let's try to make it stealthy
00:27:26.640 --> 00:27:30.639
completely stealthy because if I have the
standard that is not stealth maybe that we
00:27:30.639 --> 00:27:36.440
give away the whole attack so we said okay
maybe if we just slow down all the attacks
00:27:36.440 --> 00:27:41.240
then there will be less cache hits,
cache misses and then maybe all
00:27:41.240 --> 00:27:48.070
the attacks are actually stealthy why not?
So we tried that we slowed down everything
00:27:48.070 --> 00:27:52.889
so Flush+Reload and Flash+Flash
are around 50 kB/s now
00:27:52.889 --> 00:27:55.829
Prime+Probe is a bit slower because it
takes more time
00:27:55.829 --> 00:28:01.330
to prime and probe anything but still
00:28:01.330 --> 00:28:09.419
even with this slow down only Flush+Flush
has its receiver stealth and we also
00:28:09.419 --> 00:28:14.769
managed to have the sender stealth now so
basically whether you want a fast covert
00:28:14.769 --> 00:28:20.450
channel or a stealth covert channel
Flush+Flush is really great.
00:28:20.450 --> 00:28:26.500
Now we wanted to also evaluate if it
wasn't too noisy to perform some side
00:28:26.500 --> 00:28:30.740
channel attack so we did these side
channels on the AES t-table implementation
00:28:30.740 --> 00:28:35.910
the attacks that we have shown you
earlier, so we computed a lot of
00:28:35.910 --> 00:28:41.820
encryption that we needed to determine the
upper four bits of a key bytes so here the
00:28:41.820 --> 00:28:48.870
lower the better the attack and Flush +
Reload is a bit better so we need only 250
00:28:48.870 --> 00:28:55.029
encryptions to recover these bits but
Flush+Flush comes quite, comes quite
00:28:55.029 --> 00:29:00.570
close with 350 and Prime+Probe is
actually the most noisy of them all, needs
00:29:00.570 --> 00:29:06.101
5... close to 5000 encryptions so we have
around the same performance for
00:29:06.101 --> 00:29:13.520
Flush+Flush and Flush+Reload.
Now let's evaluate the stealthiness again.
00:29:13.520 --> 00:29:19.320
So what we did here is we perform 256
billion encryptions in a synchronous
00:29:19.320 --> 00:29:25.740
attack so we really had the spy and the
victim scattered and we evaluated the
00:29:25.740 --> 00:29:31.409
stealthiness of them all and here only
Flush+Flush again is stealth. And while
00:29:31.409 --> 00:29:36.279
you can always slow down a covert channel
you can't actually slow down a side
00:29:36.279 --> 00:29:40.700
channel because, in a real-life scenario,
you're not going to say "Hey victim, him
00:29:40.700 --> 00:29:47.179
wait for me a bit, I am trying to do an
attack here." That won't work.
00:29:47.179 --> 00:29:51.429
So there's even more to it but I will need
again a bit of background before
00:29:51.429 --> 00:29:56.910
continuing. So I've shown you the
different levels of caches and here I'm
00:29:56.910 --> 00:30:04.009
going to focus more on the last-level
cache. So we have here our four slices so
00:30:04.009 --> 00:30:09.830
this is the last-level cache and we have
some bits of the address here that
00:30:09.830 --> 00:30:14.330
corresponds to the set, but more
importantly, we need to know where in
00:30:14.330 --> 00:30:19.899
which slice and address is going to be.
And that is given, that is given by some
00:30:19.899 --> 00:30:23.850
bits of the set and the type of the
address that are passed into a function
00:30:23.850 --> 00:30:27.960
that says in which slice the line is going
to be.
00:30:27.960 --> 00:30:32.460
Now the thing is that this hash function
is undocumented by Intel. Wouldn't be fun
00:30:32.460 --> 00:30:39.250
otherwise. So we have this: As many slices
as core, an undocumented hash function
00:30:39.250 --> 00:30:43.980
that maps a physical address to a slice,
and while it's actually a bit of a pain
00:30:43.980 --> 00:30:48.710
for attacks, it has, it was not designed
for security originally but for
00:30:48.710 --> 00:30:53.570
performance, because you want all the
access to be evenly distributed in the
00:30:53.570 --> 00:31:00.399
different slices, for performance reasons.
So the hash function basically does, it
00:31:00.399 --> 00:31:05.279
takes some bits of the physical address
and output k bits of slice, so just one
00:31:05.279 --> 00:31:09.309
bits if you have a two-core machine, two
bits if you have a four-core machine and
00:31:09.309 --> 00:31:16.830
so on. Now let's go back to clflush, see
what's the relation with that.
00:31:16.830 --> 00:31:21.169
So the thing that we noticed is that
clflush is actually faster to reach a line
00:31:21.169 --> 00:31:28.549
on the local slice.
So if you have, if you're flushing always
00:31:28.549 --> 00:31:33.340
one line and you run your program on core
zero, core one, core two and core three,
00:31:33.340 --> 00:31:37.899
you will observe that one core in
particular on, when you run the program on
00:31:37.899 --> 00:31:44.632
one core, the clflush is faster. And so
here this is on core one, and you can see
00:31:44.632 --> 00:31:51.139
that core zero, two, and three it's, it's
a bit slower and here we can deduce that,
00:31:51.139 --> 00:31:55.320
so we run the program on core one and we
flush always the same line and we can
00:31:55.320 --> 00:32:01.850
deduce that the line belong to slice one.
And what we can do with that is that we
00:32:01.850 --> 00:32:06.500
can map physical address to slices.
And that's one way to reverse-engineer
00:32:06.500 --> 00:32:10.639
this addressing function that was not
documented.
00:32:10.639 --> 00:32:15.880
Funnily enough that's not the only way:
What I did before that was using the
00:32:15.880 --> 00:32:21.229
performance counters to reverse-engineer
this function, but that's actually a whole
00:32:21.229 --> 00:32:27.770
other story and if you want more detail on
that, there's also an article on that.
00:32:27.770 --> 00:32:30.139
ML: So the next instruction we want to
00:32:30.139 --> 00:32:35.110
talk about is the prefetch instruction.
And the prefetch instruction is used to
00:32:35.110 --> 00:32:40.841
tell the CPU: "Okay, please load the data
I need later on, into the cache, if you
00:32:40.841 --> 00:32:45.968
have some time." And in the end there are
actually six different prefetch
00:32:45.968 --> 00:32:52.929
instructions: prefetcht0 to t2 which
means: "CPU, please load the data into the
00:32:52.929 --> 00:32:58.640
first-level cache", or in the last-level
cache, whatever you want to use, but we
00:32:58.640 --> 00:33:02.250
spare you the details because it's not so
interesting in the end.
00:33:02.250 --> 00:33:06.940
However, what's more interesting is when
we take a look at the Intel manual and
00:33:06.940 --> 00:33:11.880
what it says there. So, "Using the
PREFETCH instruction is recommended only
00:33:11.880 --> 00:33:17.049
if data does not fit in the cache." So you
can tell the CPU: "Please load data I want
00:33:17.049 --> 00:33:23.210
to stream into the cache, so it's more
performant." "Use of software prefetch
00:33:23.210 --> 00:33:27.740
should be limited to memory addresses that
are managed or owned within the
00:33:27.740 --> 00:33:33.620
application context."
So one might wonder what happens if this
00:33:33.620 --> 00:33:40.940
address is not managed by myself. Sounds
interesting. "Prefetching to addresses
00:33:40.940 --> 00:33:46.289
that are not mapped to physical pages can
experience non-deterministic performance
00:33:46.289 --> 00:33:52.030
penalty. For example specifying a NULL
pointer as an address for prefetch can
00:33:52.030 --> 00:33:56.000
cause long delays."
So we don't want to do that because our
00:33:56.000 --> 00:34:02.919
program will be slow. So, let's take a
look what they mean with non-deterministic
00:34:02.919 --> 00:34:08.889
performance penalty, because we want to
write good software, right? But before
00:34:08.889 --> 00:34:12.510
that, we have to take a look at a little
bit more background information to
00:34:12.510 --> 00:34:17.710
understand the attacks.
So on modern operating systems, every
00:34:17.710 --> 00:34:22.850
application has its own virtual address
space. So at some point, the CPU needs to
00:34:22.850 --> 00:34:27.479
translate these addresses to the physical
addresses actually in the DRAM. And for
00:34:27.479 --> 00:34:33.690
that we have this very complex-looking
data structure. So we have a 48-bit
00:34:33.690 --> 00:34:40.409
virtual address, and some of those bits
mapped to a table, like the PM level 4
00:34:40.409 --> 00:34:47.760
table, with 512 entries, so depending on
those bits the CPU knows, at which line he
00:34:47.760 --> 00:34:51.520
has to look.
And if there is data there, because the
00:34:51.520 --> 00:34:56.900
address is mapped, he can proceed and look
at the page directory, point the table,
00:34:56.900 --> 00:35:04.620
and so on for the town. So is everything,
is the same for each level until you come
00:35:04.620 --> 00:35:09.130
to your page table, where you have
4-kilobyte pages. So it's in the end not
00:35:09.130 --> 00:35:13.851
that complicated, but it's a bit
confusing, because you want to know a
00:35:13.851 --> 00:35:20.310
physical address, so you have to look it
up somewhere in the, in the main memory
00:35:20.310 --> 00:35:25.420
with physical addresses to translate your
virtual addresses. And if you have to go
00:35:25.420 --> 00:35:31.890
through all those levels, it needs a long
time, so we can do better than that and
00:35:31.890 --> 00:35:39.160
that's why Intel introduced additional
caches, also for all of those levels. So,
00:35:39.160 --> 00:35:45.560
if you want to translate an address, you
take a look at the ITLB for instructions,
00:35:45.560 --> 00:35:51.150
and the data TLB for data. If it's there,
you can stop, otherwise you go down all
00:35:51.150 --> 00:35:58.700
those levels and if it's not in any cache
you have to look it up in the DRAM. In
00:35:58.700 --> 00:36:03.300
addition, the address space you have is
shared, because you have, on the one hand,
00:36:03.300 --> 00:36:07.470
the user memory and, on the other hand,
you have mapped the kernel for convenience
00:36:07.470 --> 00:36:12.870
and performance also in the address space.
And if your user program wants to access
00:36:12.870 --> 00:36:18.310
some kernel functionality like reading a
file, it will switch to the kernel memory
00:36:18.310 --> 00:36:23.880
there's a privilege escalation, and then
you can read the file, and so on. So,
00:36:23.880 --> 00:36:30.420
that's it. However, you have drivers in
the kernel, and if you know the addresses
00:36:30.420 --> 00:36:35.771
of those drivers, you can do code-reuse
attacks, and as a countermeasure, they
00:36:35.771 --> 00:36:40.150
introduced address-space layout
randomization, also for the kernel.
00:36:40.150 --> 00:36:47.040
And this means that when you have your
program running, the kernel is mapped at
00:36:47.040 --> 00:36:51.630
one address and if you reboot the machine
it's not on the same address anymore but
00:36:51.630 --> 00:36:58.390
somewhere else. So if there is a way to
find out at which address the kernel is
00:36:58.390 --> 00:37:04.450
loaded, you have circumvented this
countermeasure and defeated kernel address
00:37:04.450 --> 00:37:11.060
space layout randomization. So this would
be nice for some attacks. In addition,
00:37:11.060 --> 00:37:16.947
there's also the kernel direct physical
map. And what does this mean? It's
00:37:16.947 --> 00:37:23.320
implemented on many operating systems like
OS X, Linux, also on the Xen hypervisor
00:37:23.320 --> 00:37:27.860
and
BSD, but not on Windows. But what it means
00:37:27.860 --> 00:37:33.870
is that the complete physical memory is
mapped in additionally in the kernel
00:37:33.870 --> 00:37:40.460
memory at a fixed offset. So, for every
page that is mapped in the user space,
00:37:40.460 --> 00:37:45.160
there's something like a twin page in the
kernel memory, which you can't access
00:37:45.160 --> 00:37:50.371
because it's in the kernel memory.
However, we will need it later, because
00:37:50.371 --> 00:37:58.230
now we go back to prefetch and see what we
can do with that. So, prefetch is not a
00:37:58.230 --> 00:38:04.150
usual instruction, because it just tells
the CPU "I might need that data later on.
00:38:04.150 --> 00:38:10.000
If you have time, load for me," if not,
the CPU can ignore it because it's busy
00:38:10.000 --> 00:38:15.810
with other stuff. So, there's no necessity
that this instruction is really executed,
00:38:15.810 --> 00:38:22.070
but most of the time it is. And a nice,
interesting thing is that it generates no
00:38:22.070 --> 00:38:29.000
faults, so whatever you pass to this
instruction, your program won't crash, and
00:38:29.000 --> 00:38:33.990
it does not check any privileges, so I can
also pass an kernel address to it and it
00:38:33.990 --> 00:38:37.510
won't say "No, stop, you accessed an
address that you are not allowed to
00:38:37.510 --> 00:38:45.530
access, so I crash," it just continues,
which is nice.
00:38:45.530 --> 00:38:49.810
The second interesting thing is that the
operand is a virtual address, so every
00:38:49.810 --> 00:38:55.534
time you execute this instruction, the CPU
has to go and check "OK, what physical
00:38:55.534 --> 00:38:59.600
address does this virtual address
correspond to?" So it has to do the lookup
00:38:59.600 --> 00:39:05.750
with all those tables we've seen earlier,
and as you probably have guessed already,
00:39:05.750 --> 00:39:10.370
the execution time varies also for the
prefetch instruction and we will see later
00:39:10.370 --> 00:39:16.090
on what we can do with that.
So, let's get back to the direct physical
00:39:16.090 --> 00:39:22.870
map. Because we can create an oracle for
address translation, so we can find out
00:39:22.870 --> 00:39:27.540
what physical address belongs to the
virtual address. Because nowadays you
00:39:27.540 --> 00:39:31.990
don't want that the user to know, because
you can craft nice rowhammer attacks with
00:39:31.990 --> 00:39:37.520
that information, and more advanced cache
attacks, so you restrict this information
00:39:37.520 --> 00:39:44.270
to the user. But let's check if we find a
way to still get this information. So, as
00:39:44.270 --> 00:39:50.150
I've told you earlier, if you have a
paired page in the user space map,
00:39:50.150 --> 00:39:54.505
you have the twin page in the kernel
space, and if it's cached,
00:39:54.505 --> 00:39:56.710
its cached for both of them again.
00:39:56.710 --> 00:40:03.170
So, the attack now works as the following:
From the attacker you flush your user
00:40:03.170 --> 00:40:09.760
space page, so it's not in the cache for
the... also for the kernel memory, and
00:40:09.760 --> 00:40:15.850
then you call prefetch on the address of
the kernel, because as I told you, you
00:40:15.850 --> 00:40:22.050
still can do that because it doesn't
create any faults. So, you tell the CPU
00:40:22.050 --> 00:40:28.310
"Please load me this data into the cache
even if I don't have access to this data
00:40:28.310 --> 00:40:32.550
normally."
And if we now measure on our user space
00:40:32.550 --> 00:40:37.100
page the address again, and we measure a
cache hit, because it has been loaded by
00:40:37.100 --> 00:40:42.630
the CPU into the cache, we know exactly at
which position, since we passed the
00:40:42.630 --> 00:40:48.250
address to the function, this address
corresponds to. And because this is at a
00:40:48.250 --> 00:40:53.280
fixed offset, we can just do a simple
subtraction and know the physical address
00:40:53.280 --> 00:40:59.180
again. So we have a nice way to find
physical addresses for virtual addresses.
00:40:59.180 --> 00:41:04.390
And in practice this looks like this
following plot. So, it's pretty simple,
00:41:04.390 --> 00:41:08.910
because we just do this for every address,
and at some point we measure a cache hit.
00:41:08.910 --> 00:41:14.260
So, there's a huge difference. And exactly
at this point we know this physical
00:41:14.260 --> 00:41:20.140
address corresponds to our virtual
address. The second thing is that we can
00:41:20.140 --> 00:41:27.070
exploit the timing differences it needs
for the prefetch instruction. Because, as
00:41:27.070 --> 00:41:31.850
I told you, when you go down this cache
levels, at some point you see "it's here"
00:41:31.850 --> 00:41:37.500
or "it's not here," so it can abort early.
And with that we can know exactly
00:41:37.500 --> 00:41:41.800
when the prefetch
instruction aborted, and know how the
00:41:41.800 --> 00:41:48.070
pages are mapped into the address space.
So, the timing depends on where the
00:41:48.070 --> 00:41:57.090
translation stops. And using those two
properties and those information, we can
00:41:57.090 --> 00:42:02.227
do the following: On the one hand, we can
build variants of cache attacks. So,
00:42:02.227 --> 00:42:07.444
instead of Flush+Reload, we can do
Flush+Prefetch, for instance. We can
00:42:07.444 --> 00:42:12.060
also use prefetch to mount rowhammer
attacks on privileged addresses, because
00:42:12.060 --> 00:42:18.069
it doesn't do any faults when we pass
those addresses, and it works as well. In
00:42:18.069 --> 00:42:23.330
addition, we can use it to recover the
translation levels of a process, which you
00:42:23.330 --> 00:42:27.870
could do earlier with the page map file,
but as I told you it's now privileged, so
00:42:27.870 --> 00:42:32.890
you don't have access to that, and by
doing that you can bypass address space
00:42:32.890 --> 00:42:38.170
layout randomization. In addition, as I
told you, you can translate virtual
00:42:38.170 --> 00:42:43.530
addresses to physical addresses, which is
now also privileged with the page map
00:42:43.530 --> 00:42:48.790
file, and using that it reenables return
to direct exploits, which have been
00:42:48.790 --> 00:42:55.550
demonstrated last year. On top of that, we
can also use this to locate kernel
00:42:55.550 --> 00:43:00.850
drivers, as I told you. It would be nice
if we can circumvent KSLR as well, and I
00:43:00.850 --> 00:43:08.380
will show you now how this is possible.
So, with the first oracle we find out all
00:43:08.380 --> 00:43:15.430
the pages that are mapped, and for each of
those pages, we evict the translation
00:43:15.430 --> 00:43:18.210
caches, and we can do that by either
calling sleep,
00:43:18.210 --> 00:43:24.450
which schedules another program, or access
just a large memory buffer. Then, we
00:43:24.450 --> 00:43:28.260
perform a syscall to the driver. So,
there's code of the driver executed and
00:43:28.260 --> 00:43:33.540
loaded into the cache, and then we just
measure the time prefetch takes on this
00:43:33.540 --> 00:43:40.840
address. And in the end, the fastest
average access time is the driver page.
00:43:40.840 --> 00:43:46.770
So, we can mount this attack on Windows 10
in less than 12 seconds. So, we can defeat
00:43:46.770 --> 00:43:52.110
KSLR in less than 12 seconds, which is
very nice. And in practice, the
00:43:52.110 --> 00:43:58.330
measurements looks like the following: So,
we have a lot of long measurements, and at
00:43:58.330 --> 00:44:05.060
some point you have a low one, and you
know exactly that this driver region and
00:44:05.060 --> 00:44:09.930
the address the driver is located. And
you can mount those read to direct
00:44:09.930 --> 00:44:16.210
attacks again. However, that's not
everything, because there are more
00:44:16.210 --> 00:44:20.795
instructions in Intel.
CM: Yeah, so, the following is not our
00:44:20.795 --> 00:44:24.350
work, but we thought that would be
interesting, because it's basically more
00:44:24.350 --> 00:44:30.740
instructions, more attacks, more fun. So
there's the RDSEED instruction, and what
00:44:30.740 --> 00:44:35.340
it does, that is request a random seed to
the hardware random number generator. So,
00:44:35.340 --> 00:44:39.310
the thing is that there is a fixed number
of precomputed random bits, and that takes
00:44:39.310 --> 00:44:44.320
time to regenerate them. So, as everything
that takes time, you can create a covert
00:44:44.320 --> 00:44:50.180
channel with that. There is also FADD and
FMUL, which are floating point operations.
00:44:50.180 --> 00:44:56.740
Here, the running time of this instruction
depends on the operands. Some people
00:44:56.740 --> 00:45:01.530
managed to bypass Firefox's same origin
policy with an SVG filter timing attack
00:45:01.530 --> 00:45:08.540
with that. There's also the JMP
instructions. So, in modern CPUs you have
00:45:08.540 --> 00:45:14.520
branch prediction, and branch target
prediction. With that, it's actually been
00:45:14.520 --> 00:45:18.250
studied a lot, you can create a covert
channel. You can do side-channel attacks
00:45:18.250 --> 00:45:26.028
on crypto. You can also bypass KSLR, and
finally, there are TSX instructions, which
00:45:26.028 --> 00:45:31.010
is an extension for hardware transactional
memory support, which has also been used
00:45:31.010 --> 00:45:37.150
to bypass KSLR. So, in case you're not
sure, KSLR is dead. You have lots of
00:45:37.150 --> 00:45:45.650
different things to read. Okay, so, on the
conclusion now. So, as you've seen, it's
00:45:45.650 --> 00:45:50.190
actually more a problem of CPU design,
than really the instruction sets
00:45:50.190 --> 00:45:55.720
architecture. The thing is that all these
issues are really hard to patch. They
00:45:55.720 --> 00:45:59.966
are all linked to performance
optimizations, and we are not getting rid
00:45:59.966 --> 00:46:03.890
of performance optimization. That's
basically a trade-off between performance
00:46:03.890 --> 00:46:11.530
and security, and performance seems to
always win. There has been some
00:46:11.530 --> 00:46:20.922
propositions to... against cache attacks,
to... let's say remove the CLFLUSH
00:46:20.922 --> 00:46:26.640
instructions. The thing is that all these
quick fix won't work, because we always
00:46:26.640 --> 00:46:31.450
find new ways to do the same thing without
these precise instructions and also, we
00:46:31.450 --> 00:46:37.410
keep finding new instruction that leak
information. So, it's really, let's say
00:46:37.410 --> 00:46:43.740
quite a big topic that we have to fix
this. So, thank you very much for your
00:46:43.740 --> 00:46:47.046
attention. If you have any questions we'd
be happy to answer them.
00:46:47.046 --> 00:46:52.728
applause
00:46:52.728 --> 00:47:01.510
applause
Herald: Okay. Thank you very much again
00:47:01.510 --> 00:47:06.571
for your talk, and now we will have a Q&A,
and we have, I think, about 15 minutes, so
00:47:06.571 --> 00:47:11.330
you can start lining up behind the
microphones. They are in the gangways in
00:47:11.330 --> 00:47:18.130
the middle. Except, I think that one...
oh, no, it's back up, so it will work. And
00:47:18.130 --> 00:47:22.180
while we wait, I think we will take
questions from our signal angel, if there
00:47:22.180 --> 00:47:28.810
are any. Okay, there aren't any, so...
microphone questions. I think, you in
00:47:28.810 --> 00:47:33.440
front.
Microphone: Hi. Can you hear me?
00:47:33.440 --> 00:47:40.050
Herald: Try again.
Microphone: Okay. Can you hear me now?
00:47:40.050 --> 00:47:46.480
Okay. Yeah, I'd like to know what exactly
was your stealthiness metric? Was it that
00:47:46.480 --> 00:47:51.310
you can't distinguish it from a normal
process, or...?
00:47:51.310 --> 00:47:56.500
CM: So...
Herald: Wait a second. We have still Q&A,
00:47:56.500 --> 00:47:59.780
so could you quiet down a bit? That would
be nice.
00:47:59.780 --> 00:48:08.180
CM: So, the question was about the
stealthiness metric. Basically, we use the
00:48:08.180 --> 00:48:14.320
metric with cache misses and cache
references, normalized by the instructions
00:48:14.320 --> 00:48:21.080
TLB events, and we
just found the threshold under which
00:48:21.080 --> 00:48:25.820
pretty much every benign application was
below this, and rowhammer and cache
00:48:25.820 --> 00:48:30.520
attacks were after that. So we fixed the
threshold, basically.
00:48:30.520 --> 00:48:35.520
H: That microphone.
Microphone: Hello. Thanks for your talk.
00:48:35.520 --> 00:48:42.760
It was great. First question: Did you
inform Intel before doing this talk?
00:48:42.760 --> 00:48:47.520
CM: Nope.
Microphone: Okay. The second question:
00:48:47.520 --> 00:48:51.050
What's your future plans?
CM: Sorry?
00:48:51.050 --> 00:48:55.780
M: What's your future plans?
CM: Ah, future plans. Well, what I did,
00:48:55.780 --> 00:49:01.220
that is interesting, is that we keep
finding these more or less by accident, or
00:49:01.220 --> 00:49:06.440
manually, so having a good idea of what's
the attack surface here would be a good
00:49:06.440 --> 00:49:10.050
thing, and doing that automatically would
be even better.
00:49:10.050 --> 00:49:14.170
M: Great, thanks.
H: Okay, the microphone in the back,
00:49:14.170 --> 00:49:18.770
over there. The guy in white.
M: Hi. One question. If you have,
00:49:18.770 --> 00:49:24.410
like, a demon, that randomly invalidates
some cache lines, would that be a better
00:49:24.410 --> 00:49:31.120
countermeasure than disabling the caches?
ML: What was the question?
00:49:31.120 --> 00:49:39.580
CM: If invalidating cache lines would be
better than disabling the whole cache. So,
00:49:39.580 --> 00:49:42.680
I'm...
ML: If you know which cache lines have
00:49:42.680 --> 00:49:47.300
been accessed by the process, you can
invalidate those cache lines before you
00:49:47.300 --> 00:49:52.820
swap those processes, but it's also a
trade-off between performance. Like, you
00:49:52.820 --> 00:49:57.940
can also, if you switch processes, flush
the whole cache, and then it's empty, and
00:49:57.940 --> 00:50:01.900
then you don't see any activity anymore,
but there's also the trade-off of
00:50:01.900 --> 00:50:07.510
performance with this.
M: Okay, maybe a second question. If you,
00:50:07.510 --> 00:50:12.240
there are some ARM architectures
that have random cache line invalidations.
00:50:12.240 --> 00:50:16.010
Did you try those, if you can see a
[unintelligible] channel there.
00:50:16.010 --> 00:50:21.960
ML: If they're truly random, but probably
you just have to make more measurements
00:50:21.960 --> 00:50:27.180
and more measurements, and then you can
average out the noise, and then you can do
00:50:27.180 --> 00:50:30.350
these attacks again. It's like, with prime
and probe, where you need more
00:50:30.350 --> 00:50:34.080
measurements, because it's much more
noisy, so in the end you will just need
00:50:34.080 --> 00:50:37.870
much more measurements.
CM: So, on ARM, it's supposed to be pretty
00:50:37.870 --> 00:50:43.260
random. At least it's in the manual, but
we actually found nice ways to evict cache
00:50:43.260 --> 00:50:47.230
lines, that we really wanted to evict, so
it's not actually that pseudo-random.
00:50:47.230 --> 00:50:51.960
So, even... let's say, if something is
truly random, it might be nice, but then
00:50:51.960 --> 00:50:57.170
it's also quite complicated to implement.
I mean, you probably don't want a random
00:50:57.170 --> 00:51:01.480
number generator just for the cache.
M: Okay. Thanks.
00:51:01.480 --> 00:51:05.980
H: Okay, and then the three guys here on
the microphone in the front.
00:51:05.980 --> 00:51:13.450
M: My question is about a detail with the
keylogger. You could distinguish between
00:51:13.450 --> 00:51:18.150
space, backspace and alphabet, which is
quite interesting. But could you also
00:51:18.150 --> 00:51:22.320
figure out the specific keys that were
pressed, and if so, how?
00:51:22.320 --> 00:51:25.650
ML: Yeah, that depends on the
implementation of the keyboard. But what
00:51:25.650 --> 00:51:29.310
we did, we used the Android stock
keyboard, which is shipped with the
00:51:29.310 --> 00:51:34.520
Samsung, so it's pre-installed. And if you
have a table somewhere in your code, which
00:51:34.520 --> 00:51:39.540
says "Okay, if you press this exact
location or this image, it's an A or it's
00:51:39.540 --> 00:51:44.450
an B", then you can also do a more
sophisticated attack. So, if you find any
00:51:44.450 --> 00:51:49.050
functions or data in the code, which
directly tells you "Okay, this is this
00:51:49.050 --> 00:51:54.520
character," you can also spy on the actual
key characters on the keyboard.
00:51:54.520 --> 00:52:02.900
M: Thank you.
M: Hi. Thank you for your talk. My first
00:52:02.900 --> 00:52:08.570
question is: What can we actually do now,
to mitigate this kind of attack? By, for
00:52:08.570 --> 00:52:11.980
example switching off TSX or using ECC
RAM.
00:52:11.980 --> 00:52:17.410
CM: So, I think the very important thing
to protect would be, like crypto, and the
00:52:17.410 --> 00:52:20.840
good thing is that today we know how to
build crypto that is resistant to side-
00:52:20.840 --> 00:52:24.490
channel attacks. So the good thing would
be to stop improving implementation that
00:52:24.490 --> 00:52:31.360
are known to be vulnerable for 10 years.
Then things like keystrokes is way harder
00:52:31.360 --> 00:52:36.830
to protect, so let's say crypto is
manageable; the whole system is clearly
00:52:36.830 --> 00:52:41.490
another problem. And you can have
different types of countermeasure on the
00:52:41.490 --> 00:52:45.780
hardware side but it does would mean that
Intel an ARM actually want to fix that,
00:52:45.780 --> 00:52:48.560
and that they know how to fix that. I
don't even know how to fix that in
00:52:48.560 --> 00:52:55.500
hardware. Then on the system side, if you
prevent some kind of memory sharing, you
00:52:55.500 --> 00:52:58.540
don't have flush involved anymore
and primum (?) probably is much more
00:52:58.540 --> 00:53:04.880
noisier, so it would be an improvement.
M: Thank you.
00:53:04.880 --> 00:53:11.880
H: Do we have signal angel questions? No.
OK, then more microphone.
00:53:11.880 --> 00:53:16.630
M: Hi, thank you. I wanted to ask about
the way you establish the side-channel
00:53:16.630 --> 00:53:23.280
between the two processors, because it
would obviously have to be timed in a way to
00:53:23.280 --> 00:53:28.511
transmit information between one process
to the other. Is there anywhere that you
00:53:28.511 --> 00:53:32.970
documented the whole? You know, it's
actually almost like the seven layers or
00:53:32.970 --> 00:53:36.580
something like that. There are any ways
that you documented that? It would be
00:53:36.580 --> 00:53:40.260
really interesting to know how it worked.
ML: You can find this information in the
00:53:40.260 --> 00:53:46.120
paper because there are several papers on
covered channels using that, so the NDSS
00:53:46.120 --> 00:53:51.300
paper is published in February I guess,
but the Armageddon paper also includes
00:53:51.300 --> 00:53:55.670
a cover channel, and you can
find more information about how the
00:53:55.670 --> 00:53:59.320
packets look like and how the
synchronization works in the paper.
00:53:59.320 --> 00:54:04.020
M: Thank you.
H: One last question?
00:54:04.020 --> 00:54:09.750
M: Hi! You mentioned that you used Osvik's
attack for the AES side-channel attack.
00:54:09.750 --> 00:54:17.350
Did you solve the AES round detection and
is it different to some scheduler
00:54:17.350 --> 00:54:21.441
manipulation?
CM: So on this one I think we only did
00:54:21.441 --> 00:54:24.280
some synchronous attack, so we already
knew when
00:54:24.280 --> 00:54:27.770
the victim is going to be scheduled and
we didn't have anything to do with
00:54:27.770 --> 00:54:32.930
schedulers.
M: Alright, thank you.
00:54:32.930 --> 00:54:37.140
H: Are there any more questions? No, I
don't see anyone. Then, thank you very
00:54:37.140 --> 00:54:39.132
much again to our speakers.
00:54:39.132 --> 00:54:42.162
applause
00:54:42.162 --> 00:54:58.970
music
00:54:58.970 --> 00:55:06.000
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!