1 00:00:00,000 --> 00:00:13,245 Music 2 00:00:13,245 --> 00:00:17,060 Herald Angel: We are here with a motto, and the motto of this year is "Works For 3 00:00:17,060 --> 00:00:21,670 Me" and I think, who many people, how many people in here are programmmers? Raise 4 00:00:21,670 --> 00:00:28,700 your hands or shout or... Whoa, that's a lot. Okay. So I think many of you will 5 00:00:28,700 --> 00:00:38,990 work on x86. Yeah. And I think you assume that it works, and that everything works 6 00:00:38,990 --> 00:00:48,150 as intended And I mean: What could go wrong? Our next talk, the first one today, 7 00:00:48,150 --> 00:00:52,290 will be by Clémentine Maurice, who previously was here with RowhammerJS, 8 00:00:52,290 --> 00:01:01,740 something I would call scary, and Moritz Lipp, who has worked on the Armageddon 9 00:01:01,740 --> 00:01:09,820 exploit, back, what is it? Okay, so the next... I would like to hear a really warm 10 00:01:09,820 --> 00:01:14,460 applause for the speakers for the talk "What could what could possibly go wrong 11 00:01:14,460 --> 00:01:17,280 with insert x86 instruction here?" 12 00:01:17,280 --> 00:01:18,375 thank you. 13 00:01:18,375 --> 00:01:28,290 Applause 14 00:01:28,290 --> 00:01:32,530 Clémentine Maurice (CM): Well, thank you all for being here this morning. Yes, this 15 00:01:32,530 --> 00:01:38,080 is our talk "What could possibly go wrong with insert x86 instructions here". So 16 00:01:38,080 --> 00:01:42,850 just a few words about ourselves: So I'm Clémentine Maurice, I got my PhD last year 17 00:01:42,850 --> 00:01:47,119 in computer science and I'm now working as a postdoc at Graz University of Technology 18 00:01:47,119 --> 00:01:52,090 in Austria. You can reach me on Twitter or by email but there's also I think a lots 19 00:01:52,090 --> 00:01:56,670 of time before the Congress is over. Moritz Lipp (ML): Hi and my name is Moritz 20 00:01:56,670 --> 00:02:01,520 Lipp, I'm a PhD student at Graz University of Technology and you can also reach me on 21 00:02:01,520 --> 00:02:06,679 Twitter or just after our talk and in the next days. 22 00:02:06,679 --> 00:02:10,860 CM: So, about this talk: So, the title says this is a talk about x86 23 00:02:10,860 --> 00:02:17,720 instructions, but this is not a talk about software. Don't leave yet! I'm actually 24 00:02:17,720 --> 00:02:22,440 even assuming safe software and the point that we want to make is that safe software 25 00:02:22,440 --> 00:02:27,390 does not mean safe execution and we have information leakage because of the 26 00:02:27,390 --> 00:02:32,560 underlying hardware and this is what we're going to talk about today. So we'll be 27 00:02:32,560 --> 00:02:36,819 talking about cache attacks, what are they, what can we do with that and also a 28 00:02:36,819 --> 00:02:41,510 special kind of cache attack that we found this year. So... doing cache attacks 29 00:02:41,510 --> 00:02:48,590 without memory accesses and how to use that even to bypass kernel ASLR. 30 00:02:48,590 --> 00:02:53,129 So again, the talk says is to talk about x86 instructions but this is even more 31 00:02:53,129 --> 00:02:58,209 global than that. We can also mount these cache attacks on ARM and not only on the 32 00:02:58,209 --> 00:03:07,050 x86. So some of the examples that you will see also applies to ARM. So today we'll do 33 00:03:07,050 --> 00:03:11,420 have a bit of background, but actually most of the background will be along the 34 00:03:11,420 --> 00:03:19,251 lines because this covers really a huge chunk of our research, and we'll see 35 00:03:19,251 --> 00:03:24,209 mainly three instructions: So "mov" and how we can perform these cache attacks, 36 00:03:24,209 --> 00:03:29,430 what are they... The instruction "clflush", so here we'll be doing cache 37 00:03:29,430 --> 00:03:36,370 attacks without any memory accesses. Then we'll see "prefetch" and how we can bypass 38 00:03:36,370 --> 00:03:43,420 kernel ASLR and lots of translations levels, and then there's even a bonus 39 00:03:43,420 --> 00:03:48,950 track, so it's this this will be not our works, but even more instructions and even 40 00:03:48,950 --> 00:03:54,210 more text. Okay, so let's start with a bit of an 41 00:03:54,210 --> 00:04:01,190 introduction. So we will be mainly focusing on Intel CPUs, and this is 42 00:04:01,190 --> 00:04:05,599 roughly in terms of cores and caches, how it looks like today. So we have different 43 00:04:05,599 --> 00:04:09,440 levels of cores ...uh... different cores so here four cores, and different levels 44 00:04:09,440 --> 00:04:14,220 of caches. So here usually we have three levels of caches. We have level 1 and 45 00:04:14,220 --> 00:04:18,269 level 2 that are private to each call, which means that core 0 can only access 46 00:04:18,269 --> 00:04:24,520 its level 1 and its level 2 and not level 1 and level 2 of, for example, core 3, and 47 00:04:24,520 --> 00:04:30,130 we have the last level cache... so here if you can see the pointer... So this one is 48 00:04:30,130 --> 00:04:36,289 divided in slices so we have as many slices as cores, so here 4 slices, but all 49 00:04:36,289 --> 00:04:40,659 the slices are shared across core so core 0 can access the whole last level cache, 50 00:04:40,659 --> 00:04:48,669 that's 0 1 2 & 3. We also have a nice property on Intel CPUs is that this level 51 00:04:48,669 --> 00:04:52,280 of cache is inclusive, and what it means is that everything that is contained in 52 00:04:52,280 --> 00:04:56,889 level 1 and level 2 will also be contained in the last level cache, and this will 53 00:04:56,889 --> 00:05:01,439 prove to be quite useful for cache attacks. 54 00:05:01,439 --> 00:05:08,430 So today we mostly have set associative caches. What it means is that we have data 55 00:05:08,430 --> 00:05:13,249 that is loaded in specific sets and that depends only on its address. So we have 56 00:05:13,249 --> 00:05:18,900 some bits of the address that gives us the index and that says "Ok the line is going 57 00:05:18,900 --> 00:05:24,610 to be loaded in this cache set", so this is a cache set. Then we have several ways 58 00:05:24,610 --> 00:05:30,629 per set so here we have 4 ways and the cacheine is going to be loaded in a 59 00:05:30,629 --> 00:05:35,270 specific way and that will only depend on the replacement policy and not on the 60 00:05:35,270 --> 00:05:40,800 address itself, so when you load a line into the cache, usually the cache is 61 00:05:40,800 --> 00:05:44,830 already full and you have to make room for a new line. So this is where the 62 00:05:44,830 --> 00:05:49,729 replacement replacement policy—this is what it does—it says ok I'm going to 63 00:05:49,729 --> 00:05:57,779 remove this line to make room for the next line. So for today we're going to see only 64 00:05:57,779 --> 00:06:01,960 three instruction as I've been telling you. So the move instruction, it does a 65 00:06:01,960 --> 00:06:06,610 lot of things but the only aspect that we're interested in about it that can 66 00:06:06,610 --> 00:06:12,809 access data in the main memory. We're going to see a clflush what it does 67 00:06:12,809 --> 00:06:18,349 is that it removes a cache line from the cache, from the whole cache. And we're 68 00:06:18,349 --> 00:06:25,569 going to see prefetch, it prefetches a cache line for future use. So we're going 69 00:06:25,569 --> 00:06:30,520 to see what they do and the kind of side effects that they have and all the attacks 70 00:06:30,520 --> 00:06:34,800 that we can do with them. And that's basically all the example you need for 71 00:06:34,800 --> 00:06:39,830 today so even if you're not an expert of x86 don't worry it's not just slides full 72 00:06:39,830 --> 00:06:44,899 of assembly and stuff. Okay so on to the first one. 73 00:06:44,899 --> 00:06:49,940 ML: So we will first start with the 'mov' instruction and actually the first slide 74 00:06:49,940 --> 00:06:57,809 is full of code, however as you can see the mov instruction is used to move data 75 00:06:57,809 --> 00:07:02,629 from registers to registers, from the main memory and back to the main memory and as 76 00:07:02,629 --> 00:07:07,240 you can see there are many moves you can use but basically it's just to move data 77 00:07:07,240 --> 00:07:12,589 and that's all we need to know. In addition, a lot of exceptions can occur so 78 00:07:12,589 --> 00:07:18,139 we can assume that those restrictions are so tight that nothing can go wrong when 79 00:07:18,139 --> 00:07:22,210 you just move data because moving data is simple. 80 00:07:22,210 --> 00:07:27,879 However while there are a lot of exceptions the data that is accessed is 81 00:07:27,879 --> 00:07:35,009 always loaded into the cache, so data is in the cache and this is transparent to 82 00:07:35,009 --> 00:07:40,870 the program that is running. However, there are side-effects when you run these 83 00:07:40,870 --> 00:07:46,219 instructions, and we will see how they look like with the mov instruction. So you 84 00:07:46,219 --> 00:07:51,289 probably all know that data can either be in CPU registers, in the different levels 85 00:07:51,289 --> 00:07:56,029 of the cache that Clementine showed to you earlier, in the main memory, or on the 86 00:07:56,029 --> 00:08:02,219 disk, and depending on where the memory and the data is located it needs a longer 87 00:08:02,219 --> 00:08:09,689 time to be loaded back to the CPU, and this is what we can see in this plot. So 88 00:08:09,689 --> 00:08:15,739 we try here to measure the access time of an address over and over again, assuming 89 00:08:15,739 --> 00:08:21,759 that when we access it more often, it is already stored in the cache. So around 70 90 00:08:21,759 --> 00:08:27,289 cycles, most of the time we can assume when we load an address and it takes 70 91 00:08:27,289 --> 00:08:34,809 cycles, it's loaded into the cache. However, when we assume that the data is 92 00:08:34,809 --> 00:08:39,659 loaded from the main memory, we can clearly see that it needs a much longer 93 00:08:39,659 --> 00:08:46,720 time like a bit more than 200 cycles. So depending when we measure the time it 94 00:08:46,720 --> 00:08:51,470 takes to load the address we can say the data has been loaded to the cache or the 95 00:08:51,470 --> 00:08:58,339 data is still located in the main memory. And this property is what we can exploit 96 00:08:58,339 --> 00:09:05,339 using cache attacks. So we measure the timing differences on memory accesses. And 97 00:09:05,339 --> 00:09:09,940 what an attacker does he monitors the cache lines, but he has no way to know 98 00:09:09,940 --> 00:09:14,459 what's actually the content of the cache line. So we can only monitor that this 99 00:09:14,459 --> 00:09:20,099 cache line has been accessed and not what's actually stored in the cache line. 100 00:09:20,099 --> 00:09:24,411 And what you can do with this is you can implement covert channels, so you can 101 00:09:24,411 --> 00:09:29,580 allow two processes to communicate with each other evading the permission system 102 00:09:29,580 --> 00:09:35,060 what we will see later on. In addition you can also do side channel attacks, so you 103 00:09:35,060 --> 00:09:40,600 can spy with a malicious attacking application on benign processes, and you 104 00:09:40,600 --> 00:09:46,140 can use this to steal cryptographic keys or to spy on keystrokes. 105 00:09:46,140 --> 00:09:53,649 And basically we have different types of cache attacks and I want to explain the 106 00:09:53,649 --> 00:09:58,810 most popular one, the "Flush+Reload" attack, in the beginning. So on the left, 107 00:09:58,810 --> 00:10:03,110 you have the address space of the victim, and on the right you have the address 108 00:10:03,110 --> 00:10:08,560 space of the attacker who maps a shared library—an executable—that the victim is 109 00:10:08,560 --> 00:10:14,899 using in to its own address space, like the red rectangle. And this means that 110 00:10:14,899 --> 00:10:22,760 when this data is stored in the cache, it's cached for both processes. Now the 111 00:10:22,760 --> 00:10:28,170 attacker can use the flush instruction to remove the data out of the cache, so it's 112 00:10:28,170 --> 00:10:34,420 not in the cache anymore, so it's also not cached for the victim. Now the attacker 113 00:10:34,420 --> 00:10:39,100 can schedule the victim and if the victim decides "yeah, I need this data", it will 114 00:10:39,100 --> 00:10:44,970 be loaded back into the cache. And now the attacker can reload the data, measure the 115 00:10:44,970 --> 00:10:49,661 time how long it took, and then decide "okay, the victim has accessed the data in 116 00:10:49,661 --> 00:10:54,179 the meantime" or "the victim has not accessed the data in the meantime." And by 117 00:10:54,179 --> 00:10:58,959 that you can spy if this address has been used. 118 00:10:58,959 --> 00:11:03,240 The second type of attack is called "Prime+Probe" and it does not rely on the 119 00:11:03,240 --> 00:11:08,971 shared memory like the "Flush+Reload" attack, and it works as following: Instead 120 00:11:08,971 --> 00:11:16,139 of mapping anything into its own address space, the attacker loads a lot of data 121 00:11:16,139 --> 00:11:24,589 into one cache line, here, and fills the cache. Now he again schedules the victim 122 00:11:24,589 --> 00:11:31,820 and the schedule can access data that maps to the same cache set. 123 00:11:31,820 --> 00:11:38,050 So the cache set is used by the attacker and the victim at the same time. Now the 124 00:11:38,050 --> 00:11:43,050 attacker can start measuring the access time to the addresses he loaded into the 125 00:11:43,050 --> 00:11:49,050 cache before, and when he accesses an address that is still in the cache it's 126 00:11:49,050 --> 00:11:55,649 faster so he measures the lower time. And if it's not in the cache anymore it has to 127 00:11:55,649 --> 00:12:01,279 be reloaded into the cache so it takes a longer time. He can sum this up and detect 128 00:12:01,279 --> 00:12:07,870 if the victim has loaded data into the cache as well. So the first thing we want 129 00:12:07,870 --> 00:12:11,900 to show you is what you can do with cache attacks is you can implement a covert 130 00:12:11,900 --> 00:12:17,439 channel and this could be happening in the following scenario. 131 00:12:17,439 --> 00:12:23,610 You install an app on your phone to view your favorite images you take, to apply 132 00:12:23,610 --> 00:12:28,630 some filters, and in the end you don't know that it's malicious because the only 133 00:12:28,630 --> 00:12:33,609 permission it requires is to access your images which makes sense. So you can 134 00:12:33,609 --> 00:12:38,700 easily install it without any fear. In addition you want to know what the weather 135 00:12:38,700 --> 00:12:43,040 is outside, so you install a nice little weather widget, and the only permission it 136 00:12:43,040 --> 00:12:48,230 has is to access the internet because it has to load the information from 137 00:12:48,230 --> 00:12:55,569 somewhere. So what happens if you're able to implement a covert channel between two 138 00:12:55,569 --> 00:12:59,779 these two applications, without any permissions and privileges so they can 139 00:12:59,779 --> 00:13:05,060 communicate with each other without using any mechanisms provided by the operating 140 00:13:05,060 --> 00:13:11,149 system, so it's hidden. It can happen that now the gallery app can send the image to 141 00:13:11,149 --> 00:13:18,680 the internet, it will be uploaded and exposed for everyone. So maybe you don't 142 00:13:18,680 --> 00:13:25,610 want to see the cat picture everywhere. While we can do this with those 143 00:13:25,610 --> 00:13:30,219 Prime+Probe/ Flush+Reload attacks, we will discuss a covert channel using 144 00:13:30,219 --> 00:13:35,690 Prime+Probe. So how can we transmit this data? We need to transmit ones and zeros 145 00:13:35,690 --> 00:13:40,980 at some point. So the sender and the receiver agree on one cache set that they 146 00:13:40,980 --> 00:13:49,319 both use. The receiver probes the set all the time. When the sender wants to 147 00:13:49,319 --> 00:13:57,529 transmit a zero he just does nothing, so the lines of the receiver are in the cache 148 00:13:57,529 --> 00:14:01,809 all the time, and he knows "okay, he's sending nothing", so it's a zero. 149 00:14:01,809 --> 00:14:05,940 On the other hand if the sender wants to transmit a one, he starts accessing 150 00:14:05,940 --> 00:14:10,800 addresses that map to the same cache set so it will take a longer time for the 151 00:14:10,800 --> 00:14:16,540 receiver to access its addresses again, and he knows "okay, the sender just sent 152 00:14:16,540 --> 00:14:23,059 me a one", and Clementine will show you what you can do with this covert channel. 153 00:14:23,059 --> 00:14:25,180 CM: So the really nice thing about 154 00:14:25,180 --> 00:14:28,959 Prime+Probe is that it has really low requirements. It doesn't need any kind of 155 00:14:28,959 --> 00:14:34,349 shared memory. For example if you have two virtual machines you could have some 156 00:14:34,349 --> 00:14:38,700 shared memory via memory deduplication. The thing is that this is highly insecure, 157 00:14:38,700 --> 00:14:43,969 so cloud providers like Amazon ec2, they disable that. Now we can still use 158 00:14:43,969 --> 00:14:50,429 Prime+Probe because it doesn't need this shared memory. Another problem with cache 159 00:14:50,429 --> 00:14:54,999 covert channels is that they are quite noisy. So when you have other applications 160 00:14:54,999 --> 00:14:59,259 that are also running on the system, they are all competing for the cache and they 161 00:14:59,259 --> 00:15:03,009 might, like, evict some cache lines, especially if it's an application that is 162 00:15:03,009 --> 00:15:08,749 very memory intensive. And you also have noise due to the fact that the sender and 163 00:15:08,749 --> 00:15:12,770 the receiver might not be scheduled at the same time. So if you have your sender that 164 00:15:12,770 --> 00:15:16,649 sends all the things and the receiver is not scheduled then some part of the 165 00:15:16,649 --> 00:15:22,539 transmission can get lost. So what we did is we tried to build an error-free covert 166 00:15:22,539 --> 00:15:30,829 channel. We took care of all these noise issues by using some error detection to 167 00:15:30,829 --> 00:15:36,470 resynchronize the sender and the receiver and then we use some error correction to 168 00:15:36,470 --> 00:15:40,779 correct the remaining errors. So we managed to have a completely error- 169 00:15:40,779 --> 00:15:46,069 free covert channel even if you have a lot of noise, so let's say another virtual 170 00:15:46,069 --> 00:15:54,119 machine also on the machine serving files through a web server, also doing lots of 171 00:15:54,119 --> 00:16:01,600 memory-intensive tasks at the same time, and the covert channel stayed completely 172 00:16:01,600 --> 00:16:07,610 error-free, and around 40 to 75 kilobytes per second, which is still quite a lot. 173 00:16:07,610 --> 00:16:14,470 All of this is between virtual machines on Amazon ec2. And the really neat thing—we 174 00:16:14,470 --> 00:16:19,389 wanted to do something with that—and basically we managed to create an SSH 175 00:16:19,389 --> 00:16:27,060 connection really over the cache. So they don't have any network between 176 00:16:27,060 --> 00:16:31,439 them, but just we are sending the zeros and the ones and we have an SSH connection 177 00:16:31,439 --> 00:16:36,839 between them. So you could say that cache covert channels are nothing, but I think 178 00:16:36,839 --> 00:16:43,079 it's a real threat. And if you want to have more details about this work in 179 00:16:43,079 --> 00:16:49,220 particular, it will be published soon at NDSS. 180 00:16:49,220 --> 00:16:54,040 So the second application that we wanted to show you is that we can attack crypto 181 00:16:54,040 --> 00:17:01,340 with cache attacks. In particular we are going to show an attack on AES and a 182 00:17:01,340 --> 00:17:04,990 special implementation of AES that uses T-Tables. so that's the fast software 183 00:17:04,990 --> 00:17:11,650 implementation because it uses some precomputed lookup tables. It's known to 184 00:17:11,650 --> 00:17:17,490 be vulnerable to side-channel attacks since 2006 by Osvik et al, and it's a one- 185 00:17:17,490 --> 00:17:24,110 round known plaintext attack, so you have p—or plaintext—and k, your secret key. And 186 00:17:24,110 --> 00:17:29,570 the AES algorithm, what it does is compute an intermediate state at each round r. 187 00:17:29,570 --> 00:17:38,559 And in the first round, the accessed table indices are just p XOR k. Now it's a known 188 00:17:38,559 --> 00:17:43,500 plaintext attack, what this means is that if you can recover the accessed table 189 00:17:43,500 --> 00:17:49,460 indices you've also managed to recover the key because it's just XOR. So that would 190 00:17:49,460 --> 00:17:55,450 be bad, right, if we could recover these accessed table indices. Well we can, with 191 00:17:55,450 --> 00:18:00,510 cache attacks! So we did that with Flush+Reload and with Prime+Probe. On the 192 00:18:00,510 --> 00:18:05,809 x-axis you have the plaintext byte values and on the y-axis you have the addresses 193 00:18:05,809 --> 00:18:15,529 which are essentially the T table entries. So a black cell means that we've monitored 194 00:18:15,529 --> 00:18:19,970 the cache line, and we've seen a lot of cache hits. So basically the blacker it 195 00:18:19,970 --> 00:18:25,650 is, the more certain we are that the T-Table entry has been accessed. And here 196 00:18:25,650 --> 00:18:31,779 it's a toy example, the key is all-zeros, but you would basically just have a 197 00:18:31,779 --> 00:18:35,700 different pattern if the key was not all- zeros, and as long as you can see this 198 00:18:35,700 --> 00:18:43,409 nice diagonal or a pattern then you have recovered the key. So it's an old attack, 199 00:18:43,409 --> 00:18:48,890 2006, it's been 10 years, everything should be fixed by now, and you see where 200 00:18:48,890 --> 00:18:56,880 I'm going: it's not. So on Android the bouncy castle implementation it uses by 201 00:18:56,880 --> 00:19:03,360 default the T-table, so that's bad. Also many implementations that you can find 202 00:19:03,360 --> 00:19:11,380 online uses pre-computed values, so maybe be wary about this kind of attacks. The 203 00:19:11,380 --> 00:19:17,240 last application we wanted to show you is how we can spy on keystrokes. 204 00:19:17,240 --> 00:19:21,480 So for that we will use flush and reload because it's a really fine grained 205 00:19:21,480 --> 00:19:26,309 attack. We can see very precisely which cache line has been accessed, and a cache 206 00:19:26,309 --> 00:19:31,440 line is only 64 bytes so it's really not a lot and we're going to use that to spy on 207 00:19:31,440 --> 00:19:37,690 keystrokes and we even have a small demo for you. 208 00:19:40,110 --> 00:19:45,640 ML: So what you can see on the screen this is not on Intel x86 it's on a smartphone, 209 00:19:45,640 --> 00:19:50,330 on the Galaxy S6, but you can also apply these cache attacks there so that's what 210 00:19:50,330 --> 00:19:53,850 we want to emphasize. So on the left you see the screen and on 211 00:19:53,850 --> 00:19:57,960 the right we have connected a shell with no privileges and permissions, so it can 212 00:19:57,960 --> 00:20:00,799 basically be an app that you install glass bottle falling 213 00:20:00,799 --> 00:20:09,480 from the App Store and on the right we are going to start our spy tool, and on the 214 00:20:09,480 --> 00:20:14,110 left we just open the messenger app and whenever the user hits any key on the 215 00:20:14,110 --> 00:20:19,690 keyboard our spy tool takes care of that and notices that. Also if he presses the 216 00:20:19,690 --> 00:20:26,120 spacebar we can also measure that. If the user decides "ok, I want to delete the 217 00:20:26,120 --> 00:20:30,880 word" because he changed his mind, we can also register if the user pressed the 218 00:20:30,880 --> 00:20:37,929 backspace button, so in the end we can see exactly how long the words were, the user 219 00:20:37,929 --> 00:20:45,630 typed into his phone without any permissions and privileges, which is bad. 220 00:20:45,630 --> 00:20:55,250 laughs applause 221 00:20:55,250 --> 00:21:00,320 ML: so enough about the mov instruction, let's head to clflush. 222 00:21:00,320 --> 00:21:07,230 CM: So the clflush instruction: What it does is that it invalidates from every 223 00:21:07,230 --> 00:21:12,309 level the cache line that contains the address that you pass to this instruction. 224 00:21:12,309 --> 00:21:16,990 So in itself it's kind of bad because it enables the Flush+Reload attacks that we 225 00:21:16,990 --> 00:21:21,300 showed earlier, that was just flush, reload, and the flush part is done with 226 00:21:21,300 --> 00:21:29,140 clflush. But there's actually more to it, how wonderful. So there's a first timing 227 00:21:29,140 --> 00:21:33,320 leakage with it, so we're going to see that the clflush instruction has a 228 00:21:33,320 --> 00:21:37,890 different timing depending on whether the data that you that you pass to it is 229 00:21:37,890 --> 00:21:44,710 cached or not. So imagine you have a cache line that is on the level 1 by inclu... 230 00:21:44,710 --> 00:21:50,299 With the inclusion property it has to be also in the last level cache. Now this is 231 00:21:50,299 --> 00:21:54,350 quite convenient and this is also why we have this inclusion property for 232 00:21:54,350 --> 00:22:00,019 performance reason on Intel CPUs, if you want to see if a line is present at all in 233 00:22:00,019 --> 00:22:04,209 the cache you just have to look in the last level cache. So this is basically 234 00:22:04,209 --> 00:22:08,010 what the clflush instruction does. It goes to the last last level cache, sees "ok 235 00:22:08,010 --> 00:22:12,890 there's a line, I'm going to flush this one" and then there's something that tells 236 00:22:12,890 --> 00:22:18,950 ok the line is also present somewhere else so then it flushes the line in level 1 237 00:22:18,950 --> 00:22:26,390 and/or level 2. So that's slow. Now if you perform clflush on some data that is not 238 00:22:26,390 --> 00:22:32,240 cached, basically it does the same, goes to the last level cache, see that there's 239 00:22:32,240 --> 00:22:36,659 no line and there can't be any... This data can't be anywhere else in the cache 240 00:22:36,659 --> 00:22:41,269 because it would be in the last level cache if it was anywhere, so it does 241 00:22:41,269 --> 00:22:47,430 nothing and it stop there. So that's fast. So how exactly fast and slow am I talking 242 00:22:47,430 --> 00:22:53,760 about? So it's actually only a very few cycles so we did this experiments on 243 00:22:53,760 --> 00:22:59,072 different microarchitecture so Center Bridge, Ivy Bridge, and Haswell and... 244 00:22:59,072 --> 00:23:03,250 So it different colors correspond to the different microarchitecture. So first 245 00:23:03,250 --> 00:23:07,880 thing that is already... kinda funny is that you can see that you can distinguish 246 00:23:07,880 --> 00:23:14,649 the micro architecture quite nicely with this, but the real point is that you have 247 00:23:14,649 --> 00:23:20,280 really a different zones. The solids... The solid line is when we performed the 248 00:23:20,280 --> 00:23:25,200 measurement on clflush with the line that was already in the cache, and the dashed 249 00:23:25,200 --> 00:23:30,840 line is when the line was not in the cache, and in all microarchitectures you 250 00:23:30,840 --> 00:23:36,539 can see that we can see a difference: It's only a few cycles, it's a bit noisy, so 251 00:23:36,539 --> 00:23:43,250 what could go wrong? Okay, so exploiting these few cycles, we still managed to 252 00:23:43,250 --> 00:23:47,029 perform a new cache attacks that we call "Flush+Flush", so I'm going to explain 253 00:23:47,029 --> 00:23:52,220 that to you: So basically everything that we could do with "Flush+Reload", we can 254 00:23:52,220 --> 00:23:56,899 also do with "Flush+Flush". We can perform cover channels and sidechannel attacks. 255 00:23:56,899 --> 00:24:01,090 It's stealthier than previous cache attacks, I'm going to go back on this one, 256 00:24:01,090 --> 00:24:07,220 and it's also faster than previous cache attacks. So how does it work exactly? So 257 00:24:07,220 --> 00:24:12,210 the principle is a bit similar to "Flush+Reload": So we have the attacker 258 00:24:12,210 --> 00:24:16,131 and the victim that have some kind of shared memory, let's say a shared library. 259 00:24:16,131 --> 00:24:21,340 It will be shared in the cache The attacker will start by flushing the cache 260 00:24:21,340 --> 00:24:26,510 line then let's the victim perform whatever it does, let's say encryption, 261 00:24:26,510 --> 00:24:32,120 the victim will load some data into the cache, automatically, and now the attacker 262 00:24:32,120 --> 00:24:36,720 wants to know again if the victim accessed this precise cache line and instead of 263 00:24:36,720 --> 00:24:43,540 reloading it is going to flush it again. And since we have this timing difference 264 00:24:43,540 --> 00:24:47,040 depending on whether the data is in the cache or not, it gives us the same 265 00:24:47,040 --> 00:24:54,889 information as if we reloaded it, except it's way faster. So I talked about 266 00:24:54,889 --> 00:24:59,690 stealthiness. So the thing is that basically these cache attacks and that 267 00:24:59,690 --> 00:25:06,340 also applies to "Rowhammer": They are already stealth in themselves, because 268 00:25:06,340 --> 00:25:10,470 there's no antivirus today that can detect them. but some people thought that we 269 00:25:10,470 --> 00:25:14,351 could detect them with performance counters because they do a lot of cache 270 00:25:14,351 --> 00:25:18,549 misses and cache references that happen when the data is flushed and when you 271 00:25:18,549 --> 00:25:26,090 reaccess memory. now what we thought is that yeah but that also not the only 272 00:25:26,090 --> 00:25:31,269 program steps to lots of cache misses and cache references so we would like to have 273 00:25:31,269 --> 00:25:38,120 a slightly parametric. So these cache attacks they have a very heavy activity on 274 00:25:38,120 --> 00:25:43,840 the cache but they're also very particular because there are very short loops of code 275 00:25:43,840 --> 00:25:48,610 if you take flush and reload this just flush one line reload the line and then 276 00:25:48,610 --> 00:25:53,750 again flush reload that's very short loop and that creates a very low pressure on 277 00:25:53,750 --> 00:26:01,490 the instruction therapy which is kind of particular for of cache attacks so what we 278 00:26:01,490 --> 00:26:05,380 decided to do is normalizing the cache even so the cache misses and cache 279 00:26:05,380 --> 00:26:10,720 references by events that have to do with the instruction TLB and there we could 280 00:26:10,720 --> 00:26:19,360 manage to detect cache attacks and Rowhammer without having false positives 281 00:26:19,360 --> 00:26:24,510 so the system metric that I'm going to use when I talk about stealthiness so we 282 00:26:24,510 --> 00:26:29,750 started by creating a cover channel. First we wanted to have it as fast as possible 283 00:26:29,750 --> 00:26:36,160 so we created a protocol to evaluates all the kind of cache attack that we had so 284 00:26:36,160 --> 00:26:40,540 flush+flush, flush+reload, and prime+probe and we started with a 285 00:26:40,540 --> 00:26:47,010 packet side of 28 doesn't really matter. We measured the capacity of our covert 286 00:26:47,010 --> 00:26:52,799 channel and flush+flush is around 500 kB/s whereas Flush+Reload 287 00:26:52,799 --> 00:26:56,340 was only 300 kB/s so Flush+Flush is already quite an 288 00:26:56,340 --> 00:27:00,740 improvement on the speed. Then we measured the stealth zone at this 289 00:27:00,740 --> 00:27:06,100 speed only Flush+Flush was stealth and now the thing is that Flush+Flush and 290 00:27:06,100 --> 00:27:10,200 Flush+Reload as you've seen there are some similarities so for a covert channel 291 00:27:10,200 --> 00:27:15,309 they also share the same center on it is receivers different and for this one the 292 00:27:15,309 --> 00:27:20,000 center was not stealth for both of them anyway if you want a fast covert channel 293 00:27:20,000 --> 00:27:26,640 then just try flush+flush that works. Now let's try to make it stealthy 294 00:27:26,640 --> 00:27:30,639 completely stealthy because if I have the standard that is not stealth maybe that we 295 00:27:30,639 --> 00:27:36,440 give away the whole attack so we said okay maybe if we just slow down all the attacks 296 00:27:36,440 --> 00:27:41,240 then there will be less cache hits, cache misses and then maybe all 297 00:27:41,240 --> 00:27:48,070 the attacks are actually stealthy why not? So we tried that we slowed down everything 298 00:27:48,070 --> 00:27:52,889 so Flush+Reload and Flash+Flash are around 50 kB/s now 299 00:27:52,889 --> 00:27:55,829 Prime+Probe is a bit slower because it takes more time 300 00:27:55,829 --> 00:28:01,330 to prime and probe anything but still 301 00:28:01,330 --> 00:28:09,419 even with this slow down only Flush+Flush has its receiver stealth and we also 302 00:28:09,419 --> 00:28:14,769 managed to have the sender stealth now so basically whether you want a fast covert 303 00:28:14,769 --> 00:28:20,450 channel or a stealth covert channel Flush+Flush is really great. 304 00:28:20,450 --> 00:28:26,500 Now we wanted to also evaluate if it wasn't too noisy to perform some side 305 00:28:26,500 --> 00:28:30,740 channel attack so we did these side channels on the AES t-table implementation 306 00:28:30,740 --> 00:28:35,910 the attacks that we have shown you earlier, so we computed a lot of 307 00:28:35,910 --> 00:28:41,820 encryption that we needed to determine the upper four bits of a key bytes so here the 308 00:28:41,820 --> 00:28:48,870 lower the better the attack and Flush + Reload is a bit better so we need only 250 309 00:28:48,870 --> 00:28:55,029 encryptions to recover these bits but Flush+Flush comes quite, comes quite 310 00:28:55,029 --> 00:29:00,570 close with 350 and Prime+Probe is actually the most noisy of them all, needs 311 00:29:00,570 --> 00:29:06,101 5... close to 5000 encryptions so we have around the same performance for 312 00:29:06,101 --> 00:29:13,520 Flush+Flush and Flush+Reload. Now let's evaluate the stealthiness again. 313 00:29:13,520 --> 00:29:19,320 So what we did here is we perform 256 billion encryptions in a synchronous 314 00:29:19,320 --> 00:29:25,740 attack so we really had the spy and the victim scattered and we evaluated the 315 00:29:25,740 --> 00:29:31,409 stealthiness of them all and here only Flush+Flush again is stealth. And while 316 00:29:31,409 --> 00:29:36,279 you can always slow down a covert channel you can't actually slow down a side 317 00:29:36,279 --> 00:29:40,700 channel because, in a real-life scenario, you're not going to say "Hey victim, him 318 00:29:40,700 --> 00:29:47,179 wait for me a bit, I am trying to do an attack here." That won't work. 319 00:29:47,179 --> 00:29:51,429 So there's even more to it but I will need again a bit of background before 320 00:29:51,429 --> 00:29:56,910 continuing. So I've shown you the different levels of caches and here I'm 321 00:29:56,910 --> 00:30:04,009 going to focus more on the last-level cache. So we have here our four slices so 322 00:30:04,009 --> 00:30:09,830 this is the last-level cache and we have some bits of the address here that 323 00:30:09,830 --> 00:30:14,330 corresponds to the set, but more importantly, we need to know where in 324 00:30:14,330 --> 00:30:19,899 which slice and address is going to be. And that is given, that is given by some 325 00:30:19,899 --> 00:30:23,850 bits of the set and the type of the address that are passed into a function 326 00:30:23,850 --> 00:30:27,960 that says in which slice the line is going to be. 327 00:30:27,960 --> 00:30:32,460 Now the thing is that this hash function is undocumented by Intel. Wouldn't be fun 328 00:30:32,460 --> 00:30:39,250 otherwise. So we have this: As many slices as core, an undocumented hash function 329 00:30:39,250 --> 00:30:43,980 that maps a physical address to a slice, and while it's actually a bit of a pain 330 00:30:43,980 --> 00:30:48,710 for attacks, it has, it was not designed for security originally but for 331 00:30:48,710 --> 00:30:53,570 performance, because you want all the access to be evenly distributed in the 332 00:30:53,570 --> 00:31:00,399 different slices, for performance reasons. So the hash function basically does, it 333 00:31:00,399 --> 00:31:05,279 takes some bits of the physical address and output k bits of slice, so just one 334 00:31:05,279 --> 00:31:09,309 bits if you have a two-core machine, two bits if you have a four-core machine and 335 00:31:09,309 --> 00:31:16,830 so on. Now let's go back to clflush, see what's the relation with that. 336 00:31:16,830 --> 00:31:21,169 So the thing that we noticed is that clflush is actually faster to reach a line 337 00:31:21,169 --> 00:31:28,549 on the local slice. So if you have, if you're flushing always 338 00:31:28,549 --> 00:31:33,340 one line and you run your program on core zero, core one, core two and core three, 339 00:31:33,340 --> 00:31:37,899 you will observe that one core in particular on, when you run the program on 340 00:31:37,899 --> 00:31:44,632 one core, the clflush is faster. And so here this is on core one, and you can see 341 00:31:44,632 --> 00:31:51,139 that core zero, two, and three it's, it's a bit slower and here we can deduce that, 342 00:31:51,139 --> 00:31:55,320 so we run the program on core one and we flush always the same line and we can 343 00:31:55,320 --> 00:32:01,850 deduce that the line belong to slice one. And what we can do with that is that we 344 00:32:01,850 --> 00:32:06,500 can map physical address to slices. And that's one way to reverse-engineer 345 00:32:06,500 --> 00:32:10,639 this addressing function that was not documented. 346 00:32:10,639 --> 00:32:15,880 Funnily enough that's not the only way: What I did before that was using the 347 00:32:15,880 --> 00:32:21,229 performance counters to reverse-engineer this function, but that's actually a whole 348 00:32:21,229 --> 00:32:27,770 other story and if you want more detail on that, there's also an article on that. 349 00:32:27,770 --> 00:32:30,139 ML: So the next instruction we want to 350 00:32:30,139 --> 00:32:35,110 talk about is the prefetch instruction. And the prefetch instruction is used to 351 00:32:35,110 --> 00:32:40,841 tell the CPU: "Okay, please load the data I need later on, into the cache, if you 352 00:32:40,841 --> 00:32:45,968 have some time." And in the end there are actually six different prefetch 353 00:32:45,968 --> 00:32:52,929 instructions: prefetcht0 to t2 which means: "CPU, please load the data into the 354 00:32:52,929 --> 00:32:58,640 first-level cache", or in the last-level cache, whatever you want to use, but we 355 00:32:58,640 --> 00:33:02,250 spare you the details because it's not so interesting in the end. 356 00:33:02,250 --> 00:33:06,940 However, what's more interesting is when we take a look at the Intel manual and 357 00:33:06,940 --> 00:33:11,880 what it says there. So, "Using the PREFETCH instruction is recommended only 358 00:33:11,880 --> 00:33:17,049 if data does not fit in the cache." So you can tell the CPU: "Please load data I want 359 00:33:17,049 --> 00:33:23,210 to stream into the cache, so it's more performant." "Use of software prefetch 360 00:33:23,210 --> 00:33:27,740 should be limited to memory addresses that are managed or owned within the 361 00:33:27,740 --> 00:33:33,620 application context." So one might wonder what happens if this 362 00:33:33,620 --> 00:33:40,940 address is not managed by myself. Sounds interesting. "Prefetching to addresses 363 00:33:40,940 --> 00:33:46,289 that are not mapped to physical pages can experience non-deterministic performance 364 00:33:46,289 --> 00:33:52,030 penalty. For example specifying a NULL pointer as an address for prefetch can 365 00:33:52,030 --> 00:33:56,000 cause long delays." So we don't want to do that because our 366 00:33:56,000 --> 00:34:02,919 program will be slow. So, let's take a look what they mean with non-deterministic 367 00:34:02,919 --> 00:34:08,889 performance penalty, because we want to write good software, right? But before 368 00:34:08,889 --> 00:34:12,510 that, we have to take a look at a little bit more background information to 369 00:34:12,510 --> 00:34:17,710 understand the attacks. So on modern operating systems, every 370 00:34:17,710 --> 00:34:22,850 application has its own virtual address space. So at some point, the CPU needs to 371 00:34:22,850 --> 00:34:27,479 translate these addresses to the physical addresses actually in the DRAM. And for 372 00:34:27,479 --> 00:34:33,690 that we have this very complex-looking data structure. So we have a 48-bit 373 00:34:33,690 --> 00:34:40,409 virtual address, and some of those bits mapped to a table, like the PM level 4 374 00:34:40,409 --> 00:34:47,760 table, with 512 entries, so depending on those bits the CPU knows, at which line he 375 00:34:47,760 --> 00:34:51,520 has to look. And if there is data there, because the 376 00:34:51,520 --> 00:34:56,900 address is mapped, he can proceed and look at the page directory, point the table, 377 00:34:56,900 --> 00:35:04,620 and so on for the town. So is everything, is the same for each level until you come 378 00:35:04,620 --> 00:35:09,130 to your page table, where you have 4-kilobyte pages. So it's in the end not 379 00:35:09,130 --> 00:35:13,851 that complicated, but it's a bit confusing, because you want to know a 380 00:35:13,851 --> 00:35:20,310 physical address, so you have to look it up somewhere in the, in the main memory 381 00:35:20,310 --> 00:35:25,420 with physical addresses to translate your virtual addresses. And if you have to go 382 00:35:25,420 --> 00:35:31,890 through all those levels, it needs a long time, so we can do better than that and 383 00:35:31,890 --> 00:35:39,160 that's why Intel introduced additional caches, also for all of those levels. So, 384 00:35:39,160 --> 00:35:45,560 if you want to translate an address, you take a look at the ITLB for instructions, 385 00:35:45,560 --> 00:35:51,150 and the data TLB for data. If it's there, you can stop, otherwise you go down all 386 00:35:51,150 --> 00:35:58,700 those levels and if it's not in any cache you have to look it up in the DRAM. In 387 00:35:58,700 --> 00:36:03,300 addition, the address space you have is shared, because you have, on the one hand, 388 00:36:03,300 --> 00:36:07,470 the user memory and, on the other hand, you have mapped the kernel for convenience 389 00:36:07,470 --> 00:36:12,870 and performance also in the address space. And if your user program wants to access 390 00:36:12,870 --> 00:36:18,310 some kernel functionality like reading a file, it will switch to the kernel memory 391 00:36:18,310 --> 00:36:23,880 there's a privilege escalation, and then you can read the file, and so on. So, 392 00:36:23,880 --> 00:36:30,420 that's it. However, you have drivers in the kernel, and if you know the addresses 393 00:36:30,420 --> 00:36:35,771 of those drivers, you can do code-reuse attacks, and as a countermeasure, they 394 00:36:35,771 --> 00:36:40,150 introduced address-space layout randomization, also for the kernel. 395 00:36:40,150 --> 00:36:47,040 And this means that when you have your program running, the kernel is mapped at 396 00:36:47,040 --> 00:36:51,630 one address and if you reboot the machine it's not on the same address anymore but 397 00:36:51,630 --> 00:36:58,390 somewhere else. So if there is a way to find out at which address the kernel is 398 00:36:58,390 --> 00:37:04,450 loaded, you have circumvented this countermeasure and defeated kernel address 399 00:37:04,450 --> 00:37:11,060 space layout randomization. So this would be nice for some attacks. In addition, 400 00:37:11,060 --> 00:37:16,947 there's also the kernel direct physical map. And what does this mean? It's 401 00:37:16,947 --> 00:37:23,320 implemented on many operating systems like OS X, Linux, also on the Xen hypervisor 402 00:37:23,320 --> 00:37:27,860 and BSD, but not on Windows. But what it means 403 00:37:27,860 --> 00:37:33,870 is that the complete physical memory is mapped in additionally in the kernel 404 00:37:33,870 --> 00:37:40,460 memory at a fixed offset. So, for every page that is mapped in the user space, 405 00:37:40,460 --> 00:37:45,160 there's something like a twin page in the kernel memory, which you can't access 406 00:37:45,160 --> 00:37:50,371 because it's in the kernel memory. However, we will need it later, because 407 00:37:50,371 --> 00:37:58,230 now we go back to prefetch and see what we can do with that. So, prefetch is not a 408 00:37:58,230 --> 00:38:04,150 usual instruction, because it just tells the CPU "I might need that data later on. 409 00:38:04,150 --> 00:38:10,000 If you have time, load for me," if not, the CPU can ignore it because it's busy 410 00:38:10,000 --> 00:38:15,810 with other stuff. So, there's no necessity that this instruction is really executed, 411 00:38:15,810 --> 00:38:22,070 but most of the time it is. And a nice, interesting thing is that it generates no 412 00:38:22,070 --> 00:38:29,000 faults, so whatever you pass to this instruction, your program won't crash, and 413 00:38:29,000 --> 00:38:33,990 it does not check any privileges, so I can also pass an kernel address to it and it 414 00:38:33,990 --> 00:38:37,510 won't say "No, stop, you accessed an address that you are not allowed to 415 00:38:37,510 --> 00:38:45,530 access, so I crash," it just continues, which is nice. 416 00:38:45,530 --> 00:38:49,810 The second interesting thing is that the operand is a virtual address, so every 417 00:38:49,810 --> 00:38:55,534 time you execute this instruction, the CPU has to go and check "OK, what physical 418 00:38:55,534 --> 00:38:59,600 address does this virtual address correspond to?" So it has to do the lookup 419 00:38:59,600 --> 00:39:05,750 with all those tables we've seen earlier, and as you probably have guessed already, 420 00:39:05,750 --> 00:39:10,370 the execution time varies also for the prefetch instruction and we will see later 421 00:39:10,370 --> 00:39:16,090 on what we can do with that. So, let's get back to the direct physical 422 00:39:16,090 --> 00:39:22,870 map. Because we can create an oracle for address translation, so we can find out 423 00:39:22,870 --> 00:39:27,540 what physical address belongs to the virtual address. Because nowadays you 424 00:39:27,540 --> 00:39:31,990 don't want that the user to know, because you can craft nice rowhammer attacks with 425 00:39:31,990 --> 00:39:37,520 that information, and more advanced cache attacks, so you restrict this information 426 00:39:37,520 --> 00:39:44,270 to the user. But let's check if we find a way to still get this information. So, as 427 00:39:44,270 --> 00:39:50,150 I've told you earlier, if you have a paired page in the user space map, 428 00:39:50,150 --> 00:39:54,505 you have the twin page in the kernel space, and if it's cached, 429 00:39:54,505 --> 00:39:56,710 its cached for both of them again. 430 00:39:56,710 --> 00:40:03,170 So, the attack now works as the following: From the attacker you flush your user 431 00:40:03,170 --> 00:40:09,760 space page, so it's not in the cache for the... also for the kernel memory, and 432 00:40:09,760 --> 00:40:15,850 then you call prefetch on the address of the kernel, because as I told you, you 433 00:40:15,850 --> 00:40:22,050 still can do that because it doesn't create any faults. So, you tell the CPU 434 00:40:22,050 --> 00:40:28,310 "Please load me this data into the cache even if I don't have access to this data 435 00:40:28,310 --> 00:40:32,550 normally." And if we now measure on our user space 436 00:40:32,550 --> 00:40:37,100 page the address again, and we measure a cache hit, because it has been loaded by 437 00:40:37,100 --> 00:40:42,630 the CPU into the cache, we know exactly at which position, since we passed the 438 00:40:42,630 --> 00:40:48,250 address to the function, this address corresponds to. And because this is at a 439 00:40:48,250 --> 00:40:53,280 fixed offset, we can just do a simple subtraction and know the physical address 440 00:40:53,280 --> 00:40:59,180 again. So we have a nice way to find physical addresses for virtual addresses. 441 00:40:59,180 --> 00:41:04,390 And in practice this looks like this following plot. So, it's pretty simple, 442 00:41:04,390 --> 00:41:08,910 because we just do this for every address, and at some point we measure a cache hit. 443 00:41:08,910 --> 00:41:14,260 So, there's a huge difference. And exactly at this point we know this physical 444 00:41:14,260 --> 00:41:20,140 address corresponds to our virtual address. The second thing is that we can 445 00:41:20,140 --> 00:41:27,070 exploit the timing differences it needs for the prefetch instruction. Because, as 446 00:41:27,070 --> 00:41:31,850 I told you, when you go down this cache levels, at some point you see "it's here" 447 00:41:31,850 --> 00:41:37,500 or "it's not here," so it can abort early. And with that we can know exactly 448 00:41:37,500 --> 00:41:41,800 when the prefetch instruction aborted, and know how the 449 00:41:41,800 --> 00:41:48,070 pages are mapped into the address space. So, the timing depends on where the 450 00:41:48,070 --> 00:41:57,090 translation stops. And using those two properties and those information, we can 451 00:41:57,090 --> 00:42:02,227 do the following: On the one hand, we can build variants of cache attacks. So, 452 00:42:02,227 --> 00:42:07,444 instead of Flush+Reload, we can do Flush+Prefetch, for instance. We can 453 00:42:07,444 --> 00:42:12,060 also use prefetch to mount rowhammer attacks on privileged addresses, because 454 00:42:12,060 --> 00:42:18,069 it doesn't do any faults when we pass those addresses, and it works as well. In 455 00:42:18,069 --> 00:42:23,330 addition, we can use it to recover the translation levels of a process, which you 456 00:42:23,330 --> 00:42:27,870 could do earlier with the page map file, but as I told you it's now privileged, so 457 00:42:27,870 --> 00:42:32,890 you don't have access to that, and by doing that you can bypass address space 458 00:42:32,890 --> 00:42:38,170 layout randomization. In addition, as I told you, you can translate virtual 459 00:42:38,170 --> 00:42:43,530 addresses to physical addresses, which is now also privileged with the page map 460 00:42:43,530 --> 00:42:48,790 file, and using that it reenables return to direct exploits, which have been 461 00:42:48,790 --> 00:42:55,550 demonstrated last year. On top of that, we can also use this to locate kernel 462 00:42:55,550 --> 00:43:00,850 drivers, as I told you. It would be nice if we can circumvent KSLR as well, and I 463 00:43:00,850 --> 00:43:08,380 will show you now how this is possible. So, with the first oracle we find out all 464 00:43:08,380 --> 00:43:15,430 the pages that are mapped, and for each of those pages, we evict the translation 465 00:43:15,430 --> 00:43:18,210 caches, and we can do that by either calling sleep, 466 00:43:18,210 --> 00:43:24,450 which schedules another program, or access just a large memory buffer. Then, we 467 00:43:24,450 --> 00:43:28,260 perform a syscall to the driver. So, there's code of the driver executed and 468 00:43:28,260 --> 00:43:33,540 loaded into the cache, and then we just measure the time prefetch takes on this 469 00:43:33,540 --> 00:43:40,840 address. And in the end, the fastest average access time is the driver page. 470 00:43:40,840 --> 00:43:46,770 So, we can mount this attack on Windows 10 in less than 12 seconds. So, we can defeat 471 00:43:46,770 --> 00:43:52,110 KSLR in less than 12 seconds, which is very nice. And in practice, the 472 00:43:52,110 --> 00:43:58,330 measurements looks like the following: So, we have a lot of long measurements, and at 473 00:43:58,330 --> 00:44:05,060 some point you have a low one, and you know exactly that this driver region and 474 00:44:05,060 --> 00:44:09,930 the address the driver is located. And you can mount those read to direct 475 00:44:09,930 --> 00:44:16,210 attacks again. However, that's not everything, because there are more 476 00:44:16,210 --> 00:44:20,795 instructions in Intel. CM: Yeah, so, the following is not our 477 00:44:20,795 --> 00:44:24,350 work, but we thought that would be interesting, because it's basically more 478 00:44:24,350 --> 00:44:30,740 instructions, more attacks, more fun. So there's the RDSEED instruction, and what 479 00:44:30,740 --> 00:44:35,340 it does, that is request a random seed to the hardware random number generator. So, 480 00:44:35,340 --> 00:44:39,310 the thing is that there is a fixed number of precomputed random bits, and that takes 481 00:44:39,310 --> 00:44:44,320 time to regenerate them. So, as everything that takes time, you can create a covert 482 00:44:44,320 --> 00:44:50,180 channel with that. There is also FADD and FMUL, which are floating point operations. 483 00:44:50,180 --> 00:44:56,740 Here, the running time of this instruction depends on the operands. Some people 484 00:44:56,740 --> 00:45:01,530 managed to bypass Firefox's same origin policy with an SVG filter timing attack 485 00:45:01,530 --> 00:45:08,540 with that. There's also the JMP instructions. So, in modern CPUs you have 486 00:45:08,540 --> 00:45:14,520 branch prediction, and branch target prediction. With that, it's actually been 487 00:45:14,520 --> 00:45:18,250 studied a lot, you can create a covert channel. You can do side-channel attacks 488 00:45:18,250 --> 00:45:26,028 on crypto. You can also bypass KSLR, and finally, there are TSX instructions, which 489 00:45:26,028 --> 00:45:31,010 is an extension for hardware transactional memory support, which has also been used 490 00:45:31,010 --> 00:45:37,150 to bypass KSLR. So, in case you're not sure, KSLR is dead. You have lots of 491 00:45:37,150 --> 00:45:45,650 different things to read. Okay, so, on the conclusion now. So, as you've seen, it's 492 00:45:45,650 --> 00:45:50,190 actually more a problem of CPU design, than really the instruction sets 493 00:45:50,190 --> 00:45:55,720 architecture. The thing is that all these issues are really hard to patch. They 494 00:45:55,720 --> 00:45:59,966 are all linked to performance optimizations, and we are not getting rid 495 00:45:59,966 --> 00:46:03,890 of performance optimization. That's basically a trade-off between performance 496 00:46:03,890 --> 00:46:11,530 and security, and performance seems to always win. There has been some 497 00:46:11,530 --> 00:46:20,922 propositions to... against cache attacks, to... let's say remove the CLFLUSH 498 00:46:20,922 --> 00:46:26,640 instructions. The thing is that all these quick fix won't work, because we always 499 00:46:26,640 --> 00:46:31,450 find new ways to do the same thing without these precise instructions and also, we 500 00:46:31,450 --> 00:46:37,410 keep finding new instruction that leak information. So, it's really, let's say 501 00:46:37,410 --> 00:46:43,740 quite a big topic that we have to fix this. So, thank you very much for your 502 00:46:43,740 --> 00:46:47,046 attention. If you have any questions we'd be happy to answer them. 503 00:46:47,046 --> 00:46:52,728 applause 504 00:46:52,728 --> 00:47:01,510 applause Herald: Okay. Thank you very much again 505 00:47:01,510 --> 00:47:06,571 for your talk, and now we will have a Q&A, and we have, I think, about 15 minutes, so 506 00:47:06,571 --> 00:47:11,330 you can start lining up behind the microphones. They are in the gangways in 507 00:47:11,330 --> 00:47:18,130 the middle. Except, I think that one... oh, no, it's back up, so it will work. And 508 00:47:18,130 --> 00:47:22,180 while we wait, I think we will take questions from our signal angel, if there 509 00:47:22,180 --> 00:47:28,810 are any. Okay, there aren't any, so... microphone questions. I think, you in 510 00:47:28,810 --> 00:47:33,440 front. Microphone: Hi. Can you hear me? 511 00:47:33,440 --> 00:47:40,050 Herald: Try again. Microphone: Okay. Can you hear me now? 512 00:47:40,050 --> 00:47:46,480 Okay. Yeah, I'd like to know what exactly was your stealthiness metric? Was it that 513 00:47:46,480 --> 00:47:51,310 you can't distinguish it from a normal process, or...? 514 00:47:51,310 --> 00:47:56,500 CM: So... Herald: Wait a second. We have still Q&A, 515 00:47:56,500 --> 00:47:59,780 so could you quiet down a bit? That would be nice. 516 00:47:59,780 --> 00:48:08,180 CM: So, the question was about the stealthiness metric. Basically, we use the 517 00:48:08,180 --> 00:48:14,320 metric with cache misses and cache references, normalized by the instructions 518 00:48:14,320 --> 00:48:21,080 TLB events, and we just found the threshold under which 519 00:48:21,080 --> 00:48:25,820 pretty much every benign application was below this, and rowhammer and cache 520 00:48:25,820 --> 00:48:30,520 attacks were after that. So we fixed the threshold, basically. 521 00:48:30,520 --> 00:48:35,520 H: That microphone. Microphone: Hello. Thanks for your talk. 522 00:48:35,520 --> 00:48:42,760 It was great. First question: Did you inform Intel before doing this talk? 523 00:48:42,760 --> 00:48:47,520 CM: Nope. Microphone: Okay. The second question: 524 00:48:47,520 --> 00:48:51,050 What's your future plans? CM: Sorry? 525 00:48:51,050 --> 00:48:55,780 M: What's your future plans? CM: Ah, future plans. Well, what I did, 526 00:48:55,780 --> 00:49:01,220 that is interesting, is that we keep finding these more or less by accident, or 527 00:49:01,220 --> 00:49:06,440 manually, so having a good idea of what's the attack surface here would be a good 528 00:49:06,440 --> 00:49:10,050 thing, and doing that automatically would be even better. 529 00:49:10,050 --> 00:49:14,170 M: Great, thanks. H: Okay, the microphone in the back, 530 00:49:14,170 --> 00:49:18,770 over there. The guy in white. M: Hi. One question. If you have, 531 00:49:18,770 --> 00:49:24,410 like, a demon, that randomly invalidates some cache lines, would that be a better 532 00:49:24,410 --> 00:49:31,120 countermeasure than disabling the caches? ML: What was the question? 533 00:49:31,120 --> 00:49:39,580 CM: If invalidating cache lines would be better than disabling the whole cache. So, 534 00:49:39,580 --> 00:49:42,680 I'm... ML: If you know which cache lines have 535 00:49:42,680 --> 00:49:47,300 been accessed by the process, you can invalidate those cache lines before you 536 00:49:47,300 --> 00:49:52,820 swap those processes, but it's also a trade-off between performance. Like, you 537 00:49:52,820 --> 00:49:57,940 can also, if you switch processes, flush the whole cache, and then it's empty, and 538 00:49:57,940 --> 00:50:01,900 then you don't see any activity anymore, but there's also the trade-off of 539 00:50:01,900 --> 00:50:07,510 performance with this. M: Okay, maybe a second question. If you, 540 00:50:07,510 --> 00:50:12,240 there are some ARM architectures that have random cache line invalidations. 541 00:50:12,240 --> 00:50:16,010 Did you try those, if you can see a [unintelligible] channel there. 542 00:50:16,010 --> 00:50:21,960 ML: If they're truly random, but probably you just have to make more measurements 543 00:50:21,960 --> 00:50:27,180 and more measurements, and then you can average out the noise, and then you can do 544 00:50:27,180 --> 00:50:30,350 these attacks again. It's like, with prime and probe, where you need more 545 00:50:30,350 --> 00:50:34,080 measurements, because it's much more noisy, so in the end you will just need 546 00:50:34,080 --> 00:50:37,870 much more measurements. CM: So, on ARM, it's supposed to be pretty 547 00:50:37,870 --> 00:50:43,260 random. At least it's in the manual, but we actually found nice ways to evict cache 548 00:50:43,260 --> 00:50:47,230 lines, that we really wanted to evict, so it's not actually that pseudo-random. 549 00:50:47,230 --> 00:50:51,960 So, even... let's say, if something is truly random, it might be nice, but then 550 00:50:51,960 --> 00:50:57,170 it's also quite complicated to implement. I mean, you probably don't want a random 551 00:50:57,170 --> 00:51:01,480 number generator just for the cache. M: Okay. Thanks. 552 00:51:01,480 --> 00:51:05,980 H: Okay, and then the three guys here on the microphone in the front. 553 00:51:05,980 --> 00:51:13,450 M: My question is about a detail with the keylogger. You could distinguish between 554 00:51:13,450 --> 00:51:18,150 space, backspace and alphabet, which is quite interesting. But could you also 555 00:51:18,150 --> 00:51:22,320 figure out the specific keys that were pressed, and if so, how? 556 00:51:22,320 --> 00:51:25,650 ML: Yeah, that depends on the implementation of the keyboard. But what 557 00:51:25,650 --> 00:51:29,310 we did, we used the Android stock keyboard, which is shipped with the 558 00:51:29,310 --> 00:51:34,520 Samsung, so it's pre-installed. And if you have a table somewhere in your code, which 559 00:51:34,520 --> 00:51:39,540 says "Okay, if you press this exact location or this image, it's an A or it's 560 00:51:39,540 --> 00:51:44,450 an B", then you can also do a more sophisticated attack. So, if you find any 561 00:51:44,450 --> 00:51:49,050 functions or data in the code, which directly tells you "Okay, this is this 562 00:51:49,050 --> 00:51:54,520 character," you can also spy on the actual key characters on the keyboard. 563 00:51:54,520 --> 00:52:02,900 M: Thank you. M: Hi. Thank you for your talk. My first 564 00:52:02,900 --> 00:52:08,570 question is: What can we actually do now, to mitigate this kind of attack? By, for 565 00:52:08,570 --> 00:52:11,980 example switching off TSX or using ECC RAM. 566 00:52:11,980 --> 00:52:17,410 CM: So, I think the very important thing to protect would be, like crypto, and the 567 00:52:17,410 --> 00:52:20,840 good thing is that today we know how to build crypto that is resistant to side- 568 00:52:20,840 --> 00:52:24,490 channel attacks. So the good thing would be to stop improving implementation that 569 00:52:24,490 --> 00:52:31,360 are known to be vulnerable for 10 years. Then things like keystrokes is way harder 570 00:52:31,360 --> 00:52:36,830 to protect, so let's say crypto is manageable; the whole system is clearly 571 00:52:36,830 --> 00:52:41,490 another problem. And you can have different types of countermeasure on the 572 00:52:41,490 --> 00:52:45,780 hardware side but it does would mean that Intel an ARM actually want to fix that, 573 00:52:45,780 --> 00:52:48,560 and that they know how to fix that. I don't even know how to fix that in 574 00:52:48,560 --> 00:52:55,500 hardware. Then on the system side, if you prevent some kind of memory sharing, you 575 00:52:55,500 --> 00:52:58,540 don't have flush involved anymore and primum (?) probably is much more 576 00:52:58,540 --> 00:53:04,880 noisier, so it would be an improvement. M: Thank you. 577 00:53:04,880 --> 00:53:11,880 H: Do we have signal angel questions? No. OK, then more microphone. 578 00:53:11,880 --> 00:53:16,630 M: Hi, thank you. I wanted to ask about the way you establish the side-channel 579 00:53:16,630 --> 00:53:23,280 between the two processors, because it would obviously have to be timed in a way to 580 00:53:23,280 --> 00:53:28,511 transmit information between one process to the other. Is there anywhere that you 581 00:53:28,511 --> 00:53:32,970 documented the whole? You know, it's actually almost like the seven layers or 582 00:53:32,970 --> 00:53:36,580 something like that. There are any ways that you documented that? It would be 583 00:53:36,580 --> 00:53:40,260 really interesting to know how it worked. ML: You can find this information in the 584 00:53:40,260 --> 00:53:46,120 paper because there are several papers on covered channels using that, so the NDSS 585 00:53:46,120 --> 00:53:51,300 paper is published in February I guess, but the Armageddon paper also includes 586 00:53:51,300 --> 00:53:55,670 a cover channel, and you can find more information about how the 587 00:53:55,670 --> 00:53:59,320 packets look like and how the synchronization works in the paper. 588 00:53:59,320 --> 00:54:04,020 M: Thank you. H: One last question? 589 00:54:04,020 --> 00:54:09,750 M: Hi! You mentioned that you used Osvik's attack for the AES side-channel attack. 590 00:54:09,750 --> 00:54:17,350 Did you solve the AES round detection and is it different to some scheduler 591 00:54:17,350 --> 00:54:21,441 manipulation? CM: So on this one I think we only did 592 00:54:21,441 --> 00:54:24,280 some synchronous attack, so we already knew when 593 00:54:24,280 --> 00:54:27,770 the victim is going to be scheduled and we didn't have anything to do with 594 00:54:27,770 --> 00:54:32,930 schedulers. M: Alright, thank you. 595 00:54:32,930 --> 00:54:37,140 H: Are there any more questions? No, I don't see anyone. Then, thank you very 596 00:54:37,140 --> 00:54:39,132 much again to our speakers. 597 00:54:39,132 --> 00:54:42,162 applause 598 00:54:42,162 --> 00:54:58,970 music 599 00:54:58,970 --> 00:55:06,000 subtitles created by c3subtitles.de in the year 2020. Join, and help us!