WEBVTT 00:00:00.000 --> 00:00:13.245 Music 00:00:13.245 --> 00:00:17.060 Herald Angel: We are here with a motto, and the motto of this year is "Works For 00:00:17.060 --> 00:00:21.670 Me" and I think, who many people, how many people in here are programmmers? Raise 00:00:21.670 --> 00:00:28.700 your hands or shout or... Whoa, that's a lot. Okay. So I think many of you will 00:00:28.700 --> 00:00:38.990 work on x86. Yeah. And I think you assume that it works, and that everything works 00:00:38.990 --> 00:00:48.150 as intended And I mean: What could go wrong? Our next talk, the first one today, 00:00:48.150 --> 00:00:52.290 will be by Clémentine Maurice, who previously was here with RowhammerJS, 00:00:52.290 --> 00:01:01.740 something I would call scary, and Moritz Lipp, who has worked on the Armageddon 00:01:01.740 --> 00:01:09.820 exploit, back, what is it? Okay, so the next... I would like to hear a really warm 00:01:09.820 --> 00:01:14.460 applause for the speakers for the talk "What could what could possibly go wrong 00:01:14.460 --> 00:01:17.280 with insert x86 instruction here?" 00:01:17.280 --> 00:01:18.375 thank you. 00:01:18.375 --> 00:01:28.290 Applause 00:01:28.290 --> 00:01:32.530 Clémentine Maurice (CM): Well, thank you all for being here this morning. Yes, this 00:01:32.530 --> 00:01:38.080 is our talk "What could possibly go wrong with insert x86 instructions here". So 00:01:38.080 --> 00:01:42.850 just a few words about ourselves: So I'm Clémentine Maurice, I got my PhD last year 00:01:42.850 --> 00:01:47.119 in computer science and I'm now working as a postdoc at Graz University of Technology 00:01:47.119 --> 00:01:52.090 in Austria. You can reach me on Twitter or by email but there's also I think a lots 00:01:52.090 --> 00:01:56.670 of time before the Congress is over. Moritz Lipp (ML): Hi and my name is Moritz 00:01:56.670 --> 00:02:01.520 Lipp, I'm a PhD student at Graz University of Technology and you can also reach me on 00:02:01.520 --> 00:02:06.679 Twitter or just after our talk and in the next days. 00:02:06.679 --> 00:02:10.860 CM: So, about this talk: So, the title says this is a talk about x86 00:02:10.860 --> 00:02:17.720 instructions, but this is not a talk about software. Don't leave yet! I'm actually 00:02:17.720 --> 00:02:22.440 even assuming safe software and the point that we want to make is that safe software 00:02:22.440 --> 00:02:27.390 does not mean safe execution and we have information leakage because of the 00:02:27.390 --> 00:02:32.560 underlying hardware and this is what we're going to talk about today. So we'll be 00:02:32.560 --> 00:02:36.819 talking about cache attacks, what are they, what can we do with that and also a 00:02:36.819 --> 00:02:41.510 special kind of cache attack that we found this year. So... doing cache attacks 00:02:41.510 --> 00:02:48.590 without memory accesses and how to use that even to bypass kernel ASLR. 00:02:48.590 --> 00:02:53.129 So again, the talk says is to talk about x86 instructions but this is even more 00:02:53.129 --> 00:02:58.209 global than that. We can also mount these cache attacks on ARM and not only on the 00:02:58.209 --> 00:03:07.050 x86. So some of the examples that you will see also applies to ARM. So today we'll do 00:03:07.050 --> 00:03:11.420 have a bit of background, but actually most of the background will be along the 00:03:11.420 --> 00:03:19.251 lines because this covers really a huge chunk of our research, and we'll see 00:03:19.251 --> 00:03:24.209 mainly three instructions: So "mov" and how we can perform these cache attacks, 00:03:24.209 --> 00:03:29.430 what are they... The instruction "clflush", so here we'll be doing cache 00:03:29.430 --> 00:03:36.370 attacks without any memory accesses. Then we'll see "prefetch" and how we can bypass 00:03:36.370 --> 00:03:43.420 kernel ASLR and lots of translations levels, and then there's even a bonus 00:03:43.420 --> 00:03:48.950 track, so it's this this will be not our works, but even more instructions and even 00:03:48.950 --> 00:03:54.210 more text. Okay, so let's start with a bit of an 00:03:54.210 --> 00:04:01.190 introduction. So we will be mainly focusing on Intel CPUs, and this is 00:04:01.190 --> 00:04:05.599 roughly in terms of cores and caches, how it looks like today. So we have different 00:04:05.599 --> 00:04:09.440 levels of cores ...uh... different cores so here four cores, and different levels 00:04:09.440 --> 00:04:14.220 of caches. So here usually we have three levels of caches. We have level 1 and 00:04:14.220 --> 00:04:18.269 level 2 that are private to each call, which means that core 0 can only access 00:04:18.269 --> 00:04:24.520 its level 1 and its level 2 and not level 1 and level 2 of, for example, core 3, and 00:04:24.520 --> 00:04:30.130 we have the last level cache... so here if you can see the pointer... So this one is 00:04:30.130 --> 00:04:36.289 divided in slices so we have as many slices as cores, so here 4 slices, but all 00:04:36.289 --> 00:04:40.659 the slices are shared across core so core 0 can access the whole last level cache, 00:04:40.659 --> 00:04:48.669 that's 0 1 2 & 3. We also have a nice property on Intel CPUs is that this level 00:04:48.669 --> 00:04:52.280 of cache is inclusive, and what it means is that everything that is contained in 00:04:52.280 --> 00:04:56.889 level 1 and level 2 will also be contained in the last level cache, and this will 00:04:56.889 --> 00:05:01.439 prove to be quite useful for cache attacks. 00:05:01.439 --> 00:05:08.430 So today we mostly have set associative caches. What it means is that we have data 00:05:08.430 --> 00:05:13.249 that is loaded in specific sets and that depends only on its address. So we have 00:05:13.249 --> 00:05:18.900 some bits of the address that gives us the index and that says "Ok the line is going 00:05:18.900 --> 00:05:24.610 to be loaded in this cache set", so this is a cache set. Then we have several ways 00:05:24.610 --> 00:05:30.629 line:1 per set so here we have 4 ways and the cacheine is going to be loaded in a 00:05:30.629 --> 00:05:35.270 specific way and that will only depend on the replacement policy and not on the 00:05:35.270 --> 00:05:40.800 address itself, so when you load a line into the cache, usually the cache is 00:05:40.800 --> 00:05:44.830 already full and you have to make room for a new line. So this is where the 00:05:44.830 --> 00:05:49.729 replacement replacement policy—this is what it does—it says ok I'm going to 00:05:49.729 --> 00:05:57.779 remove this line to make room for the next line. So for today we're going to see only 00:05:57.779 --> 00:06:01.960 three instruction as I've been telling you. So the move instruction, it does a 00:06:01.960 --> 00:06:06.610 lot of things but the only aspect that we're interested in about it that can 00:06:06.610 --> 00:06:12.809 access data in the main memory. We're going to see a clflush what it does 00:06:12.809 --> 00:06:18.349 is that it removes a cache line from the cache, from the whole cache. And we're 00:06:18.349 --> 00:06:25.569 going to see prefetch, it prefetches a cache line for future use. So we're going 00:06:25.569 --> 00:06:30.520 to see what they do and the kind of side effects that they have and all the attacks 00:06:30.520 --> 00:06:34.800 that we can do with them. And that's basically all the example you need for 00:06:34.800 --> 00:06:39.830 today so even if you're not an expert of x86 don't worry it's not just slides full 00:06:39.830 --> 00:06:44.899 of assembly and stuff. Okay so on to the first one. 00:06:44.899 --> 00:06:49.940 ML: So we will first start with the 'mov' instruction and actually the first slide 00:06:49.940 --> 00:06:57.809 is full of code, however as you can see the mov instruction is used to move data 00:06:57.809 --> 00:07:02.629 from registers to registers, from the main memory and back to the main memory and as 00:07:02.629 --> 00:07:07.240 you can see there are many moves you can use but basically it's just to move data 00:07:07.240 --> 00:07:12.589 and that's all we need to know. In addition, a lot of exceptions can occur so 00:07:12.589 --> 00:07:18.139 we can assume that those restrictions are so tight that nothing can go wrong when 00:07:18.139 --> 00:07:22.210 you just move data because moving data is simple. 00:07:22.210 --> 00:07:27.879 However while there are a lot of exceptions the data that is accessed is 00:07:27.879 --> 00:07:35.009 always loaded into the cache, so data is in the cache and this is transparent to 00:07:35.009 --> 00:07:40.870 the program that is running. However, there are side-effects when you run these 00:07:40.870 --> 00:07:46.219 instructions, and we will see how they look like with the mov instruction. So you 00:07:46.219 --> 00:07:51.289 probably all know that data can either be in CPU registers, in the different levels 00:07:51.289 --> 00:07:56.029 of the cache that Clementine showed to you earlier, in the main memory, or on the 00:07:56.029 --> 00:08:02.219 disk, and depending on where the memory and the data is located it needs a longer 00:08:02.219 --> 00:08:09.689 time to be loaded back to the CPU, and this is what we can see in this plot. So 00:08:09.689 --> 00:08:15.739 we try here to measure the access time of an address over and over again, assuming 00:08:15.739 --> 00:08:21.759 that when we access it more often, it is already stored in the cache. So around 70 00:08:21.759 --> 00:08:27.289 cycles, most of the time we can assume when we load an address and it takes 70 00:08:27.289 --> 00:08:34.809 cycles, it's loaded into the cache. However, when we assume that the data is 00:08:34.809 --> 00:08:39.659 loaded from the main memory, we can clearly see that it needs a much longer 00:08:39.659 --> 00:08:46.720 time like a bit more than 200 cycles. So depending when we measure the time it 00:08:46.720 --> 00:08:51.470 takes to load the address we can say the data has been loaded to the cache or the 00:08:51.470 --> 00:08:58.339 data is still located in the main memory. And this property is what we can exploit 00:08:58.339 --> 00:09:05.339 using cache attacks. So we measure the timing differences on memory accesses. And 00:09:05.339 --> 00:09:09.940 what an attacker does he monitors the cache lines, but he has no way to know 00:09:09.940 --> 00:09:14.459 what's actually the content of the cache line. So we can only monitor that this 00:09:14.459 --> 00:09:20.099 cache line has been accessed and not what's actually stored in the cache line. 00:09:20.099 --> 00:09:24.411 And what you can do with this is you can implement covert channels, so you can 00:09:24.411 --> 00:09:29.580 allow two processes to communicate with each other evading the permission system 00:09:29.580 --> 00:09:35.060 what we will see later on. In addition you can also do side channel attacks, so you 00:09:35.060 --> 00:09:40.600 can spy with a malicious attacking application on benign processes, and you 00:09:40.600 --> 00:09:46.140 can use this to steal cryptographic keys or to spy on keystrokes. 00:09:46.140 --> 00:09:53.649 And basically we have different types of cache attacks and I want to explain the 00:09:53.649 --> 00:09:58.810 most popular one, the "Flush+Reload" attack, in the beginning. So on the left, 00:09:58.810 --> 00:10:03.110 you have the address space of the victim, and on the right you have the address 00:10:03.110 --> 00:10:08.560 space of the attacker who maps a shared library—an executable—that the victim is 00:10:08.560 --> 00:10:14.899 using in to its own address space, like the red rectangle. And this means that 00:10:14.899 --> 00:10:22.760 when this data is stored in the cache, it's cached for both processes. Now the 00:10:22.760 --> 00:10:28.170 attacker can use the flush instruction to remove the data out of the cache, so it's 00:10:28.170 --> 00:10:34.420 not in the cache anymore, so it's also not cached for the victim. Now the attacker 00:10:34.420 --> 00:10:39.100 can schedule the victim and if the victim decides "yeah, I need this data", it will 00:10:39.100 --> 00:10:44.970 be loaded back into the cache. And now the attacker can reload the data, measure the 00:10:44.970 --> 00:10:49.661 time how long it took, and then decide "okay, the victim has accessed the data in 00:10:49.661 --> 00:10:54.179 the meantime" or "the victim has not accessed the data in the meantime." And by 00:10:54.179 --> 00:10:58.959 that you can spy if this address has been used. 00:10:58.959 --> 00:11:03.240 The second type of attack is called "Prime+Probe" and it does not rely on the 00:11:03.240 --> 00:11:08.971 shared memory like the "Flush+Reload" attack, and it works as following: Instead 00:11:08.971 --> 00:11:16.139 of mapping anything into its own address space, the attacker loads a lot of data 00:11:16.139 --> 00:11:24.589 into one cache line, here, and fills the cache. Now he again schedules the victim 00:11:24.589 --> 00:11:31.820 and the schedule can access data that maps to the same cache set. 00:11:31.820 --> 00:11:38.050 So the cache set is used by the attacker and the victim at the same time. Now the 00:11:38.050 --> 00:11:43.050 attacker can start measuring the access time to the addresses he loaded into the 00:11:43.050 --> 00:11:49.050 cache before, and when he accesses an address that is still in the cache it's 00:11:49.050 --> 00:11:55.649 faster so he measures the lower time. And if it's not in the cache anymore it has to 00:11:55.649 --> 00:12:01.279 be reloaded into the cache so it takes a longer time. He can sum this up and detect 00:12:01.279 --> 00:12:07.870 if the victim has loaded data into the cache as well. So the first thing we want 00:12:07.870 --> 00:12:11.900 to show you is what you can do with cache attacks is you can implement a covert 00:12:11.900 --> 00:12:17.439 channel and this could be happening in the following scenario. 00:12:17.439 --> 00:12:23.610 You install an app on your phone to view your favorite images you take, to apply 00:12:23.610 --> 00:12:28.630 some filters, and in the end you don't know that it's malicious because the only 00:12:28.630 --> 00:12:33.609 permission it requires is to access your images which makes sense. So you can 00:12:33.609 --> 00:12:38.700 easily install it without any fear. In addition you want to know what the weather 00:12:38.700 --> 00:12:43.040 is outside, so you install a nice little weather widget, and the only permission it 00:12:43.040 --> 00:12:48.230 has is to access the internet because it has to load the information from 00:12:48.230 --> 00:12:55.569 somewhere. So what happens if you're able to implement a covert channel between two 00:12:55.569 --> 00:12:59.779 these two applications, without any permissions and privileges so they can 00:12:59.779 --> 00:13:05.060 communicate with each other without using any mechanisms provided by the operating 00:13:05.060 --> 00:13:11.149 system, so it's hidden. It can happen that now the gallery app can send the image to 00:13:11.149 --> 00:13:18.680 the internet, it will be uploaded and exposed for everyone. So maybe you don't 00:13:18.680 --> 00:13:25.610 want to see the cat picture everywhere. While we can do this with those 00:13:25.610 --> 00:13:30.219 Prime+Probe/ Flush+Reload attacks, we will discuss a covert channel using 00:13:30.219 --> 00:13:35.690 Prime+Probe. So how can we transmit this data? We need to transmit ones and zeros 00:13:35.690 --> 00:13:40.980 at some point. So the sender and the receiver agree on one cache set that they 00:13:40.980 --> 00:13:49.319 both use. The receiver probes the set all the time. When the sender wants to 00:13:49.319 --> 00:13:57.529 transmit a zero he just does nothing, so the lines of the receiver are in the cache 00:13:57.529 --> 00:14:01.809 all the time, and he knows "okay, he's sending nothing", so it's a zero. 00:14:01.809 --> 00:14:05.940 On the other hand if the sender wants to transmit a one, he starts accessing 00:14:05.940 --> 00:14:10.800 addresses that map to the same cache set so it will take a longer time for the 00:14:10.800 --> 00:14:16.540 receiver to access its addresses again, and he knows "okay, the sender just sent 00:14:16.540 --> 00:14:23.059 me a one", and Clementine will show you what you can do with this covert channel. 00:14:23.059 --> 00:14:25.180 CM: So the really nice thing about 00:14:25.180 --> 00:14:28.959 Prime+Probe is that it has really low requirements. It doesn't need any kind of 00:14:28.959 --> 00:14:34.349 shared memory. For example if you have two virtual machines you could have some 00:14:34.349 --> 00:14:38.700 shared memory via memory deduplication. The thing is that this is highly insecure, 00:14:38.700 --> 00:14:43.969 so cloud providers like Amazon ec2, they disable that. Now we can still use 00:14:43.969 --> 00:14:50.429 Prime+Probe because it doesn't need this shared memory. Another problem with cache 00:14:50.429 --> 00:14:54.999 covert channels is that they are quite noisy. So when you have other applications 00:14:54.999 --> 00:14:59.259 that are also running on the system, they are all competing for the cache and they 00:14:59.259 --> 00:15:03.009 might, like, evict some cache lines, especially if it's an application that is 00:15:03.009 --> 00:15:08.749 very memory intensive. And you also have noise due to the fact that the sender and 00:15:08.749 --> 00:15:12.770 the receiver might not be scheduled at the same time. So if you have your sender that 00:15:12.770 --> 00:15:16.649 sends all the things and the receiver is not scheduled then some part of the 00:15:16.649 --> 00:15:22.539 transmission can get lost. So what we did is we tried to build an error-free covert 00:15:22.539 --> 00:15:30.829 channel. We took care of all these noise issues by using some error detection to 00:15:30.829 --> 00:15:36.470 resynchronize the sender and the receiver and then we use some error correction to 00:15:36.470 --> 00:15:40.779 correct the remaining errors. So we managed to have a completely error- 00:15:40.779 --> 00:15:46.069 free covert channel even if you have a lot of noise, so let's say another virtual 00:15:46.069 --> 00:15:54.119 machine also on the machine serving files through a web server, also doing lots of 00:15:54.119 --> 00:16:01.600 memory-intensive tasks at the same time, and the covert channel stayed completely 00:16:01.600 --> 00:16:07.610 error-free, and around 40 to 75 kilobytes per second, which is still quite a lot. 00:16:07.610 --> 00:16:14.470 All of this is between virtual machines on Amazon ec2. And the really neat thing—we 00:16:14.470 --> 00:16:19.389 wanted to do something with that—and basically we managed to create an SSH 00:16:19.389 --> 00:16:27.060 connection really over the cache. So they don't have any network between 00:16:27.060 --> 00:16:31.439 them, but just we are sending the zeros and the ones and we have an SSH connection 00:16:31.439 --> 00:16:36.839 between them. So you could say that cache covert channels are nothing, but I think 00:16:36.839 --> 00:16:43.079 it's a real threat. And if you want to have more details about this work in 00:16:43.079 --> 00:16:49.220 particular, it will be published soon at NDSS. 00:16:49.220 --> 00:16:54.040 So the second application that we wanted to show you is that we can attack crypto 00:16:54.040 --> 00:17:01.340 with cache attacks. In particular we are going to show an attack on AES and a 00:17:01.340 --> 00:17:04.990 special implementation of AES that uses T-Tables. so that's the fast software 00:17:04.990 --> 00:17:11.650 implementation because it uses some precomputed lookup tables. It's known to 00:17:11.650 --> 00:17:17.490 be vulnerable to side-channel attacks since 2006 by Osvik et al, and it's a one- 00:17:17.490 --> 00:17:24.110 round known plaintext attack, so you have p—or plaintext—and k, your secret key. And 00:17:24.110 --> 00:17:29.570 the AES algorithm, what it does is compute an intermediate state at each round r. 00:17:29.570 --> 00:17:38.559 And in the first round, the accessed table indices are just p XOR k. Now it's a known 00:17:38.559 --> 00:17:43.500 plaintext attack, what this means is that if you can recover the accessed table 00:17:43.500 --> 00:17:49.460 indices you've also managed to recover the key because it's just XOR. So that would 00:17:49.460 --> 00:17:55.450 be bad, right, if we could recover these accessed table indices. Well we can, with 00:17:55.450 --> 00:18:00.510 cache attacks! So we did that with Flush+Reload and with Prime+Probe. On the 00:18:00.510 --> 00:18:05.809 x-axis you have the plaintext byte values and on the y-axis you have the addresses 00:18:05.809 --> 00:18:15.529 which are essentially the T table entries. So a black cell means that we've monitored 00:18:15.529 --> 00:18:19.970 the cache line, and we've seen a lot of cache hits. So basically the blacker it 00:18:19.970 --> 00:18:25.650 is, the more certain we are that the T-Table entry has been accessed. And here 00:18:25.650 --> 00:18:31.779 it's a toy example, the key is all-zeros, but you would basically just have a 00:18:31.779 --> 00:18:35.700 different pattern if the key was not all- zeros, and as long as you can see this 00:18:35.700 --> 00:18:43.409 nice diagonal or a pattern then you have recovered the key. So it's an old attack, 00:18:43.409 --> 00:18:48.890 2006, it's been 10 years, everything should be fixed by now, and you see where 00:18:48.890 --> 00:18:56.880 I'm going: it's not. So on Android the bouncy castle implementation it uses by 00:18:56.880 --> 00:19:03.360 default the T-table, so that's bad. Also many implementations that you can find 00:19:03.360 --> 00:19:11.380 online uses pre-computed values, so maybe be wary about this kind of attacks. The 00:19:11.380 --> 00:19:17.240 last application we wanted to show you is how we can spy on keystrokes. 00:19:17.240 --> 00:19:21.480 So for that we will use flush and reload because it's a really fine grained 00:19:21.480 --> 00:19:26.309 attack. We can see very precisely which cache line has been accessed, and a cache 00:19:26.309 --> 00:19:31.440 line is only 64 bytes so it's really not a lot and we're going to use that to spy on 00:19:31.440 --> 00:19:37.690 keystrokes and we even have a small demo for you. 00:19:40.110 --> 00:19:45.640 ML: So what you can see on the screen this is not on Intel x86 it's on a smartphone, 00:19:45.640 --> 00:19:50.330 on the Galaxy S6, but you can also apply these cache attacks there so that's what 00:19:50.330 --> 00:19:53.850 we want to emphasize. So on the left you see the screen and on 00:19:53.850 --> 00:19:57.960 the right we have connected a shell with no privileges and permissions, so it can 00:19:57.960 --> 00:20:00.799 basically be an app that you install glass bottle falling 00:20:00.799 --> 00:20:09.480 from the App Store and on the right we are going to start our spy tool, and on the 00:20:09.480 --> 00:20:14.110 left we just open the messenger app and whenever the user hits any key on the 00:20:14.110 --> 00:20:19.690 keyboard our spy tool takes care of that and notices that. Also if he presses the 00:20:19.690 --> 00:20:26.120 spacebar we can also measure that. If the user decides "ok, I want to delete the 00:20:26.120 --> 00:20:30.880 word" because he changed his mind, we can also register if the user pressed the 00:20:30.880 --> 00:20:37.929 backspace button, so in the end we can see exactly how long the words were, the user 00:20:37.929 --> 00:20:45.630 typed into his phone without any permissions and privileges, which is bad. 00:20:45.630 --> 00:20:55.250 laughs applause 00:20:55.250 --> 00:21:00.320 ML: so enough about the mov instruction, let's head to clflush. 00:21:00.320 --> 00:21:07.230 CM: So the clflush instruction: What it does is that it invalidates from every 00:21:07.230 --> 00:21:12.309 level the cache line that contains the address that you pass to this instruction. 00:21:12.309 --> 00:21:16.990 So in itself it's kind of bad because it enables the Flush+Reload attacks that we 00:21:16.990 --> 00:21:21.300 showed earlier, that was just flush, reload, and the flush part is done with 00:21:21.300 --> 00:21:29.140 clflush. But there's actually more to it, how wonderful. So there's a first timing 00:21:29.140 --> 00:21:33.320 leakage with it, so we're going to see that the clflush instruction has a 00:21:33.320 --> 00:21:37.890 different timing depending on whether the data that you that you pass to it is 00:21:37.890 --> 00:21:44.710 cached or not. So imagine you have a cache line that is on the level 1 by inclu... 00:21:44.710 --> 00:21:50.299 With the inclusion property it has to be also in the last level cache. Now this is 00:21:50.299 --> 00:21:54.350 quite convenient and this is also why we have this inclusion property for 00:21:54.350 --> 00:22:00.019 performance reason on Intel CPUs, if you want to see if a line is present at all in 00:22:00.019 --> 00:22:04.209 the cache you just have to look in the last level cache. So this is basically 00:22:04.209 --> 00:22:08.010 what the clflush instruction does. It goes to the last last level cache, sees "ok 00:22:08.010 --> 00:22:12.890 there's a line, I'm going to flush this one" and then there's something that tells 00:22:12.890 --> 00:22:18.950 ok the line is also present somewhere else so then it flushes the line in level 1 00:22:18.950 --> 00:22:26.390 and/or level 2. So that's slow. Now if you perform clflush on some data that is not 00:22:26.390 --> 00:22:32.240 cached, basically it does the same, goes to the last level cache, see that there's 00:22:32.240 --> 00:22:36.659 no line and there can't be any... This data can't be anywhere else in the cache 00:22:36.659 --> 00:22:41.269 because it would be in the last level cache if it was anywhere, so it does 00:22:41.269 --> 00:22:47.430 nothing and it stop there. So that's fast. So how exactly fast and slow am I talking 00:22:47.430 --> 00:22:53.760 about? So it's actually only a very few cycles so we did this experiments on 00:22:53.760 --> 00:22:59.072 different microarchitecture so Center Bridge, Ivy Bridge, and Haswell and... 00:22:59.072 --> 00:23:03.250 So it different colors correspond to the different microarchitecture. So first 00:23:03.250 --> 00:23:07.880 thing that is already... kinda funny is that you can see that you can distinguish 00:23:07.880 --> 00:23:14.649 the micro architecture quite nicely with this, but the real point is that you have 00:23:14.649 --> 00:23:20.280 really a different zones. The solids... The solid line is when we performed the 00:23:20.280 --> 00:23:25.200 measurement on clflush with the line that was already in the cache, and the dashed 00:23:25.200 --> 00:23:30.840 line is when the line was not in the cache, and in all microarchitectures you 00:23:30.840 --> 00:23:36.539 can see that we can see a difference: It's only a few cycles, it's a bit noisy, so 00:23:36.539 --> 00:23:43.250 what could go wrong? Okay, so exploiting these few cycles, we still managed to 00:23:43.250 --> 00:23:47.029 perform a new cache attacks that we call "Flush+Flush", so I'm going to explain 00:23:47.029 --> 00:23:52.220 that to you: So basically everything that we could do with "Flush+Reload", we can 00:23:52.220 --> 00:23:56.899 also do with "Flush+Flush". We can perform cover channels and sidechannel attacks. 00:23:56.899 --> 00:24:01.090 It's stealthier than previous cache attacks, I'm going to go back on this one, 00:24:01.090 --> 00:24:07.220 and it's also faster than previous cache attacks. So how does it work exactly? So 00:24:07.220 --> 00:24:12.210 the principle is a bit similar to "Flush+Reload": So we have the attacker 00:24:12.210 --> 00:24:16.131 and the victim that have some kind of shared memory, let's say a shared library. 00:24:16.131 --> 00:24:21.340 It will be shared in the cache The attacker will start by flushing the cache 00:24:21.340 --> 00:24:26.510 line then let's the victim perform whatever it does, let's say encryption, 00:24:26.510 --> 00:24:32.120 the victim will load some data into the cache, automatically, and now the attacker 00:24:32.120 --> 00:24:36.720 wants to know again if the victim accessed this precise cache line and instead of 00:24:36.720 --> 00:24:43.540 reloading it is going to flush it again. And since we have this timing difference 00:24:43.540 --> 00:24:47.040 depending on whether the data is in the cache or not, it gives us the same 00:24:47.040 --> 00:24:54.889 information as if we reloaded it, except it's way faster. So I talked about 00:24:54.889 --> 00:24:59.690 stealthiness. So the thing is that basically these cache attacks and that 00:24:59.690 --> 00:25:06.340 also applies to "Rowhammer": They are already stealth in themselves, because 00:25:06.340 --> 00:25:10.470 there's no antivirus today that can detect them. but some people thought that we 00:25:10.470 --> 00:25:14.351 could detect them with performance counters because they do a lot of cache 00:25:14.351 --> 00:25:18.549 misses and cache references that happen when the data is flushed and when you 00:25:18.549 --> 00:25:26.090 reaccess memory. now what we thought is that yeah but that also not the only 00:25:26.090 --> 00:25:31.269 program steps to lots of cache misses and cache references so we would like to have 00:25:31.269 --> 00:25:38.120 a slightly parametric. So these cache attacks they have a very heavy activity on 00:25:38.120 --> 00:25:43.840 the cache but they're also very particular because there are very short loops of code 00:25:43.840 --> 00:25:48.610 if you take flush and reload this just flush one line reload the line and then 00:25:48.610 --> 00:25:53.750 again flush reload that's very short loop and that creates a very low pressure on 00:25:53.750 --> 00:26:01.490 the instruction therapy which is kind of particular for of cache attacks so what we 00:26:01.490 --> 00:26:05.380 decided to do is normalizing the cache even so the cache misses and cache 00:26:05.380 --> 00:26:10.720 references by events that have to do with the instruction TLB and there we could 00:26:10.720 --> 00:26:19.360 manage to detect cache attacks and Rowhammer without having false positives 00:26:19.360 --> 00:26:24.510 so the system metric that I'm going to use when I talk about stealthiness so we 00:26:24.510 --> 00:26:29.750 started by creating a cover channel. First we wanted to have it as fast as possible 00:26:29.750 --> 00:26:36.160 so we created a protocol to evaluates all the kind of cache attack that we had so 00:26:36.160 --> 00:26:40.540 flush+flush, flush+reload, and prime+probe and we started with a 00:26:40.540 --> 00:26:47.010 packet side of 28 doesn't really matter. We measured the capacity of our covert 00:26:47.010 --> 00:26:52.799 channel and flush+flush is around 500 kB/s whereas Flush+Reload 00:26:52.799 --> 00:26:56.340 was only 300 kB/s so Flush+Flush is already quite an 00:26:56.340 --> 00:27:00.740 improvement on the speed. Then we measured the stealth zone at this 00:27:00.740 --> 00:27:06.100 speed only Flush+Flush was stealth and now the thing is that Flush+Flush and 00:27:06.100 --> 00:27:10.200 Flush+Reload as you've seen there are some similarities so for a covert channel 00:27:10.200 --> 00:27:15.309 they also share the same center on it is receivers different and for this one the 00:27:15.309 --> 00:27:20.000 center was not stealth for both of them anyway if you want a fast covert channel 00:27:20.000 --> 00:27:26.640 then just try flush+flush that works. Now let's try to make it stealthy 00:27:26.640 --> 00:27:30.639 completely stealthy because if I have the standard that is not stealth maybe that we 00:27:30.639 --> 00:27:36.440 give away the whole attack so we said okay maybe if we just slow down all the attacks 00:27:36.440 --> 00:27:41.240 then there will be less cache hits, cache misses and then maybe all 00:27:41.240 --> 00:27:48.070 the attacks are actually stealthy why not? So we tried that we slowed down everything 00:27:48.070 --> 00:27:52.889 so Flush+Reload and Flash+Flash are around 50 kB/s now 00:27:52.889 --> 00:27:55.829 Prime+Probe is a bit slower because it takes more time 00:27:55.829 --> 00:28:01.330 to prime and probe anything but still 00:28:01.330 --> 00:28:09.419 even with this slow down only Flush+Flush has its receiver stealth and we also 00:28:09.419 --> 00:28:14.769 managed to have the sender stealth now so basically whether you want a fast covert 00:28:14.769 --> 00:28:20.450 channel or a stealth covert channel Flush+Flush is really great. 00:28:20.450 --> 00:28:26.500 Now we wanted to also evaluate if it wasn't too noisy to perform some side 00:28:26.500 --> 00:28:30.740 channel attack so we did these side channels on the AES t-table implementation 00:28:30.740 --> 00:28:35.910 the attacks that we have shown you earlier, so we computed a lot of 00:28:35.910 --> 00:28:41.820 encryption that we needed to determine the upper four bits of a key bytes so here the 00:28:41.820 --> 00:28:48.870 lower the better the attack and Flush + Reload is a bit better so we need only 250 00:28:48.870 --> 00:28:55.029 encryptions to recover these bits but Flush+Flush comes quite, comes quite 00:28:55.029 --> 00:29:00.570 close with 350 and Prime+Probe is actually the most noisy of them all, needs 00:29:00.570 --> 00:29:06.101 5... close to 5000 encryptions so we have around the same performance for 00:29:06.101 --> 00:29:13.520 Flush+Flush and Flush+Reload. Now let's evaluate the stealthiness again. 00:29:13.520 --> 00:29:19.320 So what we did here is we perform 256 billion encryptions in a synchronous 00:29:19.320 --> 00:29:25.740 attack so we really had the spy and the victim scattered and we evaluated the 00:29:25.740 --> 00:29:31.409 stealthiness of them all and here only Flush+Flush again is stealth. And while 00:29:31.409 --> 00:29:36.279 you can always slow down a covert channel you can't actually slow down a side 00:29:36.279 --> 00:29:40.700 channel because, in a real-life scenario, you're not going to say "Hey victim, him 00:29:40.700 --> 00:29:47.179 wait for me a bit, I am trying to do an attack here." That won't work. 00:29:47.179 --> 00:29:51.429 So there's even more to it but I will need again a bit of background before 00:29:51.429 --> 00:29:56.910 continuing. So I've shown you the different levels of caches and here I'm 00:29:56.910 --> 00:30:04.009 going to focus more on the last-level cache. So we have here our four slices so 00:30:04.009 --> 00:30:09.830 this is the last-level cache and we have some bits of the address here that 00:30:09.830 --> 00:30:14.330 corresponds to the set, but more importantly, we need to know where in 00:30:14.330 --> 00:30:19.899 which slice and address is going to be. And that is given, that is given by some 00:30:19.899 --> 00:30:23.850 bits of the set and the type of the address that are passed into a function 00:30:23.850 --> 00:30:27.960 that says in which slice the line is going to be. 00:30:27.960 --> 00:30:32.460 Now the thing is that this hash function is undocumented by Intel. Wouldn't be fun 00:30:32.460 --> 00:30:39.250 otherwise. So we have this: As many slices as core, an undocumented hash function 00:30:39.250 --> 00:30:43.980 that maps a physical address to a slice, and while it's actually a bit of a pain 00:30:43.980 --> 00:30:48.710 for attacks, it has, it was not designed for security originally but for 00:30:48.710 --> 00:30:53.570 performance, because you want all the access to be evenly distributed in the 00:30:53.570 --> 00:31:00.399 different slices, for performance reasons. So the hash function basically does, it 00:31:00.399 --> 00:31:05.279 takes some bits of the physical address and output k bits of slice, so just one 00:31:05.279 --> 00:31:09.309 bits if you have a two-core machine, two bits if you have a four-core machine and 00:31:09.309 --> 00:31:16.830 so on. Now let's go back to clflush, see what's the relation with that. 00:31:16.830 --> 00:31:21.169 So the thing that we noticed is that clflush is actually faster to reach a line 00:31:21.169 --> 00:31:28.549 on the local slice. So if you have, if you're flushing always 00:31:28.549 --> 00:31:33.340 one line and you run your program on core zero, core one, core two and core three, 00:31:33.340 --> 00:31:37.899 you will observe that one core in particular on, when you run the program on 00:31:37.899 --> 00:31:44.632 one core, the clflush is faster. And so here this is on core one, and you can see 00:31:44.632 --> 00:31:51.139 that core zero, two, and three it's, it's a bit slower and here we can deduce that, 00:31:51.139 --> 00:31:55.320 so we run the program on core one and we flush always the same line and we can 00:31:55.320 --> 00:32:01.850 deduce that the line belong to slice one. And what we can do with that is that we 00:32:01.850 --> 00:32:06.500 can map physical address to slices. And that's one way to reverse-engineer 00:32:06.500 --> 00:32:10.639 this addressing function that was not documented. 00:32:10.639 --> 00:32:15.880 Funnily enough that's not the only way: What I did before that was using the 00:32:15.880 --> 00:32:21.229 performance counters to reverse-engineer this function, but that's actually a whole 00:32:21.229 --> 00:32:27.770 other story and if you want more detail on that, there's also an article on that. 00:32:27.770 --> 00:32:30.139 ML: So the next instruction we want to 00:32:30.139 --> 00:32:35.110 talk about is the prefetch instruction. And the prefetch instruction is used to 00:32:35.110 --> 00:32:40.841 tell the CPU: "Okay, please load the data I need later on, into the cache, if you 00:32:40.841 --> 00:32:45.968 have some time." And in the end there are actually six different prefetch 00:32:45.968 --> 00:32:52.929 instructions: prefetcht0 to t2 which means: "CPU, please load the data into the 00:32:52.929 --> 00:32:58.640 first-level cache", or in the last-level cache, whatever you want to use, but we 00:32:58.640 --> 00:33:02.250 spare you the details because it's not so interesting in the end. 00:33:02.250 --> 00:33:06.940 However, what's more interesting is when we take a look at the Intel manual and 00:33:06.940 --> 00:33:11.880 what it says there. So, "Using the PREFETCH instruction is recommended only 00:33:11.880 --> 00:33:17.049 if data does not fit in the cache." So you can tell the CPU: "Please load data I want 00:33:17.049 --> 00:33:23.210 to stream into the cache, so it's more performant." "Use of software prefetch 00:33:23.210 --> 00:33:27.740 should be limited to memory addresses that are managed or owned within the 00:33:27.740 --> 00:33:33.620 application context." So one might wonder what happens if this 00:33:33.620 --> 00:33:40.940 address is not managed by myself. Sounds interesting. "Prefetching to addresses 00:33:40.940 --> 00:33:46.289 that are not mapped to physical pages can experience non-deterministic performance 00:33:46.289 --> 00:33:52.030 penalty. For example specifying a NULL pointer as an address for prefetch can 00:33:52.030 --> 00:33:56.000 cause long delays." So we don't want to do that because our 00:33:56.000 --> 00:34:02.919 program will be slow. So, let's take a look what they mean with non-deterministic 00:34:02.919 --> 00:34:08.889 performance penalty, because we want to write good software, right? But before 00:34:08.889 --> 00:34:12.510 that, we have to take a look at a little bit more background information to 00:34:12.510 --> 00:34:17.710 understand the attacks. So on modern operating systems, every 00:34:17.710 --> 00:34:22.850 application has its own virtual address space. So at some point, the CPU needs to 00:34:22.850 --> 00:34:27.479 translate these addresses to the physical addresses actually in the DRAM. And for 00:34:27.479 --> 00:34:33.690 that we have this very complex-looking data structure. So we have a 48-bit 00:34:33.690 --> 00:34:40.409 virtual address, and some of those bits mapped to a table, like the PM level 4 00:34:40.409 --> 00:34:47.760 table, with 512 entries, so depending on those bits the CPU knows, at which line he 00:34:47.760 --> 00:34:51.520 has to look. And if there is data there, because the 00:34:51.520 --> 00:34:56.900 address is mapped, he can proceed and look at the page directory, point the table, 00:34:56.900 --> 00:35:04.620 and so on for the town. So is everything, is the same for each level until you come 00:35:04.620 --> 00:35:09.130 to your page table, where you have 4-kilobyte pages. So it's in the end not 00:35:09.130 --> 00:35:13.851 that complicated, but it's a bit confusing, because you want to know a 00:35:13.851 --> 00:35:20.310 physical address, so you have to look it up somewhere in the, in the main memory 00:35:20.310 --> 00:35:25.420 with physical addresses to translate your virtual addresses. And if you have to go 00:35:25.420 --> 00:35:31.890 through all those levels, it needs a long time, so we can do better than that and 00:35:31.890 --> 00:35:39.160 that's why Intel introduced additional caches, also for all of those levels. So, 00:35:39.160 --> 00:35:45.560 if you want to translate an address, you take a look at the ITLB for instructions, 00:35:45.560 --> 00:35:51.150 and the data TLB for data. If it's there, you can stop, otherwise you go down all 00:35:51.150 --> 00:35:58.700 those levels and if it's not in any cache you have to look it up in the DRAM. In 00:35:58.700 --> 00:36:03.300 addition, the address space you have is shared, because you have, on the one hand, 00:36:03.300 --> 00:36:07.470 the user memory and, on the other hand, you have mapped the kernel for convenience 00:36:07.470 --> 00:36:12.870 and performance also in the address space. And if your user program wants to access 00:36:12.870 --> 00:36:18.310 some kernel functionality like reading a file, it will switch to the kernel memory 00:36:18.310 --> 00:36:23.880 there's a privilege escalation, and then you can read the file, and so on. So, 00:36:23.880 --> 00:36:30.420 that's it. However, you have drivers in the kernel, and if you know the addresses 00:36:30.420 --> 00:36:35.771 of those drivers, you can do code-reuse attacks, and as a countermeasure, they 00:36:35.771 --> 00:36:40.150 introduced address-space layout randomization, also for the kernel. 00:36:40.150 --> 00:36:47.040 And this means that when you have your program running, the kernel is mapped at 00:36:47.040 --> 00:36:51.630 one address and if you reboot the machine it's not on the same address anymore but 00:36:51.630 --> 00:36:58.390 somewhere else. So if there is a way to find out at which address the kernel is 00:36:58.390 --> 00:37:04.450 loaded, you have circumvented this countermeasure and defeated kernel address 00:37:04.450 --> 00:37:11.060 space layout randomization. So this would be nice for some attacks. In addition, 00:37:11.060 --> 00:37:16.947 there's also the kernel direct physical map. And what does this mean? It's 00:37:16.947 --> 00:37:23.320 implemented on many operating systems like OS X, Linux, also on the Xen hypervisor 00:37:23.320 --> 00:37:27.860 and BSD, but not on Windows. But what it means 00:37:27.860 --> 00:37:33.870 is that the complete physical memory is mapped in additionally in the kernel 00:37:33.870 --> 00:37:40.460 memory at a fixed offset. So, for every page that is mapped in the user space, 00:37:40.460 --> 00:37:45.160 there's something like a twin page in the kernel memory, which you can't access 00:37:45.160 --> 00:37:50.371 because it's in the kernel memory. However, we will need it later, because 00:37:50.371 --> 00:37:58.230 now we go back to prefetch and see what we can do with that. So, prefetch is not a 00:37:58.230 --> 00:38:04.150 usual instruction, because it just tells the CPU "I might need that data later on. 00:38:04.150 --> 00:38:10.000 If you have time, load for me," if not, the CPU can ignore it because it's busy 00:38:10.000 --> 00:38:15.810 with other stuff. So, there's no necessity that this instruction is really executed, 00:38:15.810 --> 00:38:22.070 but most of the time it is. And a nice, interesting thing is that it generates no 00:38:22.070 --> 00:38:29.000 faults, so whatever you pass to this instruction, your program won't crash, and 00:38:29.000 --> 00:38:33.990 it does not check any privileges, so I can also pass an kernel address to it and it 00:38:33.990 --> 00:38:37.510 won't say "No, stop, you accessed an address that you are not allowed to 00:38:37.510 --> 00:38:45.530 access, so I crash," it just continues, which is nice. 00:38:45.530 --> 00:38:49.810 The second interesting thing is that the operand is a virtual address, so every 00:38:49.810 --> 00:38:55.534 time you execute this instruction, the CPU has to go and check "OK, what physical 00:38:55.534 --> 00:38:59.600 address does this virtual address correspond to?" So it has to do the lookup 00:38:59.600 --> 00:39:05.750 with all those tables we've seen earlier, and as you probably have guessed already, 00:39:05.750 --> 00:39:10.370 the execution time varies also for the prefetch instruction and we will see later 00:39:10.370 --> 00:39:16.090 on what we can do with that. So, let's get back to the direct physical 00:39:16.090 --> 00:39:22.870 map. Because we can create an oracle for address translation, so we can find out 00:39:22.870 --> 00:39:27.540 what physical address belongs to the virtual address. Because nowadays you 00:39:27.540 --> 00:39:31.990 don't want that the user to know, because you can craft nice rowhammer attacks with 00:39:31.990 --> 00:39:37.520 that information, and more advanced cache attacks, so you restrict this information 00:39:37.520 --> 00:39:44.270 to the user. But let's check if we find a way to still get this information. So, as 00:39:44.270 --> 00:39:50.150 I've told you earlier, if you have a paired page in the user space map, 00:39:50.150 --> 00:39:54.505 you have the twin page in the kernel space, and if it's cached, 00:39:54.505 --> 00:39:56.710 its cached for both of them again. 00:39:56.710 --> 00:40:03.170 So, the attack now works as the following: From the attacker you flush your user 00:40:03.170 --> 00:40:09.760 space page, so it's not in the cache for the... also for the kernel memory, and 00:40:09.760 --> 00:40:15.850 then you call prefetch on the address of the kernel, because as I told you, you 00:40:15.850 --> 00:40:22.050 still can do that because it doesn't create any faults. So, you tell the CPU 00:40:22.050 --> 00:40:28.310 "Please load me this data into the cache even if I don't have access to this data 00:40:28.310 --> 00:40:32.550 normally." And if we now measure on our user space 00:40:32.550 --> 00:40:37.100 page the address again, and we measure a cache hit, because it has been loaded by 00:40:37.100 --> 00:40:42.630 the CPU into the cache, we know exactly at which position, since we passed the 00:40:42.630 --> 00:40:48.250 address to the function, this address corresponds to. And because this is at a 00:40:48.250 --> 00:40:53.280 fixed offset, we can just do a simple subtraction and know the physical address 00:40:53.280 --> 00:40:59.180 again. So we have a nice way to find physical addresses for virtual addresses. 00:40:59.180 --> 00:41:04.390 And in practice this looks like this following plot. So, it's pretty simple, 00:41:04.390 --> 00:41:08.910 because we just do this for every address, and at some point we measure a cache hit. 00:41:08.910 --> 00:41:14.260 So, there's a huge difference. And exactly at this point we know this physical 00:41:14.260 --> 00:41:20.140 address corresponds to our virtual address. The second thing is that we can 00:41:20.140 --> 00:41:27.070 exploit the timing differences it needs for the prefetch instruction. Because, as 00:41:27.070 --> 00:41:31.850 I told you, when you go down this cache levels, at some point you see "it's here" 00:41:31.850 --> 00:41:37.500 or "it's not here," so it can abort early. And with that we can know exactly 00:41:37.500 --> 00:41:41.800 when the prefetch instruction aborted, and know how the 00:41:41.800 --> 00:41:48.070 pages are mapped into the address space. So, the timing depends on where the 00:41:48.070 --> 00:41:57.090 translation stops. And using those two properties and those information, we can 00:41:57.090 --> 00:42:02.227 do the following: On the one hand, we can build variants of cache attacks. So, 00:42:02.227 --> 00:42:07.444 instead of Flush+Reload, we can do Flush+Prefetch, for instance. We can 00:42:07.444 --> 00:42:12.060 also use prefetch to mount rowhammer attacks on privileged addresses, because 00:42:12.060 --> 00:42:18.069 it doesn't do any faults when we pass those addresses, and it works as well. In 00:42:18.069 --> 00:42:23.330 addition, we can use it to recover the translation levels of a process, which you 00:42:23.330 --> 00:42:27.870 could do earlier with the page map file, but as I told you it's now privileged, so 00:42:27.870 --> 00:42:32.890 you don't have access to that, and by doing that you can bypass address space 00:42:32.890 --> 00:42:38.170 layout randomization. In addition, as I told you, you can translate virtual 00:42:38.170 --> 00:42:43.530 addresses to physical addresses, which is now also privileged with the page map 00:42:43.530 --> 00:42:48.790 file, and using that it reenables return to direct exploits, which have been 00:42:48.790 --> 00:42:55.550 demonstrated last year. On top of that, we can also use this to locate kernel 00:42:55.550 --> 00:43:00.850 drivers, as I told you. It would be nice if we can circumvent KSLR as well, and I 00:43:00.850 --> 00:43:08.380 will show you now how this is possible. So, with the first oracle we find out all 00:43:08.380 --> 00:43:15.430 the pages that are mapped, and for each of those pages, we evict the translation 00:43:15.430 --> 00:43:18.210 caches, and we can do that by either calling sleep, 00:43:18.210 --> 00:43:24.450 which schedules another program, or access just a large memory buffer. Then, we 00:43:24.450 --> 00:43:28.260 perform a syscall to the driver. So, there's code of the driver executed and 00:43:28.260 --> 00:43:33.540 loaded into the cache, and then we just measure the time prefetch takes on this 00:43:33.540 --> 00:43:40.840 address. And in the end, the fastest average access time is the driver page. 00:43:40.840 --> 00:43:46.770 So, we can mount this attack on Windows 10 in less than 12 seconds. So, we can defeat 00:43:46.770 --> 00:43:52.110 KSLR in less than 12 seconds, which is very nice. And in practice, the 00:43:52.110 --> 00:43:58.330 measurements looks like the following: So, we have a lot of long measurements, and at 00:43:58.330 --> 00:44:05.060 some point you have a low one, and you know exactly that this driver region and 00:44:05.060 --> 00:44:09.930 the address the driver is located. And you can mount those read to direct 00:44:09.930 --> 00:44:16.210 attacks again. However, that's not everything, because there are more 00:44:16.210 --> 00:44:20.795 instructions in Intel. CM: Yeah, so, the following is not our 00:44:20.795 --> 00:44:24.350 work, but we thought that would be interesting, because it's basically more 00:44:24.350 --> 00:44:30.740 instructions, more attacks, more fun. So there's the RDSEED instruction, and what 00:44:30.740 --> 00:44:35.340 it does, that is request a random seed to the hardware random number generator. So, 00:44:35.340 --> 00:44:39.310 the thing is that there is a fixed number of precomputed random bits, and that takes 00:44:39.310 --> 00:44:44.320 time to regenerate them. So, as everything that takes time, you can create a covert 00:44:44.320 --> 00:44:50.180 channel with that. There is also FADD and FMUL, which are floating point operations. 00:44:50.180 --> 00:44:56.740 Here, the running time of this instruction depends on the operands. Some people 00:44:56.740 --> 00:45:01.530 managed to bypass Firefox's same origin policy with an SVG filter timing attack 00:45:01.530 --> 00:45:08.540 with that. There's also the JMP instructions. So, in modern CPUs you have 00:45:08.540 --> 00:45:14.520 branch prediction, and branch target prediction. With that, it's actually been 00:45:14.520 --> 00:45:18.250 studied a lot, you can create a covert channel. You can do side-channel attacks 00:45:18.250 --> 00:45:26.028 on crypto. You can also bypass KSLR, and finally, there are TSX instructions, which 00:45:26.028 --> 00:45:31.010 is an extension for hardware transactional memory support, which has also been used 00:45:31.010 --> 00:45:37.150 to bypass KSLR. So, in case you're not sure, KSLR is dead. You have lots of 00:45:37.150 --> 00:45:45.650 different things to read. Okay, so, on the conclusion now. So, as you've seen, it's 00:45:45.650 --> 00:45:50.190 actually more a problem of CPU design, than really the instruction sets 00:45:50.190 --> 00:45:55.720 architecture. The thing is that all these issues are really hard to patch. They 00:45:55.720 --> 00:45:59.966 are all linked to performance optimizations, and we are not getting rid 00:45:59.966 --> 00:46:03.890 of performance optimization. That's basically a trade-off between performance 00:46:03.890 --> 00:46:11.530 and security, and performance seems to always win. There has been some 00:46:11.530 --> 00:46:20.922 propositions to... against cache attacks, to... let's say remove the CLFLUSH 00:46:20.922 --> 00:46:26.640 instructions. The thing is that all these quick fix won't work, because we always 00:46:26.640 --> 00:46:31.450 find new ways to do the same thing without these precise instructions and also, we 00:46:31.450 --> 00:46:37.410 keep finding new instruction that leak information. So, it's really, let's say 00:46:37.410 --> 00:46:43.740 quite a big topic that we have to fix this. So, thank you very much for your 00:46:43.740 --> 00:46:47.046 attention. If you have any questions we'd be happy to answer them. 00:46:47.046 --> 00:46:52.728 applause 00:46:52.728 --> 00:47:01.510 applause Herald: Okay. Thank you very much again 00:47:01.510 --> 00:47:06.571 for your talk, and now we will have a Q&A, and we have, I think, about 15 minutes, so 00:47:06.571 --> 00:47:11.330 you can start lining up behind the microphones. They are in the gangways in 00:47:11.330 --> 00:47:18.130 the middle. Except, I think that one... oh, no, it's back up, so it will work. And 00:47:18.130 --> 00:47:22.180 while we wait, I think we will take questions from our signal angel, if there 00:47:22.180 --> 00:47:28.810 are any. Okay, there aren't any, so... microphone questions. I think, you in 00:47:28.810 --> 00:47:33.440 front. Microphone: Hi. Can you hear me? 00:47:33.440 --> 00:47:40.050 Herald: Try again. Microphone: Okay. Can you hear me now? 00:47:40.050 --> 00:47:46.480 Okay. Yeah, I'd like to know what exactly was your stealthiness metric? Was it that 00:47:46.480 --> 00:47:51.310 you can't distinguish it from a normal process, or...? 00:47:51.310 --> 00:47:56.500 CM: So... Herald: Wait a second. We have still Q&A, 00:47:56.500 --> 00:47:59.780 so could you quiet down a bit? That would be nice. 00:47:59.780 --> 00:48:08.180 CM: So, the question was about the stealthiness metric. Basically, we use the 00:48:08.180 --> 00:48:14.320 metric with cache misses and cache references, normalized by the instructions 00:48:14.320 --> 00:48:21.080 TLB events, and we just found the threshold under which 00:48:21.080 --> 00:48:25.820 pretty much every benign application was below this, and rowhammer and cache 00:48:25.820 --> 00:48:30.520 attacks were after that. So we fixed the threshold, basically. 00:48:30.520 --> 00:48:35.520 H: That microphone. Microphone: Hello. Thanks for your talk. 00:48:35.520 --> 00:48:42.760 It was great. First question: Did you inform Intel before doing this talk? 00:48:42.760 --> 00:48:47.520 CM: Nope. Microphone: Okay. The second question: 00:48:47.520 --> 00:48:51.050 What's your future plans? CM: Sorry? 00:48:51.050 --> 00:48:55.780 M: What's your future plans? CM: Ah, future plans. Well, what I did, 00:48:55.780 --> 00:49:01.220 that is interesting, is that we keep finding these more or less by accident, or 00:49:01.220 --> 00:49:06.440 manually, so having a good idea of what's the attack surface here would be a good 00:49:06.440 --> 00:49:10.050 thing, and doing that automatically would be even better. 00:49:10.050 --> 00:49:14.170 M: Great, thanks. H: Okay, the microphone in the back, 00:49:14.170 --> 00:49:18.770 over there. The guy in white. M: Hi. One question. If you have, 00:49:18.770 --> 00:49:24.410 like, a demon, that randomly invalidates some cache lines, would that be a better 00:49:24.410 --> 00:49:31.120 countermeasure than disabling the caches? ML: What was the question? 00:49:31.120 --> 00:49:39.580 CM: If invalidating cache lines would be better than disabling the whole cache. So, 00:49:39.580 --> 00:49:42.680 I'm... ML: If you know which cache lines have 00:49:42.680 --> 00:49:47.300 been accessed by the process, you can invalidate those cache lines before you 00:49:47.300 --> 00:49:52.820 swap those processes, but it's also a trade-off between performance. Like, you 00:49:52.820 --> 00:49:57.940 can also, if you switch processes, flush the whole cache, and then it's empty, and 00:49:57.940 --> 00:50:01.900 then you don't see any activity anymore, but there's also the trade-off of 00:50:01.900 --> 00:50:07.510 performance with this. M: Okay, maybe a second question. If you, 00:50:07.510 --> 00:50:12.240 there are some ARM architectures that have random cache line invalidations. 00:50:12.240 --> 00:50:16.010 Did you try those, if you can see a [unintelligible] channel there. 00:50:16.010 --> 00:50:21.960 ML: If they're truly random, but probably you just have to make more measurements 00:50:21.960 --> 00:50:27.180 and more measurements, and then you can average out the noise, and then you can do 00:50:27.180 --> 00:50:30.350 these attacks again. It's like, with prime and probe, where you need more 00:50:30.350 --> 00:50:34.080 measurements, because it's much more noisy, so in the end you will just need 00:50:34.080 --> 00:50:37.870 much more measurements. CM: So, on ARM, it's supposed to be pretty 00:50:37.870 --> 00:50:43.260 random. At least it's in the manual, but we actually found nice ways to evict cache 00:50:43.260 --> 00:50:47.230 lines, that we really wanted to evict, so it's not actually that pseudo-random. 00:50:47.230 --> 00:50:51.960 So, even... let's say, if something is truly random, it might be nice, but then 00:50:51.960 --> 00:50:57.170 it's also quite complicated to implement. I mean, you probably don't want a random 00:50:57.170 --> 00:51:01.480 number generator just for the cache. M: Okay. Thanks. 00:51:01.480 --> 00:51:05.980 H: Okay, and then the three guys here on the microphone in the front. 00:51:05.980 --> 00:51:13.450 M: My question is about a detail with the keylogger. You could distinguish between 00:51:13.450 --> 00:51:18.150 space, backspace and alphabet, which is quite interesting. But could you also 00:51:18.150 --> 00:51:22.320 figure out the specific keys that were pressed, and if so, how? 00:51:22.320 --> 00:51:25.650 ML: Yeah, that depends on the implementation of the keyboard. But what 00:51:25.650 --> 00:51:29.310 we did, we used the Android stock keyboard, which is shipped with the 00:51:29.310 --> 00:51:34.520 Samsung, so it's pre-installed. And if you have a table somewhere in your code, which 00:51:34.520 --> 00:51:39.540 says "Okay, if you press this exact location or this image, it's an A or it's 00:51:39.540 --> 00:51:44.450 an B", then you can also do a more sophisticated attack. So, if you find any 00:51:44.450 --> 00:51:49.050 functions or data in the code, which directly tells you "Okay, this is this 00:51:49.050 --> 00:51:54.520 character," you can also spy on the actual key characters on the keyboard. 00:51:54.520 --> 00:52:02.900 M: Thank you. M: Hi. Thank you for your talk. My first 00:52:02.900 --> 00:52:08.570 question is: What can we actually do now, to mitigate this kind of attack? By, for 00:52:08.570 --> 00:52:11.980 example switching off TSX or using ECC RAM. 00:52:11.980 --> 00:52:17.410 CM: So, I think the very important thing to protect would be, like crypto, and the 00:52:17.410 --> 00:52:20.840 good thing is that today we know how to build crypto that is resistant to side- 00:52:20.840 --> 00:52:24.490 channel attacks. So the good thing would be to stop improving implementation that 00:52:24.490 --> 00:52:31.360 are known to be vulnerable for 10 years. Then things like keystrokes is way harder 00:52:31.360 --> 00:52:36.830 to protect, so let's say crypto is manageable; the whole system is clearly 00:52:36.830 --> 00:52:41.490 another problem. And you can have different types of countermeasure on the 00:52:41.490 --> 00:52:45.780 hardware side but it does would mean that Intel an ARM actually want to fix that, 00:52:45.780 --> 00:52:48.560 and that they know how to fix that. I don't even know how to fix that in 00:52:48.560 --> 00:52:55.500 hardware. Then on the system side, if you prevent some kind of memory sharing, you 00:52:55.500 --> 00:52:58.540 don't have flush involved anymore and primum (?) probably is much more 00:52:58.540 --> 00:53:04.880 noisier, so it would be an improvement. M: Thank you. 00:53:04.880 --> 00:53:11.880 H: Do we have signal angel questions? No. OK, then more microphone. 00:53:11.880 --> 00:53:16.630 M: Hi, thank you. I wanted to ask about the way you establish the side-channel 00:53:16.630 --> 00:53:23.280 between the two processors, because it would obviously have to be timed in a way to 00:53:23.280 --> 00:53:28.511 transmit information between one process to the other. Is there anywhere that you 00:53:28.511 --> 00:53:32.970 documented the whole? You know, it's actually almost like the seven layers or 00:53:32.970 --> 00:53:36.580 something like that. There are any ways that you documented that? It would be 00:53:36.580 --> 00:53:40.260 really interesting to know how it worked. ML: You can find this information in the 00:53:40.260 --> 00:53:46.120 paper because there are several papers on covered channels using that, so the NDSS 00:53:46.120 --> 00:53:51.300 paper is published in February I guess, but the Armageddon paper also includes 00:53:51.300 --> 00:53:55.670 a cover channel, and you can find more information about how the 00:53:55.670 --> 00:53:59.320 packets look like and how the synchronization works in the paper. 00:53:59.320 --> 00:54:04.020 M: Thank you. H: One last question? 00:54:04.020 --> 00:54:09.750 M: Hi! You mentioned that you used Osvik's attack for the AES side-channel attack. 00:54:09.750 --> 00:54:17.350 Did you solve the AES round detection and is it different to some scheduler 00:54:17.350 --> 00:54:21.441 manipulation? CM: So on this one I think we only did 00:54:21.441 --> 00:54:24.280 some synchronous attack, so we already knew when 00:54:24.280 --> 00:54:27.770 the victim is going to be scheduled and we didn't have anything to do with 00:54:27.770 --> 00:54:32.930 schedulers. M: Alright, thank you. 00:54:32.930 --> 00:54:37.140 H: Are there any more questions? No, I don't see anyone. Then, thank you very 00:54:37.140 --> 00:54:39.132 much again to our speakers. 00:54:39.132 --> 00:54:42.162 applause 00:54:42.162 --> 00:54:58.970 music 00:54:58.970 --> 00:55:06.000 subtitles created by c3subtitles.de in the year 2020. Join, and help us!