WEBVTT 00:00:03.129 --> 00:00:07.360 35C3 preroll music 00:00:18.780 --> 00:00:23.869 Herald: So the next talk Benjamin Kollenda and Philipp Koppe - they will refresh our 00:00:23.869 --> 00:00:30.529 memories because they already had a talk on 34C3 where they talked about the micro 00:00:30.529 --> 00:00:37.580 code ROM and today they're gonna give us more insights on how micro code works. And 00:00:37.580 --> 00:00:44.320 more details on the ROM itself. Benjamin is a PhD student and has a focus on 00:00:44.320 --> 00:00:51.280 software attacks and defenses and together with Phillip they will now abuse AMD 00:00:51.280 --> 00:00:55.190 microcode for fun and security. Please enjoy. 00:00:55.190 --> 00:00:58.730 Applause 00:01:01.320 --> 00:01:06.260 Benjamin: Thank you. So as mentioned we were able to reverse engineer the AMD 00:01:06.260 --> 00:01:11.599 microcode and the AMD microcode ROM and I'm going to talk about our journey. What 00:01:11.599 --> 00:01:16.369 we learned on the way and how we did it. So this joint work with my colleagues at 00:01:16.369 --> 00:01:20.799 Ruhr Universtat Bochum and a quick outline how are we going to do it. We're going to 00:01:20.799 --> 00:01:25.380 start with a quick crash course on micro architectural basics and what microcode 00:01:25.380 --> 00:01:28.350 actually is. Then I talk about how we reconstructed the 00:01:28.350 --> 00:01:30.330 microcode ROM and what we learned 00:01:30.330 --> 00:01:35.389 along the way. Then I quickly give some examples of the applications we 00:01:35.389 --> 00:01:41.430 implemented with the knowledge we gained from second step. And lastly I talk about 00:01:41.430 --> 00:01:47.649 a framework we used. How it works and what we can do with it. And also this framework 00:01:47.649 --> 00:01:51.899 is available on GitHub along with some other tools so you're free to continue our 00:01:51.899 --> 00:01:57.189 work. OK. So when I'm talking about microcode you can think of it essentially 00:01:57.189 --> 00:02:02.331 as a firmware for your processor. It handles multiple purposes for example 00:02:02.331 --> 00:02:06.440 you can use it to fix CPU bugs that you have in silicon and you want to fix later 00:02:06.440 --> 00:02:11.971 in the design phase. It is used for instruction decoding - I cover this one a 00:02:11.971 --> 00:02:17.970 bit more. It is also used for exception handling. For example, if an exception or 00:02:17.970 --> 00:02:22.200 interrupt is raised, microcode has a first chance of modifying this interrupt 00:02:22.200 --> 00:02:27.110 ignoring it or just passing it along to the operating system. It's also used for 00:02:27.110 --> 00:02:31.790 power management and some other complex features like Intel SGX. And most 00:02:31.790 --> 00:02:37.318 importantly for us microcode is updatable. This used to patch errors in the field. 00:02:37.318 --> 00:02:40.975 Everyone remembers Spectre / Meltdown patches and there's 00:02:40.975 --> 00:02:44.210 a microcode update. So your 00:02:44.210 --> 00:02:50.830 x86 CPU takes multiple steps to execute an instruction. The first step is decoding 00:02:50.830 --> 00:02:55.022 a x86 instruction into multiple smaller micro ops. 00:02:55.022 --> 00:02:57.150 These are then scheduled into the pipeline 00:02:57.150 --> 00:03:01.632 From there, they are dispatched to the different functional units 00:03:01.632 --> 00:03:03.532 like your ALU / AGU 00:03:03.532 --> 00:03:06.392 multiplication division units 00:03:06.392 --> 00:03:08.355 For our purposes the decode step is the 00:03:08.355 --> 00:03:12.190 most interesting one. In the decode step you have a instruction buffer that feeds 00:03:12.190 --> 00:03:17.030 instructions to some decoders. You have short decoders that handle really simple 00:03:17.030 --> 00:03:21.100 instructions. There are long decoders that can handle some more advance instructions. 00:03:21.100 --> 00:03:25.260 And finally, the vector decoder. The vector decoder handles the most complex 00:03:25.260 --> 00:03:29.690 instructions with the help of microcode. So the microcode engine is essentially the 00:03:29.690 --> 00:03:31.247 vector decoder. 00:03:32.458 --> 00:03:36.570 The Microcode engine in essence is compromised out of a microcode 00:03:36.570 --> 00:03:40.770 ROM that stores the instructions for the microcode engine. Think of it as your 00:03:40.770 --> 00:03:48.190 standard instructions. Then there is also a writeable memory the microcode RAM. This 00:03:48.190 --> 00:03:52.520 is where the microcode updates end up when you apply microcode updates. And of course 00:03:52.520 --> 00:03:57.310 around the storage has a whole lot of things that make it actually run. For this 00:03:57.310 --> 00:04:00.860 talk, you only need to know what is a Match Registers. Match Registers are 00:04:00.860 --> 00:04:05.650 essentially breakpoint registers. So if we write an address from inside the microcode 00:04:05.650 --> 00:04:10.670 ROM inside a Match Register whenever this address is fetched, execution, control is 00:04:10.670 --> 00:04:17.570 transferred to the microcode RAM so our patch gets executed. And the microcode 00:04:17.570 --> 00:04:23.060 updates are usually loaded by the BIOS or by the kernel. Linux has an update driver, 00:04:23.060 --> 00:04:28.340 sometimes the BIOS updates it with a pre-installed version and they have a 00:04:28.340 --> 00:04:32.120 pretty simple structure, a partially documented header, and followed by the 00:04:32.120 --> 00:04:37.730 actual microcode that is loaded inside the CPU. And so microcode is organized in 00:04:37.730 --> 00:04:42.650 something called triads. Each triad has three operations essentially x86 00:04:42.650 --> 00:04:48.230 instructions, but based on differences. And lastly, you have a sequence word. The 00:04:48.230 --> 00:04:52.025 sequence word indicates which microcode instructions should be executed next. We 00:04:52.025 --> 00:04:57.950 have options of executing just the next triad, executing another one by branching 00:04:57.950 --> 00:05:01.936 to it, or just saying OK, I'm done with decoding this instruction continue with 00:05:01.936 --> 00:05:07.490 x86 code. These updates are protected by some weak authentication which we were 00:05:07.490 --> 00:05:13.260 able to break so we can create our own. We can analyze existing ones and we can apply 00:05:13.260 --> 00:05:20.620 these to your standard laptop and desktop. However there can only ever be one update 00:05:20.620 --> 00:05:26.534 loaded at the time and when you reboot your machine this update will be gone. 00:05:28.490 --> 00:05:32.990 Also for the talk we are going to look at some microcode and we will present this 00:05:32.990 --> 00:05:38.150 microcode using a register transfer language. It is heavily based on x86. I'm 00:05:38.150 --> 00:05:43.290 just going to cover the differences between these two. Most importantly the 00:05:43.290 --> 00:05:48.650 microcode can have three operands for an instruction in comparison to x86 which 00:05:48.650 --> 00:05:53.640 usually only has two. So you can specify a destination and two source operands. 00:05:55.618 --> 00:05:56.446 Also, 00:05:57.210 --> 00:06:02.240 microcode has some certain bit flags that need to be set and these we do we see with 00:06:02.240 --> 00:06:07.449 these annotations for example ".C" means says instruction also updates a carry flag 00:06:07.449 --> 00:06:14.050 based on the result. Then you have the instruction "jcc" which is a conditional 00:06:14.050 --> 00:06:19.570 branch and the first operand denotes the condition up on which this branch is 00:06:19.570 --> 00:06:24.100 taken. In this case branch if the carry flag is one and [the] second operand 00:06:24.100 --> 00:06:30.300 indicates the offset to add to the instruction pointer. Then we also have 00:06:30.300 --> 00:06:35.760 some sequence word annotations: "next", "complete", and "branch". Also it should 00:06:35.760 --> 00:06:39.958 be noted that the internal microcode architecture is a load-store architecture. 00:06:39.958 --> 00:06:45.350 You can't use memory operands in other instructions like you can on x86 you 00:06:45.350 --> 00:06:48.310 always need to load and store memory explicitly. 00:06:49.190 --> 00:06:51.710 Now we are going to talk about 00:06:51.710 --> 00:06:58.710 how we manage to recover the microcode ROM. The microcode ROM is baked into your 00:06:58.710 --> 00:07:06.860 CPU, you can't change it anymore. It is defined in the silicon during the 00:07:06.860 --> 00:07:12.930 fabrication process and in this picture you can see a die shot taken with a 00:07:12.930 --> 00:07:16.840 electron microscope and this is one of three regions that contains the bits for 00:07:16.840 --> 00:07:23.240 the microcode operations. And if you zoom in a bit more, each of these regions 00:07:23.240 --> 00:07:30.050 consist out of four arrays and these are further subdivided into blocks. Really 00:07:30.050 --> 00:07:34.660 interesting is "Array 2" which is a bit smaller than the other ones but it has 00:07:34.660 --> 00:07:42.160 some structures above it which are of a different visual layout. This is SRAM 00:07:42.160 --> 00:07:47.050 which stores the microcode update. So this is one-time reprogrammable memory that is 00:07:47.050 --> 00:07:53.860 still pretty fast. So the microcode RAM is located right next to the microcode ROM 00:07:53.860 --> 00:07:57.645 which also makes sense from a design standpoint. 00:08:00.445 --> 00:08:02.010 Just an overview of how we 00:08:02.010 --> 00:08:06.930 went ahead and how we went about. We started with pictures and then we used 00:08:06.930 --> 00:08:11.456 some OCR-ike process to transform them into bit strings which we can then further 00:08:11.456 --> 00:08:17.169 process. These bitstrings were then arranged into triads. We could already 00:08:17.169 --> 00:08:22.050 gather that we got individual triades right because there were data dependencies 00:08:22.050 --> 00:08:27.550 all over the place, but between triads, there were no or very few data 00:08:27.550 --> 00:08:33.699 dependencies so the ordering of the triades was still wrong and this was a 00:08:33.699 --> 00:08:38.860 major problem when we went ahead and what we had to reverse engineer and this is 00:08:38.860 --> 00:08:43.870 mapping a certain physical address of a triad that we gathered from the ROM 00:08:43.870 --> 00:08:48.050 readout to a virtual address that is used inside the microcode update or the 00:08:48.050 --> 00:08:53.690 microcode ROM. But after reverse engineer this, you can just do a linear sweep 00:08:53.690 --> 00:08:59.020 disassembly of the microcode ROM and arrive at human readable output. But this 00:08:59.020 --> 00:09:04.870 recovery was a bit tricky because we required physical virtual address pairs. 00:09:04.870 --> 00:09:09.520 But gathering these is a bit harder because we worked there through the 00:09:09.520 --> 00:09:14.040 available updates, but we could only find two pairs of them. These pairs were 00:09:14.040 --> 00:09:18.520 actually easy to find because every update replaces a certain triad inside your 00:09:18.520 --> 00:09:24.580 microcode ROM and this triad is usually also placed in the microcode update. So by 00:09:24.580 --> 00:09:31.260 matching the address this update replaces with a microcode ROM readout. You can just 00:09:31.260 --> 00:09:38.000 get your two data points. But we had to get more data points so we generated these 00:09:38.000 --> 00:09:42.630 mappings by matching semantics of triads in the microcode ROM readout and the 00:09:42.630 --> 00:09:47.779 semantics when we force execution of a certain microcode address. And gathering 00:09:47.779 --> 00:09:52.330 the semantics of the read-out microcode, we implemented a simple microcode 00:09:52.330 --> 00:09:58.820 simulator. Essentially it works on triad level, so you give it an input state and a 00:09:58.820 --> 00:10:03.430 triad and it calculates the output state of it. Input and output state are 00:10:03.430 --> 00:10:08.460 comprised out of the x86-state which is your standard registers and also the 00:10:08.460 --> 00:10:12.320 internal microcode registers. There are multiple temporary registers that get 00:10:12.320 --> 00:10:18.350 reset for every new x86 instruction that is executed, but they can also be modified 00:10:18.350 --> 00:10:24.130 by microcode of course. Our emulator supports all known arithmetic operations 00:10:24.130 --> 00:10:29.230 and we have a white-list of operations that do not form or produce any observable 00:10:29.230 --> 00:10:32.950 change in state just so that we could process more triades and give them more 00:10:32.950 --> 00:10:41.310 data points. In total we gathered 54 additional data-address pairs which turned 00:10:41.310 --> 00:10:46.649 out to be enough to recover the whole mapping. This mapping, essentially you 00:10:46.649 --> 00:10:50.820 have the four different arrays that map to individual blocks and these blocks in 00:10:50.820 --> 00:10:56.750 these arrays or then again permuted a bit and then the triads inside these blocks 00:10:56.750 --> 00:11:02.330 have some table-based permutations. So this is not an obfuscation. This is just 00:11:02.330 --> 00:11:07.680 from a hardware design standpoint it can make sense to reroute it a bit differently 00:11:09.330 --> 00:11:14.629 Also now that we can actually map a certain address to the microcode ROM 00:11:14.629 --> 00:11:19.093 readout and we know the addresses of different x86 instructions from our 00:11:19.093 --> 00:11:24.240 earlier experiments, we can look at the implementation of instructions. So let's 00:11:24.240 --> 00:11:29.130 start with a pretty simple one. Shift- Right-Double which essentially takes a 00:11:29.130 --> 00:11:33.250 register, shift it by a given amount and shifts in bits from another register. So 00:11:33.250 --> 00:11:38.180 of course you would expect a lot of shifts and rolls in its implementation and this 00:11:38.180 --> 00:11:45.338 is exactly what we're seeing here. You have two shift-right operands and you can 00:11:45.338 --> 00:11:50.830 see regmd6 and regmd4. These are place holders. The microcode engine can 00:11:50.830 --> 00:11:55.630 replace certain bit combinations with the registers that are used in the x86 00:11:55.630 --> 00:12:01.560 operation. For example this one would be replaced by ECX or EAX depending on what 00:12:01.560 --> 00:12:08.339 you wrote in x86. And at this point we can also already gather more information about 00:12:08.339 --> 00:12:13.601 microcodes than we previously knew because we know "OK, so this is source, this is 00:12:13.601 --> 00:12:18.529 also a source and this is a destination". But this source which indicates the shift 00:12:18.529 --> 00:12:22.750 amount, this one was previously unknown, because it is a high temporary microcode 00:12:22.750 --> 00:12:28.279 register and we found out that these usually implement specific different 00:12:28.279 --> 00:12:31.800 purpose. They are not - if you write to them, sometimes the CPU behaves 00:12:31.800 --> 00:12:35.890 erratically, sometimes it crashes, sometimes nothing happens. But in this 00:12:35.890 --> 00:12:40.300 case, this seems to be the shift count, and the shift count is given by a third 00:12:40.300 --> 00:12:45.279 operand in the instruction. So in this case, we already learned "OK, if you want 00:12:45.279 --> 00:12:51.380 to read the third operand of an instruction, we need to read t41". And 00:12:51.380 --> 00:12:56.236 this is how we went about recovering more and more information about microcode. The 00:12:56.236 --> 00:13:00.160 rest of the implementation is essentially concerned with implementing the rest of 00:13:00.160 --> 00:13:05.721 the semantics of the x86 instruction and updating the flags correctly. OK, so now 00:13:05.721 --> 00:13:11.980 let's look at a instruction set that is a bit more complicated. If you check out 00:13:11.980 --> 00:13:19.620 rdtsc. rdtsc returns a internal cycle counter in EDX and EAX, so the upper part 00:13:19.620 --> 00:13:25.520 ends up in EDX, lower part in EAX. So in the end we want to see writes to these 00:13:25.520 --> 00:13:30.760 registers, potentially with a shift somewhere in there. But somewhere the CPU 00:13:30.760 --> 00:13:37.570 needs to gather the cycle counter. So in the beginning we have two load-style 00:13:37.570 --> 00:13:41.410 operations. This one is a proper load which we identified and this one is 00:13:41.410 --> 00:13:48.569 unknown. But despite that we do not know the instruction, we know the target 00:13:48.569 --> 00:13:52.720 because the result of this instruction will end up in t9 and the result of this 00:13:52.720 --> 00:13:58.060 instruction will end up in t10, so we can follow the uses of these two registers. So 00:13:58.060 --> 00:14:04.450 for simplicity I'm going to start with t10 and t10, which we later found out, this is 00:14:04.450 --> 00:14:09.730 another register which essentially denotes a specific internal register. And if you 00:14:09.730 --> 00:14:15.450 play around with these bits you notice that this combination encodes cr4. The x86 00:14:15.450 --> 00:14:22.987 will just see cr4. You can also address cr1 and cr2. And if you look further, t10 00:14:22.987 --> 00:14:29.160 is then ended with this bit mask and if you look in the manual you find out that 00:14:29.160 --> 00:14:34.930 this bit in cr4 denotes the bit that determines whether oddity C is 00:14:34.930 --> 00:14:40.019 available from user space or not. So this is the check if this instruction should be 00:14:40.019 --> 00:14:48.170 executed. So now let's just keep in mind that t9 holds some other loaded value from 00:14:48.170 --> 00:14:53.930 some other internal register and we will come back to this one a bit later. For 00:14:53.930 --> 00:14:58.848 now, let's follow execution. This triad is essentially a padding triad. It is a 00:14:58.848 --> 00:15:04.885 common pattern we see. So let's look at where this branch takes us. 00:15:05.895 --> 00:15:07.180 And this branch 00:15:07.180 --> 00:15:15.959 takes us to a conditional branch triad. And if you look a bit up, this end 00:15:15.959 --> 00:15:21.740 instruction actually updated this flag. So this is a conditional branch that 00:15:21.740 --> 00:15:26.360 determines whether this check was successful or not. So it branches toward 00:15:26.360 --> 00:15:32.570 the error triad or the success triad. But here we already see the exit. We see a 00:15:32.570 --> 00:15:41.170 write to RDX or EDX in this case with a shift from t9 by 32 bit, which is exactly 00:15:41.170 --> 00:15:45.910 what you would expect to write the time stamp counter on the upper 32 bits of the 00:15:45.910 --> 00:15:50.829 time stamp counter to edx. And you have an unknown instruction, but we know, okay, we 00:15:50.829 --> 00:15:57.877 move something from t9 to eax, which is the lower 32 bits. But we're not done 00:15:57.877 --> 00:16:02.690 here, because we can still look at the error pass that is taken if the access is 00:16:02.690 --> 00:16:09.210 denied. So if you scroll a bit down we can see a move of an immediate into a certain 00:16:09.210 --> 00:16:14.530 internal register. And this is immediate actually encodes a general protection 00:16:14.530 --> 00:16:21.790 fault interrupt code. D denotes to the exception handler that this was a general 00:16:21.790 --> 00:16:28.680 protection fault. And later this triad branches to this address, and if you look 00:16:28.680 --> 00:16:34.013 at the uses of this address we can find other immediates that also correspond on 00:16:34.013 --> 00:16:36.962 to x86 instructions. So now we learned 00:16:36.962 --> 00:16:39.947 how we can actually raise our own interrupts. We 00:16:39.947 --> 00:16:46.100 just need to load the code we want into the specific register and branch to this 00:16:46.100 --> 00:16:52.820 address. And now we learned a lot about how we can actually write microcode, but 00:16:52.820 --> 00:16:57.000 it's also interesting to see how certain instructions are implemented. So let's 00:16:57.000 --> 00:17:03.671 look at a pretty complicated one: wrmsr (Write MSR). wrmsr essentially writes some 00:17:03.671 --> 00:17:08.449 data it is given to a machine specific register. This machine specific register 00:17:08.449 --> 00:17:12.980 differs between CPUs, between vendors, sometimes between revisions. And these 00:17:12.980 --> 00:17:17.910 implement non-standard extensions or pretty complex features. For example, you 00:17:17.910 --> 00:17:23.949 trigger a microcode update by writing to a machine specific register. The register 00:17:23.949 --> 00:17:30.570 addresses you want to write to is given in ecx. And now we can see ecx is read and 00:17:30.570 --> 00:17:39.679 it is shifted by sixteen bits to t10. So again, we follow uses of t10 and we see 00:17:39.679 --> 00:17:46.070 it as XOR'd with a certain bitmask. And this bitmask is C000, which actually 00:17:46.070 --> 00:17:52.429 denotes a namespace of the model specific registers. In this case this should be an 00:17:52.429 --> 00:17:58.450 AMD-specific namespace. And, of course, this one again sets some flags, and you 00:17:58.450 --> 00:18:04.240 can see your conditional branch depending on these flags to what should be the 00:18:04.240 --> 00:18:06.235 handler for this namespace. 00:18:06.695 --> 00:18:10.770 Next one: We have another XOR that uses a different bit 00:18:10.770 --> 00:18:16.890 mask — in this case C001. C001 is the namespace where the microcode update 00:18:16.890 --> 00:18:25.050 routine is actually located in. So again, we branch to this handler. And if you just 00:18:25.050 --> 00:18:31.010 continue on, there are more operations on rcx, followed by more branches, and this 00:18:31.010 --> 00:18:35.790 continues until everything is dispatched to the correct handler. And this is how, 00:18:35.790 --> 00:18:40.340 internally, wrmsr is implemented, and also Read MSR is going to be implemented pretty 00:18:40.340 --> 00:18:43.640 similar, because it implements some kind of similar thing. 00:18:47.750 --> 00:18:49.190 OK, so now I showed you 00:18:49.190 --> 00:18:52.470 how we actually went ahead of reconstructing the knowledge we 00:18:52.470 --> 00:18:57.939 currently have. And now I'm going to show you what we can actually do with it. And 00:18:57.939 --> 00:19:02.440 for this I am going to quickly cover what applications we wrote in microcode. We 00:19:02.440 --> 00:19:04.940 wrote a simple configurable rdtsc precision. 00:19:04.940 --> 00:19:07.710 This means a certain bit mask is AND'd to 00:19:07.710 --> 00:19:11.890 the result of rdtsc, so you can reduce the accuracy of it, which can 00:19:11.890 --> 00:19:18.284 sometimes prevent timing attacks. We also implemented microcode-assisted address 00:19:18.284 --> 00:19:23.260 sanitizer, which I'll cover quickly in a second. We also have some basic microcode 00:19:23.260 --> 00:19:29.070 instruction set randomization. Some microcode-assisted instrumentation. What 00:19:29.070 --> 00:19:33.520 this means is, you can write a filter for your instrumentation in microcode itself. 00:19:33.520 --> 00:19:37.580 So instead of hooking an instruction, instead of debugging your code or 00:19:37.580 --> 00:19:42.160 emulating it, you can just say whenever the instruction is executed filter if this 00:19:42.160 --> 00:19:47.180 is relevant for me, and if it is, call my x86 handler — entirely in microcode, 00:19:47.180 --> 00:19:52.470 without changing the instruction in the RAM. We also implemented some basic 00:19:52.470 --> 00:20:00.000 authenticated microcode updates. The usual update mechanism is weak — that's how we 00:20:00.000 --> 00:20:05.430 got our foot in the door in the first place. So we improved upon it a bit. Also 00:20:05.430 --> 00:20:09.799 line:1 we found out that microcode actually has some enclave-like features because once 00:20:09.799 --> 00:20:13.730 we're executing in Microcode, your kernel can't interupt you, your hypervisor can't 00:20:13.730 --> 00:20:18.610 interrupt you and any state you want visible to the outside world. You actually 00:20:18.610 --> 00:20:22.840 need to write explicitly. So all these microcode internal registers are not 00:20:22.840 --> 00:20:26.600 accessible from the outside world. So any computation you perform in micro code 00:20:26.600 --> 00:20:30.360 cannot be interfered with. So you can implement a simple enclave on top of this 00:20:30.360 --> 00:20:37.039 one. So our hardware-assisted address sanitizer variant is based on the work by 00:20:37.039 --> 00:20:41.970 the original authors and address sanitizer is a software instrumentation that detects 00:20:41.970 --> 00:20:47.070 invalid memory access by using a shadow map shadow memory to just say which memory 00:20:47.070 --> 00:20:50.746 is valid to be read and written to. 00:20:50.746 --> 00:20:53.840 The authors proposed hardware address sanitizer 00:20:53.840 --> 00:20:59.011 which is essentially doing the same checks but using a new instruction. And the 00:20:59.011 --> 00:21:03.940 instruction should raise a fault if an invalid access is detected. This algorithm 00:21:03.940 --> 00:21:07.670 they proposed - The details are not important. What is important is in 00:21:07.670 --> 00:21:12.080 essence: It's pretty simple. You load from a certain adress, performs the operations 00:21:12.080 --> 00:21:18.816 on it and if there is the shadow after this operations you just report a bug. 00:21:18.816 --> 00:21:24.910 Advantages of hardware address sanitizer are for example you get better performance 00:21:24.910 --> 00:21:29.170 out of it. Because you only have a single instruction maybe you can do some fancy 00:21:29.170 --> 00:21:34.450 tricks inside your CPU that are faster than using x86 instructions, you get more 00:21:34.450 --> 00:21:38.880 compact code and you have the possibility of one time configuration which is a bit 00:21:38.880 --> 00:21:45.210 hard with software address sanitizer. We implemented hardware address sanitizer our 00:21:45.210 --> 00:21:49.270 variant by replacing the bound instruction Bound is an old instruction that is no 00:21:49.270 --> 00:21:54.870 longer used by compilers because in fact it is slower to use bound instead of 00:21:54.870 --> 00:21:58.901 performing the checks with multiple x86 instructions. We changed the interface. 00:21:58.901 --> 00:22:04.090 The first argument is the register which holds the address you want to access. And 00:22:04.090 --> 00:22:07.835 the second argument holds the size you want this access to be. 00:22:07.835 --> 00:22:11.050 So, 1 byte, 2 byte and so on. 00:22:11.050 --> 00:22:14.950 This instruction is a no-op if the check succeeds. So if there is no bug it 00:22:14.950 --> 00:22:19.980 just continues on like nothing happened. However if we detect an invalid access we 00:22:19.980 --> 00:22:25.359 can take a configurable action, we can for example just raise your normal page fault 00:22:25.359 --> 00:22:29.630 or we can raise a bound interrupt, which is a custom interrupt, that only denotes 00:22:29.630 --> 00:22:34.299 this one or we can branch to an x86 handler that either performs additional 00:22:34.299 --> 00:22:39.760 checking, for example whitelisting, or it generates a pretty error report for you. 00:22:41.340 --> 00:22:47.480 Most importantly this is a single instruction. We also do not dirty any x86 00:22:47.480 --> 00:22:52.690 registers because they are some intermediate results. You need to store 00:22:52.690 --> 00:22:56.360 these somewhere and this you usually do in the x86 registers. So you increase 00:22:56.360 --> 00:23:00.010 register pressure. Maybe you cause spilling. So overall your performance gets 00:23:00.010 --> 00:23:07.230 worse. We also found out that we are actually faster than doing the checking 00:23:07.230 --> 00:23:12.390 using x86 instructions. So just by moving the implementation from x86 level to 00:23:12.390 --> 00:23:16.805 microcode, which in some way is still kind of like software, we already improved the 00:23:16.805 --> 00:23:22.160 performance. Also on top of this you get better cache utilization because you have 00:23:22.160 --> 00:23:27.020 less instructions, there are less bytes in the cache, so we get fuller cache lines. 00:23:27.020 --> 00:23:31.630 And also it is really easy to tell which is testing code and which is your actual 00:23:31.630 --> 00:23:40.080 program code. Lastly I'm going to show you just a rough overview of our framework 00:23:40.080 --> 00:23:45.920 which we used during our development and which you can also find on GitHub. Early 00:23:45.920 --> 00:23:50.079 line:1 on we found out that we are probably going to need to test a lot of microcode 00:23:50.079 --> 00:23:55.640 line:1 updates, because in the beginning you just throw everything at the CPU and see how it 00:23:55.640 --> 00:24:01.400 line:1 behaves and we wanted to do this in parallel. So we developed a small custom 00:24:01.400 --> 00:24:07.180 OS called "Angry OS" and deployed it to mainboards. These mainboards are just old 00:24:07.180 --> 00:24:13.270 AMD mainboards. All these mainboards were hooked up via serial for communication and 00:24:13.270 --> 00:24:19.400 GPIO to a Raspberry Pi. With the GPIO you can reset, support power on, power down 00:24:19.400 --> 00:24:23.890 and just have remote control of this mainboard and then you can connect to that 00:24:23.890 --> 00:24:28.719 Raspberry Pi from anywhere on earth and just deploy and play around with it. 00:24:28.719 --> 00:24:30.640 This was the first version. 00:24:30.640 --> 00:24:34.490 In the beginning we didn't really know much about electronics 00:24:34.490 --> 00:24:38.520 so we used one Raspberry Pi per mainboard. And it turns out Raspberry Pis are more 00:24:38.520 --> 00:24:43.970 expensive than these old mainboards, but we improved upon this and now we're down 00:24:43.970 --> 00:24:48.007 to one Raspberry Pi for four / five setups. 00:24:48.007 --> 00:24:51.587 For example you only need 3 GPIO ports per 00:24:51.587 --> 00:24:57.358 mainboard. You connect each of these to optocouplers just to separate the voltage 00:24:57.358 --> 00:25:01.860 levels and then you connect one side of the optocoupler to the GPIO the other side 00:25:01.860 --> 00:25:05.909 to your reset pin, to your power pin and for input to know whether your board is up 00:25:05.909 --> 00:25:11.230 or down you connect the power LED. And that way you can save a lot of space, a 00:25:11.230 --> 00:25:17.205 lot of money. And also if you're really constrained you can just remove the power 00:25:17.205 --> 00:25:23.530 LED sensing because usually you know it is in the state your setup is in. As I 00:25:23.530 --> 00:25:28.230 already said we wrote our custom operating system and it is intentionally really 00:25:28.230 --> 00:25:32.659 really minimal because the major feature we wanted is control over every 00:25:32.659 --> 00:25:36.740 instructions that's going to be executed from a certain point on, because we're 00:25:36.740 --> 00:25:40.780 playing around with instruction encoding and if we execute an instructions that we 00:25:40.780 --> 00:25:45.530 did not intend we might crash the CPU, we might go into an invalid state and we do 00:25:45.530 --> 00:25:50.850 not even know which instruction caused it. And Angry OS essentially only listens on 00:25:50.850 --> 00:26:00.150 the serial port for something to do. What it can do is apply an update. These 00:26:00.150 --> 00:26:04.820 updates are just microcode updates. They are streamed via serial. We can also 00:26:04.820 --> 00:26:10.039 stream x86 code which is then run by Angry OS and this is just so that we do not need 00:26:10.039 --> 00:26:14.409 to reflash the USB stick every time we want to update our testing code and the 00:26:14.409 --> 00:26:19.280 result, all the errors are reported back to the Raspberry Pi and thus they are 00:26:19.280 --> 00:26:26.852 forwarded to us. The framework we use most importantly has the microcode assembler 00:26:26.852 --> 00:26:30.713 and a pretty verbose disassembler. This disassembler generates the output I showed 00:26:30.713 --> 00:26:36.919 you earlier and using this you can just quickly write your own microcode. We also 00:26:36.919 --> 00:26:42.245 included an x86 assembler because we wanted to rapidly test different x86 00:26:42.245 --> 00:26:47.730 testing codes. Using this framework we were able to disassemble the existing 00:26:47.730 --> 00:26:53.500 updates and we also used it to disassemble our ROM after we reordered it and also 00:26:53.500 --> 00:27:01.169 during the process when we fed it to our emulator. And we can also create the 00:27:01.169 --> 00:27:07.909 proper binary files that can be loaded by the Linux kernel driver. We modified the 00:27:07.909 --> 00:27:12.777 stock one to just load any update you give it without checking if it's the correct 00:27:12.777 --> 00:27:20.060 CPU ID and all these things just for testing purposes. It's also available. And 00:27:20.060 --> 00:27:25.740 also of course the framework can control Angry OS to make your testing easier. And 00:27:25.740 --> 00:27:29.650 we implemented a pretty basic remote execution wrapper, so you can work on a 00:27:29.650 --> 00:27:33.389 remote Raspberry Pi as if you were using it locally. 00:27:34.809 --> 00:27:36.799 And this brings me to the end 00:27:36.799 --> 00:27:40.800 of talk. And in conclusion we can say reversing the ROM opened up a lot of new 00:27:40.800 --> 00:27:44.809 possibilities. We learned a lot about how microcode works. We learned about how to 00:27:44.809 --> 00:27:49.720 actually use it properly instead of just inferring from a really small dataset, 00:27:49.720 --> 00:27:55.060 that we have from the updates, or from the random bits things we send to the CPU and 00:27:55.060 --> 00:27:59.530 observe what happened. But there's a lot left to do. So if you really want to hack 00:27:59.530 --> 00:28:04.089 on it, just get in contact, we were happy to share our findings with you. And as I 00:28:04.089 --> 00:28:09.009 said the framework AngryOS, example programs, that we implemented, and some 00:28:09.009 --> 00:28:13.850 other stuff like the wiring is available on GitHub. So that's that. And we are 00:28:13.850 --> 00:28:16.809 happy to answer any questions you might have. 00:28:16.809 --> 00:28:22.234 applause 00:28:24.910 --> 00:28:28.438 Herald Angel: Thank you very much. So we 00:28:28.438 --> 00:28:34.260 have 10 minutes for questions please line up at the microphones. We start with this 00:28:34.260 --> 00:28:39.220 one: microphone number 2. M2: Hi. Thanks for a nice talk. A few 00:28:39.220 --> 00:28:42.780 questions about your hardware address sanitizer. 00:28:42.780 --> 00:28:49.830 Benjamin: Mhm M2: As I understand you don't need the 00:28:49.830 --> 00:28:56.010 source code instrumentation because the microcode is responsible for checking the 00:28:56.010 --> 00:29:02.929 shadow memory, right? Benjamin: No... The original hardware 00:29:02.929 --> 00:29:07.950 sanitizer implementation is also based on a compiler extension, that inserts a new 00:29:07.950 --> 00:29:12.200 instruction because it doesn't exist usually. And it also inserts a bootstrap 00:29:12.200 --> 00:29:18.049 code that in inits your shadow map and also instruments your allocators to update 00:29:18.049 --> 00:29:23.020 the shadow map doing runtime and we essentially need the same component, but 00:29:23.020 --> 00:29:26.850 we do not need the software address sanitizer component that essentially 00:29:26.850 --> 00:29:33.740 inserts 10 or 20 x86 instructions before every memory access. So yes we still need 00:29:33.740 --> 00:29:37.647 a compile time component and we are still source code based in a sense. 00:29:39.388 --> 00:29:45.600 Herald: And, so.. M2: And I didn't see, maybe I missed the 00:29:45.600 --> 00:29:51.299 numbers. How much it is faster than this initial version? 00:29:51.299 --> 00:29:56.419 Benjamin: You mean the initial hardware sanitizer version or the software address 00:29:56.419 --> 00:29:59.900 sanitizer. M2: I mean let's say custom kernel address 00:29:59.900 --> 00:30:05.180 sanitizer for Linux kernel which is the the usual one and your approach. 00:30:05.180 --> 00:30:10.270 Benjamin: We only performed a micro benchmark on Angry OS and we essentially 00:30:10.270 --> 00:30:16.059 took the instrumentation as emitted by the compiler for some memory access which is 00:30:16.059 --> 00:30:20.590 your standard software address sanitizer and compared it to our version using only 00:30:20.590 --> 00:30:24.640 the modified bound instruction. So I really can't talk about how it compares to 00:30:24.640 --> 00:30:28.820 KASAN or something or some like real world implementation, because we only have the 00:30:28.820 --> 00:30:34.069 prototype and the basic instrumentation. M2: Thank you very much. 00:30:34.069 --> 00:30:36.490 Herald Angel: OK. Microphone number 4 please. 00:30:36.490 --> 00:30:51.145 M4: Hey thanks for the talk and did you find any weird microcode 00:30:51.145 --> 00:31:00.529 implementations. I don't mean security wise, just like you rarely expected to 00:31:00.529 --> 00:31:07.330 see it be implemented that way. 00:31:09.040 --> 00:31:11.700 Benjamin: The problem is there's a lot of 00:31:11.700 --> 00:31:20.270 microcode to begin with. You have f000 triads. Each of which has 3 op-codes. So 00:31:20.270 --> 00:31:25.003 you have a lot of ground to cover and also we have read-out errors. Sometimes you are 00:31:25.003 --> 00:31:29.169 seeing bit flips, which kind of slows you down because you then need to always 00:31:29.169 --> 00:31:32.820 consider: OK, maybe this register is something else, maybe this address is 00:31:32.820 --> 00:31:37.420 wrong. And also sometimes you have a dust particles that kind of knocks out an 00:31:37.420 --> 00:31:42.550 entire region. So we only looked at the components, we were pretty sure that we 00:31:42.550 --> 00:31:46.520 recovered correctly, and we'd only looked at a really tiny subset compared to all of 00:31:46.520 --> 00:31:52.940 the microcode ROM. It's just not feasible to do and to go through it and look at 00:31:52.940 --> 00:31:57.330 everything. So no we didn't find anything funny but we also wouldn't know what funny 00:31:57.330 --> 00:32:00.790 looks like because we don't know what the official spec for microcode is. 00:32:01.180 --> 00:32:03.990 M4: Thanks. Herald Angel: Interesting. We have one 00:32:04.034 --> 00:32:05.809 question from the Internet, from the 00:32:05.809 --> 00:32:09.792 Signal Angel please. Signal Angel: Yes. Which AMD CPU 00:32:09.792 --> 00:32:15.510 generations does this apply to? Benjamin: Yeah this is still based on the 00:32:15.510 --> 00:32:21.289 work of our first talk and this only works on pretty old ones: K8, K10. So until, 00:32:21.289 --> 00:32:26.940 CPUs produced until 2013. Yeah this was the last year AMD produced anything like 00:32:26.940 --> 00:32:32.520 that. Newer ones use some public key based cryptography from what we can tell and we 00:32:32.520 --> 00:32:36.559 haven't yet managed to break it. Same goes for Intel, they seem to be using public 00:32:36.559 --> 00:32:39.919 key cryptography and we haven't gotten a foot in the door yet. 00:32:40.989 --> 00:32:44.789 Herald Angel: Thank you. We go one around. On microphone number 3 please. 00:32:44.789 --> 00:32:51.290 M3: Yeah. Thank you. I would like to know how complex could the microcode programs 00:32:51.290 --> 00:32:59.159 be, that you could write. So what's the complexity of new operations you could 00:32:59.159 --> 00:33:03.300 implement. Benjamin: The only limiting factor is the 00:33:03.300 --> 00:33:07.923 size of your microcode update RAM. But this one is really really limited. 00:33:07.923 --> 00:33:12.679 For example on K8, where we performed the majority of our experiments. We are 00:33:12.679 --> 00:33:19.050 limited to 32 triads, which comes down to a sixty nine instructions and you also 00:33:19.050 --> 00:33:22.440 have some constraints on these instructions for example the next triad 00:33:22.440 --> 00:33:27.809 will always be executed no matter what. Some operations can only go at the second 00:33:27.809 --> 00:33:33.859 slot. Some can only go on another slot, so it's really really hard. And you're also 00:33:33.859 --> 00:33:38.930 limited from our knowledge to loading 16 bit immediates instead of 32 bit or even 00:33:38.930 --> 00:33:44.470 64 bit immediates. So your whole program grows really fast if you're trying to do 00:33:44.470 --> 00:33:49.400 something complex. For example our authenticated microcode update mechanism 00:33:49.400 --> 00:33:54.440 is the most complex one we wrote it nearly fills out the RAM and we used TEA – Tiny 00:33:54.440 --> 00:33:58.700 Encryption Algorithm – because that was the only one we managed to fit mostly due 00:33:58.700 --> 00:34:04.510 to S-box and other constants we would need to load. So it's really small. 00:34:04.510 --> 00:34:08.539 Herald Angel: Thank you Microphone number 1. 00:34:08.539 --> 00:34:14.709 M1: So you said the microcode is used for instruction decoding and it needs to meet 00:34:14.709 --> 00:34:19.429 the micro-ops to the scheduler and micro queue in some way. Did you find out how 00:34:19.429 --> 00:34:27.519 that works? Bejamin: In essence we are not actually 00:34:27.519 --> 00:34:33.539 executing code inside in microcode engine. From what from what we understand, the 00:34:33.539 --> 00:34:38.569 microcode engine is just some kind of a software based recipe, that describes how 00:34:38.569 --> 00:34:43.479 to decode an instruction, so you don't actually get execution, you just commit 00:34:43.479 --> 00:34:47.269 instructions into the pipelines, that do what you want. And because we have some 00:34:47.269 --> 00:34:51.269 control flow possibility, that is actually inside the micro code engine, because you 00:34:51.269 --> 00:34:55.268 can branch to different addresses, you can conditionally branch and loop. You kind of 00:34:55.268 --> 00:34:59.089 get an execution, but in essence to just commit stuff in the pipeline and the CPU 00:34:59.089 --> 00:35:01.440 does what you tell it to. 00:35:04.240 --> 00:35:07.161 Herald Angel: One more question. Microphone number 2, please. 00:35:07.161 --> 00:35:11.927 M2: How did you take the picture of the internal CPU? Did you open it? 00:35:11.927 --> 00:35:14.969 Benjamin: Yeah. We worked together with 00:35:14.969 --> 00:35:19.680 Chris. He's our hardware guy. He has access to his equipment to delayer it and 00:35:19.680 --> 00:35:24.289 to take high resolution optical shots and he also takes shots with a scanning 00:35:24.289 --> 00:35:29.279 electron microscope. So I think about five or six CPUs were harmed in the making of 00:35:29.279 --> 00:35:30.357 this paper. 00:35:33.810 --> 00:35:37.815 Herald Angel: So we have one more last question. Microphone number 2 please. 00:35:39.248 --> 00:35:41.390 M2: Are you aware of research done by 00:35:41.390 --> 00:35:49.400 Christopher Domas, where he mapped out the instruction set for x86 processors? 00:35:49.400 --> 00:35:57.119 B: You mean sandsifter? We actually talked with him and yeah we are 00:35:57.119 --> 00:36:02.910 aware, that there's a map essentially of the instruction set and also maybe you can 00:36:02.910 --> 00:36:07.275 combine it, because in the beginning we reverse engineered where certain x86 00:36:07.275 --> 00:36:11.335 instructions are implemented in microcode. So if you plug these two together you kind 00:36:11.335 --> 00:36:15.170 of map out the whole microcode ROM at the same time that you map out a whole 00:36:15.170 --> 00:36:18.989 instruction set. However there are some components of the microcode ROM that are 00:36:18.989 --> 00:36:23.470 most likely not triggered by instructions. For example it seems like power management 00:36:23.470 --> 00:36:27.368 or everything that is behind a write MSR [wrmsr] or read MSR [rdmsr]. wrmsr is a 00:36:27.368 --> 00:36:31.249 single instruction, but depending on the arguments you give it it just branches to 00:36:31.249 --> 00:36:36.442 totally different triads and the microcode itself is implemented in microcode. And 00:36:36.442 --> 00:36:40.190 this one is a huge chunk you wouldn't even find without brute forcing all 00:36:40.190 --> 00:36:44.159 combinations for all instructions which is not really feasible. 00:36:46.483 --> 00:36:51.279 Herald Angel: Thank you. Thank you Benjamin. 00:36:51.279 --> 00:36:57.210 applause 00:36:57.210 --> 00:37:01.811 35c3 postroll music 00:37:01.811 --> 00:37:21.000 subtitles created by c3subtitles.de in the years 2019-2020. Join, and help us!