0:00:03.129,0:00:07.360 35C3 preroll music 0:00:18.780,0:00:23.869 Herald: So the next talk Benjamin Kollenda[br]and Philipp Koppe - they will refresh our 0:00:23.869,0:00:30.529 memories because they already had a talk[br]on 34C3 where they talked about the micro 0:00:30.529,0:00:37.580 code ROM and today they're gonna give us[br]more insights on how micro code works. And 0:00:37.580,0:00:44.320 more details on the ROM itself. Benjamin[br]is a PhD student and has a focus on 0:00:44.320,0:00:51.280 software attacks and defenses and together[br]with Phillip they will now abuse AMD 0:00:51.280,0:00:55.190 microcode for fun and security. Please[br]enjoy. 0:00:55.190,0:00:58.730 Applause 0:01:01.320,0:01:06.260 Benjamin: Thank you. So as mentioned we[br]were able to reverse engineer the AMD 0:01:06.260,0:01:11.599 microcode and the AMD microcode ROM and[br]I'm going to talk about our journey. What 0:01:11.599,0:01:16.369 we learned on the way and how we did it.[br]So this joint work with my colleagues at 0:01:16.369,0:01:20.799 Ruhr Universtat Bochum and a quick outline[br]how are we going to do it. We're going to 0:01:20.799,0:01:25.380 start with a quick crash course on micro[br]architectural basics and what microcode 0:01:25.380,0:01:28.350 actually is. Then I talk about how we[br]reconstructed the 0:01:28.350,0:01:30.330 microcode ROM and what we learned 0:01:30.330,0:01:35.389 along the way. Then I quickly give some[br]examples of the applications we 0:01:35.389,0:01:41.430 implemented with the knowledge we gained[br]from second step. And lastly I talk about 0:01:41.430,0:01:47.649 a framework we used. How it works and what[br]we can do with it. And also this framework 0:01:47.649,0:01:51.899 is available on GitHub along with some[br]other tools so you're free to continue our 0:01:51.899,0:01:57.189 work. OK. So when I'm talking about[br]microcode you can think of it essentially 0:01:57.189,0:02:02.331 as a firmware for your processor. It[br]handles multiple purposes for example 0:02:02.331,0:02:06.440 you can use it to fix CPU bugs that you[br]have in silicon and you want to fix later 0:02:06.440,0:02:11.971 in the design phase. It is used for[br]instruction decoding - I cover this one a 0:02:11.971,0:02:17.970 bit more. It is also used for exception[br]handling. For example, if an exception or 0:02:17.970,0:02:22.200 interrupt is raised, microcode has a first[br]chance of modifying this interrupt 0:02:22.200,0:02:27.110 ignoring it or just passing it along to[br]the operating system. It's also used for 0:02:27.110,0:02:31.790 power management and some other complex[br]features like Intel SGX. And most 0:02:31.790,0:02:37.318 importantly for us microcode is updatable.[br]This used to patch errors in the field. 0:02:37.318,0:02:40.975 Everyone remembers Spectre / Meltdown[br]patches and there's 0:02:40.975,0:02:44.210 a microcode update. So your 0:02:44.210,0:02:50.830 x86 CPU takes multiple steps to execute an[br]instruction. The first step is decoding 0:02:50.830,0:02:55.022 a x86 instruction into multiple smaller[br]micro ops. 0:02:55.022,0:02:57.150 These are then scheduled into the pipeline 0:02:57.150,0:03:01.632 From there, they are dispatched to[br]the different functional units 0:03:01.632,0:03:03.532 like your ALU / AGU 0:03:03.532,0:03:06.392 multiplication division units 0:03:06.392,0:03:08.355 For our purposes the decode step is the 0:03:08.355,0:03:12.190 most interesting one. In the decode step[br]you have a instruction buffer that feeds 0:03:12.190,0:03:17.030 instructions to some decoders. You have[br]short decoders that handle really simple 0:03:17.030,0:03:21.100 instructions. There are long decoders that[br]can handle some more advance instructions. 0:03:21.100,0:03:25.260 And finally, the vector decoder. The[br]vector decoder handles the most complex 0:03:25.260,0:03:29.690 instructions with the help of microcode.[br]So the microcode engine is essentially the 0:03:29.690,0:03:31.247 vector decoder. 0:03:32.458,0:03:36.570 The Microcode engine in essence[br]is compromised out of a microcode 0:03:36.570,0:03:40.770 ROM that stores the instructions for the[br]microcode engine. Think of it as your 0:03:40.770,0:03:48.190 standard instructions. Then there is also[br]a writeable memory the microcode RAM. This 0:03:48.190,0:03:52.520 is where the microcode updates end up when[br]you apply microcode updates. And of course 0:03:52.520,0:03:57.310 around the storage has a whole lot of[br]things that make it actually run. For this 0:03:57.310,0:04:00.860 talk, you only need to know what is a[br]Match Registers. Match Registers are 0:04:00.860,0:04:05.650 essentially breakpoint registers. So if we[br]write an address from inside the microcode 0:04:05.650,0:04:10.670 ROM inside a Match Register whenever this[br]address is fetched, execution, control is 0:04:10.670,0:04:17.570 transferred to the microcode RAM so our[br]patch gets executed. And the microcode 0:04:17.570,0:04:23.060 updates are usually loaded by the BIOS or[br]by the kernel. Linux has an update driver, 0:04:23.060,0:04:28.340 sometimes the BIOS updates it with a[br]pre-installed version and they have a 0:04:28.340,0:04:32.120 pretty simple structure, a partially[br]documented header, and followed by the 0:04:32.120,0:04:37.730 actual microcode that is loaded inside the[br]CPU. And so microcode is organized in 0:04:37.730,0:04:42.650 something called triads. Each triad has[br]three operations essentially x86 0:04:42.650,0:04:48.230 instructions, but based on differences.[br]And lastly, you have a sequence word. The 0:04:48.230,0:04:52.025 sequence word indicates which microcode[br]instructions should be executed next. We 0:04:52.025,0:04:57.950 have options of executing just the next[br]triad, executing another one by branching 0:04:57.950,0:05:01.936 to it, or just saying OK, I'm done with[br]decoding this instruction continue with 0:05:01.936,0:05:07.490 x86 code. These updates are protected by[br]some weak authentication which we were 0:05:07.490,0:05:13.260 able to break so we can create our own. We[br]can analyze existing ones and we can apply 0:05:13.260,0:05:20.620 these to your standard laptop and desktop.[br]However there can only ever be one update 0:05:20.620,0:05:26.534 loaded at the time and when you reboot[br]your machine this update will be gone. 0:05:28.490,0:05:32.990 Also for the talk we are going to look at[br]some microcode and we will present this 0:05:32.990,0:05:38.150 microcode using a register transfer[br]language. It is heavily based on x86. I'm 0:05:38.150,0:05:43.290 just going to cover the differences[br]between these two. Most importantly the 0:05:43.290,0:05:48.650 microcode can have three operands for an[br]instruction in comparison to x86 which 0:05:48.650,0:05:53.640 usually only has two. So you can specify a[br]destination and two source operands. 0:05:55.618,0:05:56.446 Also, 0:05:57.210,0:06:02.240 microcode has some certain bit flags that[br]need to be set and these we do we see with 0:06:02.240,0:06:07.449 these annotations for example ".C" means[br]says instruction also updates a carry flag 0:06:07.449,0:06:14.050 based on the result. Then you have the[br]instruction "jcc" which is a conditional 0:06:14.050,0:06:19.570 branch and the first operand denotes the[br]condition up on which this branch is 0:06:19.570,0:06:24.100 taken. In this case branch if the carry[br]flag is one and [the] second operand 0:06:24.100,0:06:30.300 indicates the offset to add to the[br]instruction pointer. Then we also have 0:06:30.300,0:06:35.760 some sequence word annotations: "next",[br]"complete", and "branch". Also it should 0:06:35.760,0:06:39.958 be noted that the internal microcode[br]architecture is a load-store architecture. 0:06:39.958,0:06:45.350 You can't use memory operands in other[br]instructions like you can on x86 you 0:06:45.350,0:06:48.310 always need to load and store memory[br]explicitly. 0:06:49.190,0:06:51.710 Now we are going to talk about 0:06:51.710,0:06:58.710 how we manage to recover the microcode[br]ROM. The microcode ROM is baked into your 0:06:58.710,0:07:06.860 CPU, you can't change it anymore. It is[br]defined in the silicon during the 0:07:06.860,0:07:12.930 fabrication process and in this picture[br]you can see a die shot taken with a 0:07:12.930,0:07:16.840 electron microscope and this is one of[br]three regions that contains the bits for 0:07:16.840,0:07:23.240 the microcode operations. And if you zoom[br]in a bit more, each of these regions 0:07:23.240,0:07:30.050 consist out of four arrays and these are[br]further subdivided into blocks. Really 0:07:30.050,0:07:34.660 interesting is "Array 2" which is a bit[br]smaller than the other ones but it has 0:07:34.660,0:07:42.160 some structures above it which are of a[br]different visual layout. This is SRAM 0:07:42.160,0:07:47.050 which stores the microcode update. So this[br]is one-time reprogrammable memory that is 0:07:47.050,0:07:53.860 still pretty fast. So the microcode RAM is[br]located right next to the microcode ROM 0:07:53.860,0:07:57.645 which also makes sense from a design[br]standpoint. 0:08:00.445,0:08:02.010 Just an overview of how we 0:08:02.010,0:08:06.930 went ahead and how we went about. We[br]started with pictures and then we used 0:08:06.930,0:08:11.456 some OCR-ike process to transform them[br]into bit strings which we can then further 0:08:11.456,0:08:17.169 process. These bitstrings were then[br]arranged into triads. We could already 0:08:17.169,0:08:22.050 gather that we got individual triades[br]right because there were data dependencies 0:08:22.050,0:08:27.550 all over the place, but between triads,[br]there were no or very few data 0:08:27.550,0:08:33.699 dependencies so the ordering of the[br]triades was still wrong and this was a 0:08:33.699,0:08:38.860 major problem when we went ahead and what[br]we had to reverse engineer and this is 0:08:38.860,0:08:43.870 mapping a certain physical address of a[br]triad that we gathered from the ROM 0:08:43.870,0:08:48.050 readout to a virtual address that is used[br]inside the microcode update or the 0:08:48.050,0:08:53.690 microcode ROM. But after reverse engineer[br]this, you can just do a linear sweep 0:08:53.690,0:08:59.020 disassembly of the microcode ROM and[br]arrive at human readable output. But this 0:08:59.020,0:09:04.870 recovery was a bit tricky because we[br]required physical virtual address pairs. 0:09:04.870,0:09:09.520 But gathering these is a bit harder[br]because we worked there through the 0:09:09.520,0:09:14.040 available updates, but we could only find[br]two pairs of them. These pairs were 0:09:14.040,0:09:18.520 actually easy to find because every update[br]replaces a certain triad inside your 0:09:18.520,0:09:24.580 microcode ROM and this triad is usually[br]also placed in the microcode update. So by 0:09:24.580,0:09:31.260 matching the address this update replaces[br]with a microcode ROM readout. You can just 0:09:31.260,0:09:38.000 get your two data points. But we had to[br]get more data points so we generated these 0:09:38.000,0:09:42.630 mappings by matching semantics of triads[br]in the microcode ROM readout and the 0:09:42.630,0:09:47.779 semantics when we force execution of a[br]certain microcode address. And gathering 0:09:47.779,0:09:52.330 the semantics of the read-out microcode,[br]we implemented a simple microcode 0:09:52.330,0:09:58.820 simulator. Essentially it works on triad[br]level, so you give it an input state and a 0:09:58.820,0:10:03.430 triad and it calculates the output state[br]of it. Input and output state are 0:10:03.430,0:10:08.460 comprised out of the x86-state which is[br]your standard registers and also the 0:10:08.460,0:10:12.320 internal microcode registers. There are[br]multiple temporary registers that get 0:10:12.320,0:10:18.350 reset for every new x86 instruction that[br]is executed, but they can also be modified 0:10:18.350,0:10:24.130 by microcode of course. Our emulator[br]supports all known arithmetic operations 0:10:24.130,0:10:29.230 and we have a white-list of operations[br]that do not form or produce any observable 0:10:29.230,0:10:32.950 change in state just so that we could[br]process more triades and give them more 0:10:32.950,0:10:41.310 data points. In total we gathered 54[br]additional data-address pairs which turned 0:10:41.310,0:10:46.649 out to be enough to recover the whole[br]mapping. This mapping, essentially you 0:10:46.649,0:10:50.820 have the four different arrays that map to[br]individual blocks and these blocks in 0:10:50.820,0:10:56.750 these arrays or then again permuted a bit[br]and then the triads inside these blocks 0:10:56.750,0:11:02.330 have some table-based permutations. So[br]this is not an obfuscation. This is just 0:11:02.330,0:11:07.680 from a hardware design standpoint it can[br]make sense to reroute it a bit differently 0:11:09.330,0:11:14.629 Also now that we can actually[br]map a certain address to the microcode ROM 0:11:14.629,0:11:19.093 readout and we know the addresses of[br]different x86 instructions from our 0:11:19.093,0:11:24.240 earlier experiments, we can look at the[br]implementation of instructions. So let's 0:11:24.240,0:11:29.130 start with a pretty simple one. Shift-[br]Right-Double which essentially takes a 0:11:29.130,0:11:33.250 register, shift it by a given amount and[br]shifts in bits from another register. So 0:11:33.250,0:11:38.180 of course you would expect a lot of shifts[br]and rolls in its implementation and this 0:11:38.180,0:11:45.338 is exactly what we're seeing here. You[br]have two shift-right operands and you can 0:11:45.338,0:11:50.830 see regmd6 and regmd4. These are[br]place holders. The microcode engine can 0:11:50.830,0:11:55.630 replace certain bit combinations with the[br]registers that are used in the x86 0:11:55.630,0:12:01.560 operation. For example this one would be[br]replaced by ECX or EAX depending on what 0:12:01.560,0:12:08.339 you wrote in x86. And at this point we can[br]also already gather more information about 0:12:08.339,0:12:13.601 microcodes than we previously knew because[br]we know "OK, so this is source, this is 0:12:13.601,0:12:18.529 also a source and this is a destination".[br]But this source which indicates the shift 0:12:18.529,0:12:22.750 amount, this one was previously unknown,[br]because it is a high temporary microcode 0:12:22.750,0:12:28.279 register and we found out that these[br]usually implement specific different 0:12:28.279,0:12:31.800 purpose. They are not - if you write to[br]them, sometimes the CPU behaves 0:12:31.800,0:12:35.890 erratically, sometimes it crashes,[br]sometimes nothing happens. But in this 0:12:35.890,0:12:40.300 case, this seems to be the shift count,[br]and the shift count is given by a third 0:12:40.300,0:12:45.279 operand in the instruction. So in this[br]case, we already learned "OK, if you want 0:12:45.279,0:12:51.380 to read the third operand of an[br]instruction, we need to read t41". And 0:12:51.380,0:12:56.236 this is how we went about recovering more[br]and more information about microcode. The 0:12:56.236,0:13:00.160 rest of the implementation is essentially[br]concerned with implementing the rest of 0:13:00.160,0:13:05.721 the semantics of the x86 instruction and[br]updating the flags correctly. OK, so now 0:13:05.721,0:13:11.980 let's look at a instruction set that is a[br]bit more complicated. If you check out 0:13:11.980,0:13:19.620 rdtsc. rdtsc returns a internal cycle[br]counter in EDX and EAX, so the upper part 0:13:19.620,0:13:25.520 ends up in EDX, lower part in EAX. So in[br]the end we want to see writes to these 0:13:25.520,0:13:30.760 registers, potentially with a shift[br]somewhere in there. But somewhere the CPU 0:13:30.760,0:13:37.570 needs to gather the cycle counter. So in[br]the beginning we have two load-style 0:13:37.570,0:13:41.410 operations. This one is a proper load[br]which we identified and this one is 0:13:41.410,0:13:48.569 unknown. But despite that we do not know[br]the instruction, we know the target 0:13:48.569,0:13:52.720 because the result of this instruction[br]will end up in t9 and the result of this 0:13:52.720,0:13:58.060 instruction will end up in t10, so we can[br]follow the uses of these two registers. So 0:13:58.060,0:14:04.450 for simplicity I'm going to start with t10[br]and t10, which we later found out, this is 0:14:04.450,0:14:09.730 another register which essentially denotes[br]a specific internal register. And if you 0:14:09.730,0:14:15.450 play around with these bits you notice[br]that this combination encodes cr4. The x86 0:14:15.450,0:14:22.987 will just see cr4. You can also address[br]cr1 and cr2. And if you look further, t10 0:14:22.987,0:14:29.160 is then ended with this bit mask and if[br]you look in the manual you find out that 0:14:29.160,0:14:34.930 this bit in cr4 denotes the bit that[br]determines whether oddity C is 0:14:34.930,0:14:40.019 available from user space or not. So this[br]is the check if this instruction should be 0:14:40.019,0:14:48.170 executed. So now let's just keep in mind[br]that t9 holds some other loaded value from 0:14:48.170,0:14:53.930 some other internal register and we will[br]come back to this one a bit later. For 0:14:53.930,0:14:58.848 now, let's follow execution. This triad is[br]essentially a padding triad. It is a 0:14:58.848,0:15:04.885 common pattern we see. So let's look at[br]where this branch takes us. 0:15:05.895,0:15:07.180 And this branch 0:15:07.180,0:15:15.959 takes us to a conditional branch[br]triad. And if you look a bit up, this end 0:15:15.959,0:15:21.740 instruction actually updated this flag. So[br]this is a conditional branch that 0:15:21.740,0:15:26.360 determines whether this check was[br]successful or not. So it branches toward 0:15:26.360,0:15:32.570 the error triad or the success triad. But[br]here we already see the exit. We see a 0:15:32.570,0:15:41.170 write to RDX or EDX in this case with a[br]shift from t9 by 32 bit, which is exactly 0:15:41.170,0:15:45.910 what you would expect to write the time[br]stamp counter on the upper 32 bits of the 0:15:45.910,0:15:50.829 time stamp counter to edx. And you have an[br]unknown instruction, but we know, okay, we 0:15:50.829,0:15:57.877 move something from t9 to eax, which is[br]the lower 32 bits. But we're not done 0:15:57.877,0:16:02.690 here, because we can still look at the[br]error pass that is taken if the access is 0:16:02.690,0:16:09.210 denied. So if you scroll a bit down we can[br]see a move of an immediate into a certain 0:16:09.210,0:16:14.530 internal register. And this is immediate[br]actually encodes a general protection 0:16:14.530,0:16:21.790 fault interrupt code. D denotes to the[br]exception handler that this was a general 0:16:21.790,0:16:28.680 protection fault. And later this triad[br]branches to this address, and if you look 0:16:28.680,0:16:34.013 at the uses of this address we can find[br]other immediates that also correspond on 0:16:34.013,0:16:36.962 to x86 instructions. So now we learned 0:16:36.962,0:16:39.947 how we can actually raise our[br]own interrupts. We 0:16:39.947,0:16:46.100 just need to load the code we want into[br]the specific register and branch to this 0:16:46.100,0:16:52.820 address. And now we learned a lot about[br]how we can actually write microcode, but 0:16:52.820,0:16:57.000 it's also interesting to see how certain[br]instructions are implemented. So let's 0:16:57.000,0:17:03.671 look at a pretty complicated one: wrmsr[br](Write MSR). wrmsr essentially writes some 0:17:03.671,0:17:08.449 data it is given to a machine specific[br]register. This machine specific register 0:17:08.449,0:17:12.980 differs between CPUs, between vendors,[br]sometimes between revisions. And these 0:17:12.980,0:17:17.910 implement non-standard extensions or[br]pretty complex features. For example, you 0:17:17.910,0:17:23.949 trigger a microcode update by writing to a[br]machine specific register. The register 0:17:23.949,0:17:30.570 addresses you want to write to is given in[br]ecx. And now we can see ecx is read and 0:17:30.570,0:17:39.679 it is shifted by sixteen bits to t10. So[br]again, we follow uses of t10 and we see 0:17:39.679,0:17:46.070 it as XOR'd with a certain bitmask. And[br]this bitmask is C000, which actually 0:17:46.070,0:17:52.429 denotes a namespace of the model specific[br]registers. In this case this should be an 0:17:52.429,0:17:58.450 AMD-specific namespace. And, of course,[br]this one again sets some flags, and you 0:17:58.450,0:18:04.240 can see your conditional branch depending[br]on these flags to what should be the 0:18:04.240,0:18:06.235 handler for this namespace. 0:18:06.695,0:18:10.770 Next one: We have another XOR[br]that uses a different bit 0:18:10.770,0:18:16.890 mask — in this case C001. C001 is the[br]namespace where the microcode update 0:18:16.890,0:18:25.050 routine is actually located in. So again,[br]we branch to this handler. And if you just 0:18:25.050,0:18:31.010 continue on, there are more operations on[br]rcx, followed by more branches, and this 0:18:31.010,0:18:35.790 continues until everything is dispatched[br]to the correct handler. And this is how, 0:18:35.790,0:18:40.340 internally, wrmsr is implemented, and also[br]Read MSR is going to be implemented pretty 0:18:40.340,0:18:43.640 similar, because it implements some kind[br]of similar thing. 0:18:47.750,0:18:49.190 OK, so now I showed you 0:18:49.190,0:18:52.470 how we actually went ahead of[br]reconstructing the knowledge we 0:18:52.470,0:18:57.939 currently have. And now I'm going to show[br]you what we can actually do with it. And 0:18:57.939,0:19:02.440 for this I am going to quickly cover what[br]applications we wrote in microcode. We 0:19:02.440,0:19:04.940 wrote a simple configurable[br]rdtsc precision. 0:19:04.940,0:19:07.710 This means a certain bit mask is AND'd to 0:19:07.710,0:19:11.890 the result of rdtsc, so you can[br]reduce the accuracy of it, which can 0:19:11.890,0:19:18.284 sometimes prevent timing attacks. We also[br]implemented microcode-assisted address 0:19:18.284,0:19:23.260 sanitizer, which I'll cover quickly in a[br]second. We also have some basic microcode 0:19:23.260,0:19:29.070 instruction set randomization. Some[br]microcode-assisted instrumentation. What 0:19:29.070,0:19:33.520 this means is, you can write a filter for[br]your instrumentation in microcode itself. 0:19:33.520,0:19:37.580 So instead of hooking an instruction,[br]instead of debugging your code or 0:19:37.580,0:19:42.160 emulating it, you can just say whenever[br]the instruction is executed filter if this 0:19:42.160,0:19:47.180 is relevant for me, and if it is, call my[br]x86 handler — entirely in microcode, 0:19:47.180,0:19:52.470 without changing the instruction in the[br]RAM. We also implemented some basic 0:19:52.470,0:20:00.000 authenticated microcode updates. The usual[br]update mechanism is weak — that's how we 0:20:00.000,0:20:05.430 got our foot in the door in the first[br]place. So we improved upon it a bit. Also 0:20:05.430,0:20:09.799 we found out that microcode actually has[br]some enclave-like features because once 0:20:09.799,0:20:13.730 we're executing in Microcode, your kernel[br]can't interupt you, your hypervisor can't 0:20:13.730,0:20:18.610 interrupt you and any state you want[br]visible to the outside world. You actually 0:20:18.610,0:20:22.840 need to write explicitly. So all these[br]microcode internal registers are not 0:20:22.840,0:20:26.600 accessible from the outside world. So any[br]computation you perform in micro code 0:20:26.600,0:20:30.360 cannot be interfered with. So you can[br]implement a simple enclave on top of this 0:20:30.360,0:20:37.039 one. So our hardware-assisted address[br]sanitizer variant is based on the work by 0:20:37.039,0:20:41.970 the original authors and address sanitizer[br]is a software instrumentation that detects 0:20:41.970,0:20:47.070 invalid memory access by using a shadow[br]map shadow memory to just say which memory 0:20:47.070,0:20:50.746 is valid to be read and written to. 0:20:50.746,0:20:53.840 The authors proposed hardware[br]address sanitizer 0:20:53.840,0:20:59.011 which is essentially doing the same checks[br]but using a new instruction. And the 0:20:59.011,0:21:03.940 instruction should raise a fault if an[br]invalid access is detected. This algorithm 0:21:03.940,0:21:07.670 they proposed - The details are not[br]important. What is important is in 0:21:07.670,0:21:12.080 essence: It's pretty simple. You load from[br]a certain adress, performs the operations 0:21:12.080,0:21:18.816 on it and if there is the shadow after[br]this operations you just report a bug. 0:21:18.816,0:21:24.910 Advantages of hardware address sanitizer[br]are for example you get better performance 0:21:24.910,0:21:29.170 out of it. Because you only have a single[br]instruction maybe you can do some fancy 0:21:29.170,0:21:34.450 tricks inside your CPU that are faster[br]than using x86 instructions, you get more 0:21:34.450,0:21:38.880 compact code and you have the possibility[br]of one time configuration which is a bit 0:21:38.880,0:21:45.210 hard with software address sanitizer. We[br]implemented hardware address sanitizer our 0:21:45.210,0:21:49.270 variant by replacing the bound instruction[br]Bound is an old instruction that is no 0:21:49.270,0:21:54.870 longer used by compilers because in fact[br]it is slower to use bound instead of 0:21:54.870,0:21:58.901 performing the checks with multiple x86[br]instructions. We changed the interface. 0:21:58.901,0:22:04.090 The first argument is the register which[br]holds the address you want to access. And 0:22:04.090,0:22:07.835 the second argument holds the size you[br]want this access to be. 0:22:07.835,0:22:11.050 So, 1 byte, 2 byte and so on. 0:22:11.050,0:22:14.950 This instruction is a no-op if the[br]check succeeds. So if there is no bug it 0:22:14.950,0:22:19.980 just continues on like nothing happened.[br]However if we detect an invalid access we 0:22:19.980,0:22:25.359 can take a configurable action, we can for[br]example just raise your normal page fault 0:22:25.359,0:22:29.630 or we can raise a bound interrupt, which[br]is a custom interrupt, that only denotes 0:22:29.630,0:22:34.299 this one or we can branch to an x86[br]handler that either performs additional 0:22:34.299,0:22:39.760 checking, for example whitelisting, or it[br]generates a pretty error report for you. 0:22:41.340,0:22:47.480 Most importantly this is a single[br]instruction. We also do not dirty any x86 0:22:47.480,0:22:52.690 registers because they are some[br]intermediate results. You need to store 0:22:52.690,0:22:56.360 these somewhere and this you usually do in[br]the x86 registers. So you increase 0:22:56.360,0:23:00.010 register pressure. Maybe you cause[br]spilling. So overall your performance gets 0:23:00.010,0:23:07.230 worse. We also found out that we are[br]actually faster than doing the checking 0:23:07.230,0:23:12.390 using x86 instructions. So just by moving[br]the implementation from x86 level to 0:23:12.390,0:23:16.805 microcode, which in some way is still kind[br]of like software, we already improved the 0:23:16.805,0:23:22.160 performance. Also on top of this you get[br]better cache utilization because you have 0:23:22.160,0:23:27.020 less instructions, there are less bytes in[br]the cache, so we get fuller cache lines. 0:23:27.020,0:23:31.630 And also it is really easy to tell which[br]is testing code and which is your actual 0:23:31.630,0:23:40.080 program code. Lastly I'm going to show you[br]just a rough overview of our framework 0:23:40.080,0:23:45.920 which we used during our development and[br]which you can also find on GitHub. Early 0:23:45.920,0:23:50.079 on we found out that we are probably going[br]to need to test a lot of microcode 0:23:50.079,0:23:55.640 updates, because in the beginning you just[br]throw everything at the CPU and see how it 0:23:55.640,0:24:01.400 behaves and we wanted to do this in[br]parallel. So we developed a small custom 0:24:01.400,0:24:07.180 OS called "Angry OS" and deployed it to[br]mainboards. These mainboards are just old 0:24:07.180,0:24:13.270 AMD mainboards. All these mainboards were[br]hooked up via serial for communication and 0:24:13.270,0:24:19.400 GPIO to a Raspberry Pi. With the GPIO you[br]can reset, support power on, power down 0:24:19.400,0:24:23.890 and just have remote control of this[br]mainboard and then you can connect to that 0:24:23.890,0:24:28.719 Raspberry Pi from anywhere on earth and[br]just deploy and play around with it. 0:24:28.719,0:24:30.640 This was the first version. 0:24:30.640,0:24:34.490 In the beginning we[br]didn't really know much about electronics 0:24:34.490,0:24:38.520 so we used one Raspberry Pi per mainboard.[br]And it turns out Raspberry Pis are more 0:24:38.520,0:24:43.970 expensive than these old mainboards, but[br]we improved upon this and now we're down 0:24:43.970,0:24:48.007 to one Raspberry Pi for[br]four / five setups. 0:24:48.007,0:24:51.587 For example you only need 3 GPIO ports per 0:24:51.587,0:24:57.358 mainboard. You connect each of these to[br]optocouplers just to separate the voltage 0:24:57.358,0:25:01.860 levels and then you connect one side of[br]the optocoupler to the GPIO the other side 0:25:01.860,0:25:05.909 to your reset pin, to your power pin and[br]for input to know whether your board is up 0:25:05.909,0:25:11.230 or down you connect the power LED. And[br]that way you can save a lot of space, a 0:25:11.230,0:25:17.205 lot of money. And also if you're really[br]constrained you can just remove the power 0:25:17.205,0:25:23.530 LED sensing because usually you know it is[br]in the state your setup is in. As I 0:25:23.530,0:25:28.230 already said we wrote our custom operating[br]system and it is intentionally really 0:25:28.230,0:25:32.659 really minimal because the major feature[br]we wanted is control over every 0:25:32.659,0:25:36.740 instructions that's going to be executed[br]from a certain point on, because we're 0:25:36.740,0:25:40.780 playing around with instruction encoding[br]and if we execute an instructions that we 0:25:40.780,0:25:45.530 did not intend we might crash the CPU, we[br]might go into an invalid state and we do 0:25:45.530,0:25:50.850 not even know which instruction caused it.[br]And Angry OS essentially only listens on 0:25:50.850,0:26:00.150 the serial port for something to do. What[br]it can do is apply an update. These 0:26:00.150,0:26:04.820 updates are just microcode updates. They[br]are streamed via serial. We can also 0:26:04.820,0:26:10.039 stream x86 code which is then run by Angry[br]OS and this is just so that we do not need 0:26:10.039,0:26:14.409 to reflash the USB stick every time we[br]want to update our testing code and the 0:26:14.409,0:26:19.280 result, all the errors are reported back[br]to the Raspberry Pi and thus they are 0:26:19.280,0:26:26.852 forwarded to us. The framework we use most[br]importantly has the microcode assembler 0:26:26.852,0:26:30.713 and a pretty verbose disassembler. This[br]disassembler generates the output I showed 0:26:30.713,0:26:36.919 you earlier and using this you can just[br]quickly write your own microcode. We also 0:26:36.919,0:26:42.245 included an x86 assembler because we[br]wanted to rapidly test different x86 0:26:42.245,0:26:47.730 testing codes. Using this framework we[br]were able to disassemble the existing 0:26:47.730,0:26:53.500 updates and we also used it to disassemble[br]our ROM after we reordered it and also 0:26:53.500,0:27:01.169 during the process when we fed it to our[br]emulator. And we can also create the 0:27:01.169,0:27:07.909 proper binary files that can be loaded by[br]the Linux kernel driver. We modified the 0:27:07.909,0:27:12.777 stock one to just load any update you give[br]it without checking if it's the correct 0:27:12.777,0:27:20.060 CPU ID and all these things just for[br]testing purposes. It's also available. And 0:27:20.060,0:27:25.740 also of course the framework can control[br]Angry OS to make your testing easier. And 0:27:25.740,0:27:29.650 we implemented a pretty basic remote[br]execution wrapper, so you can work on a 0:27:29.650,0:27:33.389 remote Raspberry Pi as if you were using[br]it locally. 0:27:34.809,0:27:36.799 And this brings me to the end 0:27:36.799,0:27:40.800 of talk. And in conclusion we can say[br]reversing the ROM opened up a lot of new 0:27:40.800,0:27:44.809 possibilities. We learned a lot about how[br]microcode works. We learned about how to 0:27:44.809,0:27:49.720 actually use it properly instead of just[br]inferring from a really small dataset, 0:27:49.720,0:27:55.060 that we have from the updates, or from the[br]random bits things we send to the CPU and 0:27:55.060,0:27:59.530 observe what happened. But there's a lot[br]left to do. So if you really want to hack 0:27:59.530,0:28:04.089 on it, just get in contact, we were happy[br]to share our findings with you. And as I 0:28:04.089,0:28:09.009 said the framework AngryOS, example[br]programs, that we implemented, and some 0:28:09.009,0:28:13.850 other stuff like the wiring is available[br]on GitHub. So that's that. And we are 0:28:13.850,0:28:16.809 happy to answer any questions you might[br]have. 0:28:16.809,0:28:22.234 applause 0:28:24.910,0:28:28.438 Herald Angel: Thank you very much. So we 0:28:28.438,0:28:34.260 have 10 minutes for questions please line[br]up at the microphones. We start with this 0:28:34.260,0:28:39.220 one: microphone number 2.[br]M2: Hi. Thanks for a nice talk. A few 0:28:39.220,0:28:42.780 questions about your hardware address[br]sanitizer. 0:28:42.780,0:28:49.830 Benjamin: Mhm[br]M2: As I understand you don't need the 0:28:49.830,0:28:56.010 source code instrumentation because the[br]microcode is responsible for checking the 0:28:56.010,0:29:02.929 shadow memory, right?[br]Benjamin: No... The original hardware 0:29:02.929,0:29:07.950 sanitizer implementation is also based on[br]a compiler extension, that inserts a new 0:29:07.950,0:29:12.200 instruction because it doesn't exist[br]usually. And it also inserts a bootstrap 0:29:12.200,0:29:18.049 code that in inits your shadow map and[br]also instruments your allocators to update 0:29:18.049,0:29:23.020 the shadow map doing runtime and we[br]essentially need the same component, but 0:29:23.020,0:29:26.850 we do not need the software address[br]sanitizer component that essentially 0:29:26.850,0:29:33.740 inserts 10 or 20 x86 instructions before[br]every memory access. So yes we still need 0:29:33.740,0:29:37.647 a compile time component and we are still[br]source code based in a sense. 0:29:39.388,0:29:45.600 Herald: And, so..[br]M2: And I didn't see, maybe I missed the 0:29:45.600,0:29:51.299 numbers. How much it is faster than this[br]initial version? 0:29:51.299,0:29:56.419 Benjamin: You mean the initial hardware[br]sanitizer version or the software address 0:29:56.419,0:29:59.900 sanitizer.[br]M2: I mean let's say custom kernel address 0:29:59.900,0:30:05.180 sanitizer for Linux kernel which is the[br]the usual one and your approach. 0:30:05.180,0:30:10.270 Benjamin: We only performed a micro[br]benchmark on Angry OS and we essentially 0:30:10.270,0:30:16.059 took the instrumentation as emitted by the[br]compiler for some memory access which is 0:30:16.059,0:30:20.590 your standard software address sanitizer[br]and compared it to our version using only 0:30:20.590,0:30:24.640 the modified bound instruction. So I[br]really can't talk about how it compares to 0:30:24.640,0:30:28.820 KASAN or something or some like real world[br]implementation, because we only have the 0:30:28.820,0:30:34.069 prototype and the basic instrumentation.[br]M2: Thank you very much. 0:30:34.069,0:30:36.490 Herald Angel: OK. Microphone number 4[br]please. 0:30:36.490,0:30:51.145 M4: Hey thanks for the talk and did you[br]find any weird microcode 0:30:51.145,0:31:00.529 implementations. I don't mean security[br]wise, just like you rarely expected to 0:31:00.529,0:31:07.330 see it be implemented that way. 0:31:09.040,0:31:11.700 Benjamin: The problem is there's a lot of 0:31:11.700,0:31:20.270 microcode to begin with. You have f000[br]triads. Each of which has 3 op-codes. So 0:31:20.270,0:31:25.003 you have a lot of ground to cover and also[br]we have read-out errors. Sometimes you are 0:31:25.003,0:31:29.169 seeing bit flips, which kind of slows you[br]down because you then need to always 0:31:29.169,0:31:32.820 consider: OK, maybe this register is[br]something else, maybe this address is 0:31:32.820,0:31:37.420 wrong. And also sometimes you have a dust[br]particles that kind of knocks out an 0:31:37.420,0:31:42.550 entire region. So we only looked at the[br]components, we were pretty sure that we 0:31:42.550,0:31:46.520 recovered correctly, and we'd only looked[br]at a really tiny subset compared to all of 0:31:46.520,0:31:52.940 the microcode ROM. It's just not feasible[br]to do and to go through it and look at 0:31:52.940,0:31:57.330 everything. So no we didn't find anything[br]funny but we also wouldn't know what funny 0:31:57.330,0:32:00.790 looks like because we don't know what the[br]official spec for microcode is. 0:32:01.180,0:32:03.990 M4: Thanks.[br]Herald Angel: Interesting. We have one 0:32:04.034,0:32:05.809 question from the Internet, from the 0:32:05.809,0:32:09.792 Signal Angel please.[br]Signal Angel: Yes. Which AMD CPU 0:32:09.792,0:32:15.510 generations does this apply to?[br]Benjamin: Yeah this is still based on the 0:32:15.510,0:32:21.289 work of our first talk and this only works[br]on pretty old ones: K8, K10. So until, 0:32:21.289,0:32:26.940 CPUs produced until 2013. Yeah this was[br]the last year AMD produced anything like 0:32:26.940,0:32:32.520 that. Newer ones use some public key based[br]cryptography from what we can tell and we 0:32:32.520,0:32:36.559 haven't yet managed to break it. Same goes[br]for Intel, they seem to be using public 0:32:36.559,0:32:39.919 key cryptography and we haven't gotten a[br]foot in the door yet. 0:32:40.989,0:32:44.789 Herald Angel: Thank you. We go one around.[br]On microphone number 3 please. 0:32:44.789,0:32:51.290 M3: Yeah. Thank you. I would like to know[br]how complex could the microcode programs 0:32:51.290,0:32:59.159 be, that you could write. So what's the[br]complexity of new operations you could 0:32:59.159,0:33:03.300 implement.[br]Benjamin: The only limiting factor is the 0:33:03.300,0:33:07.923 size of your microcode update RAM. But[br]this one is really really limited. 0:33:07.923,0:33:12.679 For example on K8, where we performed the[br]majority of our experiments. We are 0:33:12.679,0:33:19.050 limited to 32 triads, which comes down to[br]a sixty nine instructions and you also 0:33:19.050,0:33:22.440 have some constraints on these[br]instructions for example the next triad 0:33:22.440,0:33:27.809 will always be executed no matter what.[br]Some operations can only go at the second 0:33:27.809,0:33:33.859 slot. Some can only go on another slot, so[br]it's really really hard. And you're also 0:33:33.859,0:33:38.930 limited from our knowledge to loading 16[br]bit immediates instead of 32 bit or even 0:33:38.930,0:33:44.470 64 bit immediates. So your whole program[br]grows really fast if you're trying to do 0:33:44.470,0:33:49.400 something complex. For example our[br]authenticated microcode update mechanism 0:33:49.400,0:33:54.440 is the most complex one we wrote it nearly[br]fills out the RAM and we used TEA – Tiny 0:33:54.440,0:33:58.700 Encryption Algorithm – because that was[br]the only one we managed to fit mostly due 0:33:58.700,0:34:04.510 to S-box and other constants we would need[br]to load. So it's really small. 0:34:04.510,0:34:08.539 Herald Angel: Thank you Microphone number[br]1. 0:34:08.539,0:34:14.709 M1: So you said the microcode is used for[br]instruction decoding and it needs to meet 0:34:14.709,0:34:19.429 the micro-ops to the scheduler and micro[br]queue in some way. Did you find out how 0:34:19.429,0:34:27.519 that works?[br]Bejamin: In essence we are not actually 0:34:27.519,0:34:33.539 executing code inside in microcode engine.[br]From what from what we understand, the 0:34:33.539,0:34:38.569 microcode engine is just some kind of a[br]software based recipe, that describes how 0:34:38.569,0:34:43.479 to decode an instruction, so you don't[br]actually get execution, you just commit 0:34:43.479,0:34:47.269 instructions into the pipelines, that do[br]what you want. And because we have some 0:34:47.269,0:34:51.269 control flow possibility, that is actually[br]inside the micro code engine, because you 0:34:51.269,0:34:55.268 can branch to different addresses, you can[br]conditionally branch and loop. You kind of 0:34:55.268,0:34:59.089 get an execution, but in essence to just[br]commit stuff in the pipeline and the CPU 0:34:59.089,0:35:01.440 does what you tell it to. 0:35:04.240,0:35:07.161 Herald Angel: One more question.[br]Microphone number 2, please. 0:35:07.161,0:35:11.927 M2: How did you take the picture of the[br]internal CPU? Did you open it? 0:35:11.927,0:35:14.969 Benjamin: Yeah. We worked together with 0:35:14.969,0:35:19.680 Chris. He's our hardware guy. He has[br]access to his equipment to delayer it and 0:35:19.680,0:35:24.289 to take high resolution optical shots and[br]he also takes shots with a scanning 0:35:24.289,0:35:29.279 electron microscope. So I think about five[br]or six CPUs were harmed in the making of 0:35:29.279,0:35:30.357 this paper. 0:35:33.810,0:35:37.815 Herald Angel: So we have one more last[br]question. Microphone number 2 please. 0:35:39.248,0:35:41.390 M2: Are you aware of research done by 0:35:41.390,0:35:49.400 Christopher Domas, where he mapped out the[br]instruction set for x86 processors? 0:35:49.400,0:35:57.119 B: You mean sandsifter? We[br]actually talked with him and yeah we are 0:35:57.119,0:36:02.910 aware, that there's a map essentially of[br]the instruction set and also maybe you can 0:36:02.910,0:36:07.275 combine it, because in the beginning we[br]reverse engineered where certain x86 0:36:07.275,0:36:11.335 instructions are implemented in microcode.[br]So if you plug these two together you kind 0:36:11.335,0:36:15.170 of map out the whole microcode ROM at the[br]same time that you map out a whole 0:36:15.170,0:36:18.989 instruction set. However there are some[br]components of the microcode ROM that are 0:36:18.989,0:36:23.470 most likely not triggered by instructions.[br]For example it seems like power management 0:36:23.470,0:36:27.368 or everything that is behind a write MSR[br][wrmsr] or read MSR [rdmsr]. wrmsr is a 0:36:27.368,0:36:31.249 single instruction, but depending on the[br]arguments you give it it just branches to 0:36:31.249,0:36:36.442 totally different triads and the microcode[br]itself is implemented in microcode. And 0:36:36.442,0:36:40.190 this one is a huge chunk you wouldn't even[br]find without brute forcing all 0:36:40.190,0:36:44.159 combinations for all instructions which is[br]not really feasible. 0:36:46.483,0:36:51.279 Herald Angel: Thank you. Thank you[br]Benjamin. 0:36:51.279,0:36:57.210 applause 0:36:57.210,0:37:01.811 35c3 postroll music 0:37:01.811,0:37:21.000 subtitles created by c3subtitles.de[br]in the years 2019-2020. Join, and help us!