0:00:03.129,0:00:07.360
35C3 preroll music
0:00:18.780,0:00:23.869
Herald: So the next talk Benjamin Kollenda[br]and Philipp Koppe - they will refresh our
0:00:23.869,0:00:30.529
memories because they already had a talk[br]on 34C3 where they talked about the micro
0:00:30.529,0:00:37.580
code ROM and today they're gonna give us[br]more insights on how micro code works. And
0:00:37.580,0:00:44.320
more details on the ROM itself. Benjamin[br]is a PhD student and has a focus on
0:00:44.320,0:00:51.280
software attacks and defenses and together[br]with Phillip they will now abuse AMD
0:00:51.280,0:00:55.190
microcode for fun and security. Please[br]enjoy.
0:00:55.190,0:00:58.730
Applause
0:01:01.320,0:01:06.260
Benjamin: Thank you. So as mentioned we[br]were able to reverse engineer the AMD
0:01:06.260,0:01:11.599
microcode and the AMD microcode ROM and[br]I'm going to talk about our journey. What
0:01:11.599,0:01:16.369
we learned on the way and how we did it.[br]So this joint work with my colleagues at
0:01:16.369,0:01:20.799
Ruhr Universtat Bochum and a quick outline[br]how are we going to do it. We're going to
0:01:20.799,0:01:25.380
start with a quick crash course on micro[br]architectural basics and what microcode
0:01:25.380,0:01:28.350
actually is. Then I talk about how we[br]reconstructed the
0:01:28.350,0:01:30.330
microcode ROM and what we learned
0:01:30.330,0:01:35.389
along the way. Then I quickly give some[br]examples of the applications we
0:01:35.389,0:01:41.430
implemented with the knowledge we gained[br]from second step. And lastly I talk about
0:01:41.430,0:01:47.649
a framework we used. How it works and what[br]we can do with it. And also this framework
0:01:47.649,0:01:51.899
is available on GitHub along with some[br]other tools so you're free to continue our
0:01:51.899,0:01:57.189
work. OK. So when I'm talking about[br]microcode you can think of it essentially
0:01:57.189,0:02:02.331
as a firmware for your processor. It[br]handles multiple purposes for example
0:02:02.331,0:02:06.440
you can use it to fix CPU bugs that you[br]have in silicon and you want to fix later
0:02:06.440,0:02:11.971
in the design phase. It is used for[br]instruction decoding - I cover this one a
0:02:11.971,0:02:17.970
bit more. It is also used for exception[br]handling. For example, if an exception or
0:02:17.970,0:02:22.200
interrupt is raised, microcode has a first[br]chance of modifying this interrupt
0:02:22.200,0:02:27.110
ignoring it or just passing it along to[br]the operating system. It's also used for
0:02:27.110,0:02:31.790
power management and some other complex[br]features like Intel SGX. And most
0:02:31.790,0:02:37.318
importantly for us microcode is updatable.[br]This used to patch errors in the field.
0:02:37.318,0:02:40.975
Everyone remembers Spectre / Meltdown[br]patches and there's
0:02:40.975,0:02:44.210
a microcode update. So your
0:02:44.210,0:02:50.830
x86 CPU takes multiple steps to execute an[br]instruction. The first step is decoding
0:02:50.830,0:02:55.022
a x86 instruction into multiple smaller[br]micro ops.
0:02:55.022,0:02:57.150
These are then scheduled into the pipeline
0:02:57.150,0:03:01.632
From there, they are dispatched to[br]the different functional units
0:03:01.632,0:03:03.532
like your ALU / AGU
0:03:03.532,0:03:06.392
multiplication division units
0:03:06.392,0:03:08.355
For our purposes the decode step is the
0:03:08.355,0:03:12.190
most interesting one. In the decode step[br]you have a instruction buffer that feeds
0:03:12.190,0:03:17.030
instructions to some decoders. You have[br]short decoders that handle really simple
0:03:17.030,0:03:21.100
instructions. There are long decoders that[br]can handle some more advance instructions.
0:03:21.100,0:03:25.260
And finally, the vector decoder. The[br]vector decoder handles the most complex
0:03:25.260,0:03:29.690
instructions with the help of microcode.[br]So the microcode engine is essentially the
0:03:29.690,0:03:31.247
vector decoder.
0:03:32.458,0:03:36.570
The Microcode engine in essence[br]is compromised out of a microcode
0:03:36.570,0:03:40.770
ROM that stores the instructions for the[br]microcode engine. Think of it as your
0:03:40.770,0:03:48.190
standard instructions. Then there is also[br]a writeable memory the microcode RAM. This
0:03:48.190,0:03:52.520
is where the microcode updates end up when[br]you apply microcode updates. And of course
0:03:52.520,0:03:57.310
around the storage has a whole lot of[br]things that make it actually run. For this
0:03:57.310,0:04:00.860
talk, you only need to know what is a[br]Match Registers. Match Registers are
0:04:00.860,0:04:05.650
essentially breakpoint registers. So if we[br]write an address from inside the microcode
0:04:05.650,0:04:10.670
ROM inside a Match Register whenever this[br]address is fetched, execution, control is
0:04:10.670,0:04:17.570
transferred to the microcode RAM so our[br]patch gets executed. And the microcode
0:04:17.570,0:04:23.060
updates are usually loaded by the BIOS or[br]by the kernel. Linux has an update driver,
0:04:23.060,0:04:28.340
sometimes the BIOS updates it with a[br]pre-installed version and they have a
0:04:28.340,0:04:32.120
pretty simple structure, a partially[br]documented header, and followed by the
0:04:32.120,0:04:37.730
actual microcode that is loaded inside the[br]CPU. And so microcode is organized in
0:04:37.730,0:04:42.650
something called triads. Each triad has[br]three operations essentially x86
0:04:42.650,0:04:48.230
instructions, but based on differences.[br]And lastly, you have a sequence word. The
0:04:48.230,0:04:52.025
sequence word indicates which microcode[br]instructions should be executed next. We
0:04:52.025,0:04:57.950
have options of executing just the next[br]triad, executing another one by branching
0:04:57.950,0:05:01.936
to it, or just saying OK, I'm done with[br]decoding this instruction continue with
0:05:01.936,0:05:07.490
x86 code. These updates are protected by[br]some weak authentication which we were
0:05:07.490,0:05:13.260
able to break so we can create our own. We[br]can analyze existing ones and we can apply
0:05:13.260,0:05:20.620
these to your standard laptop and desktop.[br]However there can only ever be one update
0:05:20.620,0:05:26.534
loaded at the time and when you reboot[br]your machine this update will be gone.
0:05:28.490,0:05:32.990
Also for the talk we are going to look at[br]some microcode and we will present this
0:05:32.990,0:05:38.150
microcode using a register transfer[br]language. It is heavily based on x86. I'm
0:05:38.150,0:05:43.290
just going to cover the differences[br]between these two. Most importantly the
0:05:43.290,0:05:48.650
microcode can have three operands for an[br]instruction in comparison to x86 which
0:05:48.650,0:05:53.640
usually only has two. So you can specify a[br]destination and two source operands.
0:05:55.618,0:05:56.446
Also,
0:05:57.210,0:06:02.240
microcode has some certain bit flags that[br]need to be set and these we do we see with
0:06:02.240,0:06:07.449
these annotations for example ".C" means[br]says instruction also updates a carry flag
0:06:07.449,0:06:14.050
based on the result. Then you have the[br]instruction "jcc" which is a conditional
0:06:14.050,0:06:19.570
branch and the first operand denotes the[br]condition up on which this branch is
0:06:19.570,0:06:24.100
taken. In this case branch if the carry[br]flag is one and [the] second operand
0:06:24.100,0:06:30.300
indicates the offset to add to the[br]instruction pointer. Then we also have
0:06:30.300,0:06:35.760
some sequence word annotations: "next",[br]"complete", and "branch". Also it should
0:06:35.760,0:06:39.958
be noted that the internal microcode[br]architecture is a load-store architecture.
0:06:39.958,0:06:45.350
You can't use memory operands in other[br]instructions like you can on x86 you
0:06:45.350,0:06:48.310
always need to load and store memory[br]explicitly.
0:06:49.190,0:06:51.710
Now we are going to talk about
0:06:51.710,0:06:58.710
how we manage to recover the microcode[br]ROM. The microcode ROM is baked into your
0:06:58.710,0:07:06.860
CPU, you can't change it anymore. It is[br]defined in the silicon during the
0:07:06.860,0:07:12.930
fabrication process and in this picture[br]you can see a die shot taken with a
0:07:12.930,0:07:16.840
electron microscope and this is one of[br]three regions that contains the bits for
0:07:16.840,0:07:23.240
the microcode operations. And if you zoom[br]in a bit more, each of these regions
0:07:23.240,0:07:30.050
consist out of four arrays and these are[br]further subdivided into blocks. Really
0:07:30.050,0:07:34.660
interesting is "Array 2" which is a bit[br]smaller than the other ones but it has
0:07:34.660,0:07:42.160
some structures above it which are of a[br]different visual layout. This is SRAM
0:07:42.160,0:07:47.050
which stores the microcode update. So this[br]is one-time reprogrammable memory that is
0:07:47.050,0:07:53.860
still pretty fast. So the microcode RAM is[br]located right next to the microcode ROM
0:07:53.860,0:07:57.645
which also makes sense from a design[br]standpoint.
0:08:00.445,0:08:02.010
Just an overview of how we
0:08:02.010,0:08:06.930
went ahead and how we went about. We[br]started with pictures and then we used
0:08:06.930,0:08:11.456
some OCR-ike process to transform them[br]into bit strings which we can then further
0:08:11.456,0:08:17.169
process. These bitstrings were then[br]arranged into triads. We could already
0:08:17.169,0:08:22.050
gather that we got individual triades[br]right because there were data dependencies
0:08:22.050,0:08:27.550
all over the place, but between triads,[br]there were no or very few data
0:08:27.550,0:08:33.699
dependencies so the ordering of the[br]triades was still wrong and this was a
0:08:33.699,0:08:38.860
major problem when we went ahead and what[br]we had to reverse engineer and this is
0:08:38.860,0:08:43.870
mapping a certain physical address of a[br]triad that we gathered from the ROM
0:08:43.870,0:08:48.050
readout to a virtual address that is used[br]inside the microcode update or the
0:08:48.050,0:08:53.690
microcode ROM. But after reverse engineer[br]this, you can just do a linear sweep
0:08:53.690,0:08:59.020
disassembly of the microcode ROM and[br]arrive at human readable output. But this
0:08:59.020,0:09:04.870
recovery was a bit tricky because we[br]required physical virtual address pairs.
0:09:04.870,0:09:09.520
But gathering these is a bit harder[br]because we worked there through the
0:09:09.520,0:09:14.040
available updates, but we could only find[br]two pairs of them. These pairs were
0:09:14.040,0:09:18.520
actually easy to find because every update[br]replaces a certain triad inside your
0:09:18.520,0:09:24.580
microcode ROM and this triad is usually[br]also placed in the microcode update. So by
0:09:24.580,0:09:31.260
matching the address this update replaces[br]with a microcode ROM readout. You can just
0:09:31.260,0:09:38.000
get your two data points. But we had to[br]get more data points so we generated these
0:09:38.000,0:09:42.630
mappings by matching semantics of triads[br]in the microcode ROM readout and the
0:09:42.630,0:09:47.779
semantics when we force execution of a[br]certain microcode address. And gathering
0:09:47.779,0:09:52.330
the semantics of the read-out microcode,[br]we implemented a simple microcode
0:09:52.330,0:09:58.820
simulator. Essentially it works on triad[br]level, so you give it an input state and a
0:09:58.820,0:10:03.430
triad and it calculates the output state[br]of it. Input and output state are
0:10:03.430,0:10:08.460
comprised out of the x86-state which is[br]your standard registers and also the
0:10:08.460,0:10:12.320
internal microcode registers. There are[br]multiple temporary registers that get
0:10:12.320,0:10:18.350
reset for every new x86 instruction that[br]is executed, but they can also be modified
0:10:18.350,0:10:24.130
by microcode of course. Our emulator[br]supports all known arithmetic operations
0:10:24.130,0:10:29.230
and we have a white-list of operations[br]that do not form or produce any observable
0:10:29.230,0:10:32.950
change in state just so that we could[br]process more triades and give them more
0:10:32.950,0:10:41.310
data points. In total we gathered 54[br]additional data-address pairs which turned
0:10:41.310,0:10:46.649
out to be enough to recover the whole[br]mapping. This mapping, essentially you
0:10:46.649,0:10:50.820
have the four different arrays that map to[br]individual blocks and these blocks in
0:10:50.820,0:10:56.750
these arrays or then again permuted a bit[br]and then the triads inside these blocks
0:10:56.750,0:11:02.330
have some table-based permutations. So[br]this is not an obfuscation. This is just
0:11:02.330,0:11:07.680
from a hardware design standpoint it can[br]make sense to reroute it a bit differently
0:11:09.330,0:11:14.629
Also now that we can actually[br]map a certain address to the microcode ROM
0:11:14.629,0:11:19.093
readout and we know the addresses of[br]different x86 instructions from our
0:11:19.093,0:11:24.240
earlier experiments, we can look at the[br]implementation of instructions. So let's
0:11:24.240,0:11:29.130
start with a pretty simple one. Shift-[br]Right-Double which essentially takes a
0:11:29.130,0:11:33.250
register, shift it by a given amount and[br]shifts in bits from another register. So
0:11:33.250,0:11:38.180
of course you would expect a lot of shifts[br]and rolls in its implementation and this
0:11:38.180,0:11:45.338
is exactly what we're seeing here. You[br]have two shift-right operands and you can
0:11:45.338,0:11:50.830
see regmd6 and regmd4. These are[br]place holders. The microcode engine can
0:11:50.830,0:11:55.630
replace certain bit combinations with the[br]registers that are used in the x86
0:11:55.630,0:12:01.560
operation. For example this one would be[br]replaced by ECX or EAX depending on what
0:12:01.560,0:12:08.339
you wrote in x86. And at this point we can[br]also already gather more information about
0:12:08.339,0:12:13.601
microcodes than we previously knew because[br]we know "OK, so this is source, this is
0:12:13.601,0:12:18.529
also a source and this is a destination".[br]But this source which indicates the shift
0:12:18.529,0:12:22.750
amount, this one was previously unknown,[br]because it is a high temporary microcode
0:12:22.750,0:12:28.279
register and we found out that these[br]usually implement specific different
0:12:28.279,0:12:31.800
purpose. They are not - if you write to[br]them, sometimes the CPU behaves
0:12:31.800,0:12:35.890
erratically, sometimes it crashes,[br]sometimes nothing happens. But in this
0:12:35.890,0:12:40.300
case, this seems to be the shift count,[br]and the shift count is given by a third
0:12:40.300,0:12:45.279
operand in the instruction. So in this[br]case, we already learned "OK, if you want
0:12:45.279,0:12:51.380
to read the third operand of an[br]instruction, we need to read t41". And
0:12:51.380,0:12:56.236
this is how we went about recovering more[br]and more information about microcode. The
0:12:56.236,0:13:00.160
rest of the implementation is essentially[br]concerned with implementing the rest of
0:13:00.160,0:13:05.721
the semantics of the x86 instruction and[br]updating the flags correctly. OK, so now
0:13:05.721,0:13:11.980
let's look at a instruction set that is a[br]bit more complicated. If you check out
0:13:11.980,0:13:19.620
rdtsc. rdtsc returns a internal cycle[br]counter in EDX and EAX, so the upper part
0:13:19.620,0:13:25.520
ends up in EDX, lower part in EAX. So in[br]the end we want to see writes to these
0:13:25.520,0:13:30.760
registers, potentially with a shift[br]somewhere in there. But somewhere the CPU
0:13:30.760,0:13:37.570
needs to gather the cycle counter. So in[br]the beginning we have two load-style
0:13:37.570,0:13:41.410
operations. This one is a proper load[br]which we identified and this one is
0:13:41.410,0:13:48.569
unknown. But despite that we do not know[br]the instruction, we know the target
0:13:48.569,0:13:52.720
because the result of this instruction[br]will end up in t9 and the result of this
0:13:52.720,0:13:58.060
instruction will end up in t10, so we can[br]follow the uses of these two registers. So
0:13:58.060,0:14:04.450
for simplicity I'm going to start with t10[br]and t10, which we later found out, this is
0:14:04.450,0:14:09.730
another register which essentially denotes[br]a specific internal register. And if you
0:14:09.730,0:14:15.450
play around with these bits you notice[br]that this combination encodes cr4. The x86
0:14:15.450,0:14:22.987
will just see cr4. You can also address[br]cr1 and cr2. And if you look further, t10
0:14:22.987,0:14:29.160
is then ended with this bit mask and if[br]you look in the manual you find out that
0:14:29.160,0:14:34.930
this bit in cr4 denotes the bit that[br]determines whether oddity C is
0:14:34.930,0:14:40.019
available from user space or not. So this[br]is the check if this instruction should be
0:14:40.019,0:14:48.170
executed. So now let's just keep in mind[br]that t9 holds some other loaded value from
0:14:48.170,0:14:53.930
some other internal register and we will[br]come back to this one a bit later. For
0:14:53.930,0:14:58.848
now, let's follow execution. This triad is[br]essentially a padding triad. It is a
0:14:58.848,0:15:04.885
common pattern we see. So let's look at[br]where this branch takes us.
0:15:05.895,0:15:07.180
And this branch
0:15:07.180,0:15:15.959
takes us to a conditional branch[br]triad. And if you look a bit up, this end
0:15:15.959,0:15:21.740
instruction actually updated this flag. So[br]this is a conditional branch that
0:15:21.740,0:15:26.360
determines whether this check was[br]successful or not. So it branches toward
0:15:26.360,0:15:32.570
the error triad or the success triad. But[br]here we already see the exit. We see a
0:15:32.570,0:15:41.170
write to RDX or EDX in this case with a[br]shift from t9 by 32 bit, which is exactly
0:15:41.170,0:15:45.910
what you would expect to write the time[br]stamp counter on the upper 32 bits of the
0:15:45.910,0:15:50.829
time stamp counter to edx. And you have an[br]unknown instruction, but we know, okay, we
0:15:50.829,0:15:57.877
move something from t9 to eax, which is[br]the lower 32 bits. But we're not done
0:15:57.877,0:16:02.690
here, because we can still look at the[br]error pass that is taken if the access is
0:16:02.690,0:16:09.210
denied. So if you scroll a bit down we can[br]see a move of an immediate into a certain
0:16:09.210,0:16:14.530
internal register. And this is immediate[br]actually encodes a general protection
0:16:14.530,0:16:21.790
fault interrupt code. D denotes to the[br]exception handler that this was a general
0:16:21.790,0:16:28.680
protection fault. And later this triad[br]branches to this address, and if you look
0:16:28.680,0:16:34.013
at the uses of this address we can find[br]other immediates that also correspond on
0:16:34.013,0:16:36.962
to x86 instructions. So now we learned
0:16:36.962,0:16:39.947
how we can actually raise our[br]own interrupts. We
0:16:39.947,0:16:46.100
just need to load the code we want into[br]the specific register and branch to this
0:16:46.100,0:16:52.820
address. And now we learned a lot about[br]how we can actually write microcode, but
0:16:52.820,0:16:57.000
it's also interesting to see how certain[br]instructions are implemented. So let's
0:16:57.000,0:17:03.671
look at a pretty complicated one: wrmsr[br](Write MSR). wrmsr essentially writes some
0:17:03.671,0:17:08.449
data it is given to a machine specific[br]register. This machine specific register
0:17:08.449,0:17:12.980
differs between CPUs, between vendors,[br]sometimes between revisions. And these
0:17:12.980,0:17:17.910
implement non-standard extensions or[br]pretty complex features. For example, you
0:17:17.910,0:17:23.949
trigger a microcode update by writing to a[br]machine specific register. The register
0:17:23.949,0:17:30.570
addresses you want to write to is given in[br]ecx. And now we can see ecx is read and
0:17:30.570,0:17:39.679
it is shifted by sixteen bits to t10. So[br]again, we follow uses of t10 and we see
0:17:39.679,0:17:46.070
it as XOR'd with a certain bitmask. And[br]this bitmask is C000, which actually
0:17:46.070,0:17:52.429
denotes a namespace of the model specific[br]registers. In this case this should be an
0:17:52.429,0:17:58.450
AMD-specific namespace. And, of course,[br]this one again sets some flags, and you
0:17:58.450,0:18:04.240
can see your conditional branch depending[br]on these flags to what should be the
0:18:04.240,0:18:06.235
handler for this namespace.
0:18:06.695,0:18:10.770
Next one: We have another XOR[br]that uses a different bit
0:18:10.770,0:18:16.890
mask — in this case C001. C001 is the[br]namespace where the microcode update
0:18:16.890,0:18:25.050
routine is actually located in. So again,[br]we branch to this handler. And if you just
0:18:25.050,0:18:31.010
continue on, there are more operations on[br]rcx, followed by more branches, and this
0:18:31.010,0:18:35.790
continues until everything is dispatched[br]to the correct handler. And this is how,
0:18:35.790,0:18:40.340
internally, wrmsr is implemented, and also[br]Read MSR is going to be implemented pretty
0:18:40.340,0:18:43.640
similar, because it implements some kind[br]of similar thing.
0:18:47.750,0:18:49.190
OK, so now I showed you
0:18:49.190,0:18:52.470
how we actually went ahead of[br]reconstructing the knowledge we
0:18:52.470,0:18:57.939
currently have. And now I'm going to show[br]you what we can actually do with it. And
0:18:57.939,0:19:02.440
for this I am going to quickly cover what[br]applications we wrote in microcode. We
0:19:02.440,0:19:04.940
wrote a simple configurable[br]rdtsc precision.
0:19:04.940,0:19:07.710
This means a certain bit mask is AND'd to
0:19:07.710,0:19:11.890
the result of rdtsc, so you can[br]reduce the accuracy of it, which can
0:19:11.890,0:19:18.284
sometimes prevent timing attacks. We also[br]implemented microcode-assisted address
0:19:18.284,0:19:23.260
sanitizer, which I'll cover quickly in a[br]second. We also have some basic microcode
0:19:23.260,0:19:29.070
instruction set randomization. Some[br]microcode-assisted instrumentation. What
0:19:29.070,0:19:33.520
this means is, you can write a filter for[br]your instrumentation in microcode itself.
0:19:33.520,0:19:37.580
So instead of hooking an instruction,[br]instead of debugging your code or
0:19:37.580,0:19:42.160
emulating it, you can just say whenever[br]the instruction is executed filter if this
0:19:42.160,0:19:47.180
is relevant for me, and if it is, call my[br]x86 handler — entirely in microcode,
0:19:47.180,0:19:52.470
without changing the instruction in the[br]RAM. We also implemented some basic
0:19:52.470,0:20:00.000
authenticated microcode updates. The usual[br]update mechanism is weak — that's how we
0:20:00.000,0:20:05.430
got our foot in the door in the first[br]place. So we improved upon it a bit. Also
0:20:05.430,0:20:09.799
we found out that microcode actually has[br]some enclave-like features because once
0:20:09.799,0:20:13.730
we're executing in Microcode, your kernel[br]can't interupt you, your hypervisor can't
0:20:13.730,0:20:18.610
interrupt you and any state you want[br]visible to the outside world. You actually
0:20:18.610,0:20:22.840
need to write explicitly. So all these[br]microcode internal registers are not
0:20:22.840,0:20:26.600
accessible from the outside world. So any[br]computation you perform in micro code
0:20:26.600,0:20:30.360
cannot be interfered with. So you can[br]implement a simple enclave on top of this
0:20:30.360,0:20:37.039
one. So our hardware-assisted address[br]sanitizer variant is based on the work by
0:20:37.039,0:20:41.970
the original authors and address sanitizer[br]is a software instrumentation that detects
0:20:41.970,0:20:47.070
invalid memory access by using a shadow[br]map shadow memory to just say which memory
0:20:47.070,0:20:50.746
is valid to be read and written to.
0:20:50.746,0:20:53.840
The authors proposed hardware[br]address sanitizer
0:20:53.840,0:20:59.011
which is essentially doing the same checks[br]but using a new instruction. And the
0:20:59.011,0:21:03.940
instruction should raise a fault if an[br]invalid access is detected. This algorithm
0:21:03.940,0:21:07.670
they proposed - The details are not[br]important. What is important is in
0:21:07.670,0:21:12.080
essence: It's pretty simple. You load from[br]a certain adress, performs the operations
0:21:12.080,0:21:18.816
on it and if there is the shadow after[br]this operations you just report a bug.
0:21:18.816,0:21:24.910
Advantages of hardware address sanitizer[br]are for example you get better performance
0:21:24.910,0:21:29.170
out of it. Because you only have a single[br]instruction maybe you can do some fancy
0:21:29.170,0:21:34.450
tricks inside your CPU that are faster[br]than using x86 instructions, you get more
0:21:34.450,0:21:38.880
compact code and you have the possibility[br]of one time configuration which is a bit
0:21:38.880,0:21:45.210
hard with software address sanitizer. We[br]implemented hardware address sanitizer our
0:21:45.210,0:21:49.270
variant by replacing the bound instruction[br]Bound is an old instruction that is no
0:21:49.270,0:21:54.870
longer used by compilers because in fact[br]it is slower to use bound instead of
0:21:54.870,0:21:58.901
performing the checks with multiple x86[br]instructions. We changed the interface.
0:21:58.901,0:22:04.090
The first argument is the register which[br]holds the address you want to access. And
0:22:04.090,0:22:07.835
the second argument holds the size you[br]want this access to be.
0:22:07.835,0:22:11.050
So, 1 byte, 2 byte and so on.
0:22:11.050,0:22:14.950
This instruction is a no-op if the[br]check succeeds. So if there is no bug it
0:22:14.950,0:22:19.980
just continues on like nothing happened.[br]However if we detect an invalid access we
0:22:19.980,0:22:25.359
can take a configurable action, we can for[br]example just raise your normal page fault
0:22:25.359,0:22:29.630
or we can raise a bound interrupt, which[br]is a custom interrupt, that only denotes
0:22:29.630,0:22:34.299
this one or we can branch to an x86[br]handler that either performs additional
0:22:34.299,0:22:39.760
checking, for example whitelisting, or it[br]generates a pretty error report for you.
0:22:41.340,0:22:47.480
Most importantly this is a single[br]instruction. We also do not dirty any x86
0:22:47.480,0:22:52.690
registers because they are some[br]intermediate results. You need to store
0:22:52.690,0:22:56.360
these somewhere and this you usually do in[br]the x86 registers. So you increase
0:22:56.360,0:23:00.010
register pressure. Maybe you cause[br]spilling. So overall your performance gets
0:23:00.010,0:23:07.230
worse. We also found out that we are[br]actually faster than doing the checking
0:23:07.230,0:23:12.390
using x86 instructions. So just by moving[br]the implementation from x86 level to
0:23:12.390,0:23:16.805
microcode, which in some way is still kind[br]of like software, we already improved the
0:23:16.805,0:23:22.160
performance. Also on top of this you get[br]better cache utilization because you have
0:23:22.160,0:23:27.020
less instructions, there are less bytes in[br]the cache, so we get fuller cache lines.
0:23:27.020,0:23:31.630
And also it is really easy to tell which[br]is testing code and which is your actual
0:23:31.630,0:23:40.080
program code. Lastly I'm going to show you[br]just a rough overview of our framework
0:23:40.080,0:23:45.920
which we used during our development and[br]which you can also find on GitHub. Early
0:23:45.920,0:23:50.079
on we found out that we are probably going[br]to need to test a lot of microcode
0:23:50.079,0:23:55.640
updates, because in the beginning you just[br]throw everything at the CPU and see how it
0:23:55.640,0:24:01.400
behaves and we wanted to do this in[br]parallel. So we developed a small custom
0:24:01.400,0:24:07.180
OS called "Angry OS" and deployed it to[br]mainboards. These mainboards are just old
0:24:07.180,0:24:13.270
AMD mainboards. All these mainboards were[br]hooked up via serial for communication and
0:24:13.270,0:24:19.400
GPIO to a Raspberry Pi. With the GPIO you[br]can reset, support power on, power down
0:24:19.400,0:24:23.890
and just have remote control of this[br]mainboard and then you can connect to that
0:24:23.890,0:24:28.719
Raspberry Pi from anywhere on earth and[br]just deploy and play around with it.
0:24:28.719,0:24:30.640
This was the first version.
0:24:30.640,0:24:34.490
In the beginning we[br]didn't really know much about electronics
0:24:34.490,0:24:38.520
so we used one Raspberry Pi per mainboard.[br]And it turns out Raspberry Pis are more
0:24:38.520,0:24:43.970
expensive than these old mainboards, but[br]we improved upon this and now we're down
0:24:43.970,0:24:48.007
to one Raspberry Pi for[br]four / five setups.
0:24:48.007,0:24:51.587
For example you only need 3 GPIO ports per
0:24:51.587,0:24:57.358
mainboard. You connect each of these to[br]optocouplers just to separate the voltage
0:24:57.358,0:25:01.860
levels and then you connect one side of[br]the optocoupler to the GPIO the other side
0:25:01.860,0:25:05.909
to your reset pin, to your power pin and[br]for input to know whether your board is up
0:25:05.909,0:25:11.230
or down you connect the power LED. And[br]that way you can save a lot of space, a
0:25:11.230,0:25:17.205
lot of money. And also if you're really[br]constrained you can just remove the power
0:25:17.205,0:25:23.530
LED sensing because usually you know it is[br]in the state your setup is in. As I
0:25:23.530,0:25:28.230
already said we wrote our custom operating[br]system and it is intentionally really
0:25:28.230,0:25:32.659
really minimal because the major feature[br]we wanted is control over every
0:25:32.659,0:25:36.740
instructions that's going to be executed[br]from a certain point on, because we're
0:25:36.740,0:25:40.780
playing around with instruction encoding[br]and if we execute an instructions that we
0:25:40.780,0:25:45.530
did not intend we might crash the CPU, we[br]might go into an invalid state and we do
0:25:45.530,0:25:50.850
not even know which instruction caused it.[br]And Angry OS essentially only listens on
0:25:50.850,0:26:00.150
the serial port for something to do. What[br]it can do is apply an update. These
0:26:00.150,0:26:04.820
updates are just microcode updates. They[br]are streamed via serial. We can also
0:26:04.820,0:26:10.039
stream x86 code which is then run by Angry[br]OS and this is just so that we do not need
0:26:10.039,0:26:14.409
to reflash the USB stick every time we[br]want to update our testing code and the
0:26:14.409,0:26:19.280
result, all the errors are reported back[br]to the Raspberry Pi and thus they are
0:26:19.280,0:26:26.852
forwarded to us. The framework we use most[br]importantly has the microcode assembler
0:26:26.852,0:26:30.713
and a pretty verbose disassembler. This[br]disassembler generates the output I showed
0:26:30.713,0:26:36.919
you earlier and using this you can just[br]quickly write your own microcode. We also
0:26:36.919,0:26:42.245
included an x86 assembler because we[br]wanted to rapidly test different x86
0:26:42.245,0:26:47.730
testing codes. Using this framework we[br]were able to disassemble the existing
0:26:47.730,0:26:53.500
updates and we also used it to disassemble[br]our ROM after we reordered it and also
0:26:53.500,0:27:01.169
during the process when we fed it to our[br]emulator. And we can also create the
0:27:01.169,0:27:07.909
proper binary files that can be loaded by[br]the Linux kernel driver. We modified the
0:27:07.909,0:27:12.777
stock one to just load any update you give[br]it without checking if it's the correct
0:27:12.777,0:27:20.060
CPU ID and all these things just for[br]testing purposes. It's also available. And
0:27:20.060,0:27:25.740
also of course the framework can control[br]Angry OS to make your testing easier. And
0:27:25.740,0:27:29.650
we implemented a pretty basic remote[br]execution wrapper, so you can work on a
0:27:29.650,0:27:33.389
remote Raspberry Pi as if you were using[br]it locally.
0:27:34.809,0:27:36.799
And this brings me to the end
0:27:36.799,0:27:40.800
of talk. And in conclusion we can say[br]reversing the ROM opened up a lot of new
0:27:40.800,0:27:44.809
possibilities. We learned a lot about how[br]microcode works. We learned about how to
0:27:44.809,0:27:49.720
actually use it properly instead of just[br]inferring from a really small dataset,
0:27:49.720,0:27:55.060
that we have from the updates, or from the[br]random bits things we send to the CPU and
0:27:55.060,0:27:59.530
observe what happened. But there's a lot[br]left to do. So if you really want to hack
0:27:59.530,0:28:04.089
on it, just get in contact, we were happy[br]to share our findings with you. And as I
0:28:04.089,0:28:09.009
said the framework AngryOS, example[br]programs, that we implemented, and some
0:28:09.009,0:28:13.850
other stuff like the wiring is available[br]on GitHub. So that's that. And we are
0:28:13.850,0:28:16.809
happy to answer any questions you might[br]have.
0:28:16.809,0:28:22.234
applause
0:28:24.910,0:28:28.438
Herald Angel: Thank you very much. So we
0:28:28.438,0:28:34.260
have 10 minutes for questions please line[br]up at the microphones. We start with this
0:28:34.260,0:28:39.220
one: microphone number 2.[br]M2: Hi. Thanks for a nice talk. A few
0:28:39.220,0:28:42.780
questions about your hardware address[br]sanitizer.
0:28:42.780,0:28:49.830
Benjamin: Mhm[br]M2: As I understand you don't need the
0:28:49.830,0:28:56.010
source code instrumentation because the[br]microcode is responsible for checking the
0:28:56.010,0:29:02.929
shadow memory, right?[br]Benjamin: No... The original hardware
0:29:02.929,0:29:07.950
sanitizer implementation is also based on[br]a compiler extension, that inserts a new
0:29:07.950,0:29:12.200
instruction because it doesn't exist[br]usually. And it also inserts a bootstrap
0:29:12.200,0:29:18.049
code that in inits your shadow map and[br]also instruments your allocators to update
0:29:18.049,0:29:23.020
the shadow map doing runtime and we[br]essentially need the same component, but
0:29:23.020,0:29:26.850
we do not need the software address[br]sanitizer component that essentially
0:29:26.850,0:29:33.740
inserts 10 or 20 x86 instructions before[br]every memory access. So yes we still need
0:29:33.740,0:29:37.647
a compile time component and we are still[br]source code based in a sense.
0:29:39.388,0:29:45.600
Herald: And, so..[br]M2: And I didn't see, maybe I missed the
0:29:45.600,0:29:51.299
numbers. How much it is faster than this[br]initial version?
0:29:51.299,0:29:56.419
Benjamin: You mean the initial hardware[br]sanitizer version or the software address
0:29:56.419,0:29:59.900
sanitizer.[br]M2: I mean let's say custom kernel address
0:29:59.900,0:30:05.180
sanitizer for Linux kernel which is the[br]the usual one and your approach.
0:30:05.180,0:30:10.270
Benjamin: We only performed a micro[br]benchmark on Angry OS and we essentially
0:30:10.270,0:30:16.059
took the instrumentation as emitted by the[br]compiler for some memory access which is
0:30:16.059,0:30:20.590
your standard software address sanitizer[br]and compared it to our version using only
0:30:20.590,0:30:24.640
the modified bound instruction. So I[br]really can't talk about how it compares to
0:30:24.640,0:30:28.820
KASAN or something or some like real world[br]implementation, because we only have the
0:30:28.820,0:30:34.069
prototype and the basic instrumentation.[br]M2: Thank you very much.
0:30:34.069,0:30:36.490
Herald Angel: OK. Microphone number 4[br]please.
0:30:36.490,0:30:51.145
M4: Hey thanks for the talk and did you[br]find any weird microcode
0:30:51.145,0:31:00.529
implementations. I don't mean security[br]wise, just like you rarely expected to
0:31:00.529,0:31:07.330
see it be implemented that way.
0:31:09.040,0:31:11.700
Benjamin: The problem is there's a lot of
0:31:11.700,0:31:20.270
microcode to begin with. You have f000[br]triads. Each of which has 3 op-codes. So
0:31:20.270,0:31:25.003
you have a lot of ground to cover and also[br]we have read-out errors. Sometimes you are
0:31:25.003,0:31:29.169
seeing bit flips, which kind of slows you[br]down because you then need to always
0:31:29.169,0:31:32.820
consider: OK, maybe this register is[br]something else, maybe this address is
0:31:32.820,0:31:37.420
wrong. And also sometimes you have a dust[br]particles that kind of knocks out an
0:31:37.420,0:31:42.550
entire region. So we only looked at the[br]components, we were pretty sure that we
0:31:42.550,0:31:46.520
recovered correctly, and we'd only looked[br]at a really tiny subset compared to all of
0:31:46.520,0:31:52.940
the microcode ROM. It's just not feasible[br]to do and to go through it and look at
0:31:52.940,0:31:57.330
everything. So no we didn't find anything[br]funny but we also wouldn't know what funny
0:31:57.330,0:32:00.790
looks like because we don't know what the[br]official spec for microcode is.
0:32:01.180,0:32:03.990
M4: Thanks.[br]Herald Angel: Interesting. We have one
0:32:04.034,0:32:05.809
question from the Internet, from the
0:32:05.809,0:32:09.792
Signal Angel please.[br]Signal Angel: Yes. Which AMD CPU
0:32:09.792,0:32:15.510
generations does this apply to?[br]Benjamin: Yeah this is still based on the
0:32:15.510,0:32:21.289
work of our first talk and this only works[br]on pretty old ones: K8, K10. So until,
0:32:21.289,0:32:26.940
CPUs produced until 2013. Yeah this was[br]the last year AMD produced anything like
0:32:26.940,0:32:32.520
that. Newer ones use some public key based[br]cryptography from what we can tell and we
0:32:32.520,0:32:36.559
haven't yet managed to break it. Same goes[br]for Intel, they seem to be using public
0:32:36.559,0:32:39.919
key cryptography and we haven't gotten a[br]foot in the door yet.
0:32:40.989,0:32:44.789
Herald Angel: Thank you. We go one around.[br]On microphone number 3 please.
0:32:44.789,0:32:51.290
M3: Yeah. Thank you. I would like to know[br]how complex could the microcode programs
0:32:51.290,0:32:59.159
be, that you could write. So what's the[br]complexity of new operations you could
0:32:59.159,0:33:03.300
implement.[br]Benjamin: The only limiting factor is the
0:33:03.300,0:33:07.923
size of your microcode update RAM. But[br]this one is really really limited.
0:33:07.923,0:33:12.679
For example on K8, where we performed the[br]majority of our experiments. We are
0:33:12.679,0:33:19.050
limited to 32 triads, which comes down to[br]a sixty nine instructions and you also
0:33:19.050,0:33:22.440
have some constraints on these[br]instructions for example the next triad
0:33:22.440,0:33:27.809
will always be executed no matter what.[br]Some operations can only go at the second
0:33:27.809,0:33:33.859
slot. Some can only go on another slot, so[br]it's really really hard. And you're also
0:33:33.859,0:33:38.930
limited from our knowledge to loading 16[br]bit immediates instead of 32 bit or even
0:33:38.930,0:33:44.470
64 bit immediates. So your whole program[br]grows really fast if you're trying to do
0:33:44.470,0:33:49.400
something complex. For example our[br]authenticated microcode update mechanism
0:33:49.400,0:33:54.440
is the most complex one we wrote it nearly[br]fills out the RAM and we used TEA – Tiny
0:33:54.440,0:33:58.700
Encryption Algorithm – because that was[br]the only one we managed to fit mostly due
0:33:58.700,0:34:04.510
to S-box and other constants we would need[br]to load. So it's really small.
0:34:04.510,0:34:08.539
Herald Angel: Thank you Microphone number[br]1.
0:34:08.539,0:34:14.709
M1: So you said the microcode is used for[br]instruction decoding and it needs to meet
0:34:14.709,0:34:19.429
the micro-ops to the scheduler and micro[br]queue in some way. Did you find out how
0:34:19.429,0:34:27.519
that works?[br]Bejamin: In essence we are not actually
0:34:27.519,0:34:33.539
executing code inside in microcode engine.[br]From what from what we understand, the
0:34:33.539,0:34:38.569
microcode engine is just some kind of a[br]software based recipe, that describes how
0:34:38.569,0:34:43.479
to decode an instruction, so you don't[br]actually get execution, you just commit
0:34:43.479,0:34:47.269
instructions into the pipelines, that do[br]what you want. And because we have some
0:34:47.269,0:34:51.269
control flow possibility, that is actually[br]inside the micro code engine, because you
0:34:51.269,0:34:55.268
can branch to different addresses, you can[br]conditionally branch and loop. You kind of
0:34:55.268,0:34:59.089
get an execution, but in essence to just[br]commit stuff in the pipeline and the CPU
0:34:59.089,0:35:01.440
does what you tell it to.
0:35:04.240,0:35:07.161
Herald Angel: One more question.[br]Microphone number 2, please.
0:35:07.161,0:35:11.927
M2: How did you take the picture of the[br]internal CPU? Did you open it?
0:35:11.927,0:35:14.969
Benjamin: Yeah. We worked together with
0:35:14.969,0:35:19.680
Chris. He's our hardware guy. He has[br]access to his equipment to delayer it and
0:35:19.680,0:35:24.289
to take high resolution optical shots and[br]he also takes shots with a scanning
0:35:24.289,0:35:29.279
electron microscope. So I think about five[br]or six CPUs were harmed in the making of
0:35:29.279,0:35:30.357
this paper.
0:35:33.810,0:35:37.815
Herald Angel: So we have one more last[br]question. Microphone number 2 please.
0:35:39.248,0:35:41.390
M2: Are you aware of research done by
0:35:41.390,0:35:49.400
Christopher Domas, where he mapped out the[br]instruction set for x86 processors?
0:35:49.400,0:35:57.119
B: You mean sandsifter? We[br]actually talked with him and yeah we are
0:35:57.119,0:36:02.910
aware, that there's a map essentially of[br]the instruction set and also maybe you can
0:36:02.910,0:36:07.275
combine it, because in the beginning we[br]reverse engineered where certain x86
0:36:07.275,0:36:11.335
instructions are implemented in microcode.[br]So if you plug these two together you kind
0:36:11.335,0:36:15.170
of map out the whole microcode ROM at the[br]same time that you map out a whole
0:36:15.170,0:36:18.989
instruction set. However there are some[br]components of the microcode ROM that are
0:36:18.989,0:36:23.470
most likely not triggered by instructions.[br]For example it seems like power management
0:36:23.470,0:36:27.368
or everything that is behind a write MSR[br][wrmsr] or read MSR [rdmsr]. wrmsr is a
0:36:27.368,0:36:31.249
single instruction, but depending on the[br]arguments you give it it just branches to
0:36:31.249,0:36:36.442
totally different triads and the microcode[br]itself is implemented in microcode. And
0:36:36.442,0:36:40.190
this one is a huge chunk you wouldn't even[br]find without brute forcing all
0:36:40.190,0:36:44.159
combinations for all instructions which is[br]not really feasible.
0:36:46.483,0:36:51.279
Herald Angel: Thank you. Thank you[br]Benjamin.
0:36:51.279,0:36:57.210
applause
0:36:57.210,0:37:01.811
35c3 postroll music
0:37:01.811,0:37:21.000
subtitles created by c3subtitles.de[br]in the years 2019-2020. Join, and help us!