WEBVTT
00:00:03.129 --> 00:00:07.360
35C3 preroll music
00:00:18.780 --> 00:00:23.869
Herald: So the next talk Benjamin Kollenda
and Philipp Koppe - they will refresh our
00:00:23.869 --> 00:00:30.529
memories because they already had a talk
on 34C3 where they talked about the micro
00:00:30.529 --> 00:00:37.580
code ROM and today they're gonna give us
more insights on how micro code works. And
00:00:37.580 --> 00:00:44.320
more details on the ROM itself. Benjamin
is a PhD student and has a focus on
00:00:44.320 --> 00:00:51.280
software attacks and defenses and together
with Phillip they will now abuse AMD
00:00:51.280 --> 00:00:55.190
microcode for fun and security. Please
enjoy.
00:00:55.190 --> 00:00:58.730
Applause
00:01:01.320 --> 00:01:06.260
Benjamin: Thank you. So as mentioned we
were able to reverse engineer the AMD
00:01:06.260 --> 00:01:11.599
microcode and the AMD microcode ROM and
I'm going to talk about our journey. What
00:01:11.599 --> 00:01:16.369
we learned on the way and how we did it.
So this joint work with my colleagues at
00:01:16.369 --> 00:01:20.799
Ruhr Universtat Bochum and a quick outline
how are we going to do it. We're going to
00:01:20.799 --> 00:01:25.380
start with a quick crash course on micro
architectural basics and what microcode
00:01:25.380 --> 00:01:28.350
actually is. Then I talk about how we
reconstructed the
00:01:28.350 --> 00:01:30.330
microcode ROM and what we learned
00:01:30.330 --> 00:01:35.389
along the way. Then I quickly give some
examples of the applications we
00:01:35.389 --> 00:01:41.430
implemented with the knowledge we gained
from second step. And lastly I talk about
00:01:41.430 --> 00:01:47.649
a framework we used. How it works and what
we can do with it. And also this framework
00:01:47.649 --> 00:01:51.899
is available on GitHub along with some
other tools so you're free to continue our
00:01:51.899 --> 00:01:57.189
work. OK. So when I'm talking about
microcode you can think of it essentially
00:01:57.189 --> 00:02:02.331
as a firmware for your processor. It
handles multiple purposes for example
00:02:02.331 --> 00:02:06.440
you can use it to fix CPU bugs that you
have in silicon and you want to fix later
00:02:06.440 --> 00:02:11.971
in the design phase. It is used for
instruction decoding - I cover this one a
00:02:11.971 --> 00:02:17.970
bit more. It is also used for exception
handling. For example, if an exception or
00:02:17.970 --> 00:02:22.200
interrupt is raised, microcode has a first
chance of modifying this interrupt
00:02:22.200 --> 00:02:27.110
ignoring it or just passing it along to
the operating system. It's also used for
00:02:27.110 --> 00:02:31.790
power management and some other complex
features like Intel SGX. And most
00:02:31.790 --> 00:02:37.318
importantly for us microcode is updatable.
This used to patch errors in the field.
00:02:37.318 --> 00:02:40.975
Everyone remembers Spectre / Meltdown
patches and there's
00:02:40.975 --> 00:02:44.210
a microcode update. So your
00:02:44.210 --> 00:02:50.830
x86 CPU takes multiple steps to execute an
instruction. The first step is decoding
00:02:50.830 --> 00:02:55.022
a x86 instruction into multiple smaller
micro ops.
00:02:55.022 --> 00:02:57.150
These are then scheduled into the pipeline
00:02:57.150 --> 00:03:01.632
From there, they are dispatched to
the different functional units
00:03:01.632 --> 00:03:03.532
like your ALU / AGU
00:03:03.532 --> 00:03:06.392
multiplication division units
00:03:06.392 --> 00:03:08.355
For our purposes the decode step is the
00:03:08.355 --> 00:03:12.190
most interesting one. In the decode step
you have a instruction buffer that feeds
00:03:12.190 --> 00:03:17.030
instructions to some decoders. You have
short decoders that handle really simple
00:03:17.030 --> 00:03:21.100
instructions. There are long decoders that
can handle some more advance instructions.
00:03:21.100 --> 00:03:25.260
And finally, the vector decoder. The
vector decoder handles the most complex
00:03:25.260 --> 00:03:29.690
instructions with the help of microcode.
So the microcode engine is essentially the
00:03:29.690 --> 00:03:31.247
vector decoder.
00:03:32.458 --> 00:03:36.570
The Microcode engine in essence
is compromised out of a microcode
00:03:36.570 --> 00:03:40.770
ROM that stores the instructions for the
microcode engine. Think of it as your
00:03:40.770 --> 00:03:48.190
standard instructions. Then there is also
a writeable memory the microcode RAM. This
00:03:48.190 --> 00:03:52.520
is where the microcode updates end up when
you apply microcode updates. And of course
00:03:52.520 --> 00:03:57.310
around the storage has a whole lot of
things that make it actually run. For this
00:03:57.310 --> 00:04:00.860
talk, you only need to know what is a
Match Registers. Match Registers are
00:04:00.860 --> 00:04:05.650
essentially breakpoint registers. So if we
write an address from inside the microcode
00:04:05.650 --> 00:04:10.670
ROM inside a Match Register whenever this
address is fetched, execution, control is
00:04:10.670 --> 00:04:17.570
transferred to the microcode RAM so our
patch gets executed. And the microcode
00:04:17.570 --> 00:04:23.060
updates are usually loaded by the BIOS or
by the kernel. Linux has an update driver,
00:04:23.060 --> 00:04:28.340
sometimes the BIOS updates it with a
pre-installed version and they have a
00:04:28.340 --> 00:04:32.120
pretty simple structure, a partially
documented header, and followed by the
00:04:32.120 --> 00:04:37.730
actual microcode that is loaded inside the
CPU. And so microcode is organized in
00:04:37.730 --> 00:04:42.650
something called triads. Each triad has
three operations essentially x86
00:04:42.650 --> 00:04:48.230
instructions, but based on differences.
And lastly, you have a sequence word. The
00:04:48.230 --> 00:04:52.025
sequence word indicates which microcode
instructions should be executed next. We
00:04:52.025 --> 00:04:57.950
have options of executing just the next
triad, executing another one by branching
00:04:57.950 --> 00:05:01.936
to it, or just saying OK, I'm done with
decoding this instruction continue with
00:05:01.936 --> 00:05:07.490
x86 code. These updates are protected by
some weak authentication which we were
00:05:07.490 --> 00:05:13.260
able to break so we can create our own. We
can analyze existing ones and we can apply
00:05:13.260 --> 00:05:20.620
these to your standard laptop and desktop.
However there can only ever be one update
00:05:20.620 --> 00:05:26.534
loaded at the time and when you reboot
your machine this update will be gone.
00:05:28.490 --> 00:05:32.990
Also for the talk we are going to look at
some microcode and we will present this
00:05:32.990 --> 00:05:38.150
microcode using a register transfer
language. It is heavily based on x86. I'm
00:05:38.150 --> 00:05:43.290
just going to cover the differences
between these two. Most importantly the
00:05:43.290 --> 00:05:48.650
microcode can have three operands for an
instruction in comparison to x86 which
00:05:48.650 --> 00:05:53.640
usually only has two. So you can specify a
destination and two source operands.
00:05:55.618 --> 00:05:56.446
Also,
00:05:57.210 --> 00:06:02.240
microcode has some certain bit flags that
need to be set and these we do we see with
00:06:02.240 --> 00:06:07.449
these annotations for example ".C" means
says instruction also updates a carry flag
00:06:07.449 --> 00:06:14.050
based on the result. Then you have the
instruction "jcc" which is a conditional
00:06:14.050 --> 00:06:19.570
branch and the first operand denotes the
condition up on which this branch is
00:06:19.570 --> 00:06:24.100
taken. In this case branch if the carry
flag is one and [the] second operand
00:06:24.100 --> 00:06:30.300
indicates the offset to add to the
instruction pointer. Then we also have
00:06:30.300 --> 00:06:35.760
some sequence word annotations: "next",
"complete", and "branch". Also it should
00:06:35.760 --> 00:06:39.958
be noted that the internal microcode
architecture is a load-store architecture.
00:06:39.958 --> 00:06:45.350
You can't use memory operands in other
instructions like you can on x86 you
00:06:45.350 --> 00:06:48.310
always need to load and store memory
explicitly.
00:06:49.190 --> 00:06:51.710
Now we are going to talk about
00:06:51.710 --> 00:06:58.710
how we manage to recover the microcode
ROM. The microcode ROM is baked into your
00:06:58.710 --> 00:07:06.860
CPU, you can't change it anymore. It is
defined in the silicon during the
00:07:06.860 --> 00:07:12.930
fabrication process and in this picture
you can see a die shot taken with a
00:07:12.930 --> 00:07:16.840
electron microscope and this is one of
three regions that contains the bits for
00:07:16.840 --> 00:07:23.240
the microcode operations. And if you zoom
in a bit more, each of these regions
00:07:23.240 --> 00:07:30.050
consist out of four arrays and these are
further subdivided into blocks. Really
00:07:30.050 --> 00:07:34.660
interesting is "Array 2" which is a bit
smaller than the other ones but it has
00:07:34.660 --> 00:07:42.160
some structures above it which are of a
different visual layout. This is SRAM
00:07:42.160 --> 00:07:47.050
which stores the microcode update. So this
is one-time reprogrammable memory that is
00:07:47.050 --> 00:07:53.860
still pretty fast. So the microcode RAM is
located right next to the microcode ROM
00:07:53.860 --> 00:07:57.645
which also makes sense from a design
standpoint.
00:08:00.445 --> 00:08:02.010
Just an overview of how we
00:08:02.010 --> 00:08:06.930
went ahead and how we went about. We
started with pictures and then we used
00:08:06.930 --> 00:08:11.456
some OCR-ike process to transform them
into bit strings which we can then further
00:08:11.456 --> 00:08:17.169
process. These bitstrings were then
arranged into triads. We could already
00:08:17.169 --> 00:08:22.050
gather that we got individual triades
right because there were data dependencies
00:08:22.050 --> 00:08:27.550
all over the place, but between triads,
there were no or very few data
00:08:27.550 --> 00:08:33.699
dependencies so the ordering of the
triades was still wrong and this was a
00:08:33.699 --> 00:08:38.860
major problem when we went ahead and what
we had to reverse engineer and this is
00:08:38.860 --> 00:08:43.870
mapping a certain physical address of a
triad that we gathered from the ROM
00:08:43.870 --> 00:08:48.050
readout to a virtual address that is used
inside the microcode update or the
00:08:48.050 --> 00:08:53.690
microcode ROM. But after reverse engineer
this, you can just do a linear sweep
00:08:53.690 --> 00:08:59.020
disassembly of the microcode ROM and
arrive at human readable output. But this
00:08:59.020 --> 00:09:04.870
recovery was a bit tricky because we
required physical virtual address pairs.
00:09:04.870 --> 00:09:09.520
But gathering these is a bit harder
because we worked there through the
00:09:09.520 --> 00:09:14.040
available updates, but we could only find
two pairs of them. These pairs were
00:09:14.040 --> 00:09:18.520
actually easy to find because every update
replaces a certain triad inside your
00:09:18.520 --> 00:09:24.580
microcode ROM and this triad is usually
also placed in the microcode update. So by
00:09:24.580 --> 00:09:31.260
matching the address this update replaces
with a microcode ROM readout. You can just
00:09:31.260 --> 00:09:38.000
get your two data points. But we had to
get more data points so we generated these
00:09:38.000 --> 00:09:42.630
mappings by matching semantics of triads
in the microcode ROM readout and the
00:09:42.630 --> 00:09:47.779
semantics when we force execution of a
certain microcode address. And gathering
00:09:47.779 --> 00:09:52.330
the semantics of the read-out microcode,
we implemented a simple microcode
00:09:52.330 --> 00:09:58.820
simulator. Essentially it works on triad
level, so you give it an input state and a
00:09:58.820 --> 00:10:03.430
triad and it calculates the output state
of it. Input and output state are
00:10:03.430 --> 00:10:08.460
comprised out of the x86-state which is
your standard registers and also the
00:10:08.460 --> 00:10:12.320
internal microcode registers. There are
multiple temporary registers that get
00:10:12.320 --> 00:10:18.350
reset for every new x86 instruction that
is executed, but they can also be modified
00:10:18.350 --> 00:10:24.130
by microcode of course. Our emulator
supports all known arithmetic operations
00:10:24.130 --> 00:10:29.230
and we have a white-list of operations
that do not form or produce any observable
00:10:29.230 --> 00:10:32.950
change in state just so that we could
process more triades and give them more
00:10:32.950 --> 00:10:41.310
data points. In total we gathered 54
additional data-address pairs which turned
00:10:41.310 --> 00:10:46.649
out to be enough to recover the whole
mapping. This mapping, essentially you
00:10:46.649 --> 00:10:50.820
have the four different arrays that map to
individual blocks and these blocks in
00:10:50.820 --> 00:10:56.750
these arrays or then again permuted a bit
and then the triads inside these blocks
00:10:56.750 --> 00:11:02.330
have some table-based permutations. So
this is not an obfuscation. This is just
00:11:02.330 --> 00:11:07.680
from a hardware design standpoint it can
make sense to reroute it a bit differently
00:11:09.330 --> 00:11:14.629
Also now that we can actually
map a certain address to the microcode ROM
00:11:14.629 --> 00:11:19.093
readout and we know the addresses of
different x86 instructions from our
00:11:19.093 --> 00:11:24.240
earlier experiments, we can look at the
implementation of instructions. So let's
00:11:24.240 --> 00:11:29.130
start with a pretty simple one. Shift-
Right-Double which essentially takes a
00:11:29.130 --> 00:11:33.250
register, shift it by a given amount and
shifts in bits from another register. So
00:11:33.250 --> 00:11:38.180
of course you would expect a lot of shifts
and rolls in its implementation and this
00:11:38.180 --> 00:11:45.338
is exactly what we're seeing here. You
have two shift-right operands and you can
00:11:45.338 --> 00:11:50.830
see regmd6 and regmd4. These are
place holders. The microcode engine can
00:11:50.830 --> 00:11:55.630
replace certain bit combinations with the
registers that are used in the x86
00:11:55.630 --> 00:12:01.560
operation. For example this one would be
replaced by ECX or EAX depending on what
00:12:01.560 --> 00:12:08.339
you wrote in x86. And at this point we can
also already gather more information about
00:12:08.339 --> 00:12:13.601
microcodes than we previously knew because
we know "OK, so this is source, this is
00:12:13.601 --> 00:12:18.529
also a source and this is a destination".
But this source which indicates the shift
00:12:18.529 --> 00:12:22.750
amount, this one was previously unknown,
because it is a high temporary microcode
00:12:22.750 --> 00:12:28.279
register and we found out that these
usually implement specific different
00:12:28.279 --> 00:12:31.800
purpose. They are not - if you write to
them, sometimes the CPU behaves
00:12:31.800 --> 00:12:35.890
erratically, sometimes it crashes,
sometimes nothing happens. But in this
00:12:35.890 --> 00:12:40.300
case, this seems to be the shift count,
and the shift count is given by a third
00:12:40.300 --> 00:12:45.279
operand in the instruction. So in this
case, we already learned "OK, if you want
00:12:45.279 --> 00:12:51.380
to read the third operand of an
instruction, we need to read t41". And
00:12:51.380 --> 00:12:56.236
this is how we went about recovering more
and more information about microcode. The
00:12:56.236 --> 00:13:00.160
rest of the implementation is essentially
concerned with implementing the rest of
00:13:00.160 --> 00:13:05.721
the semantics of the x86 instruction and
updating the flags correctly. OK, so now
00:13:05.721 --> 00:13:11.980
let's look at a instruction set that is a
bit more complicated. If you check out
00:13:11.980 --> 00:13:19.620
rdtsc. rdtsc returns a internal cycle
counter in EDX and EAX, so the upper part
00:13:19.620 --> 00:13:25.520
ends up in EDX, lower part in EAX. So in
the end we want to see writes to these
00:13:25.520 --> 00:13:30.760
registers, potentially with a shift
somewhere in there. But somewhere the CPU
00:13:30.760 --> 00:13:37.570
needs to gather the cycle counter. So in
the beginning we have two load-style
00:13:37.570 --> 00:13:41.410
operations. This one is a proper load
which we identified and this one is
00:13:41.410 --> 00:13:48.569
unknown. But despite that we do not know
the instruction, we know the target
00:13:48.569 --> 00:13:52.720
because the result of this instruction
will end up in t9 and the result of this
00:13:52.720 --> 00:13:58.060
instruction will end up in t10, so we can
follow the uses of these two registers. So
00:13:58.060 --> 00:14:04.450
for simplicity I'm going to start with t10
and t10, which we later found out, this is
00:14:04.450 --> 00:14:09.730
another register which essentially denotes
a specific internal register. And if you
00:14:09.730 --> 00:14:15.450
play around with these bits you notice
that this combination encodes cr4. The x86
00:14:15.450 --> 00:14:22.987
will just see cr4. You can also address
cr1 and cr2. And if you look further, t10
00:14:22.987 --> 00:14:29.160
is then ended with this bit mask and if
you look in the manual you find out that
00:14:29.160 --> 00:14:34.930
this bit in cr4 denotes the bit that
determines whether oddity C is
00:14:34.930 --> 00:14:40.019
available from user space or not. So this
is the check if this instruction should be
00:14:40.019 --> 00:14:48.170
executed. So now let's just keep in mind
that t9 holds some other loaded value from
00:14:48.170 --> 00:14:53.930
some other internal register and we will
come back to this one a bit later. For
00:14:53.930 --> 00:14:58.848
now, let's follow execution. This triad is
essentially a padding triad. It is a
00:14:58.848 --> 00:15:04.885
common pattern we see. So let's look at
where this branch takes us.
00:15:05.895 --> 00:15:07.180
And this branch
00:15:07.180 --> 00:15:15.959
takes us to a conditional branch
triad. And if you look a bit up, this end
00:15:15.959 --> 00:15:21.740
instruction actually updated this flag. So
this is a conditional branch that
00:15:21.740 --> 00:15:26.360
determines whether this check was
successful or not. So it branches toward
00:15:26.360 --> 00:15:32.570
the error triad or the success triad. But
here we already see the exit. We see a
00:15:32.570 --> 00:15:41.170
write to RDX or EDX in this case with a
shift from t9 by 32 bit, which is exactly
00:15:41.170 --> 00:15:45.910
what you would expect to write the time
stamp counter on the upper 32 bits of the
00:15:45.910 --> 00:15:50.829
time stamp counter to edx. And you have an
unknown instruction, but we know, okay, we
00:15:50.829 --> 00:15:57.877
move something from t9 to eax, which is
the lower 32 bits. But we're not done
00:15:57.877 --> 00:16:02.690
here, because we can still look at the
error pass that is taken if the access is
00:16:02.690 --> 00:16:09.210
denied. So if you scroll a bit down we can
see a move of an immediate into a certain
00:16:09.210 --> 00:16:14.530
internal register. And this is immediate
actually encodes a general protection
00:16:14.530 --> 00:16:21.790
fault interrupt code. D denotes to the
exception handler that this was a general
00:16:21.790 --> 00:16:28.680
protection fault. And later this triad
branches to this address, and if you look
00:16:28.680 --> 00:16:34.013
at the uses of this address we can find
other immediates that also correspond on
00:16:34.013 --> 00:16:36.962
to x86 instructions. So now we learned
00:16:36.962 --> 00:16:39.947
how we can actually raise our
own interrupts. We
00:16:39.947 --> 00:16:46.100
just need to load the code we want into
the specific register and branch to this
00:16:46.100 --> 00:16:52.820
address. And now we learned a lot about
how we can actually write microcode, but
00:16:52.820 --> 00:16:57.000
it's also interesting to see how certain
instructions are implemented. So let's
00:16:57.000 --> 00:17:03.671
look at a pretty complicated one: wrmsr
(Write MSR). wrmsr essentially writes some
00:17:03.671 --> 00:17:08.449
data it is given to a machine specific
register. This machine specific register
00:17:08.449 --> 00:17:12.980
differs between CPUs, between vendors,
sometimes between revisions. And these
00:17:12.980 --> 00:17:17.910
implement non-standard extensions or
pretty complex features. For example, you
00:17:17.910 --> 00:17:23.949
trigger a microcode update by writing to a
machine specific register. The register
00:17:23.949 --> 00:17:30.570
addresses you want to write to is given in
ecx. And now we can see ecx is read and
00:17:30.570 --> 00:17:39.679
it is shifted by sixteen bits to t10. So
again, we follow uses of t10 and we see
00:17:39.679 --> 00:17:46.070
it as XOR'd with a certain bitmask. And
this bitmask is C000, which actually
00:17:46.070 --> 00:17:52.429
denotes a namespace of the model specific
registers. In this case this should be an
00:17:52.429 --> 00:17:58.450
AMD-specific namespace. And, of course,
this one again sets some flags, and you
00:17:58.450 --> 00:18:04.240
can see your conditional branch depending
on these flags to what should be the
00:18:04.240 --> 00:18:06.235
handler for this namespace.
00:18:06.695 --> 00:18:10.770
Next one: We have another XOR
that uses a different bit
00:18:10.770 --> 00:18:16.890
mask — in this case C001. C001 is the
namespace where the microcode update
00:18:16.890 --> 00:18:25.050
routine is actually located in. So again,
we branch to this handler. And if you just
00:18:25.050 --> 00:18:31.010
continue on, there are more operations on
rcx, followed by more branches, and this
00:18:31.010 --> 00:18:35.790
continues until everything is dispatched
to the correct handler. And this is how,
00:18:35.790 --> 00:18:40.340
internally, wrmsr is implemented, and also
Read MSR is going to be implemented pretty
00:18:40.340 --> 00:18:43.640
similar, because it implements some kind
of similar thing.
00:18:47.750 --> 00:18:49.190
OK, so now I showed you
00:18:49.190 --> 00:18:52.470
how we actually went ahead of
reconstructing the knowledge we
00:18:52.470 --> 00:18:57.939
currently have. And now I'm going to show
you what we can actually do with it. And
00:18:57.939 --> 00:19:02.440
for this I am going to quickly cover what
applications we wrote in microcode. We
00:19:02.440 --> 00:19:04.940
wrote a simple configurable
rdtsc precision.
00:19:04.940 --> 00:19:07.710
This means a certain bit mask is AND'd to
00:19:07.710 --> 00:19:11.890
the result of rdtsc, so you can
reduce the accuracy of it, which can
00:19:11.890 --> 00:19:18.284
sometimes prevent timing attacks. We also
implemented microcode-assisted address
00:19:18.284 --> 00:19:23.260
sanitizer, which I'll cover quickly in a
second. We also have some basic microcode
00:19:23.260 --> 00:19:29.070
instruction set randomization. Some
microcode-assisted instrumentation. What
00:19:29.070 --> 00:19:33.520
this means is, you can write a filter for
your instrumentation in microcode itself.
00:19:33.520 --> 00:19:37.580
So instead of hooking an instruction,
instead of debugging your code or
00:19:37.580 --> 00:19:42.160
emulating it, you can just say whenever
the instruction is executed filter if this
00:19:42.160 --> 00:19:47.180
is relevant for me, and if it is, call my
x86 handler — entirely in microcode,
00:19:47.180 --> 00:19:52.470
without changing the instruction in the
RAM. We also implemented some basic
00:19:52.470 --> 00:20:00.000
authenticated microcode updates. The usual
update mechanism is weak — that's how we
00:20:00.000 --> 00:20:05.430
got our foot in the door in the first
place. So we improved upon it a bit. Also
00:20:05.430 --> 00:20:09.799 line:1
we found out that microcode actually has
some enclave-like features because once
00:20:09.799 --> 00:20:13.730
we're executing in Microcode, your kernel
can't interupt you, your hypervisor can't
00:20:13.730 --> 00:20:18.610
interrupt you and any state you want
visible to the outside world. You actually
00:20:18.610 --> 00:20:22.840
need to write explicitly. So all these
microcode internal registers are not
00:20:22.840 --> 00:20:26.600
accessible from the outside world. So any
computation you perform in micro code
00:20:26.600 --> 00:20:30.360
cannot be interfered with. So you can
implement a simple enclave on top of this
00:20:30.360 --> 00:20:37.039
one. So our hardware-assisted address
sanitizer variant is based on the work by
00:20:37.039 --> 00:20:41.970
the original authors and address sanitizer
is a software instrumentation that detects
00:20:41.970 --> 00:20:47.070
invalid memory access by using a shadow
map shadow memory to just say which memory
00:20:47.070 --> 00:20:50.746
is valid to be read and written to.
00:20:50.746 --> 00:20:53.840
The authors proposed hardware
address sanitizer
00:20:53.840 --> 00:20:59.011
which is essentially doing the same checks
but using a new instruction. And the
00:20:59.011 --> 00:21:03.940
instruction should raise a fault if an
invalid access is detected. This algorithm
00:21:03.940 --> 00:21:07.670
they proposed - The details are not
important. What is important is in
00:21:07.670 --> 00:21:12.080
essence: It's pretty simple. You load from
a certain adress, performs the operations
00:21:12.080 --> 00:21:18.816
on it and if there is the shadow after
this operations you just report a bug.
00:21:18.816 --> 00:21:24.910
Advantages of hardware address sanitizer
are for example you get better performance
00:21:24.910 --> 00:21:29.170
out of it. Because you only have a single
instruction maybe you can do some fancy
00:21:29.170 --> 00:21:34.450
tricks inside your CPU that are faster
than using x86 instructions, you get more
00:21:34.450 --> 00:21:38.880
compact code and you have the possibility
of one time configuration which is a bit
00:21:38.880 --> 00:21:45.210
hard with software address sanitizer. We
implemented hardware address sanitizer our
00:21:45.210 --> 00:21:49.270
variant by replacing the bound instruction
Bound is an old instruction that is no
00:21:49.270 --> 00:21:54.870
longer used by compilers because in fact
it is slower to use bound instead of
00:21:54.870 --> 00:21:58.901
performing the checks with multiple x86
instructions. We changed the interface.
00:21:58.901 --> 00:22:04.090
The first argument is the register which
holds the address you want to access. And
00:22:04.090 --> 00:22:07.835
the second argument holds the size you
want this access to be.
00:22:07.835 --> 00:22:11.050
So, 1 byte, 2 byte and so on.
00:22:11.050 --> 00:22:14.950
This instruction is a no-op if the
check succeeds. So if there is no bug it
00:22:14.950 --> 00:22:19.980
just continues on like nothing happened.
However if we detect an invalid access we
00:22:19.980 --> 00:22:25.359
can take a configurable action, we can for
example just raise your normal page fault
00:22:25.359 --> 00:22:29.630
or we can raise a bound interrupt, which
is a custom interrupt, that only denotes
00:22:29.630 --> 00:22:34.299
this one or we can branch to an x86
handler that either performs additional
00:22:34.299 --> 00:22:39.760
checking, for example whitelisting, or it
generates a pretty error report for you.
00:22:41.340 --> 00:22:47.480
Most importantly this is a single
instruction. We also do not dirty any x86
00:22:47.480 --> 00:22:52.690
registers because they are some
intermediate results. You need to store
00:22:52.690 --> 00:22:56.360
these somewhere and this you usually do in
the x86 registers. So you increase
00:22:56.360 --> 00:23:00.010
register pressure. Maybe you cause
spilling. So overall your performance gets
00:23:00.010 --> 00:23:07.230
worse. We also found out that we are
actually faster than doing the checking
00:23:07.230 --> 00:23:12.390
using x86 instructions. So just by moving
the implementation from x86 level to
00:23:12.390 --> 00:23:16.805
microcode, which in some way is still kind
of like software, we already improved the
00:23:16.805 --> 00:23:22.160
performance. Also on top of this you get
better cache utilization because you have
00:23:22.160 --> 00:23:27.020
less instructions, there are less bytes in
the cache, so we get fuller cache lines.
00:23:27.020 --> 00:23:31.630
And also it is really easy to tell which
is testing code and which is your actual
00:23:31.630 --> 00:23:40.080
program code. Lastly I'm going to show you
just a rough overview of our framework
00:23:40.080 --> 00:23:45.920
which we used during our development and
which you can also find on GitHub. Early
00:23:45.920 --> 00:23:50.079 line:1
on we found out that we are probably going
to need to test a lot of microcode
00:23:50.079 --> 00:23:55.640 line:1
updates, because in the beginning you just
throw everything at the CPU and see how it
00:23:55.640 --> 00:24:01.400 line:1
behaves and we wanted to do this in
parallel. So we developed a small custom
00:24:01.400 --> 00:24:07.180
OS called "Angry OS" and deployed it to
mainboards. These mainboards are just old
00:24:07.180 --> 00:24:13.270
AMD mainboards. All these mainboards were
hooked up via serial for communication and
00:24:13.270 --> 00:24:19.400
GPIO to a Raspberry Pi. With the GPIO you
can reset, support power on, power down
00:24:19.400 --> 00:24:23.890
and just have remote control of this
mainboard and then you can connect to that
00:24:23.890 --> 00:24:28.719
Raspberry Pi from anywhere on earth and
just deploy and play around with it.
00:24:28.719 --> 00:24:30.640
This was the first version.
00:24:30.640 --> 00:24:34.490
In the beginning we
didn't really know much about electronics
00:24:34.490 --> 00:24:38.520
so we used one Raspberry Pi per mainboard.
And it turns out Raspberry Pis are more
00:24:38.520 --> 00:24:43.970
expensive than these old mainboards, but
we improved upon this and now we're down
00:24:43.970 --> 00:24:48.007
to one Raspberry Pi for
four / five setups.
00:24:48.007 --> 00:24:51.587
For example you only need 3 GPIO ports per
00:24:51.587 --> 00:24:57.358
mainboard. You connect each of these to
optocouplers just to separate the voltage
00:24:57.358 --> 00:25:01.860
levels and then you connect one side of
the optocoupler to the GPIO the other side
00:25:01.860 --> 00:25:05.909
to your reset pin, to your power pin and
for input to know whether your board is up
00:25:05.909 --> 00:25:11.230
or down you connect the power LED. And
that way you can save a lot of space, a
00:25:11.230 --> 00:25:17.205
lot of money. And also if you're really
constrained you can just remove the power
00:25:17.205 --> 00:25:23.530
LED sensing because usually you know it is
in the state your setup is in. As I
00:25:23.530 --> 00:25:28.230
already said we wrote our custom operating
system and it is intentionally really
00:25:28.230 --> 00:25:32.659
really minimal because the major feature
we wanted is control over every
00:25:32.659 --> 00:25:36.740
instructions that's going to be executed
from a certain point on, because we're
00:25:36.740 --> 00:25:40.780
playing around with instruction encoding
and if we execute an instructions that we
00:25:40.780 --> 00:25:45.530
did not intend we might crash the CPU, we
might go into an invalid state and we do
00:25:45.530 --> 00:25:50.850
not even know which instruction caused it.
And Angry OS essentially only listens on
00:25:50.850 --> 00:26:00.150
the serial port for something to do. What
it can do is apply an update. These
00:26:00.150 --> 00:26:04.820
updates are just microcode updates. They
are streamed via serial. We can also
00:26:04.820 --> 00:26:10.039
stream x86 code which is then run by Angry
OS and this is just so that we do not need
00:26:10.039 --> 00:26:14.409
to reflash the USB stick every time we
want to update our testing code and the
00:26:14.409 --> 00:26:19.280
result, all the errors are reported back
to the Raspberry Pi and thus they are
00:26:19.280 --> 00:26:26.852
forwarded to us. The framework we use most
importantly has the microcode assembler
00:26:26.852 --> 00:26:30.713
and a pretty verbose disassembler. This
disassembler generates the output I showed
00:26:30.713 --> 00:26:36.919
you earlier and using this you can just
quickly write your own microcode. We also
00:26:36.919 --> 00:26:42.245
included an x86 assembler because we
wanted to rapidly test different x86
00:26:42.245 --> 00:26:47.730
testing codes. Using this framework we
were able to disassemble the existing
00:26:47.730 --> 00:26:53.500
updates and we also used it to disassemble
our ROM after we reordered it and also
00:26:53.500 --> 00:27:01.169
during the process when we fed it to our
emulator. And we can also create the
00:27:01.169 --> 00:27:07.909
proper binary files that can be loaded by
the Linux kernel driver. We modified the
00:27:07.909 --> 00:27:12.777
stock one to just load any update you give
it without checking if it's the correct
00:27:12.777 --> 00:27:20.060
CPU ID and all these things just for
testing purposes. It's also available. And
00:27:20.060 --> 00:27:25.740
also of course the framework can control
Angry OS to make your testing easier. And
00:27:25.740 --> 00:27:29.650
we implemented a pretty basic remote
execution wrapper, so you can work on a
00:27:29.650 --> 00:27:33.389
remote Raspberry Pi as if you were using
it locally.
00:27:34.809 --> 00:27:36.799
And this brings me to the end
00:27:36.799 --> 00:27:40.800
of talk. And in conclusion we can say
reversing the ROM opened up a lot of new
00:27:40.800 --> 00:27:44.809
possibilities. We learned a lot about how
microcode works. We learned about how to
00:27:44.809 --> 00:27:49.720
actually use it properly instead of just
inferring from a really small dataset,
00:27:49.720 --> 00:27:55.060
that we have from the updates, or from the
random bits things we send to the CPU and
00:27:55.060 --> 00:27:59.530
observe what happened. But there's a lot
left to do. So if you really want to hack
00:27:59.530 --> 00:28:04.089
on it, just get in contact, we were happy
to share our findings with you. And as I
00:28:04.089 --> 00:28:09.009
said the framework AngryOS, example
programs, that we implemented, and some
00:28:09.009 --> 00:28:13.850
other stuff like the wiring is available
on GitHub. So that's that. And we are
00:28:13.850 --> 00:28:16.809
happy to answer any questions you might
have.
00:28:16.809 --> 00:28:22.234
applause
00:28:24.910 --> 00:28:28.438
Herald Angel: Thank you very much. So we
00:28:28.438 --> 00:28:34.260
have 10 minutes for questions please line
up at the microphones. We start with this
00:28:34.260 --> 00:28:39.220
one: microphone number 2.
M2: Hi. Thanks for a nice talk. A few
00:28:39.220 --> 00:28:42.780
questions about your hardware address
sanitizer.
00:28:42.780 --> 00:28:49.830
Benjamin: Mhm
M2: As I understand you don't need the
00:28:49.830 --> 00:28:56.010
source code instrumentation because the
microcode is responsible for checking the
00:28:56.010 --> 00:29:02.929
shadow memory, right?
Benjamin: No... The original hardware
00:29:02.929 --> 00:29:07.950
sanitizer implementation is also based on
a compiler extension, that inserts a new
00:29:07.950 --> 00:29:12.200
instruction because it doesn't exist
usually. And it also inserts a bootstrap
00:29:12.200 --> 00:29:18.049
code that in inits your shadow map and
also instruments your allocators to update
00:29:18.049 --> 00:29:23.020
the shadow map doing runtime and we
essentially need the same component, but
00:29:23.020 --> 00:29:26.850
we do not need the software address
sanitizer component that essentially
00:29:26.850 --> 00:29:33.740
inserts 10 or 20 x86 instructions before
every memory access. So yes we still need
00:29:33.740 --> 00:29:37.647
a compile time component and we are still
source code based in a sense.
00:29:39.388 --> 00:29:45.600
Herald: And, so..
M2: And I didn't see, maybe I missed the
00:29:45.600 --> 00:29:51.299
numbers. How much it is faster than this
initial version?
00:29:51.299 --> 00:29:56.419
Benjamin: You mean the initial hardware
sanitizer version or the software address
00:29:56.419 --> 00:29:59.900
sanitizer.
M2: I mean let's say custom kernel address
00:29:59.900 --> 00:30:05.180
sanitizer for Linux kernel which is the
the usual one and your approach.
00:30:05.180 --> 00:30:10.270
Benjamin: We only performed a micro
benchmark on Angry OS and we essentially
00:30:10.270 --> 00:30:16.059
took the instrumentation as emitted by the
compiler for some memory access which is
00:30:16.059 --> 00:30:20.590
your standard software address sanitizer
and compared it to our version using only
00:30:20.590 --> 00:30:24.640
the modified bound instruction. So I
really can't talk about how it compares to
00:30:24.640 --> 00:30:28.820
KASAN or something or some like real world
implementation, because we only have the
00:30:28.820 --> 00:30:34.069
prototype and the basic instrumentation.
M2: Thank you very much.
00:30:34.069 --> 00:30:36.490
Herald Angel: OK. Microphone number 4
please.
00:30:36.490 --> 00:30:51.145
M4: Hey thanks for the talk and did you
find any weird microcode
00:30:51.145 --> 00:31:00.529
implementations. I don't mean security
wise, just like you rarely expected to
00:31:00.529 --> 00:31:07.330
see it be implemented that way.
00:31:09.040 --> 00:31:11.700
Benjamin: The problem is there's a lot of
00:31:11.700 --> 00:31:20.270
microcode to begin with. You have f000
triads. Each of which has 3 op-codes. So
00:31:20.270 --> 00:31:25.003
you have a lot of ground to cover and also
we have read-out errors. Sometimes you are
00:31:25.003 --> 00:31:29.169
seeing bit flips, which kind of slows you
down because you then need to always
00:31:29.169 --> 00:31:32.820
consider: OK, maybe this register is
something else, maybe this address is
00:31:32.820 --> 00:31:37.420
wrong. And also sometimes you have a dust
particles that kind of knocks out an
00:31:37.420 --> 00:31:42.550
entire region. So we only looked at the
components, we were pretty sure that we
00:31:42.550 --> 00:31:46.520
recovered correctly, and we'd only looked
at a really tiny subset compared to all of
00:31:46.520 --> 00:31:52.940
the microcode ROM. It's just not feasible
to do and to go through it and look at
00:31:52.940 --> 00:31:57.330
everything. So no we didn't find anything
funny but we also wouldn't know what funny
00:31:57.330 --> 00:32:00.790
looks like because we don't know what the
official spec for microcode is.
00:32:01.180 --> 00:32:03.990
M4: Thanks.
Herald Angel: Interesting. We have one
00:32:04.034 --> 00:32:05.809
question from the Internet, from the
00:32:05.809 --> 00:32:09.792
Signal Angel please.
Signal Angel: Yes. Which AMD CPU
00:32:09.792 --> 00:32:15.510
generations does this apply to?
Benjamin: Yeah this is still based on the
00:32:15.510 --> 00:32:21.289
work of our first talk and this only works
on pretty old ones: K8, K10. So until,
00:32:21.289 --> 00:32:26.940
CPUs produced until 2013. Yeah this was
the last year AMD produced anything like
00:32:26.940 --> 00:32:32.520
that. Newer ones use some public key based
cryptography from what we can tell and we
00:32:32.520 --> 00:32:36.559
haven't yet managed to break it. Same goes
for Intel, they seem to be using public
00:32:36.559 --> 00:32:39.919
key cryptography and we haven't gotten a
foot in the door yet.
00:32:40.989 --> 00:32:44.789
Herald Angel: Thank you. We go one around.
On microphone number 3 please.
00:32:44.789 --> 00:32:51.290
M3: Yeah. Thank you. I would like to know
how complex could the microcode programs
00:32:51.290 --> 00:32:59.159
be, that you could write. So what's the
complexity of new operations you could
00:32:59.159 --> 00:33:03.300
implement.
Benjamin: The only limiting factor is the
00:33:03.300 --> 00:33:07.923
size of your microcode update RAM. But
this one is really really limited.
00:33:07.923 --> 00:33:12.679
For example on K8, where we performed the
majority of our experiments. We are
00:33:12.679 --> 00:33:19.050
limited to 32 triads, which comes down to
a sixty nine instructions and you also
00:33:19.050 --> 00:33:22.440
have some constraints on these
instructions for example the next triad
00:33:22.440 --> 00:33:27.809
will always be executed no matter what.
Some operations can only go at the second
00:33:27.809 --> 00:33:33.859
slot. Some can only go on another slot, so
it's really really hard. And you're also
00:33:33.859 --> 00:33:38.930
limited from our knowledge to loading 16
bit immediates instead of 32 bit or even
00:33:38.930 --> 00:33:44.470
64 bit immediates. So your whole program
grows really fast if you're trying to do
00:33:44.470 --> 00:33:49.400
something complex. For example our
authenticated microcode update mechanism
00:33:49.400 --> 00:33:54.440
is the most complex one we wrote it nearly
fills out the RAM and we used TEA – Tiny
00:33:54.440 --> 00:33:58.700
Encryption Algorithm – because that was
the only one we managed to fit mostly due
00:33:58.700 --> 00:34:04.510
to S-box and other constants we would need
to load. So it's really small.
00:34:04.510 --> 00:34:08.539
Herald Angel: Thank you Microphone number
1.
00:34:08.539 --> 00:34:14.709
M1: So you said the microcode is used for
instruction decoding and it needs to meet
00:34:14.709 --> 00:34:19.429
the micro-ops to the scheduler and micro
queue in some way. Did you find out how
00:34:19.429 --> 00:34:27.519
that works?
Bejamin: In essence we are not actually
00:34:27.519 --> 00:34:33.539
executing code inside in microcode engine.
From what from what we understand, the
00:34:33.539 --> 00:34:38.569
microcode engine is just some kind of a
software based recipe, that describes how
00:34:38.569 --> 00:34:43.479
to decode an instruction, so you don't
actually get execution, you just commit
00:34:43.479 --> 00:34:47.269
instructions into the pipelines, that do
what you want. And because we have some
00:34:47.269 --> 00:34:51.269
control flow possibility, that is actually
inside the micro code engine, because you
00:34:51.269 --> 00:34:55.268
can branch to different addresses, you can
conditionally branch and loop. You kind of
00:34:55.268 --> 00:34:59.089
get an execution, but in essence to just
commit stuff in the pipeline and the CPU
00:34:59.089 --> 00:35:01.440
does what you tell it to.
00:35:04.240 --> 00:35:07.161
Herald Angel: One more question.
Microphone number 2, please.
00:35:07.161 --> 00:35:11.927
M2: How did you take the picture of the
internal CPU? Did you open it?
00:35:11.927 --> 00:35:14.969
Benjamin: Yeah. We worked together with
00:35:14.969 --> 00:35:19.680
Chris. He's our hardware guy. He has
access to his equipment to delayer it and
00:35:19.680 --> 00:35:24.289
to take high resolution optical shots and
he also takes shots with a scanning
00:35:24.289 --> 00:35:29.279
electron microscope. So I think about five
or six CPUs were harmed in the making of
00:35:29.279 --> 00:35:30.357
this paper.
00:35:33.810 --> 00:35:37.815
Herald Angel: So we have one more last
question. Microphone number 2 please.
00:35:39.248 --> 00:35:41.390
M2: Are you aware of research done by
00:35:41.390 --> 00:35:49.400
Christopher Domas, where he mapped out the
instruction set for x86 processors?
00:35:49.400 --> 00:35:57.119
B: You mean sandsifter? We
actually talked with him and yeah we are
00:35:57.119 --> 00:36:02.910
aware, that there's a map essentially of
the instruction set and also maybe you can
00:36:02.910 --> 00:36:07.275
combine it, because in the beginning we
reverse engineered where certain x86
00:36:07.275 --> 00:36:11.335
instructions are implemented in microcode.
So if you plug these two together you kind
00:36:11.335 --> 00:36:15.170
of map out the whole microcode ROM at the
same time that you map out a whole
00:36:15.170 --> 00:36:18.989
instruction set. However there are some
components of the microcode ROM that are
00:36:18.989 --> 00:36:23.470
most likely not triggered by instructions.
For example it seems like power management
00:36:23.470 --> 00:36:27.368
or everything that is behind a write MSR
[wrmsr] or read MSR [rdmsr]. wrmsr is a
00:36:27.368 --> 00:36:31.249
single instruction, but depending on the
arguments you give it it just branches to
00:36:31.249 --> 00:36:36.442
totally different triads and the microcode
itself is implemented in microcode. And
00:36:36.442 --> 00:36:40.190
this one is a huge chunk you wouldn't even
find without brute forcing all
00:36:40.190 --> 00:36:44.159
combinations for all instructions which is
not really feasible.
00:36:46.483 --> 00:36:51.279
Herald Angel: Thank you. Thank you
Benjamin.
00:36:51.279 --> 00:36:57.210
applause
00:36:57.210 --> 00:37:01.811
35c3 postroll music
00:37:01.811 --> 00:37:21.000
subtitles created by c3subtitles.de
in the years 2019-2020. Join, and help us!