1
00:00:03,129 --> 00:00:07,360
35C3 preroll music
2
00:00:18,780 --> 00:00:23,869
Herald: So the next talk Benjamin Kollenda
and Philipp Koppe - they will refresh our
3
00:00:23,869 --> 00:00:30,529
memories because they already had a talk
on 34C3 where they talked about the micro
4
00:00:30,529 --> 00:00:37,580
code ROM and today they're gonna give us
more insights on how micro code works. And
5
00:00:37,580 --> 00:00:44,320
more details on the ROM itself. Benjamin
is a PhD student and has a focus on
6
00:00:44,320 --> 00:00:51,280
software attacks and defenses and together
with Phillip they will now abuse AMD
7
00:00:51,280 --> 00:00:55,190
microcode for fun and security. Please
enjoy.
8
00:00:55,190 --> 00:00:58,730
Applause
9
00:01:01,320 --> 00:01:06,260
Benjamin: Thank you. So as mentioned we
were able to reverse engineer the AMD
10
00:01:06,260 --> 00:01:11,599
microcode and the AMD microcode ROM and
I'm going to talk about our journey. What
11
00:01:11,599 --> 00:01:16,369
we learned on the way and how we did it.
So this joint work with my colleagues at
12
00:01:16,369 --> 00:01:20,799
Ruhr Universtat Bochum and a quick outline
how are we going to do it. We're going to
13
00:01:20,799 --> 00:01:25,380
start with a quick crash course on micro
architectural basics and what microcode
14
00:01:25,380 --> 00:01:28,350
actually is. Then I talk about how we
reconstructed the
15
00:01:28,350 --> 00:01:30,330
microcode ROM and what we learned
16
00:01:30,330 --> 00:01:35,389
along the way. Then I quickly give some
examples of the applications we
17
00:01:35,389 --> 00:01:41,430
implemented with the knowledge we gained
from second step. And lastly I talk about
18
00:01:41,430 --> 00:01:47,649
a framework we used. How it works and what
we can do with it. And also this framework
19
00:01:47,649 --> 00:01:51,899
is available on GitHub along with some
other tools so you're free to continue our
20
00:01:51,899 --> 00:01:57,189
work. OK. So when I'm talking about
microcode you can think of it essentially
21
00:01:57,189 --> 00:02:02,331
as a firmware for your processor. It
handles multiple purposes for example
22
00:02:02,331 --> 00:02:06,440
you can use it to fix CPU bugs that you
have in silicon and you want to fix later
23
00:02:06,440 --> 00:02:11,971
in the design phase. It is used for
instruction decoding - I cover this one a
24
00:02:11,971 --> 00:02:17,970
bit more. It is also used for exception
handling. For example, if an exception or
25
00:02:17,970 --> 00:02:22,200
interrupt is raised, microcode has a first
chance of modifying this interrupt
26
00:02:22,200 --> 00:02:27,110
ignoring it or just passing it along to
the operating system. It's also used for
27
00:02:27,110 --> 00:02:31,790
power management and some other complex
features like Intel SGX. And most
28
00:02:31,790 --> 00:02:37,318
importantly for us microcode is updatable.
This used to patch errors in the field.
29
00:02:37,318 --> 00:02:40,975
Everyone remembers Spectre / Meltdown
patches and there's
30
00:02:40,975 --> 00:02:44,210
a microcode update. So your
31
00:02:44,210 --> 00:02:50,830
x86 CPU takes multiple steps to execute an
instruction. The first step is decoding
32
00:02:50,830 --> 00:02:55,022
a x86 instruction into multiple smaller
micro ops.
33
00:02:55,022 --> 00:02:57,150
These are then scheduled into the pipeline
34
00:02:57,150 --> 00:03:01,632
From there, they are dispatched to
the different functional units
35
00:03:01,632 --> 00:03:03,532
like your ALU / AGU
36
00:03:03,532 --> 00:03:06,392
multiplication division units
37
00:03:06,392 --> 00:03:08,355
For our purposes the decode step is the
38
00:03:08,355 --> 00:03:12,190
most interesting one. In the decode step
you have a instruction buffer that feeds
39
00:03:12,190 --> 00:03:17,030
instructions to some decoders. You have
short decoders that handle really simple
40
00:03:17,030 --> 00:03:21,100
instructions. There are long decoders that
can handle some more advance instructions.
41
00:03:21,100 --> 00:03:25,260
And finally, the vector decoder. The
vector decoder handles the most complex
42
00:03:25,260 --> 00:03:29,690
instructions with the help of microcode.
So the microcode engine is essentially the
43
00:03:29,690 --> 00:03:31,247
vector decoder.
44
00:03:32,458 --> 00:03:36,570
The Microcode engine in essence
is compromised out of a microcode
45
00:03:36,570 --> 00:03:40,770
ROM that stores the instructions for the
microcode engine. Think of it as your
46
00:03:40,770 --> 00:03:48,190
standard instructions. Then there is also
a writeable memory the microcode RAM. This
47
00:03:48,190 --> 00:03:52,520
is where the microcode updates end up when
you apply microcode updates. And of course
48
00:03:52,520 --> 00:03:57,310
around the storage has a whole lot of
things that make it actually run. For this
49
00:03:57,310 --> 00:04:00,860
talk, you only need to know what is a
Match Registers. Match Registers are
50
00:04:00,860 --> 00:04:05,650
essentially breakpoint registers. So if we
write an address from inside the microcode
51
00:04:05,650 --> 00:04:10,670
ROM inside a Match Register whenever this
address is fetched, execution, control is
52
00:04:10,670 --> 00:04:17,570
transferred to the microcode RAM so our
patch gets executed. And the microcode
53
00:04:17,570 --> 00:04:23,060
updates are usually loaded by the BIOS or
by the kernel. Linux has an update driver,
54
00:04:23,060 --> 00:04:28,340
sometimes the BIOS updates it with a
pre-installed version and they have a
55
00:04:28,340 --> 00:04:32,120
pretty simple structure, a partially
documented header, and followed by the
56
00:04:32,120 --> 00:04:37,730
actual microcode that is loaded inside the
CPU. And so microcode is organized in
57
00:04:37,730 --> 00:04:42,650
something called triads. Each triad has
three operations essentially x86
58
00:04:42,650 --> 00:04:48,230
instructions, but based on differences.
And lastly, you have a sequence word. The
59
00:04:48,230 --> 00:04:52,025
sequence word indicates which microcode
instructions should be executed next. We
60
00:04:52,025 --> 00:04:57,950
have options of executing just the next
triad, executing another one by branching
61
00:04:57,950 --> 00:05:01,936
to it, or just saying OK, I'm done with
decoding this instruction continue with
62
00:05:01,936 --> 00:05:07,490
x86 code. These updates are protected by
some weak authentication which we were
63
00:05:07,490 --> 00:05:13,260
able to break so we can create our own. We
can analyze existing ones and we can apply
64
00:05:13,260 --> 00:05:20,620
these to your standard laptop and desktop.
However there can only ever be one update
65
00:05:20,620 --> 00:05:26,534
loaded at the time and when you reboot
your machine this update will be gone.
66
00:05:28,490 --> 00:05:32,990
Also for the talk we are going to look at
some microcode and we will present this
67
00:05:32,990 --> 00:05:38,150
microcode using a register transfer
language. It is heavily based on x86. I'm
68
00:05:38,150 --> 00:05:43,290
just going to cover the differences
between these two. Most importantly the
69
00:05:43,290 --> 00:05:48,650
microcode can have three operands for an
instruction in comparison to x86 which
70
00:05:48,650 --> 00:05:53,640
usually only has two. So you can specify a
destination and two source operands.
71
00:05:55,618 --> 00:05:56,446
Also,
72
00:05:57,210 --> 00:06:02,240
microcode has some certain bit flags that
need to be set and these we do we see with
73
00:06:02,240 --> 00:06:07,449
these annotations for example ".C" means
says instruction also updates a carry flag
74
00:06:07,449 --> 00:06:14,050
based on the result. Then you have the
instruction "jcc" which is a conditional
75
00:06:14,050 --> 00:06:19,570
branch and the first operand denotes the
condition up on which this branch is
76
00:06:19,570 --> 00:06:24,100
taken. In this case branch if the carry
flag is one and [the] second operand
77
00:06:24,100 --> 00:06:30,300
indicates the offset to add to the
instruction pointer. Then we also have
78
00:06:30,300 --> 00:06:35,760
some sequence word annotations: "next",
"complete", and "branch". Also it should
79
00:06:35,760 --> 00:06:39,958
be noted that the internal microcode
architecture is a load-store architecture.
80
00:06:39,958 --> 00:06:45,350
You can't use memory operands in other
instructions like you can on x86 you
81
00:06:45,350 --> 00:06:48,310
always need to load and store memory
explicitly.
82
00:06:49,190 --> 00:06:51,710
Now we are going to talk about
83
00:06:51,710 --> 00:06:58,710
how we manage to recover the microcode
ROM. The microcode ROM is baked into your
84
00:06:58,710 --> 00:07:06,860
CPU, you can't change it anymore. It is
defined in the silicon during the
85
00:07:06,860 --> 00:07:12,930
fabrication process and in this picture
you can see a die shot taken with a
86
00:07:12,930 --> 00:07:16,840
electron microscope and this is one of
three regions that contains the bits for
87
00:07:16,840 --> 00:07:23,240
the microcode operations. And if you zoom
in a bit more, each of these regions
88
00:07:23,240 --> 00:07:30,050
consist out of four arrays and these are
further subdivided into blocks. Really
89
00:07:30,050 --> 00:07:34,660
interesting is "Array 2" which is a bit
smaller than the other ones but it has
90
00:07:34,660 --> 00:07:42,160
some structures above it which are of a
different visual layout. This is SRAM
91
00:07:42,160 --> 00:07:47,050
which stores the microcode update. So this
is one-time reprogrammable memory that is
92
00:07:47,050 --> 00:07:53,860
still pretty fast. So the microcode RAM is
located right next to the microcode ROM
93
00:07:53,860 --> 00:07:57,645
which also makes sense from a design
standpoint.
94
00:08:00,445 --> 00:08:02,010
Just an overview of how we
95
00:08:02,010 --> 00:08:06,930
went ahead and how we went about. We
started with pictures and then we used
96
00:08:06,930 --> 00:08:11,456
some OCR-ike process to transform them
into bit strings which we can then further
97
00:08:11,456 --> 00:08:17,169
process. These bitstrings were then
arranged into triads. We could already
98
00:08:17,169 --> 00:08:22,050
gather that we got individual triades
right because there were data dependencies
99
00:08:22,050 --> 00:08:27,550
all over the place, but between triads,
there were no or very few data
100
00:08:27,550 --> 00:08:33,699
dependencies so the ordering of the
triades was still wrong and this was a
101
00:08:33,699 --> 00:08:38,860
major problem when we went ahead and what
we had to reverse engineer and this is
102
00:08:38,860 --> 00:08:43,870
mapping a certain physical address of a
triad that we gathered from the ROM
103
00:08:43,870 --> 00:08:48,050
readout to a virtual address that is used
inside the microcode update or the
104
00:08:48,050 --> 00:08:53,690
microcode ROM. But after reverse engineer
this, you can just do a linear sweep
105
00:08:53,690 --> 00:08:59,020
disassembly of the microcode ROM and
arrive at human readable output. But this
106
00:08:59,020 --> 00:09:04,870
recovery was a bit tricky because we
required physical virtual address pairs.
107
00:09:04,870 --> 00:09:09,520
But gathering these is a bit harder
because we worked there through the
108
00:09:09,520 --> 00:09:14,040
available updates, but we could only find
two pairs of them. These pairs were
109
00:09:14,040 --> 00:09:18,520
actually easy to find because every update
replaces a certain triad inside your
110
00:09:18,520 --> 00:09:24,580
microcode ROM and this triad is usually
also placed in the microcode update. So by
111
00:09:24,580 --> 00:09:31,260
matching the address this update replaces
with a microcode ROM readout. You can just
112
00:09:31,260 --> 00:09:38,000
get your two data points. But we had to
get more data points so we generated these
113
00:09:38,000 --> 00:09:42,630
mappings by matching semantics of triads
in the microcode ROM readout and the
114
00:09:42,630 --> 00:09:47,779
semantics when we force execution of a
certain microcode address. And gathering
115
00:09:47,779 --> 00:09:52,330
the semantics of the read-out microcode,
we implemented a simple microcode
116
00:09:52,330 --> 00:09:58,820
simulator. Essentially it works on triad
level, so you give it an input state and a
117
00:09:58,820 --> 00:10:03,430
triad and it calculates the output state
of it. Input and output state are
118
00:10:03,430 --> 00:10:08,460
comprised out of the x86-state which is
your standard registers and also the
119
00:10:08,460 --> 00:10:12,320
internal microcode registers. There are
multiple temporary registers that get
120
00:10:12,320 --> 00:10:18,350
reset for every new x86 instruction that
is executed, but they can also be modified
121
00:10:18,350 --> 00:10:24,130
by microcode of course. Our emulator
supports all known arithmetic operations
122
00:10:24,130 --> 00:10:29,230
and we have a white-list of operations
that do not form or produce any observable
123
00:10:29,230 --> 00:10:32,950
change in state just so that we could
process more triades and give them more
124
00:10:32,950 --> 00:10:41,310
data points. In total we gathered 54
additional data-address pairs which turned
125
00:10:41,310 --> 00:10:46,649
out to be enough to recover the whole
mapping. This mapping, essentially you
126
00:10:46,649 --> 00:10:50,820
have the four different arrays that map to
individual blocks and these blocks in
127
00:10:50,820 --> 00:10:56,750
these arrays or then again permuted a bit
and then the triads inside these blocks
128
00:10:56,750 --> 00:11:02,330
have some table-based permutations. So
this is not an obfuscation. This is just
129
00:11:02,330 --> 00:11:07,680
from a hardware design standpoint it can
make sense to reroute it a bit differently
130
00:11:09,330 --> 00:11:14,629
Also now that we can actually
map a certain address to the microcode ROM
131
00:11:14,629 --> 00:11:19,093
readout and we know the addresses of
different x86 instructions from our
132
00:11:19,093 --> 00:11:24,240
earlier experiments, we can look at the
implementation of instructions. So let's
133
00:11:24,240 --> 00:11:29,130
start with a pretty simple one. Shift-
Right-Double which essentially takes a
134
00:11:29,130 --> 00:11:33,250
register, shift it by a given amount and
shifts in bits from another register. So
135
00:11:33,250 --> 00:11:38,180
of course you would expect a lot of shifts
and rolls in its implementation and this
136
00:11:38,180 --> 00:11:45,338
is exactly what we're seeing here. You
have two shift-right operands and you can
137
00:11:45,338 --> 00:11:50,830
see regmd6 and regmd4. These are
place holders. The microcode engine can
138
00:11:50,830 --> 00:11:55,630
replace certain bit combinations with the
registers that are used in the x86
139
00:11:55,630 --> 00:12:01,560
operation. For example this one would be
replaced by ECX or EAX depending on what
140
00:12:01,560 --> 00:12:08,339
you wrote in x86. And at this point we can
also already gather more information about
141
00:12:08,339 --> 00:12:13,601
microcodes than we previously knew because
we know "OK, so this is source, this is
142
00:12:13,601 --> 00:12:18,529
also a source and this is a destination".
But this source which indicates the shift
143
00:12:18,529 --> 00:12:22,750
amount, this one was previously unknown,
because it is a high temporary microcode
144
00:12:22,750 --> 00:12:28,279
register and we found out that these
usually implement specific different
145
00:12:28,279 --> 00:12:31,800
purpose. They are not - if you write to
them, sometimes the CPU behaves
146
00:12:31,800 --> 00:12:35,890
erratically, sometimes it crashes,
sometimes nothing happens. But in this
147
00:12:35,890 --> 00:12:40,300
case, this seems to be the shift count,
and the shift count is given by a third
148
00:12:40,300 --> 00:12:45,279
operand in the instruction. So in this
case, we already learned "OK, if you want
149
00:12:45,279 --> 00:12:51,380
to read the third operand of an
instruction, we need to read t41". And
150
00:12:51,380 --> 00:12:56,236
this is how we went about recovering more
and more information about microcode. The
151
00:12:56,236 --> 00:13:00,160
rest of the implementation is essentially
concerned with implementing the rest of
152
00:13:00,160 --> 00:13:05,721
the semantics of the x86 instruction and
updating the flags correctly. OK, so now
153
00:13:05,721 --> 00:13:11,980
let's look at a instruction set that is a
bit more complicated. If you check out
154
00:13:11,980 --> 00:13:19,620
rdtsc. rdtsc returns a internal cycle
counter in EDX and EAX, so the upper part
155
00:13:19,620 --> 00:13:25,520
ends up in EDX, lower part in EAX. So in
the end we want to see writes to these
156
00:13:25,520 --> 00:13:30,760
registers, potentially with a shift
somewhere in there. But somewhere the CPU
157
00:13:30,760 --> 00:13:37,570
needs to gather the cycle counter. So in
the beginning we have two load-style
158
00:13:37,570 --> 00:13:41,410
operations. This one is a proper load
which we identified and this one is
159
00:13:41,410 --> 00:13:48,569
unknown. But despite that we do not know
the instruction, we know the target
160
00:13:48,569 --> 00:13:52,720
because the result of this instruction
will end up in t9 and the result of this
161
00:13:52,720 --> 00:13:58,060
instruction will end up in t10, so we can
follow the uses of these two registers. So
162
00:13:58,060 --> 00:14:04,450
for simplicity I'm going to start with t10
and t10, which we later found out, this is
163
00:14:04,450 --> 00:14:09,730
another register which essentially denotes
a specific internal register. And if you
164
00:14:09,730 --> 00:14:15,450
play around with these bits you notice
that this combination encodes cr4. The x86
165
00:14:15,450 --> 00:14:22,987
will just see cr4. You can also address
cr1 and cr2. And if you look further, t10
166
00:14:22,987 --> 00:14:29,160
is then ended with this bit mask and if
you look in the manual you find out that
167
00:14:29,160 --> 00:14:34,930
this bit in cr4 denotes the bit that
determines whether oddity C is
168
00:14:34,930 --> 00:14:40,019
available from user space or not. So this
is the check if this instruction should be
169
00:14:40,019 --> 00:14:48,170
executed. So now let's just keep in mind
that t9 holds some other loaded value from
170
00:14:48,170 --> 00:14:53,930
some other internal register and we will
come back to this one a bit later. For
171
00:14:53,930 --> 00:14:58,848
now, let's follow execution. This triad is
essentially a padding triad. It is a
172
00:14:58,848 --> 00:15:04,885
common pattern we see. So let's look at
where this branch takes us.
173
00:15:05,895 --> 00:15:07,180
And this branch
174
00:15:07,180 --> 00:15:15,959
takes us to a conditional branch
triad. And if you look a bit up, this end
175
00:15:15,959 --> 00:15:21,740
instruction actually updated this flag. So
this is a conditional branch that
176
00:15:21,740 --> 00:15:26,360
determines whether this check was
successful or not. So it branches toward
177
00:15:26,360 --> 00:15:32,570
the error triad or the success triad. But
here we already see the exit. We see a
178
00:15:32,570 --> 00:15:41,170
write to RDX or EDX in this case with a
shift from t9 by 32 bit, which is exactly
179
00:15:41,170 --> 00:15:45,910
what you would expect to write the time
stamp counter on the upper 32 bits of the
180
00:15:45,910 --> 00:15:50,829
time stamp counter to edx. And you have an
unknown instruction, but we know, okay, we
181
00:15:50,829 --> 00:15:57,877
move something from t9 to eax, which is
the lower 32 bits. But we're not done
182
00:15:57,877 --> 00:16:02,690
here, because we can still look at the
error pass that is taken if the access is
183
00:16:02,690 --> 00:16:09,210
denied. So if you scroll a bit down we can
see a move of an immediate into a certain
184
00:16:09,210 --> 00:16:14,530
internal register. And this is immediate
actually encodes a general protection
185
00:16:14,530 --> 00:16:21,790
fault interrupt code. D denotes to the
exception handler that this was a general
186
00:16:21,790 --> 00:16:28,680
protection fault. And later this triad
branches to this address, and if you look
187
00:16:28,680 --> 00:16:34,013
at the uses of this address we can find
other immediates that also correspond on
188
00:16:34,013 --> 00:16:36,962
to x86 instructions. So now we learned
189
00:16:36,962 --> 00:16:39,947
how we can actually raise our
own interrupts. We
190
00:16:39,947 --> 00:16:46,100
just need to load the code we want into
the specific register and branch to this
191
00:16:46,100 --> 00:16:52,820
address. And now we learned a lot about
how we can actually write microcode, but
192
00:16:52,820 --> 00:16:57,000
it's also interesting to see how certain
instructions are implemented. So let's
193
00:16:57,000 --> 00:17:03,671
look at a pretty complicated one: wrmsr
(Write MSR). wrmsr essentially writes some
194
00:17:03,671 --> 00:17:08,449
data it is given to a machine specific
register. This machine specific register
195
00:17:08,449 --> 00:17:12,980
differs between CPUs, between vendors,
sometimes between revisions. And these
196
00:17:12,980 --> 00:17:17,910
implement non-standard extensions or
pretty complex features. For example, you
197
00:17:17,910 --> 00:17:23,949
trigger a microcode update by writing to a
machine specific register. The register
198
00:17:23,949 --> 00:17:30,570
addresses you want to write to is given in
ecx. And now we can see ecx is read and
199
00:17:30,570 --> 00:17:39,679
it is shifted by sixteen bits to t10. So
again, we follow uses of t10 and we see
200
00:17:39,679 --> 00:17:46,070
it as XOR'd with a certain bitmask. And
this bitmask is C000, which actually
201
00:17:46,070 --> 00:17:52,429
denotes a namespace of the model specific
registers. In this case this should be an
202
00:17:52,429 --> 00:17:58,450
AMD-specific namespace. And, of course,
this one again sets some flags, and you
203
00:17:58,450 --> 00:18:04,240
can see your conditional branch depending
on these flags to what should be the
204
00:18:04,240 --> 00:18:06,235
handler for this namespace.
205
00:18:06,695 --> 00:18:10,770
Next one: We have another XOR
that uses a different bit
206
00:18:10,770 --> 00:18:16,890
mask — in this case C001. C001 is the
namespace where the microcode update
207
00:18:16,890 --> 00:18:25,050
routine is actually located in. So again,
we branch to this handler. And if you just
208
00:18:25,050 --> 00:18:31,010
continue on, there are more operations on
rcx, followed by more branches, and this
209
00:18:31,010 --> 00:18:35,790
continues until everything is dispatched
to the correct handler. And this is how,
210
00:18:35,790 --> 00:18:40,340
internally, wrmsr is implemented, and also
Read MSR is going to be implemented pretty
211
00:18:40,340 --> 00:18:43,640
similar, because it implements some kind
of similar thing.
212
00:18:47,750 --> 00:18:49,190
OK, so now I showed you
213
00:18:49,190 --> 00:18:52,470
how we actually went ahead of
reconstructing the knowledge we
214
00:18:52,470 --> 00:18:57,939
currently have. And now I'm going to show
you what we can actually do with it. And
215
00:18:57,939 --> 00:19:02,440
for this I am going to quickly cover what
applications we wrote in microcode. We
216
00:19:02,440 --> 00:19:04,940
wrote a simple configurable
rdtsc precision.
217
00:19:04,940 --> 00:19:07,710
This means a certain bit mask is AND'd to
218
00:19:07,710 --> 00:19:11,890
the result of rdtsc, so you can
reduce the accuracy of it, which can
219
00:19:11,890 --> 00:19:18,284
sometimes prevent timing attacks. We also
implemented microcode-assisted address
220
00:19:18,284 --> 00:19:23,260
sanitizer, which I'll cover quickly in a
second. We also have some basic microcode
221
00:19:23,260 --> 00:19:29,070
instruction set randomization. Some
microcode-assisted instrumentation. What
222
00:19:29,070 --> 00:19:33,520
this means is, you can write a filter for
your instrumentation in microcode itself.
223
00:19:33,520 --> 00:19:37,580
So instead of hooking an instruction,
instead of debugging your code or
224
00:19:37,580 --> 00:19:42,160
emulating it, you can just say whenever
the instruction is executed filter if this
225
00:19:42,160 --> 00:19:47,180
is relevant for me, and if it is, call my
x86 handler — entirely in microcode,
226
00:19:47,180 --> 00:19:52,470
without changing the instruction in the
RAM. We also implemented some basic
227
00:19:52,470 --> 00:20:00,000
authenticated microcode updates. The usual
update mechanism is weak — that's how we
228
00:20:00,000 --> 00:20:05,430
got our foot in the door in the first
place. So we improved upon it a bit. Also
229
00:20:05,430 --> 00:20:09,799
we found out that microcode actually has
some enclave-like features because once
230
00:20:09,799 --> 00:20:13,730
we're executing in Microcode, your kernel
can't interupt you, your hypervisor can't
231
00:20:13,730 --> 00:20:18,610
interrupt you and any state you want
visible to the outside world. You actually
232
00:20:18,610 --> 00:20:22,840
need to write explicitly. So all these
microcode internal registers are not
233
00:20:22,840 --> 00:20:26,600
accessible from the outside world. So any
computation you perform in micro code
234
00:20:26,600 --> 00:20:30,360
cannot be interfered with. So you can
implement a simple enclave on top of this
235
00:20:30,360 --> 00:20:37,039
one. So our hardware-assisted address
sanitizer variant is based on the work by
236
00:20:37,039 --> 00:20:41,970
the original authors and address sanitizer
is a software instrumentation that detects
237
00:20:41,970 --> 00:20:47,070
invalid memory access by using a shadow
map shadow memory to just say which memory
238
00:20:47,070 --> 00:20:50,746
is valid to be read and written to.
239
00:20:50,746 --> 00:20:53,840
The authors proposed hardware
address sanitizer
240
00:20:53,840 --> 00:20:59,011
which is essentially doing the same checks
but using a new instruction. And the
241
00:20:59,011 --> 00:21:03,940
instruction should raise a fault if an
invalid access is detected. This algorithm
242
00:21:03,940 --> 00:21:07,670
they proposed - The details are not
important. What is important is in
243
00:21:07,670 --> 00:21:12,080
essence: It's pretty simple. You load from
a certain adress, performs the operations
244
00:21:12,080 --> 00:21:18,816
on it and if there is the shadow after
this operations you just report a bug.
245
00:21:18,816 --> 00:21:24,910
Advantages of hardware address sanitizer
are for example you get better performance
246
00:21:24,910 --> 00:21:29,170
out of it. Because you only have a single
instruction maybe you can do some fancy
247
00:21:29,170 --> 00:21:34,450
tricks inside your CPU that are faster
than using x86 instructions, you get more
248
00:21:34,450 --> 00:21:38,880
compact code and you have the possibility
of one time configuration which is a bit
249
00:21:38,880 --> 00:21:45,210
hard with software address sanitizer. We
implemented hardware address sanitizer our
250
00:21:45,210 --> 00:21:49,270
variant by replacing the bound instruction
Bound is an old instruction that is no
251
00:21:49,270 --> 00:21:54,870
longer used by compilers because in fact
it is slower to use bound instead of
252
00:21:54,870 --> 00:21:58,901
performing the checks with multiple x86
instructions. We changed the interface.
253
00:21:58,901 --> 00:22:04,090
The first argument is the register which
holds the address you want to access. And
254
00:22:04,090 --> 00:22:07,835
the second argument holds the size you
want this access to be.
255
00:22:07,835 --> 00:22:11,050
So, 1 byte, 2 byte and so on.
256
00:22:11,050 --> 00:22:14,950
This instruction is a no-op if the
check succeeds. So if there is no bug it
257
00:22:14,950 --> 00:22:19,980
just continues on like nothing happened.
However if we detect an invalid access we
258
00:22:19,980 --> 00:22:25,359
can take a configurable action, we can for
example just raise your normal page fault
259
00:22:25,359 --> 00:22:29,630
or we can raise a bound interrupt, which
is a custom interrupt, that only denotes
260
00:22:29,630 --> 00:22:34,299
this one or we can branch to an x86
handler that either performs additional
261
00:22:34,299 --> 00:22:39,760
checking, for example whitelisting, or it
generates a pretty error report for you.
262
00:22:41,340 --> 00:22:47,480
Most importantly this is a single
instruction. We also do not dirty any x86
263
00:22:47,480 --> 00:22:52,690
registers because they are some
intermediate results. You need to store
264
00:22:52,690 --> 00:22:56,360
these somewhere and this you usually do in
the x86 registers. So you increase
265
00:22:56,360 --> 00:23:00,010
register pressure. Maybe you cause
spilling. So overall your performance gets
266
00:23:00,010 --> 00:23:07,230
worse. We also found out that we are
actually faster than doing the checking
267
00:23:07,230 --> 00:23:12,390
using x86 instructions. So just by moving
the implementation from x86 level to
268
00:23:12,390 --> 00:23:16,805
microcode, which in some way is still kind
of like software, we already improved the
269
00:23:16,805 --> 00:23:22,160
performance. Also on top of this you get
better cache utilization because you have
270
00:23:22,160 --> 00:23:27,020
less instructions, there are less bytes in
the cache, so we get fuller cache lines.
271
00:23:27,020 --> 00:23:31,630
And also it is really easy to tell which
is testing code and which is your actual
272
00:23:31,630 --> 00:23:40,080
program code. Lastly I'm going to show you
just a rough overview of our framework
273
00:23:40,080 --> 00:23:45,920
which we used during our development and
which you can also find on GitHub. Early
274
00:23:45,920 --> 00:23:50,079
on we found out that we are probably going
to need to test a lot of microcode
275
00:23:50,079 --> 00:23:55,640
updates, because in the beginning you just
throw everything at the CPU and see how it
276
00:23:55,640 --> 00:24:01,400
behaves and we wanted to do this in
parallel. So we developed a small custom
277
00:24:01,400 --> 00:24:07,180
OS called "Angry OS" and deployed it to
mainboards. These mainboards are just old
278
00:24:07,180 --> 00:24:13,270
AMD mainboards. All these mainboards were
hooked up via serial for communication and
279
00:24:13,270 --> 00:24:19,400
GPIO to a Raspberry Pi. With the GPIO you
can reset, support power on, power down
280
00:24:19,400 --> 00:24:23,890
and just have remote control of this
mainboard and then you can connect to that
281
00:24:23,890 --> 00:24:28,719
Raspberry Pi from anywhere on earth and
just deploy and play around with it.
282
00:24:28,719 --> 00:24:30,640
This was the first version.
283
00:24:30,640 --> 00:24:34,490
In the beginning we
didn't really know much about electronics
284
00:24:34,490 --> 00:24:38,520
so we used one Raspberry Pi per mainboard.
And it turns out Raspberry Pis are more
285
00:24:38,520 --> 00:24:43,970
expensive than these old mainboards, but
we improved upon this and now we're down
286
00:24:43,970 --> 00:24:48,007
to one Raspberry Pi for
four / five setups.
287
00:24:48,007 --> 00:24:51,587
For example you only need 3 GPIO ports per
288
00:24:51,587 --> 00:24:57,358
mainboard. You connect each of these to
optocouplers just to separate the voltage
289
00:24:57,358 --> 00:25:01,860
levels and then you connect one side of
the optocoupler to the GPIO the other side
290
00:25:01,860 --> 00:25:05,909
to your reset pin, to your power pin and
for input to know whether your board is up
291
00:25:05,909 --> 00:25:11,230
or down you connect the power LED. And
that way you can save a lot of space, a
292
00:25:11,230 --> 00:25:17,205
lot of money. And also if you're really
constrained you can just remove the power
293
00:25:17,205 --> 00:25:23,530
LED sensing because usually you know it is
in the state your setup is in. As I
294
00:25:23,530 --> 00:25:28,230
already said we wrote our custom operating
system and it is intentionally really
295
00:25:28,230 --> 00:25:32,659
really minimal because the major feature
we wanted is control over every
296
00:25:32,659 --> 00:25:36,740
instructions that's going to be executed
from a certain point on, because we're
297
00:25:36,740 --> 00:25:40,780
playing around with instruction encoding
and if we execute an instructions that we
298
00:25:40,780 --> 00:25:45,530
did not intend we might crash the CPU, we
might go into an invalid state and we do
299
00:25:45,530 --> 00:25:50,850
not even know which instruction caused it.
And Angry OS essentially only listens on
300
00:25:50,850 --> 00:26:00,150
the serial port for something to do. What
it can do is apply an update. These
301
00:26:00,150 --> 00:26:04,820
updates are just microcode updates. They
are streamed via serial. We can also
302
00:26:04,820 --> 00:26:10,039
stream x86 code which is then run by Angry
OS and this is just so that we do not need
303
00:26:10,039 --> 00:26:14,409
to reflash the USB stick every time we
want to update our testing code and the
304
00:26:14,409 --> 00:26:19,280
result, all the errors are reported back
to the Raspberry Pi and thus they are
305
00:26:19,280 --> 00:26:26,852
forwarded to us. The framework we use most
importantly has the microcode assembler
306
00:26:26,852 --> 00:26:30,713
and a pretty verbose disassembler. This
disassembler generates the output I showed
307
00:26:30,713 --> 00:26:36,919
you earlier and using this you can just
quickly write your own microcode. We also
308
00:26:36,919 --> 00:26:42,245
included an x86 assembler because we
wanted to rapidly test different x86
309
00:26:42,245 --> 00:26:47,730
testing codes. Using this framework we
were able to disassemble the existing
310
00:26:47,730 --> 00:26:53,500
updates and we also used it to disassemble
our ROM after we reordered it and also
311
00:26:53,500 --> 00:27:01,169
during the process when we fed it to our
emulator. And we can also create the
312
00:27:01,169 --> 00:27:07,909
proper binary files that can be loaded by
the Linux kernel driver. We modified the
313
00:27:07,909 --> 00:27:12,777
stock one to just load any update you give
it without checking if it's the correct
314
00:27:12,777 --> 00:27:20,060
CPU ID and all these things just for
testing purposes. It's also available. And
315
00:27:20,060 --> 00:27:25,740
also of course the framework can control
Angry OS to make your testing easier. And
316
00:27:25,740 --> 00:27:29,650
we implemented a pretty basic remote
execution wrapper, so you can work on a
317
00:27:29,650 --> 00:27:33,389
remote Raspberry Pi as if you were using
it locally.
318
00:27:34,809 --> 00:27:36,799
And this brings me to the end
319
00:27:36,799 --> 00:27:40,800
of talk. And in conclusion we can say
reversing the ROM opened up a lot of new
320
00:27:40,800 --> 00:27:44,809
possibilities. We learned a lot about how
microcode works. We learned about how to
321
00:27:44,809 --> 00:27:49,720
actually use it properly instead of just
inferring from a really small dataset,
322
00:27:49,720 --> 00:27:55,060
that we have from the updates, or from the
random bits things we send to the CPU and
323
00:27:55,060 --> 00:27:59,530
observe what happened. But there's a lot
left to do. So if you really want to hack
324
00:27:59,530 --> 00:28:04,089
on it, just get in contact, we were happy
to share our findings with you. And as I
325
00:28:04,089 --> 00:28:09,009
said the framework AngryOS, example
programs, that we implemented, and some
326
00:28:09,009 --> 00:28:13,850
other stuff like the wiring is available
on GitHub. So that's that. And we are
327
00:28:13,850 --> 00:28:16,809
happy to answer any questions you might
have.
328
00:28:16,809 --> 00:28:22,234
applause
329
00:28:24,910 --> 00:28:28,438
Herald Angel: Thank you very much. So we
330
00:28:28,438 --> 00:28:34,260
have 10 minutes for questions please line
up at the microphones. We start with this
331
00:28:34,260 --> 00:28:39,220
one: microphone number 2.
M2: Hi. Thanks for a nice talk. A few
332
00:28:39,220 --> 00:28:42,780
questions about your hardware address
sanitizer.
333
00:28:42,780 --> 00:28:49,830
Benjamin: Mhm
M2: As I understand you don't need the
334
00:28:49,830 --> 00:28:56,010
source code instrumentation because the
microcode is responsible for checking the
335
00:28:56,010 --> 00:29:02,929
shadow memory, right?
Benjamin: No... The original hardware
336
00:29:02,929 --> 00:29:07,950
sanitizer implementation is also based on
a compiler extension, that inserts a new
337
00:29:07,950 --> 00:29:12,200
instruction because it doesn't exist
usually. And it also inserts a bootstrap
338
00:29:12,200 --> 00:29:18,049
code that in inits your shadow map and
also instruments your allocators to update
339
00:29:18,049 --> 00:29:23,020
the shadow map doing runtime and we
essentially need the same component, but
340
00:29:23,020 --> 00:29:26,850
we do not need the software address
sanitizer component that essentially
341
00:29:26,850 --> 00:29:33,740
inserts 10 or 20 x86 instructions before
every memory access. So yes we still need
342
00:29:33,740 --> 00:29:37,647
a compile time component and we are still
source code based in a sense.
343
00:29:39,388 --> 00:29:45,600
Herald: And, so..
M2: And I didn't see, maybe I missed the
344
00:29:45,600 --> 00:29:51,299
numbers. How much it is faster than this
initial version?
345
00:29:51,299 --> 00:29:56,419
Benjamin: You mean the initial hardware
sanitizer version or the software address
346
00:29:56,419 --> 00:29:59,900
sanitizer.
M2: I mean let's say custom kernel address
347
00:29:59,900 --> 00:30:05,180
sanitizer for Linux kernel which is the
the usual one and your approach.
348
00:30:05,180 --> 00:30:10,270
Benjamin: We only performed a micro
benchmark on Angry OS and we essentially
349
00:30:10,270 --> 00:30:16,059
took the instrumentation as emitted by the
compiler for some memory access which is
350
00:30:16,059 --> 00:30:20,590
your standard software address sanitizer
and compared it to our version using only
351
00:30:20,590 --> 00:30:24,640
the modified bound instruction. So I
really can't talk about how it compares to
352
00:30:24,640 --> 00:30:28,820
KASAN or something or some like real world
implementation, because we only have the
353
00:30:28,820 --> 00:30:34,069
prototype and the basic instrumentation.
M2: Thank you very much.
354
00:30:34,069 --> 00:30:36,490
Herald Angel: OK. Microphone number 4
please.
355
00:30:36,490 --> 00:30:51,145
M4: Hey thanks for the talk and did you
find any weird microcode
356
00:30:51,145 --> 00:31:00,529
implementations. I don't mean security
wise, just like you rarely expected to
357
00:31:00,529 --> 00:31:07,330
see it be implemented that way.
358
00:31:09,040 --> 00:31:11,700
Benjamin: The problem is there's a lot of
359
00:31:11,700 --> 00:31:20,270
microcode to begin with. You have f000
triads. Each of which has 3 op-codes. So
360
00:31:20,270 --> 00:31:25,003
you have a lot of ground to cover and also
we have read-out errors. Sometimes you are
361
00:31:25,003 --> 00:31:29,169
seeing bit flips, which kind of slows you
down because you then need to always
362
00:31:29,169 --> 00:31:32,820
consider: OK, maybe this register is
something else, maybe this address is
363
00:31:32,820 --> 00:31:37,420
wrong. And also sometimes you have a dust
particles that kind of knocks out an
364
00:31:37,420 --> 00:31:42,550
entire region. So we only looked at the
components, we were pretty sure that we
365
00:31:42,550 --> 00:31:46,520
recovered correctly, and we'd only looked
at a really tiny subset compared to all of
366
00:31:46,520 --> 00:31:52,940
the microcode ROM. It's just not feasible
to do and to go through it and look at
367
00:31:52,940 --> 00:31:57,330
everything. So no we didn't find anything
funny but we also wouldn't know what funny
368
00:31:57,330 --> 00:32:00,790
looks like because we don't know what the
official spec for microcode is.
369
00:32:01,180 --> 00:32:03,990
M4: Thanks.
Herald Angel: Interesting. We have one
370
00:32:04,034 --> 00:32:05,809
question from the Internet, from the
371
00:32:05,809 --> 00:32:09,792
Signal Angel please.
Signal Angel: Yes. Which AMD CPU
372
00:32:09,792 --> 00:32:15,510
generations does this apply to?
Benjamin: Yeah this is still based on the
373
00:32:15,510 --> 00:32:21,289
work of our first talk and this only works
on pretty old ones: K8, K10. So until,
374
00:32:21,289 --> 00:32:26,940
CPUs produced until 2013. Yeah this was
the last year AMD produced anything like
375
00:32:26,940 --> 00:32:32,520
that. Newer ones use some public key based
cryptography from what we can tell and we
376
00:32:32,520 --> 00:32:36,559
haven't yet managed to break it. Same goes
for Intel, they seem to be using public
377
00:32:36,559 --> 00:32:39,919
key cryptography and we haven't gotten a
foot in the door yet.
378
00:32:40,989 --> 00:32:44,789
Herald Angel: Thank you. We go one around.
On microphone number 3 please.
379
00:32:44,789 --> 00:32:51,290
M3: Yeah. Thank you. I would like to know
how complex could the microcode programs
380
00:32:51,290 --> 00:32:59,159
be, that you could write. So what's the
complexity of new operations you could
381
00:32:59,159 --> 00:33:03,300
implement.
Benjamin: The only limiting factor is the
382
00:33:03,300 --> 00:33:07,923
size of your microcode update RAM. But
this one is really really limited.
383
00:33:07,923 --> 00:33:12,679
For example on K8, where we performed the
majority of our experiments. We are
384
00:33:12,679 --> 00:33:19,050
limited to 32 triads, which comes down to
a sixty nine instructions and you also
385
00:33:19,050 --> 00:33:22,440
have some constraints on these
instructions for example the next triad
386
00:33:22,440 --> 00:33:27,809
will always be executed no matter what.
Some operations can only go at the second
387
00:33:27,809 --> 00:33:33,859
slot. Some can only go on another slot, so
it's really really hard. And you're also
388
00:33:33,859 --> 00:33:38,930
limited from our knowledge to loading 16
bit immediates instead of 32 bit or even
389
00:33:38,930 --> 00:33:44,470
64 bit immediates. So your whole program
grows really fast if you're trying to do
390
00:33:44,470 --> 00:33:49,400
something complex. For example our
authenticated microcode update mechanism
391
00:33:49,400 --> 00:33:54,440
is the most complex one we wrote it nearly
fills out the RAM and we used TEA – Tiny
392
00:33:54,440 --> 00:33:58,700
Encryption Algorithm – because that was
the only one we managed to fit mostly due
393
00:33:58,700 --> 00:34:04,510
to S-box and other constants we would need
to load. So it's really small.
394
00:34:04,510 --> 00:34:08,539
Herald Angel: Thank you Microphone number
1.
395
00:34:08,539 --> 00:34:14,709
M1: So you said the microcode is used for
instruction decoding and it needs to meet
396
00:34:14,709 --> 00:34:19,429
the micro-ops to the scheduler and micro
queue in some way. Did you find out how
397
00:34:19,429 --> 00:34:27,519
that works?
Bejamin: In essence we are not actually
398
00:34:27,519 --> 00:34:33,539
executing code inside in microcode engine.
From what from what we understand, the
399
00:34:33,539 --> 00:34:38,569
microcode engine is just some kind of a
software based recipe, that describes how
400
00:34:38,569 --> 00:34:43,479
to decode an instruction, so you don't
actually get execution, you just commit
401
00:34:43,479 --> 00:34:47,269
instructions into the pipelines, that do
what you want. And because we have some
402
00:34:47,269 --> 00:34:51,269
control flow possibility, that is actually
inside the micro code engine, because you
403
00:34:51,269 --> 00:34:55,268
can branch to different addresses, you can
conditionally branch and loop. You kind of
404
00:34:55,268 --> 00:34:59,089
get an execution, but in essence to just
commit stuff in the pipeline and the CPU
405
00:34:59,089 --> 00:35:01,440
does what you tell it to.
406
00:35:04,240 --> 00:35:07,161
Herald Angel: One more question.
Microphone number 2, please.
407
00:35:07,161 --> 00:35:11,927
M2: How did you take the picture of the
internal CPU? Did you open it?
408
00:35:11,927 --> 00:35:14,969
Benjamin: Yeah. We worked together with
409
00:35:14,969 --> 00:35:19,680
Chris. He's our hardware guy. He has
access to his equipment to delayer it and
410
00:35:19,680 --> 00:35:24,289
to take high resolution optical shots and
he also takes shots with a scanning
411
00:35:24,289 --> 00:35:29,279
electron microscope. So I think about five
or six CPUs were harmed in the making of
412
00:35:29,279 --> 00:35:30,357
this paper.
413
00:35:33,810 --> 00:35:37,815
Herald Angel: So we have one more last
question. Microphone number 2 please.
414
00:35:39,248 --> 00:35:41,390
M2: Are you aware of research done by
415
00:35:41,390 --> 00:35:49,400
Christopher Domas, where he mapped out the
instruction set for x86 processors?
416
00:35:49,400 --> 00:35:57,119
B: You mean sandsifter? We
actually talked with him and yeah we are
417
00:35:57,119 --> 00:36:02,910
aware, that there's a map essentially of
the instruction set and also maybe you can
418
00:36:02,910 --> 00:36:07,275
combine it, because in the beginning we
reverse engineered where certain x86
419
00:36:07,275 --> 00:36:11,335
instructions are implemented in microcode.
So if you plug these two together you kind
420
00:36:11,335 --> 00:36:15,170
of map out the whole microcode ROM at the
same time that you map out a whole
421
00:36:15,170 --> 00:36:18,989
instruction set. However there are some
components of the microcode ROM that are
422
00:36:18,989 --> 00:36:23,470
most likely not triggered by instructions.
For example it seems like power management
423
00:36:23,470 --> 00:36:27,368
or everything that is behind a write MSR
[wrmsr] or read MSR [rdmsr]. wrmsr is a
424
00:36:27,368 --> 00:36:31,249
single instruction, but depending on the
arguments you give it it just branches to
425
00:36:31,249 --> 00:36:36,442
totally different triads and the microcode
itself is implemented in microcode. And
426
00:36:36,442 --> 00:36:40,190
this one is a huge chunk you wouldn't even
find without brute forcing all
427
00:36:40,190 --> 00:36:44,159
combinations for all instructions which is
not really feasible.
428
00:36:46,483 --> 00:36:51,279
Herald Angel: Thank you. Thank you
Benjamin.
429
00:36:51,279 --> 00:36:57,210
applause
430
00:36:57,210 --> 00:37:01,811
35c3 postroll music
431
00:37:01,811 --> 00:37:21,000
subtitles created by c3subtitles.de
in the years 2019-2020. Join, and help us!