35C3 - Inside the AMD Microcode ROM

Edit subtitles

0:03 - 0:07

35C3 preroll music
0:19 - 0:24

Herald: So the next talk Benjamin Kollenda
and Philipp Koppe - they will refresh our
0:24 - 0:31

memories because they already had a talk
on 34C3 where they talked about the micro
0:31 - 0:38

code ROM and today they're gonna give us
more insights on how micro code works. And
0:38 - 0:44

more details on the ROM itself. Benjamin
is a PhD student and has a focus on
0:44 - 0:51

software attacks and defenses and together
with Phillip they will now abuse AMD
0:51 - 0:55

microcode for fun and security. Please
enjoy.
0:55 - 0:59

Applause
1:01 - 1:06

Benjamin: Thank you. So as mentioned we
were able to reverse engineer the AMD
1:06 - 1:12

microcode and the AMD microcode ROM and
I'm going to talk about our journey. What
1:12 - 1:16

we learned on the way and how we did it.
So this joint work with my colleagues at
1:16 - 1:21

Ruhr Universtat Bochum and a quick outline
how are we going to do it. We're going to
1:21 - 1:25

start with a quick crash course on micro
architectural basics and what microcode
1:25 - 1:28

actually is. Then I talk about how we
reconstructed the
1:28 - 1:30

microcode ROM and what we learned
1:30 - 1:35

along the way. Then I quickly give some
examples of the applications we
1:35 - 1:41

implemented with the knowledge we gained
from second step. And lastly I talk about
1:41 - 1:48

a framework we used. How it works and what
we can do with it. And also this framework
1:48 - 1:52

is available on GitHub along with some
other tools so you're free to continue our
1:52 - 1:57

work. OK. So when I'm talking about
microcode you can think of it essentially
1:57 - 2:02

as a firmware for your processor. It
handles multiple purposes for example
2:02 - 2:06

you can use it to fix CPU bugs that you
have in silicon and you want to fix later
2:06 - 2:12

in the design phase. It is used for
instruction decoding - I cover this one a
2:12 - 2:18

bit more. It is also used for exception
handling. For example, if an exception or
2:18 - 2:22

interrupt is raised, microcode has a first
chance of modifying this interrupt
2:22 - 2:27

ignoring it or just passing it along to
the operating system. It's also used for
2:27 - 2:32

power management and some other complex
features like Intel SGX. And most
2:32 - 2:37

importantly for us microcode is updatable.
This used to patch errors in the field.
2:37 - 2:41

Everyone remembers Spectre / Meltdown
patches and there's
2:41 - 2:44

a microcode update. So your
2:44 - 2:51

x86 CPU takes multiple steps to execute an
instruction. The first step is decoding
2:51 - 2:55

a x86 instruction into multiple smaller
micro ops.
2:55 - 2:57

These are then scheduled into the pipeline
2:57 - 3:02

From there, they are dispatched to
the different functional units
3:02 - 3:04

like your ALU / AGU
3:04 - 3:06

multiplication division units
3:06 - 3:08

For our purposes the decode step is the
3:08 - 3:12

most interesting one. In the decode step
you have a instruction buffer that feeds
3:12 - 3:17

instructions to some decoders. You have
short decoders that handle really simple
3:17 - 3:21

instructions. There are long decoders that
can handle some more advance instructions.
3:21 - 3:25

And finally, the vector decoder. The
vector decoder handles the most complex
3:25 - 3:30

instructions with the help of microcode.
So the microcode engine is essentially the
3:30 - 3:31

vector decoder.
3:32 - 3:37

The Microcode engine in essence
is compromised out of a microcode
3:37 - 3:41

ROM that stores the instructions for the
microcode engine. Think of it as your
3:41 - 3:48

standard instructions. Then there is also
a writeable memory the microcode RAM. This
3:48 - 3:53

is where the microcode updates end up when
you apply microcode updates. And of course
3:53 - 3:57

around the storage has a whole lot of
things that make it actually run. For this
3:57 - 4:01

talk, you only need to know what is a
Match Registers. Match Registers are
4:01 - 4:06

essentially breakpoint registers. So if we
write an address from inside the microcode
4:06 - 4:11

ROM inside a Match Register whenever this
address is fetched, execution, control is
4:11 - 4:18

transferred to the microcode RAM so our
patch gets executed. And the microcode
4:18 - 4:23

updates are usually loaded by the BIOS or
by the kernel. Linux has an update driver,
4:23 - 4:28

sometimes the BIOS updates it with a
pre-installed version and they have a
4:28 - 4:32

pretty simple structure, a partially
documented header, and followed by the
4:32 - 4:38

actual microcode that is loaded inside the
CPU. And so microcode is organized in
4:38 - 4:43

something called triads. Each triad has
three operations essentially x86
4:43 - 4:48

instructions, but based on differences.
And lastly, you have a sequence word. The
4:48 - 4:52

sequence word indicates which microcode
instructions should be executed next. We
4:52 - 4:58

have options of executing just the next
triad, executing another one by branching
4:58 - 5:02

to it, or just saying OK, I'm done with
decoding this instruction continue with
5:02 - 5:07

x86 code. These updates are protected by
some weak authentication which we were
5:07 - 5:13

able to break so we can create our own. We
can analyze existing ones and we can apply
5:13 - 5:21

these to your standard laptop and desktop.
However there can only ever be one update
5:21 - 5:27

loaded at the time and when you reboot
your machine this update will be gone.
5:28 - 5:33

Also for the talk we are going to look at
some microcode and we will present this
5:33 - 5:38

microcode using a register transfer
language. It is heavily based on x86. I'm
5:38 - 5:43

just going to cover the differences
between these two. Most importantly the
5:43 - 5:49

microcode can have three operands for an
instruction in comparison to x86 which
5:49 - 5:54

usually only has two. So you can specify a
destination and two source operands.
5:56 - 5:56

Also,
5:57 - 6:02

microcode has some certain bit flags that
need to be set and these we do we see with
6:02 - 6:07

these annotations for example ".C" means
says instruction also updates a carry flag
6:07 - 6:14

based on the result. Then you have the
instruction "jcc" which is a conditional
6:14 - 6:20

branch and the first operand denotes the
condition up on which this branch is
6:20 - 6:24

taken. In this case branch if the carry
flag is one and [the] second operand
6:24 - 6:30

indicates the offset to add to the
instruction pointer. Then we also have
6:30 - 6:36

some sequence word annotations: "next",
"complete", and "branch". Also it should
6:36 - 6:40

be noted that the internal microcode
architecture is a load-store architecture.
6:40 - 6:45

You can't use memory operands in other
instructions like you can on x86 you
6:45 - 6:48

always need to load and store memory
explicitly.
6:49 - 6:52

Now we are going to talk about
6:52 - 6:59

how we manage to recover the microcode
ROM. The microcode ROM is baked into your
6:59 - 7:07

CPU, you can't change it anymore. It is
defined in the silicon during the
7:07 - 7:13

fabrication process and in this picture
you can see a die shot taken with a
7:13 - 7:17

electron microscope and this is one of
three regions that contains the bits for
7:17 - 7:23

the microcode operations. And if you zoom
in a bit more, each of these regions
7:23 - 7:30

consist out of four arrays and these are
further subdivided into blocks. Really
7:30 - 7:35

interesting is "Array 2" which is a bit
smaller than the other ones but it has
7:35 - 7:42

some structures above it which are of a
different visual layout. This is SRAM
7:42 - 7:47

which stores the microcode update. So this
is one-time reprogrammable memory that is
7:47 - 7:54

still pretty fast. So the microcode RAM is
located right next to the microcode ROM
7:54 - 7:58

which also makes sense from a design
standpoint.
8:00 - 8:02

Just an overview of how we
8:02 - 8:07

went ahead and how we went about. We
started with pictures and then we used
8:07 - 8:11

some OCR-ike process to transform them
into bit strings which we can then further
8:11 - 8:17

process. These bitstrings were then
arranged into triads. We could already
8:17 - 8:22

gather that we got individual triades
right because there were data dependencies
8:22 - 8:28

all over the place, but between triads,
there were no or very few data
8:28 - 8:34

dependencies so the ordering of the
triades was still wrong and this was a
8:34 - 8:39

major problem when we went ahead and what
we had to reverse engineer and this is
8:39 - 8:44

mapping a certain physical address of a
triad that we gathered from the ROM
8:44 - 8:48

readout to a virtual address that is used
inside the microcode update or the
8:48 - 8:54

microcode ROM. But after reverse engineer
this, you can just do a linear sweep
8:54 - 8:59

disassembly of the microcode ROM and
arrive at human readable output. But this
8:59 - 9:05

recovery was a bit tricky because we
required physical virtual address pairs.
9:05 - 9:10

But gathering these is a bit harder
because we worked there through the
9:10 - 9:14

available updates, but we could only find
two pairs of them. These pairs were
9:14 - 9:19

actually easy to find because every update
replaces a certain triad inside your
9:19 - 9:25

microcode ROM and this triad is usually
also placed in the microcode update. So by
9:25 - 9:31

matching the address this update replaces
with a microcode ROM readout. You can just
9:31 - 9:38

get your two data points. But we had to
get more data points so we generated these
9:38 - 9:43

mappings by matching semantics of triads
in the microcode ROM readout and the
9:43 - 9:48

semantics when we force execution of a
certain microcode address. And gathering
9:48 - 9:52

the semantics of the read-out microcode,
we implemented a simple microcode
9:52 - 9:59

simulator. Essentially it works on triad
level, so you give it an input state and a
9:59 - 10:03

triad and it calculates the output state
of it. Input and output state are
10:03 - 10:08

comprised out of the x86-state which is
your standard registers and also the
10:08 - 10:12

internal microcode registers. There are
multiple temporary registers that get
10:12 - 10:18

reset for every new x86 instruction that
is executed, but they can also be modified
10:18 - 10:24

by microcode of course. Our emulator
supports all known arithmetic operations
10:24 - 10:29

and we have a white-list of operations
that do not form or produce any observable
10:29 - 10:33

change in state just so that we could
process more triades and give them more
10:33 - 10:41

data points. In total we gathered 54
additional data-address pairs which turned
10:41 - 10:47

out to be enough to recover the whole
mapping. This mapping, essentially you
10:47 - 10:51

have the four different arrays that map to
individual blocks and these blocks in
10:51 - 10:57

these arrays or then again permuted a bit
and then the triads inside these blocks
10:57 - 11:02

have some table-based permutations. So
this is not an obfuscation. This is just
11:02 - 11:08

from a hardware design standpoint it can
make sense to reroute it a bit differently
11:09 - 11:15

Also now that we can actually
map a certain address to the microcode ROM
11:15 - 11:19

readout and we know the addresses of
different x86 instructions from our
11:19 - 11:24

earlier experiments, we can look at the
implementation of instructions. So let's
11:24 - 11:29

start with a pretty simple one. Shift-
Right-Double which essentially takes a
11:29 - 11:33

register, shift it by a given amount and
shifts in bits from another register. So
11:33 - 11:38

of course you would expect a lot of shifts
and rolls in its implementation and this
11:38 - 11:45

is exactly what we're seeing here. You
have two shift-right operands and you can
11:45 - 11:51

see regmd6 and regmd4. These are
place holders. The microcode engine can
11:51 - 11:56

replace certain bit combinations with the
registers that are used in the x86
11:56 - 12:02

operation. For example this one would be
replaced by ECX or EAX depending on what
12:02 - 12:08

you wrote in x86. And at this point we can
also already gather more information about
12:08 - 12:14

microcodes than we previously knew because
we know "OK, so this is source, this is
12:14 - 12:19

also a source and this is a destination".
But this source which indicates the shift
12:19 - 12:23

amount, this one was previously unknown,
because it is a high temporary microcode
12:23 - 12:28

register and we found out that these
usually implement specific different
12:28 - 12:32

purpose. They are not - if you write to
them, sometimes the CPU behaves
12:32 - 12:36

erratically, sometimes it crashes,
sometimes nothing happens. But in this
12:36 - 12:40

case, this seems to be the shift count,
and the shift count is given by a third
12:40 - 12:45

operand in the instruction. So in this
case, we already learned "OK, if you want
12:45 - 12:51

to read the third operand of an
instruction, we need to read t41". And
12:51 - 12:56

this is how we went about recovering more
and more information about microcode. The
12:56 - 13:00

rest of the implementation is essentially
concerned with implementing the rest of
13:00 - 13:06

the semantics of the x86 instruction and
updating the flags correctly. OK, so now
13:06 - 13:12

let's look at a instruction set that is a
bit more complicated. If you check out
13:12 - 13:20

rdtsc. rdtsc returns a internal cycle
counter in EDX and EAX, so the upper part
13:20 - 13:26

ends up in EDX, lower part in EAX. So in
the end we want to see writes to these
13:26 - 13:31

registers, potentially with a shift
somewhere in there. But somewhere the CPU
13:31 - 13:38

needs to gather the cycle counter. So in
the beginning we have two load-style
13:38 - 13:41

operations. This one is a proper load
which we identified and this one is
13:41 - 13:49

unknown. But despite that we do not know
the instruction, we know the target
13:49 - 13:53

because the result of this instruction
will end up in t9 and the result of this
13:53 - 13:58

instruction will end up in t10, so we can
follow the uses of these two registers. So
13:58 - 14:04

for simplicity I'm going to start with t10
and t10, which we later found out, this is
14:04 - 14:10

another register which essentially denotes
a specific internal register. And if you
14:10 - 14:15

play around with these bits you notice
that this combination encodes cr4. The x86
14:15 - 14:23

will just see cr4. You can also address
cr1 and cr2. And if you look further, t10
14:23 - 14:29

is then ended with this bit mask and if
you look in the manual you find out that
14:29 - 14:35

this bit in cr4 denotes the bit that
determines whether oddity C is
14:35 - 14:40

available from user space or not. So this
is the check if this instruction should be
14:40 - 14:48

executed. So now let's just keep in mind
that t9 holds some other loaded value from
14:48 - 14:54

some other internal register and we will
come back to this one a bit later. For
14:54 - 14:59

now, let's follow execution. This triad is
essentially a padding triad. It is a
14:59 - 15:05

common pattern we see. So let's look at
where this branch takes us.
15:06 - 15:07

And this branch
15:07 - 15:16

takes us to a conditional branch
triad. And if you look a bit up, this end
15:16 - 15:22

instruction actually updated this flag. So
this is a conditional branch that
15:22 - 15:26

determines whether this check was
successful or not. So it branches toward
15:26 - 15:33

the error triad or the success triad. But
here we already see the exit. We see a
15:33 - 15:41

write to RDX or EDX in this case with a
shift from t9 by 32 bit, which is exactly
15:41 - 15:46

what you would expect to write the time
stamp counter on the upper 32 bits of the
15:46 - 15:51

time stamp counter to edx. And you have an
unknown instruction, but we know, okay, we
15:51 - 15:58

move something from t9 to eax, which is
the lower 32 bits. But we're not done
15:58 - 16:03

here, because we can still look at the
error pass that is taken if the access is
16:03 - 16:09

denied. So if you scroll a bit down we can
see a move of an immediate into a certain
16:09 - 16:15

internal register. And this is immediate
actually encodes a general protection
16:15 - 16:22

fault interrupt code. D denotes to the
exception handler that this was a general
16:22 - 16:29

protection fault. And later this triad
branches to this address, and if you look
16:29 - 16:34

at the uses of this address we can find
other immediates that also correspond on
16:34 - 16:37

to x86 instructions. So now we learned
16:37 - 16:40

how we can actually raise our
own interrupts. We
16:40 - 16:46

just need to load the code we want into
the specific register and branch to this
16:46 - 16:53

address. And now we learned a lot about
how we can actually write microcode, but
16:53 - 16:57

it's also interesting to see how certain
instructions are implemented. So let's
16:57 - 17:04

look at a pretty complicated one: wrmsr
(Write MSR). wrmsr essentially writes some
17:04 - 17:08

data it is given to a machine specific
register. This machine specific register
17:08 - 17:13

differs between CPUs, between vendors,
sometimes between revisions. And these
17:13 - 17:18

implement non-standard extensions or
pretty complex features. For example, you
17:18 - 17:24

trigger a microcode update by writing to a
machine specific register. The register
17:24 - 17:31

addresses you want to write to is given in
ecx. And now we can see ecx is read and
17:31 - 17:40

it is shifted by sixteen bits to t10. So
again, we follow uses of t10 and we see
17:40 - 17:46

it as XOR'd with a certain bitmask. And
this bitmask is C000, which actually
17:46 - 17:52

denotes a namespace of the model specific
registers. In this case this should be an
17:52 - 17:58

AMD-specific namespace. And, of course,
this one again sets some flags, and you
17:58 - 18:04

can see your conditional branch depending
on these flags to what should be the
18:04 - 18:06

handler for this namespace.
18:07 - 18:11

Next one: We have another XOR
that uses a different bit
18:11 - 18:17

mask — in this case C001. C001 is the
namespace where the microcode update
18:17 - 18:25

routine is actually located in. So again,
we branch to this handler. And if you just
18:25 - 18:31

continue on, there are more operations on
rcx, followed by more branches, and this
18:31 - 18:36

continues until everything is dispatched
to the correct handler. And this is how,
18:36 - 18:40

internally, wrmsr is implemented, and also
Read MSR is going to be implemented pretty
18:40 - 18:44

similar, because it implements some kind
of similar thing.
18:48 - 18:49

OK, so now I showed you
18:49 - 18:52

how we actually went ahead of
reconstructing the knowledge we
18:52 - 18:58

currently have. And now I'm going to show
you what we can actually do with it. And
18:58 - 19:02

for this I am going to quickly cover what
applications we wrote in microcode. We
19:02 - 19:05

wrote a simple configurable
rdtsc precision.
19:05 - 19:08

This means a certain bit mask is AND'd to
19:08 - 19:12

the result of rdtsc, so you can
reduce the accuracy of it, which can
19:12 - 19:18

sometimes prevent timing attacks. We also
implemented microcode-assisted address
19:18 - 19:23

sanitizer, which I'll cover quickly in a
second. We also have some basic microcode
19:23 - 19:29

instruction set randomization. Some
microcode-assisted instrumentation. What
19:29 - 19:34

this means is, you can write a filter for
your instrumentation in microcode itself.
19:34 - 19:38

So instead of hooking an instruction,
instead of debugging your code or
19:38 - 19:42

emulating it, you can just say whenever
the instruction is executed filter if this
19:42 - 19:47

is relevant for me, and if it is, call my
x86 handler — entirely in microcode,
19:47 - 19:52

without changing the instruction in the
RAM. We also implemented some basic
19:52 - 20:00

authenticated microcode updates. The usual
update mechanism is weak — that's how we
20:00 - 20:05

got our foot in the door in the first
place. So we improved upon it a bit. Also
20:05 - 20:10

we found out that microcode actually has
some enclave-like features because once
20:10 - 20:14

we're executing in Microcode, your kernel
can't interupt you, your hypervisor can't
20:14 - 20:19

interrupt you and any state you want
visible to the outside world. You actually
20:19 - 20:23

need to write explicitly. So all these
microcode internal registers are not
20:23 - 20:27

accessible from the outside world. So any
computation you perform in micro code
20:27 - 20:30

cannot be interfered with. So you can
implement a simple enclave on top of this
20:30 - 20:37

one. So our hardware-assisted address
sanitizer variant is based on the work by
20:37 - 20:42

the original authors and address sanitizer
is a software instrumentation that detects
20:42 - 20:47

invalid memory access by using a shadow
map shadow memory to just say which memory
20:47 - 20:51

is valid to be read and written to.
20:51 - 20:54

The authors proposed hardware
address sanitizer
20:54 - 20:59

which is essentially doing the same checks
but using a new instruction. And the
20:59 - 21:04

instruction should raise a fault if an
invalid access is detected. This algorithm
21:04 - 21:08

they proposed - The details are not
important. What is important is in
21:08 - 21:12

essence: It's pretty simple. You load from
a certain adress, performs the operations
21:12 - 21:19

on it and if there is the shadow after
this operations you just report a bug.
21:19 - 21:25

Advantages of hardware address sanitizer
are for example you get better performance
21:25 - 21:29

out of it. Because you only have a single
instruction maybe you can do some fancy
21:29 - 21:34

tricks inside your CPU that are faster
than using x86 instructions, you get more
21:34 - 21:39

compact code and you have the possibility
of one time configuration which is a bit
21:39 - 21:45

hard with software address sanitizer. We
implemented hardware address sanitizer our
21:45 - 21:49

variant by replacing the bound instruction
Bound is an old instruction that is no
21:49 - 21:55

longer used by compilers because in fact
it is slower to use bound instead of
21:55 - 21:59

performing the checks with multiple x86
instructions. We changed the interface.
21:59 - 22:04

The first argument is the register which
holds the address you want to access. And
22:04 - 22:08

the second argument holds the size you
want this access to be.
22:08 - 22:11

So, 1 byte, 2 byte and so on.
22:11 - 22:15

This instruction is a no-op if the
check succeeds. So if there is no bug it
22:15 - 22:20

just continues on like nothing happened.
However if we detect an invalid access we
22:20 - 22:25

can take a configurable action, we can for
example just raise your normal page fault
22:25 - 22:30

or we can raise a bound interrupt, which
is a custom interrupt, that only denotes
22:30 - 22:34

this one or we can branch to an x86
handler that either performs additional
22:34 - 22:40

checking, for example whitelisting, or it
generates a pretty error report for you.
22:41 - 22:47

Most importantly this is a single
instruction. We also do not dirty any x86
22:47 - 22:53

registers because they are some
intermediate results. You need to store
22:53 - 22:56

these somewhere and this you usually do in
the x86 registers. So you increase
22:56 - 23:00

register pressure. Maybe you cause
spilling. So overall your performance gets
23:00 - 23:07

worse. We also found out that we are
actually faster than doing the checking
23:07 - 23:12

using x86 instructions. So just by moving
the implementation from x86 level to
23:12 - 23:17

microcode, which in some way is still kind
of like software, we already improved the
23:17 - 23:22

performance. Also on top of this you get
better cache utilization because you have
23:22 - 23:27

less instructions, there are less bytes in
the cache, so we get fuller cache lines.
23:27 - 23:32

And also it is really easy to tell which
is testing code and which is your actual
23:32 - 23:40

program code. Lastly I'm going to show you
just a rough overview of our framework
23:40 - 23:46

which we used during our development and
which you can also find on GitHub. Early
23:46 - 23:50

on we found out that we are probably going
to need to test a lot of microcode
23:50 - 23:56

updates, because in the beginning you just
throw everything at the CPU and see how it
23:56 - 24:01

behaves and we wanted to do this in
parallel. So we developed a small custom
24:01 - 24:07

OS called "Angry OS" and deployed it to
mainboards. These mainboards are just old
24:07 - 24:13

AMD mainboards. All these mainboards were
hooked up via serial for communication and
24:13 - 24:19

GPIO to a Raspberry Pi. With the GPIO you
can reset, support power on, power down
24:19 - 24:24

and just have remote control of this
mainboard and then you can connect to that
24:24 - 24:29

Raspberry Pi from anywhere on earth and
just deploy and play around with it.
24:29 - 24:31

This was the first version.
24:31 - 24:34

In the beginning we
didn't really know much about electronics
24:34 - 24:39

so we used one Raspberry Pi per mainboard.
And it turns out Raspberry Pis are more
24:39 - 24:44

expensive than these old mainboards, but
we improved upon this and now we're down
24:44 - 24:48

to one Raspberry Pi for
four / five setups.
24:48 - 24:52

For example you only need 3 GPIO ports per
24:52 - 24:57

mainboard. You connect each of these to
optocouplers just to separate the voltage
24:57 - 25:02

levels and then you connect one side of
the optocoupler to the GPIO the other side
25:02 - 25:06

to your reset pin, to your power pin and
for input to know whether your board is up
25:06 - 25:11

or down you connect the power LED. And
that way you can save a lot of space, a
25:11 - 25:17

lot of money. And also if you're really
constrained you can just remove the power
25:17 - 25:24

LED sensing because usually you know it is
in the state your setup is in. As I
25:24 - 25:28

already said we wrote our custom operating
system and it is intentionally really
25:28 - 25:33

really minimal because the major feature
we wanted is control over every
25:33 - 25:37

instructions that's going to be executed
from a certain point on, because we're
25:37 - 25:41

playing around with instruction encoding
and if we execute an instructions that we
25:41 - 25:46

did not intend we might crash the CPU, we
might go into an invalid state and we do
25:46 - 25:51

not even know which instruction caused it.
And Angry OS essentially only listens on
25:51 - 26:00

the serial port for something to do. What
it can do is apply an update. These
26:00 - 26:05

updates are just microcode updates. They
are streamed via serial. We can also
26:05 - 26:10

stream x86 code which is then run by Angry
OS and this is just so that we do not need
26:10 - 26:14

to reflash the USB stick every time we
want to update our testing code and the
26:14 - 26:19

result, all the errors are reported back
to the Raspberry Pi and thus they are
26:19 - 26:27

forwarded to us. The framework we use most
importantly has the microcode assembler
26:27 - 26:31

and a pretty verbose disassembler. This
disassembler generates the output I showed
26:31 - 26:37

you earlier and using this you can just
quickly write your own microcode. We also
26:37 - 26:42

included an x86 assembler because we
wanted to rapidly test different x86
26:42 - 26:48

testing codes. Using this framework we
were able to disassemble the existing
26:48 - 26:54

updates and we also used it to disassemble
our ROM after we reordered it and also
26:54 - 27:01

during the process when we fed it to our
emulator. And we can also create the
27:01 - 27:08

proper binary files that can be loaded by
the Linux kernel driver. We modified the
27:08 - 27:13

stock one to just load any update you give
it without checking if it's the correct
27:13 - 27:20

CPU ID and all these things just for
testing purposes. It's also available. And
27:20 - 27:26

also of course the framework can control
Angry OS to make your testing easier. And
27:26 - 27:30

we implemented a pretty basic remote
execution wrapper, so you can work on a
27:30 - 27:33

remote Raspberry Pi as if you were using
it locally.
27:35 - 27:37

And this brings me to the end
27:37 - 27:41

of talk. And in conclusion we can say
reversing the ROM opened up a lot of new
27:41 - 27:45

possibilities. We learned a lot about how
microcode works. We learned about how to
27:45 - 27:50

actually use it properly instead of just
inferring from a really small dataset,
27:50 - 27:55

that we have from the updates, or from the
random bits things we send to the CPU and
27:55 - 28:00

observe what happened. But there's a lot
left to do. So if you really want to hack
28:00 - 28:04

on it, just get in contact, we were happy
to share our findings with you. And as I
28:04 - 28:09

said the framework AngryOS, example
programs, that we implemented, and some
28:09 - 28:14

other stuff like the wiring is available
on GitHub. So that's that. And we are
28:14 - 28:17

happy to answer any questions you might
have.
28:17 - 28:22

applause
28:25 - 28:28

Herald Angel: Thank you very much. So we
28:28 - 28:34

have 10 minutes for questions please line
up at the microphones. We start with this
28:34 - 28:39

one: microphone number 2.
M2: Hi. Thanks for a nice talk. A few
28:39 - 28:43

questions about your hardware address
sanitizer.
28:43 - 28:50

Benjamin: Mhm
M2: As I understand you don't need the
28:50 - 28:56

source code instrumentation because the
microcode is responsible for checking the
28:56 - 29:03

shadow memory, right?
Benjamin: No... The original hardware
29:03 - 29:08

sanitizer implementation is also based on
a compiler extension, that inserts a new
29:08 - 29:12

instruction because it doesn't exist
usually. And it also inserts a bootstrap
29:12 - 29:18

code that in inits your shadow map and
also instruments your allocators to update
29:18 - 29:23

the shadow map doing runtime and we
essentially need the same component, but
29:23 - 29:27

we do not need the software address
sanitizer component that essentially
29:27 - 29:34

inserts 10 or 20 x86 instructions before
every memory access. So yes we still need
29:34 - 29:38

a compile time component and we are still
source code based in a sense.
29:39 - 29:46

Herald: And, so..
M2: And I didn't see, maybe I missed the
29:46 - 29:51

numbers. How much it is faster than this
initial version?
29:51 - 29:56

Benjamin: You mean the initial hardware
sanitizer version or the software address
29:56 - 30:00

sanitizer.
M2: I mean let's say custom kernel address
30:00 - 30:05

sanitizer for Linux kernel which is the
the usual one and your approach.
30:05 - 30:10

Benjamin: We only performed a micro
benchmark on Angry OS and we essentially
30:10 - 30:16

took the instrumentation as emitted by the
compiler for some memory access which is
30:16 - 30:21

your standard software address sanitizer
and compared it to our version using only
30:21 - 30:25

the modified bound instruction. So I
really can't talk about how it compares to
30:25 - 30:29

KASAN or something or some like real world
implementation, because we only have the
30:29 - 30:34

prototype and the basic instrumentation.
M2: Thank you very much.
30:34 - 30:36

Herald Angel: OK. Microphone number 4
please.
30:36 - 30:51

M4: Hey thanks for the talk and did you
find any weird microcode
30:51 - 31:01

implementations. I don't mean security
wise, just like you rarely expected to
31:01 - 31:07

see it be implemented that way.
31:09 - 31:12

Benjamin: The problem is there's a lot of
31:12 - 31:20

microcode to begin with. You have f000
triads. Each of which has 3 op-codes. So
31:20 - 31:25

you have a lot of ground to cover and also
we have read-out errors. Sometimes you are
31:25 - 31:29

seeing bit flips, which kind of slows you
down because you then need to always
31:29 - 31:33

consider: OK, maybe this register is
something else, maybe this address is
31:33 - 31:37

wrong. And also sometimes you have a dust
particles that kind of knocks out an
31:37 - 31:43

entire region. So we only looked at the
components, we were pretty sure that we
31:43 - 31:47

recovered correctly, and we'd only looked
at a really tiny subset compared to all of
31:47 - 31:53

the microcode ROM. It's just not feasible
to do and to go through it and look at
31:53 - 31:57

everything. So no we didn't find anything
funny but we also wouldn't know what funny
31:57 - 32:01

looks like because we don't know what the
official spec for microcode is.
32:01 - 32:04

M4: Thanks.
Herald Angel: Interesting. We have one
32:04 - 32:06

question from the Internet, from the
32:06 - 32:10

Signal Angel please.
Signal Angel: Yes. Which AMD CPU
32:10 - 32:16

generations does this apply to?
Benjamin: Yeah this is still based on the
32:16 - 32:21

work of our first talk and this only works
on pretty old ones: K8, K10. So until,
32:21 - 32:27

CPUs produced until 2013. Yeah this was
the last year AMD produced anything like
32:27 - 32:33

that. Newer ones use some public key based
cryptography from what we can tell and we
32:33 - 32:37

haven't yet managed to break it. Same goes
for Intel, they seem to be using public
32:37 - 32:40

key cryptography and we haven't gotten a
foot in the door yet.
32:41 - 32:45

Herald Angel: Thank you. We go one around.
On microphone number 3 please.
32:45 - 32:51

M3: Yeah. Thank you. I would like to know
how complex could the microcode programs
32:51 - 32:59

be, that you could write. So what's the
complexity of new operations you could
32:59 - 33:03

implement.
Benjamin: The only limiting factor is the
33:03 - 33:08

size of your microcode update RAM. But
this one is really really limited.
33:08 - 33:13

For example on K8, where we performed the
majority of our experiments. We are
33:13 - 33:19

limited to 32 triads, which comes down to
a sixty nine instructions and you also
33:19 - 33:22

have some constraints on these
instructions for example the next triad
33:22 - 33:28

will always be executed no matter what.
Some operations can only go at the second
33:28 - 33:34

slot. Some can only go on another slot, so
it's really really hard. And you're also
33:34 - 33:39

limited from our knowledge to loading 16
bit immediates instead of 32 bit or even
33:39 - 33:44

64 bit immediates. So your whole program
grows really fast if you're trying to do
33:44 - 33:49

something complex. For example our
authenticated microcode update mechanism
33:49 - 33:54

is the most complex one we wrote it nearly
fills out the RAM and we used TEA – Tiny
33:54 - 33:59

Encryption Algorithm – because that was
the only one we managed to fit mostly due
33:59 - 34:05

to S-box and other constants we would need
to load. So it's really small.
34:05 - 34:09

Herald Angel: Thank you Microphone number
1.
34:09 - 34:15

M1: So you said the microcode is used for
instruction decoding and it needs to meet
34:15 - 34:19

the micro-ops to the scheduler and micro
queue in some way. Did you find out how
34:19 - 34:28

that works?
Bejamin: In essence we are not actually
34:28 - 34:34

executing code inside in microcode engine.
From what from what we understand, the
34:34 - 34:39

microcode engine is just some kind of a
software based recipe, that describes how
34:39 - 34:43

to decode an instruction, so you don't
actually get execution, you just commit
34:43 - 34:47

instructions into the pipelines, that do
what you want. And because we have some
34:47 - 34:51

control flow possibility, that is actually
inside the micro code engine, because you
34:51 - 34:55

can branch to different addresses, you can
conditionally branch and loop. You kind of
34:55 - 34:59

get an execution, but in essence to just
commit stuff in the pipeline and the CPU
34:59 - 35:01

does what you tell it to.
35:04 - 35:07

Herald Angel: One more question.
Microphone number 2, please.
35:07 - 35:12

M2: How did you take the picture of the
internal CPU? Did you open it?
35:12 - 35:15

Benjamin: Yeah. We worked together with
35:15 - 35:20

Chris. He's our hardware guy. He has
access to his equipment to delayer it and
35:20 - 35:24

to take high resolution optical shots and
he also takes shots with a scanning
35:24 - 35:29

electron microscope. So I think about five
or six CPUs were harmed in the making of
35:29 - 35:30

this paper.
35:34 - 35:38

Herald Angel: So we have one more last
question. Microphone number 2 please.
35:39 - 35:41

M2: Are you aware of research done by
35:41 - 35:49

Christopher Domas, where he mapped out the
instruction set for x86 processors?
35:49 - 35:57

B: You mean sandsifter? We
actually talked with him and yeah we are
35:57 - 36:03

aware, that there's a map essentially of
the instruction set and also maybe you can
36:03 - 36:07

combine it, because in the beginning we
reverse engineered where certain x86
36:07 - 36:11

instructions are implemented in microcode.
So if you plug these two together you kind
36:11 - 36:15

of map out the whole microcode ROM at the
same time that you map out a whole
36:15 - 36:19

instruction set. However there are some
components of the microcode ROM that are
36:19 - 36:23

most likely not triggered by instructions.
For example it seems like power management
36:23 - 36:27

or everything that is behind a write MSR
[wrmsr] or read MSR [rdmsr]. wrmsr is a
36:27 - 36:31

single instruction, but depending on the
arguments you give it it just branches to
36:31 - 36:36

totally different triads and the microcode
itself is implemented in microcode. And
36:36 - 36:40

this one is a huge chunk you wouldn't even
find without brute forcing all
36:40 - 36:44

combinations for all instructions which is
not really feasible.
36:46 - 36:51

Herald Angel: Thank you. Thank you
Benjamin.
36:51 - 36:57

applause
36:57 - 37:02

35c3 postroll music
37:02 - 37:21

subtitles created by c3subtitles.de
in the years 2019-2020. Join, and help us!

Title:: 35C3 - Inside the AMD Microcode ROM
Description:: more » « less
Video Language:: English
Duration:: 37:21

	willimei edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM
	billatq edited English subtitles for 35C3 - Inside the AMD Microcode ROM

Show all

English subtitles

Revisions

Revision 34 Edited

willimei

35C3 - Inside the AMD Microcode ROM

Revisions

Our website uses cookies

Operating cookies (Required)