35C3 preroll music
Herald: So the next talk Benjamin Kollenda
and Philipp Koppe - they will refresh our
memories because they already had a talk
on 34C3 where they talked about the micro
code ROM and today they're gonna give us
more insights on how micro code works. And
more details on the ROM itself. Benjamin
is a PhD student and has a focus on
software attacks and defenses and together
with Phillip they will now abuse AMD
microcode for fun and security. Please
enjoy.
Applause
Benjamin: Thank you. So as mentioned we
were able to reverse engineer the AMD
microcode and the AMD microcode ROM and
I'm going to talk about our journey. What
we learned on the way and how we did it.
So this joint work with my colleagues at
Ruhr Universtat Bochum and a quick outline
how are we going to do it. We're going to
start with a quick crash course on micro
architectural basics and what microcode
actually is. Then I talk about how we
reconstructed the
microcode ROM and what we learned
along the way. Then I quickly give some
examples of the applications we
implemented with the knowledge we gained
from second step. And lastly I talk about
a framework we used. How it works and what
we can do with it. And also this framework
is available on GitHub along with some
other tools so you're free to continue our
work. OK. So when I'm talking about
microcode you can think of it essentially
as a firmware for your processor. It
handles multiple purposes for example
you can use it to fix CPU bugs that you
have in silicon and you want to fix later
in the design phase. It is used for
instruction decoding - I cover this one a
bit more. It is also used for exception
handling. For example, if an exception or
interrupt is raised, microcode has a first
chance of modifying this interrupt
ignoring it or just passing it along to
the operating system. It's also used for
power management and some other complex
features like Intel SGX. And most
importantly for us microcode is updatable.
This used to patch errors in the field.
Everyone remembers Spectre / Meltdown
patches and there's
a microcode update. So your
x86 CPU takes multiple steps to execute an
instruction. The first step is decoding
a x86 instruction into multiple smaller
micro ops.
These are then scheduled into the pipeline
From there, they are dispatched to
the different functional units
like your ALU / AGU
multiplication division units
For our purposes the decode step is the
most interesting one. In the decode step
you have a instruction buffer that feeds
instructions to some decoders. You have
short decoders that handle really simple
instructions. There are long decoders that
can handle some more advance instructions.
And finally, the vector decoder. The
vector decoder handles the most complex
instructions with the help of microcode.
So the microcode engine is essentially the
vector decoder.
The Microcode engine in essence
is compromised out of a microcode
ROM that stores the instructions for the
microcode engine. Think of it as your
standard instructions. Then there is also
a writeable memory the microcode RAM. This
is where the microcode updates end up when
you apply microcode updates. And of course
around the storage has a whole lot of
things that make it actually run. For this
talk, you only need to know what is a
Match Registers. Match Registers are
essentially breakpoint registers. So if we
write an address from inside the microcode
ROM inside a Match Register whenever this
address is fetched, execution, control is
transferred to the microcode RAM so our
patch gets executed. And the microcode
updates are usually loaded by the BIOS or
by the kernel. Linux has an update driver,
sometimes the BIOS updates it with a
pre-installed version and they have a
pretty simple structure, a partially
documented header, and followed by the
actual microcode that is loaded inside the
CPU. And so microcode is organized in
something called triads. Each triad has
three operations essentially x86
instructions, but based on differences.
And lastly, you have a sequence word. The
sequence word indicates which microcode
instructions should be executed next. We
have options of executing just the next
triad, executing another one by branching
to it, or just saying OK, I'm done with
decoding this instruction continue with
x86 code. These updates are protected by
some weak authentication which we were
able to break so we can create our own. We
can analyze existing ones and we can apply
these to your standard laptop and desktop.
However there can only ever be one update
loaded at the time and when you reboot
your machine this update will be gone.
Also for the talk we are going to look at
some microcode and we will present this
microcode using a register transfer
language. It is heavily based on x86. I'm
just going to cover the differences
between these two. Most importantly the
microcode can have three operands for an
instruction in comparison to x86 which
usually only has two. So you can specify a
destination and two source operands.
Also,
microcode has some certain bit flags that
need to be set and these we do we see with
these annotations for example ".C" means
says instruction also updates a carry flag
based on the result. Then you have the
instruction "jcc" which is a conditional
branch and the first operand denotes the
condition up on which this branch is
taken. In this case branch if the carry
flag is one and [the] second operand
indicates the offset to add to the
instruction pointer. Then we also have
some sequence word annotations: "next",
"complete", and "branch". Also it should
be noted that the internal microcode
architecture is a load-store architecture.
You can't use memory operands in other
instructions like you can on x86 you
always need to load and store memory
explicitly.
Now we are going to talk about
how we manage to recover the microcode
ROM. The microcode ROM is baked into your
CPU, you can't change it anymore. It is
defined in the silicon during the
fabrication process and in this picture
you can see a die shot taken with a
electron microscope and this is one of
three regions that contains the bits for
the microcode operations. And if you zoom
in a bit more, each of these regions
consist out of four arrays and these are
further subdivided into blocks. Really
interesting is "Array 2" which is a bit
smaller than the other ones but it has
some structures above it which are of a
different visual layout. This is SRAM
which stores the microcode update. So this
is one-time reprogrammable memory that is
still pretty fast. So the microcode RAM is
located right next to the microcode ROM
which also makes sense from a design
standpoint.
Just an overview of how we
went ahead and how we went about. We
started with pictures and then we used
some OCR-ike process to transform them
into bit strings which we can then further
process. These bitstrings were then
arranged into triads. We could already
gather that we got individual triades
right because there were data dependencies
all over the place, but between triads,
there were no or very few data
dependencies so the ordering of the
triades was still wrong and this was a
major problem when we went ahead and what
we had to reverse engineer and this is
mapping a certain physical address of a
triad that we gathered from the ROM
readout to a virtual address that is used
inside the microcode update or the
microcode ROM. But after reverse engineer
this, you can just do a linear sweep
disassembly of the microcode ROM and
arrive at human readable output. But this
recovery was a bit tricky because we
required physical virtual address pairs.
But gathering these is a bit harder
because we worked there through the
available updates, but we could only find
two pairs of them. These pairs were
actually easy to find because every update
replaces a certain triad inside your
microcode ROM and this triad is usually
also placed in the microcode update. So by
matching the address this update replaces
with a microcode ROM readout. You can just
get your two data points. But we had to
get more data points so we generated these
mappings by matching semantics of triads
in the microcode ROM readout and the
semantics when we force execution of a
certain microcode address. And gathering
the semantics of the read-out microcode,
we implemented a simple microcode
simulator. Essentially it works on triad
level, so you give it an input state and a
triad and it calculates the output state
of it. Input and output state are
comprised out of the x86-state which is
your standard registers and also the
internal microcode registers. There are
multiple temporary registers that get
reset for every new x86 instruction that
is executed, but they can also be modified
by microcode of course. Our emulator
supports all known arithmetic operations
and we have a white-list of operations
that do not form or produce any observable
change in state just so that we could
process more triades and give them more
data points. In total we gathered 54
additional data-address pairs which turned
out to be enough to recover the whole
mapping. This mapping, essentially you
have the four different arrays that map to
individual blocks and these blocks in
these arrays or then again permuted a bit
and then the triads inside these blocks
have some table-based permutations. So
this is not an obfuscation. This is just
from a hardware design standpoint it can
make sense to reroute it a bit differently
Also now that we can actually
map a certain address to the microcode ROM
readout and we know the addresses of
different x86 instructions from our
earlier experiments, we can look at the
implementation of instructions. So let's
start with a pretty simple one. Shift-
Right-Double which essentially takes a
register, shift it by a given amount and
shifts in bits from another register. So
of course you would expect a lot of shifts
and rolls in its implementation and this
is exactly what we're seeing here. You
have two shift-right operands and you can
see regmd6 and regmd4. These are
place holders. The microcode engine can
replace certain bit combinations with the
registers that are used in the x86
operation. For example this one would be
replaced by ECX or EAX depending on what
you wrote in x86. And at this point we can
also already gather more information about
microcodes than we previously knew because
we know "OK, so this is source, this is
also a source and this is a destination".
But this source which indicates the shift
amount, this one was previously unknown,
because it is a high temporary microcode
register and we found out that these
usually implement specific different
purpose. They are not - if you write to
them, sometimes the CPU behaves
erratically, sometimes it crashes,
sometimes nothing happens. But in this
case, this seems to be the shift count,
and the shift count is given by a third
operand in the instruction. So in this
case, we already learned "OK, if you want
to read the third operand of an
instruction, we need to read t41". And
this is how we went about recovering more
and more information about microcode. The
rest of the implementation is essentially
concerned with implementing the rest of
the semantics of the x86 instruction and
updating the flags correctly. OK, so now
let's look at a instruction set that is a
bit more complicated. If you check out
rdtsc. rdtsc returns a internal cycle
counter in EDX and EAX, so the upper part
ends up in EDX, lower part in EAX. So in
the end we want to see writes to these
registers, potentially with a shift
somewhere in there. But somewhere the CPU
needs to gather the cycle counter. So in
the beginning we have two load-style
operations. This one is a proper load
which we identified and this one is
unknown. But despite that we do not know
the instruction, we know the target
because the result of this instruction
will end up in t9 and the result of this
instruction will end up in t10, so we can
follow the uses of these two registers. So
for simplicity I'm going to start with t10
and t10, which we later found out, this is
another register which essentially denotes
a specific internal register. And if you
play around with these bits you notice
that this combination encodes cr4. The x86
will just see cr4. You can also address
cr1 and cr2. And if you look further, t10
is then ended with this bit mask and if
you look in the manual you find out that
this bit in cr4 denotes the bit that
determines whether oddity C is
available from user space or not. So this
is the check if this instruction should be
executed. So now let's just keep in mind
that t9 holds some other loaded value from
some other internal register and we will
come back to this one a bit later. For
now, let's follow execution. This triad is
essentially a padding triad. It is a
common pattern we see. So let's look at
where this branch takes us.
And this branch
takes us to a conditional branch
triad. And if you look a bit up, this end
instruction actually updated this flag. So
this is a conditional branch that
determines whether this check was
successful or not. So it branches toward
the error triad or the success triad. But
here we already see the exit. We see a
write to RDX or EDX in this case with a
shift from t9 by 32 bit, which is exactly
what you would expect to write the time
stamp counter on the upper 32 bits of the
time stamp counter to edx. And you have an
unknown instruction, but we know, okay, we
move something from t9 to eax, which is
the lower 32 bits. But we're not done
here, because we can still look at the
error pass that is taken if the access is
denied. So if you scroll a bit down we can
see a move of an immediate into a certain
internal register. And this is immediate
actually encodes a general protection
fault interrupt code. D denotes to the
exception handler that this was a general
protection fault. And later this triad
branches to this address, and if you look
at the uses of this address we can find
other immediates that also correspond on
to x86 instructions. So now we learned
how we can actually raise our
own interrupts. We
just need to load the code we want into
the specific register and branch to this
address. And now we learned a lot about
how we can actually write microcode, but
it's also interesting to see how certain
instructions are implemented. So let's
look at a pretty complicated one: wrmsr
(Write MSR). wrmsr essentially writes some
data it is given to a machine specific
register. This machine specific register
differs between CPUs, between vendors,
sometimes between revisions. And these
implement non-standard extensions or
pretty complex features. For example, you
trigger a microcode update by writing to a
machine specific register. The register
addresses you want to write to is given in
ecx. And now we can see ecx is read and
it is shifted by sixteen bits to t10. So
again, we follow uses of t10 and we see
it as XOR'd with a certain bitmask. And
this bitmask is C000, which actually
denotes a namespace of the model specific
registers. In this case this should be an
AMD-specific namespace. And, of course,
this one again sets some flags, and you
can see your conditional branch depending
on these flags to what should be the
handler for this namespace.
Next one: We have another XOR
that uses a different bit
mask — in this case C001. C001 is the
namespace where the microcode update
routine is actually located in. So again,
we branch to this handler. And if you just
continue on, there are more operations on
rcx, followed by more branches, and this
continues until everything is dispatched
to the correct handler. And this is how,
internally, wrmsr is implemented, and also
Read MSR is going to be implemented pretty
similar, because it implements some kind
of similar thing.
OK, so now I showed you
how we actually went ahead of
reconstructing the knowledge we
currently have. And now I'm going to show
you what we can actually do with it. And
for this I am going to quickly cover what
applications we wrote in microcode. We
wrote a simple configurable
rdtsc precision.
This means a certain bit mask is AND'd to
the result of rdtsc, so you can
reduce the accuracy of it, which can
sometimes prevent timing attacks. We also
implemented microcode-assisted address
sanitizer, which I'll cover quickly in a
second. We also have some basic microcode
instruction set randomization. Some
microcode-assisted instrumentation. What
this means is, you can write a filter for
your instrumentation in microcode itself.
So instead of hooking an instruction,
instead of debugging your code or
emulating it, you can just say whenever
the instruction is executed filter if this
is relevant for me, and if it is, call my
x86 handler — entirely in microcode,
without changing the instruction in the
RAM. We also implemented some basic
authenticated microcode updates. The usual
update mechanism is weak — that's how we
got our foot in the door in the first
place. So we improved upon it a bit. Also
we found out that microcode actually has
some enclave-like features because once
we're executing in Microcode, your kernel
can't interupt you, your hypervisor can't
interrupt you and any state you want
visible to the outside world. You actually
need to write explicitly. So all these
microcode internal registers are not
accessible from the outside world. So any
computation you perform in micro code
cannot be interfered with. So you can
implement a simple enclave on top of this
one. So our hardware-assisted address
sanitizer variant is based on the work by
the original authors and address sanitizer
is a software instrumentation that detects
invalid memory access by using a shadow
map shadow memory to just say which memory
is valid to be read and written to.
The authors proposed hardware
address sanitizer
which is essentially doing the same checks
but using a new instruction. And the
instruction should raise a fault if an
invalid access is detected. This algorithm
they proposed - The details are not
important. What is important is in
essence: It's pretty simple. You load from
a certain adress, performs the operations
on it and if there is the shadow after
this operations you just report a bug.
Advantages of hardware address sanitizer
are for example you get better performance
out of it. Because you only have a single
instruction maybe you can do some fancy
tricks inside your CPU that are faster
than using x86 instructions, you get more
compact code and you have the possibility
of one time configuration which is a bit
hard with software address sanitizer. We
implemented hardware address sanitizer our
variant by replacing the bound instruction
Bound is an old instruction that is no
longer used by compilers because in fact
it is slower to use bound instead of
performing the checks with multiple x86
instructions. We changed the interface.
The first argument is the register which
holds the address you want to access. And
the second argument holds the size you
want this access to be.
So, 1 byte, 2 byte and so on.
This instruction is a no-op if the
check succeeds. So if there is no bug it
just continues on like nothing happened.
However if we detect an invalid access we
can take a configurable action, we can for
example just raise your normal page fault
or we can raise a bound interrupt, which
is a custom interrupt, that only denotes
this one or we can branch to an x86
handler that either performs additional
checking, for example whitelisting, or it
generates a pretty error report for you.
Most importantly this is a single
instruction. We also do not dirty any x86
registers because they are some
intermediate results. You need to store
these somewhere and this you usually do in
the x86 registers. So you increase
register pressure. Maybe you cause
spilling. So overall your performance gets
worse. We also found out that we are
actually faster than doing the checking
using x86 instructions. So just by moving
the implementation from x86 level to
microcode, which in some way is still kind
of like software, we already improved the
performance. Also on top of this you get
better cache utilization because you have
less instructions, there are less bytes in
the cache, so we get fuller cache lines.
And also it is really easy to tell which
is testing code and which is your actual
program code. Lastly I'm going to show you
just a rough overview of our framework
which we used during our development and
which you can also find on GitHub. Early
on we found out that we are probably going
to need to test a lot of microcode
updates, because in the beginning you just
throw everything at the CPU and see how it
behaves and we wanted to do this in
parallel. So we developed a small custom
OS called "Angry OS" and deployed it to
mainboards. These mainboards are just old
AMD mainboards. All these mainboards were
hooked up via serial for communication and
GPIO to a Raspberry Pi. With the GPIO you
can reset, support power on, power down
and just have remote control of this
mainboard and then you can connect to that
Raspberry Pi from anywhere on earth and
just deploy and play around with it.
This was the first version.
In the beginning we
didn't really know much about electronics
so we used one Raspberry Pi per mainboard.
And it turns out Raspberry Pis are more
expensive than these old mainboards, but
we improved upon this and now we're down
to one Raspberry Pi for
four / five setups.
For example you only need 3 GPIO ports per
mainboard. You connect each of these to
optocouplers just to separate the voltage
levels and then you connect one side of
the optocoupler to the GPIO the other side
to your reset pin, to your power pin and
for input to know whether your board is up
or down you connect the power LED. And
that way you can save a lot of space, a
lot of money. And also if you're really
constrained you can just remove the power
LED sensing because usually you know it is
in the state your setup is in. As I
already said we wrote our custom operating
system and it is intentionally really
really minimal because the major feature
we wanted is control over every
instructions that's going to be executed
from a certain point on, because we're
playing around with instruction encoding
and if we execute an instructions that we
did not intend we might crash the CPU, we
might go into an invalid state and we do
not even know which instruction caused it.
And Angry OS essentially only listens on
the serial port for something to do. What
it can do is apply an update. These
updates are just microcode updates. They
are streamed via serial. We can also
stream x86 code which is then run by Angry
OS and this is just so that we do not need
to reflash the USB stick every time we
want to update our testing code and the
result, all the errors are reported back
to the Raspberry Pi and thus they are
forwarded to us. The framework we use most
importantly has the microcode assembler
and a pretty verbose disassembler. This
disassembler generates the output I showed
you earlier and using this you can just
quickly write your own microcode. We also
included an x86 assembler because we
wanted to rapidly test different x86
testing codes. Using this framework we
were able to disassemble the existing
updates and we also used it to disassemble
our ROM after we reordered it and also
during the process when we fed it to our
emulator. And we can also create the
proper binary files that can be loaded by
the Linux kernel driver. We modified the
stock one to just load any update you give
it without checking if it's the correct
CPU ID and all these things just for
testing purposes. It's also available. And
also of course the framework can control
Angry OS to make your testing easier. And
we implemented a pretty basic remote
execution wrapper, so you can work on a
remote Raspberry Pi as if you were using
it locally.
And this brings me to the end
of talk. And in conclusion we can say
reversing the ROM opened up a lot of new
possibilities. We learned a lot about how
microcode works. We learned about how to
actually use it properly instead of just
inferring from a really small dataset,
that we have from the updates, or from the
random bits things we send to the CPU and
observe what happened. But there's a lot
left to do. So if you really want to hack
on it, just get in contact, we were happy
to share our findings with you. And as I
said the framework AngryOS, example
programs, that we implemented, and some
other stuff like the wiring is available
on GitHub. So that's that. And we are
happy to answer any questions you might
have.
applause
Herald Angel: Thank you very much. So we
have 10 minutes for questions please line
up at the microphones. We start with this
one: microphone number 2.
M2: Hi. Thanks for a nice talk. A few
questions about your hardware address
sanitizer.
Benjamin: Mhm
M2: As I understand you don't need the
source code instrumentation because the
microcode is responsible for checking the
shadow memory, right?
Benjamin: No... The original hardware
sanitizer implementation is also based on
a compiler extension, that inserts a new
instruction because it doesn't exist
usually. And it also inserts a bootstrap
code that in inits your shadow map and
also instruments your allocators to update
the shadow map doing runtime and we
essentially need the same component, but
we do not need the software address
sanitizer component that essentially
inserts 10 or 20 x86 instructions before
every memory access. So yes we still need
a compile time component and we are still
source code based in a sense.
Herald: And, so..
M2: And I didn't see, maybe I missed the
numbers. How much it is faster than this
initial version?
Benjamin: You mean the initial hardware
sanitizer version or the software address
sanitizer.
M2: I mean let's say custom kernel address
sanitizer for Linux kernel which is the
the usual one and your approach.
Benjamin: We only performed a micro
benchmark on Angry OS and we essentially
took the instrumentation as emitted by the
compiler for some memory access which is
your standard software address sanitizer
and compared it to our version using only
the modified bound instruction. So I
really can't talk about how it compares to
KASAN or something or some like real world
implementation, because we only have the
prototype and the basic instrumentation.
M2: Thank you very much.
Herald Angel: OK. Microphone number 4
please.
M4: Hey thanks for the talk and did you
find any weird microcode
implementations. I don't mean security
wise, just like you rarely expected to
see it be implemented that way.
Benjamin: The problem is there's a lot of
microcode to begin with. You have f000
triads. Each of which has 3 op-codes. So
you have a lot of ground to cover and also
we have read-out errors. Sometimes you are
seeing bit flips, which kind of slows you
down because you then need to always
consider: OK, maybe this register is
something else, maybe this address is
wrong. And also sometimes you have a dust
particles that kind of knocks out an
entire region. So we only looked at the
components, we were pretty sure that we
recovered correctly, and we'd only looked
at a really tiny subset compared to all of
the microcode ROM. It's just not feasible
to do and to go through it and look at
everything. So no we didn't find anything
funny but we also wouldn't know what funny
looks like because we don't know what the
official spec for microcode is.
M4: Thanks.
Herald Angel: Interesting. We have one
question from the Internet, from the
Signal Angel please.
Signal Angel: Yes. Which AMD CPU
generations does this apply to?
Benjamin: Yeah this is still based on the
work of our first talk and this only works
on pretty old ones: K8, K10. So until,
CPUs produced until 2013. Yeah this was
the last year AMD produced anything like
that. Newer ones use some public key based
cryptography from what we can tell and we
haven't yet managed to break it. Same goes
for Intel, they seem to be using public
key cryptography and we haven't gotten a
foot in the door yet.
Herald Angel: Thank you. We go one around.
On microphone number 3 please.
M3: Yeah. Thank you. I would like to know
how complex could the microcode programs
be, that you could write. So what's the
complexity of new operations you could
implement.
Benjamin: The only limiting factor is the
size of your microcode update RAM. But
this one is really really limited.
For example on K8, where we performed the
majority of our experiments. We are
limited to 32 triads, which comes down to
a sixty nine instructions and you also
have some constraints on these
instructions for example the next triad
will always be executed no matter what.
Some operations can only go at the second
slot. Some can only go on another slot, so
it's really really hard. And you're also
limited from our knowledge to loading 16
bit immediates instead of 32 bit or even
64 bit immediates. So your whole program
grows really fast if you're trying to do
something complex. For example our
authenticated microcode update mechanism
is the most complex one we wrote it nearly
fills out the RAM and we used TEA – Tiny
Encryption Algorithm – because that was
the only one we managed to fit mostly due
to S-box and other constants we would need
to load. So it's really small.
Herald Angel: Thank you Microphone number
1.
M1: So you said the microcode is used for
instruction decoding and it needs to meet
the micro-ops to the scheduler and micro
queue in some way. Did you find out how
that works?
Bejamin: In essence we are not actually
executing code inside in microcode engine.
From what from what we understand, the
microcode engine is just some kind of a
software based recipe, that describes how
to decode an instruction, so you don't
actually get execution, you just commit
instructions into the pipelines, that do
what you want. And because we have some
control flow possibility, that is actually
inside the micro code engine, because you
can branch to different addresses, you can
conditionally branch and loop. You kind of
get an execution, but in essence to just
commit stuff in the pipeline and the CPU
does what you tell it to.
Herald Angel: One more question.
Microphone number 2, please.
M2: How did you take the picture of the
internal CPU? Did you open it?
Benjamin: Yeah. We worked together with
Chris. He's our hardware guy. He has
access to his equipment to delayer it and
to take high resolution optical shots and
he also takes shots with a scanning
electron microscope. So I think about five
or six CPUs were harmed in the making of
this paper.
Herald Angel: So we have one more last
question. Microphone number 2 please.
M2: Are you aware of research done by
Christopher Domas, where he mapped out the
instruction set for x86 processors?
B: You mean sandsifter? We
actually talked with him and yeah we are
aware, that there's a map essentially of
the instruction set and also maybe you can
combine it, because in the beginning we
reverse engineered where certain x86
instructions are implemented in microcode.
So if you plug these two together you kind
of map out the whole microcode ROM at the
same time that you map out a whole
instruction set. However there are some
components of the microcode ROM that are
most likely not triggered by instructions.
For example it seems like power management
or everything that is behind a write MSR
[wrmsr] or read MSR [rdmsr]. wrmsr is a
single instruction, but depending on the
arguments you give it it just branches to
totally different triads and the microcode
itself is implemented in microcode. And
this one is a huge chunk you wouldn't even
find without brute forcing all
combinations for all instructions which is
not really feasible.
Herald Angel: Thank you. Thank you
Benjamin.
applause
35c3 postroll music
subtitles created by c3subtitles.de
in the years 2019-2020. Join, and help us!