Return to Video

35C3 - Inside the AMD Microcode ROM

  • 0:03 - 0:07
    35C3 preroll music
  • 0:19 - 0:24
    Herald: So the next talk Benjamin Kollenda
    and Philipp Koppe - they will refresh our
  • 0:24 - 0:31
    memories because they already had a talk
    on 34C3 where they talked about the micro
  • 0:31 - 0:38
    code ROM and today they're gonna give us
    more insights on how micro code works. And
  • 0:38 - 0:44
    more details on the ROM itself. Benjamin
    is a PhD student and has a focus on
  • 0:44 - 0:51
    software attacks and defenses and together
    with Phillip they will now abuse AMD
  • 0:51 - 0:55
    microcode for fun and security. Please
    enjoy.
  • 0:55 - 0:59
    Applause
  • 1:01 - 1:06
    Benjamin: Thank you. So as mentioned we
    were able to reverse engineer the AMD
  • 1:06 - 1:12
    microcode and the AMD microcode ROM and
    I'm going to talk about our journey. What
  • 1:12 - 1:16
    we learned on the way and how we did it.
    So this joint work with my colleagues at
  • 1:16 - 1:21
    Ruhr Universtat Bochum and a quick outline
    how are we going to do it. We're going to
  • 1:21 - 1:25
    start with a quick crash course on micro
    architectural basics and what microcode
  • 1:25 - 1:28
    actually is. Then I talk about how we
    reconstructed the
  • 1:28 - 1:30
    microcode ROM and what we learned
  • 1:30 - 1:35
    along the way. Then I quickly give some
    examples of the applications we
  • 1:35 - 1:41
    implemented with the knowledge we gained
    from second step. And lastly I talk about
  • 1:41 - 1:48
    a framework we used. How it works and what
    we can do with it. And also this framework
  • 1:48 - 1:52
    is available on GitHub along with some
    other tools so you're free to continue our
  • 1:52 - 1:57
    work. OK. So when I'm talking about
    microcode you can think of it essentially
  • 1:57 - 2:02
    as a firmware for your processor. It
    handles multiple purposes for example
  • 2:02 - 2:06
    you can use it to fix CPU bugs that you
    have in silicon and you want to fix later
  • 2:06 - 2:12
    in the design phase. It is used for
    instruction decoding - I cover this one a
  • 2:12 - 2:18
    bit more. It is also used for exception
    handling. For example, if an exception or
  • 2:18 - 2:22
    interrupt is raised, microcode has a first
    chance of modifying this interrupt
  • 2:22 - 2:27
    ignoring it or just passing it along to
    the operating system. It's also used for
  • 2:27 - 2:32
    power management and some other complex
    features like Intel SGX. And most
  • 2:32 - 2:37
    importantly for us microcode is updatable.
    This used to patch errors in the field.
  • 2:37 - 2:41
    Everyone remembers Spectre / Meltdown
    patches and there's
  • 2:41 - 2:44
    a microcode update. So your
  • 2:44 - 2:51
    x86 CPU takes multiple steps to execute an
    instruction. The first step is decoding
  • 2:51 - 2:55
    a x86 instruction into multiple smaller
    micro ops.
  • 2:55 - 2:57
    These are then scheduled into the pipeline
  • 2:57 - 3:02
    From there, they are dispatched to
    the different functional units
  • 3:02 - 3:04
    like your ALU / AGU
  • 3:04 - 3:06
    multiplication division units
  • 3:06 - 3:08
    For our purposes the decode step is the
  • 3:08 - 3:12
    most interesting one. In the decode step
    you have a instruction buffer that feeds
  • 3:12 - 3:17
    instructions to some decoders. You have
    short decoders that handle really simple
  • 3:17 - 3:21
    instructions. There are long decoders that
    can handle some more advance instructions.
  • 3:21 - 3:25
    And finally, the vector decoder. The
    vector decoder handles the most complex
  • 3:25 - 3:30
    instructions with the help of microcode.
    So the microcode engine is essentially the
  • 3:30 - 3:31
    vector decoder.
  • 3:32 - 3:37
    The Microcode engine in essence
    is compromised out of a microcode
  • 3:37 - 3:41
    ROM that stores the instructions for the
    microcode engine. Think of it as your
  • 3:41 - 3:48
    standard instructions. Then there is also
    a writeable memory the microcode RAM. This
  • 3:48 - 3:53
    is where the microcode updates end up when
    you apply microcode updates. And of course
  • 3:53 - 3:57
    around the storage has a whole lot of
    things that make it actually run. For this
  • 3:57 - 4:01
    talk, you only need to know what is a
    Match Registers. Match Registers are
  • 4:01 - 4:06
    essentially breakpoint registers. So if we
    write an address from inside the microcode
  • 4:06 - 4:11
    ROM inside a Match Register whenever this
    address is fetched, execution, control is
  • 4:11 - 4:18
    transferred to the microcode RAM so our
    patch gets executed. And the microcode
  • 4:18 - 4:23
    updates are usually loaded by the BIOS or
    by the kernel. Linux has an update driver,
  • 4:23 - 4:28
    sometimes the BIOS updates it with a
    pre-installed version and they have a
  • 4:28 - 4:32
    pretty simple structure, a partially
    documented header, and followed by the
  • 4:32 - 4:38
    actual microcode that is loaded inside the
    CPU. And so microcode is organized in
  • 4:38 - 4:43
    something called triads. Each triad has
    three operations essentially x86
  • 4:43 - 4:48
    instructions, but based on differences.
    And lastly, you have a sequence word. The
  • 4:48 - 4:52
    sequence word indicates which microcode
    instructions should be executed next. We
  • 4:52 - 4:58
    have options of executing just the next
    triad, executing another one by branching
  • 4:58 - 5:02
    to it, or just saying OK, I'm done with
    decoding this instruction continue with
  • 5:02 - 5:07
    x86 code. These updates are protected by
    some weak authentication which we were
  • 5:07 - 5:13
    able to break so we can create our own. We
    can analyze existing ones and we can apply
  • 5:13 - 5:21
    these to your standard laptop and desktop.
    However there can only ever be one update
  • 5:21 - 5:27
    loaded at the time and when you reboot
    your machine this update will be gone.
  • 5:28 - 5:33
    Also for the talk we are going to look at
    some microcode and we will present this
  • 5:33 - 5:38
    microcode using a register transfer
    language. It is heavily based on x86. I'm
  • 5:38 - 5:43
    just going to cover the differences
    between these two. Most importantly the
  • 5:43 - 5:49
    microcode can have three operands for an
    instruction in comparison to x86 which
  • 5:49 - 5:54
    usually only has two. So you can specify a
    destination and two source operands.
  • 5:56 - 5:56
    Also,
  • 5:57 - 6:02
    microcode has some certain bit flags that
    need to be set and these we do we see with
  • 6:02 - 6:07
    these annotations for example ".C" means
    says instruction also updates a carry flag
  • 6:07 - 6:14
    based on the result. Then you have the
    instruction "jcc" which is a conditional
  • 6:14 - 6:20
    branch and the first operand denotes the
    condition up on which this branch is
  • 6:20 - 6:24
    taken. In this case branch if the carry
    flag is one and [the] second operand
  • 6:24 - 6:30
    indicates the offset to add to the
    instruction pointer. Then we also have
  • 6:30 - 6:36
    some sequence word annotations: "next",
    "complete", and "branch". Also it should
  • 6:36 - 6:40
    be noted that the internal microcode
    architecture is a load-store architecture.
  • 6:40 - 6:45
    You can't use memory operands in other
    instructions like you can on x86 you
  • 6:45 - 6:48
    always need to load and store memory
    explicitly.
  • 6:49 - 6:52
    Now we are going to talk about
  • 6:52 - 6:59
    how we manage to recover the microcode
    ROM. The microcode ROM is baked into your
  • 6:59 - 7:07
    CPU, you can't change it anymore. It is
    defined in the silicon during the
  • 7:07 - 7:13
    fabrication process and in this picture
    you can see a die shot taken with a
  • 7:13 - 7:17
    electron microscope and this is one of
    three regions that contains the bits for
  • 7:17 - 7:23
    the microcode operations. And if you zoom
    in a bit more, each of these regions
  • 7:23 - 7:30
    consist out of four arrays and these are
    further subdivided into blocks. Really
  • 7:30 - 7:35
    interesting is "Array 2" which is a bit
    smaller than the other ones but it has
  • 7:35 - 7:42
    some structures above it which are of a
    different visual layout. This is SRAM
  • 7:42 - 7:47
    which stores the microcode update. So this
    is one-time reprogrammable memory that is
  • 7:47 - 7:54
    still pretty fast. So the microcode RAM is
    located right next to the microcode ROM
  • 7:54 - 7:58
    which also makes sense from a design
    standpoint.
  • 8:00 - 8:02
    Just an overview of how we
  • 8:02 - 8:07
    went ahead and how we went about. We
    started with pictures and then we used
  • 8:07 - 8:11
    some OCR-ike process to transform them
    into bit strings which we can then further
  • 8:11 - 8:17
    process. These bitstrings were then
    arranged into triads. We could already
  • 8:17 - 8:22
    gather that we got individual triades
    right because there were data dependencies
  • 8:22 - 8:28
    all over the place, but between triads,
    there were no or very few data
  • 8:28 - 8:34
    dependencies so the ordering of the
    triades was still wrong and this was a
  • 8:34 - 8:39
    major problem when we went ahead and what
    we had to reverse engineer and this is
  • 8:39 - 8:44
    mapping a certain physical address of a
    triad that we gathered from the ROM
  • 8:44 - 8:48
    readout to a virtual address that is used
    inside the microcode update or the
  • 8:48 - 8:54
    microcode ROM. But after reverse engineer
    this, you can just do a linear sweep
  • 8:54 - 8:59
    disassembly of the microcode ROM and
    arrive at human readable output. But this
  • 8:59 - 9:05
    recovery was a bit tricky because we
    required physical virtual address pairs.
  • 9:05 - 9:10
    But gathering these is a bit harder
    because we worked there through the
  • 9:10 - 9:14
    available updates, but we could only find
    two pairs of them. These pairs were
  • 9:14 - 9:19
    actually easy to find because every update
    replaces a certain triad inside your
  • 9:19 - 9:25
    microcode ROM and this triad is usually
    also placed in the microcode update. So by
  • 9:25 - 9:31
    matching the address this update replaces
    with a microcode ROM readout. You can just
  • 9:31 - 9:38
    get your two data points. But we had to
    get more data points so we generated these
  • 9:38 - 9:43
    mappings by matching semantics of triads
    in the microcode ROM readout and the
  • 9:43 - 9:48
    semantics when we force execution of a
    certain microcode address. And gathering
  • 9:48 - 9:52
    the semantics of the read-out microcode,
    we implemented a simple microcode
  • 9:52 - 9:59
    simulator. Essentially it works on triad
    level, so you give it an input state and a
  • 9:59 - 10:03
    triad and it calculates the output state
    of it. Input and output state are
  • 10:03 - 10:08
    comprised out of the x86-state which is
    your standard registers and also the
  • 10:08 - 10:12
    internal microcode registers. There are
    multiple temporary registers that get
  • 10:12 - 10:18
    reset for every new x86 instruction that
    is executed, but they can also be modified
  • 10:18 - 10:24
    by microcode of course. Our emulator
    supports all known arithmetic operations
  • 10:24 - 10:29
    and we have a white-list of operations
    that do not form or produce any observable
  • 10:29 - 10:33
    change in state just so that we could
    process more triades and give them more
  • 10:33 - 10:41
    data points. In total we gathered 54
    additional data-address pairs which turned
  • 10:41 - 10:47
    out to be enough to recover the whole
    mapping. This mapping, essentially you
  • 10:47 - 10:51
    have the four different arrays that map to
    individual blocks and these blocks in
  • 10:51 - 10:57
    these arrays or then again permuted a bit
    and then the triads inside these blocks
  • 10:57 - 11:02
    have some table-based permutations. So
    this is not an obfuscation. This is just
  • 11:02 - 11:08
    from a hardware design standpoint it can
    make sense to reroute it a bit differently
  • 11:09 - 11:15
    Also now that we can actually
    map a certain address to the microcode ROM
  • 11:15 - 11:19
    readout and we know the addresses of
    different x86 instructions from our
  • 11:19 - 11:24
    earlier experiments, we can look at the
    implementation of instructions. So let's
  • 11:24 - 11:29
    start with a pretty simple one. Shift-
    Right-Double which essentially takes a
  • 11:29 - 11:33
    register, shift it by a given amount and
    shifts in bits from another register. So
  • 11:33 - 11:38
    of course you would expect a lot of shifts
    and rolls in its implementation and this
  • 11:38 - 11:45
    is exactly what we're seeing here. You
    have two shift-right operands and you can
  • 11:45 - 11:51
    see regmd6 and regmd4. These are
    place holders. The microcode engine can
  • 11:51 - 11:56
    replace certain bit combinations with the
    registers that are used in the x86
  • 11:56 - 12:02
    operation. For example this one would be
    replaced by ECX or EAX depending on what
  • 12:02 - 12:08
    you wrote in x86. And at this point we can
    also already gather more information about
  • 12:08 - 12:14
    microcodes than we previously knew because
    we know "OK, so this is source, this is
  • 12:14 - 12:19
    also a source and this is a destination".
    But this source which indicates the shift
  • 12:19 - 12:23
    amount, this one was previously unknown,
    because it is a high temporary microcode
  • 12:23 - 12:28
    register and we found out that these
    usually implement specific different
  • 12:28 - 12:32
    purpose. They are not - if you write to
    them, sometimes the CPU behaves
  • 12:32 - 12:36
    erratically, sometimes it crashes,
    sometimes nothing happens. But in this
  • 12:36 - 12:40
    case, this seems to be the shift count,
    and the shift count is given by a third
  • 12:40 - 12:45
    operand in the instruction. So in this
    case, we already learned "OK, if you want
  • 12:45 - 12:51
    to read the third operand of an
    instruction, we need to read t41". And
  • 12:51 - 12:56
    this is how we went about recovering more
    and more information about microcode. The
  • 12:56 - 13:00
    rest of the implementation is essentially
    concerned with implementing the rest of
  • 13:00 - 13:06
    the semantics of the x86 instruction and
    updating the flags correctly. OK, so now
  • 13:06 - 13:12
    let's look at a instruction set that is a
    bit more complicated. If you check out
  • 13:12 - 13:20
    rdtsc. rdtsc returns a internal cycle
    counter in EDX and EAX, so the upper part
  • 13:20 - 13:26
    ends up in EDX, lower part in EAX. So in
    the end we want to see writes to these
  • 13:26 - 13:31
    registers, potentially with a shift
    somewhere in there. But somewhere the CPU
  • 13:31 - 13:38
    needs to gather the cycle counter. So in
    the beginning we have two load-style
  • 13:38 - 13:41
    operations. This one is a proper load
    which we identified and this one is
  • 13:41 - 13:49
    unknown. But despite that we do not know
    the instruction, we know the target
  • 13:49 - 13:53
    because the result of this instruction
    will end up in t9 and the result of this
  • 13:53 - 13:58
    instruction will end up in t10, so we can
    follow the uses of these two registers. So
  • 13:58 - 14:04
    for simplicity I'm going to start with t10
    and t10, which we later found out, this is
  • 14:04 - 14:10
    another register which essentially denotes
    a specific internal register. And if you
  • 14:10 - 14:15
    play around with these bits you notice
    that this combination encodes cr4. The x86
  • 14:15 - 14:23
    will just see cr4. You can also address
    cr1 and cr2. And if you look further, t10
  • 14:23 - 14:29
    is then ended with this bit mask and if
    you look in the manual you find out that
  • 14:29 - 14:35
    this bit in cr4 denotes the bit that
    determines whether oddity C is
  • 14:35 - 14:40
    available from user space or not. So this
    is the check if this instruction should be
  • 14:40 - 14:48
    executed. So now let's just keep in mind
    that t9 holds some other loaded value from
  • 14:48 - 14:54
    some other internal register and we will
    come back to this one a bit later. For
  • 14:54 - 14:59
    now, let's follow execution. This triad is
    essentially a padding triad. It is a
  • 14:59 - 15:05
    common pattern we see. So let's look at
    where this branch takes us.
  • 15:06 - 15:07
    And this branch
  • 15:07 - 15:16
    takes us to a conditional branch
    triad. And if you look a bit up, this end
  • 15:16 - 15:22
    instruction actually updated this flag. So
    this is a conditional branch that
  • 15:22 - 15:26
    determines whether this check was
    successful or not. So it branches toward
  • 15:26 - 15:33
    the error triad or the success triad. But
    here we already see the exit. We see a
  • 15:33 - 15:41
    write to RDX or EDX in this case with a
    shift from t9 by 32 bit, which is exactly
  • 15:41 - 15:46
    what you would expect to write the time
    stamp counter on the upper 32 bits of the
  • 15:46 - 15:51
    time stamp counter to edx. And you have an
    unknown instruction, but we know, okay, we
  • 15:51 - 15:58
    move something from t9 to eax, which is
    the lower 32 bits. But we're not done
  • 15:58 - 16:03
    here, because we can still look at the
    error pass that is taken if the access is
  • 16:03 - 16:09
    denied. So if you scroll a bit down we can
    see a move of an immediate into a certain
  • 16:09 - 16:15
    internal register. And this is immediate
    actually encodes a general protection
  • 16:15 - 16:22
    fault interrupt code. D denotes to the
    exception handler that this was a general
  • 16:22 - 16:29
    protection fault. And later this triad
    branches to this address, and if you look
  • 16:29 - 16:34
    at the uses of this address we can find
    other immediates that also correspond on
  • 16:34 - 16:37
    to x86 instructions. So now we learned
  • 16:37 - 16:40
    how we can actually raise our
    own interrupts. We
  • 16:40 - 16:46
    just need to load the code we want into
    the specific register and branch to this
  • 16:46 - 16:53
    address. And now we learned a lot about
    how we can actually write microcode, but
  • 16:53 - 16:57
    it's also interesting to see how certain
    instructions are implemented. So let's
  • 16:57 - 17:04
    look at a pretty complicated one: wrmsr
    (Write MSR). wrmsr essentially writes some
  • 17:04 - 17:08
    data it is given to a machine specific
    register. This machine specific register
  • 17:08 - 17:13
    differs between CPUs, between vendors,
    sometimes between revisions. And these
  • 17:13 - 17:18
    implement non-standard extensions or
    pretty complex features. For example, you
  • 17:18 - 17:24
    trigger a microcode update by writing to a
    machine specific register. The register
  • 17:24 - 17:31
    addresses you want to write to is given in
    ecx. And now we can see ecx is read and
  • 17:31 - 17:40
    it is shifted by sixteen bits to t10. So
    again, we follow uses of t10 and we see
  • 17:40 - 17:46
    it as XOR'd with a certain bitmask. And
    this bitmask is C000, which actually
  • 17:46 - 17:52
    denotes a namespace of the model specific
    registers. In this case this should be an
  • 17:52 - 17:58
    AMD-specific namespace. And, of course,
    this one again sets some flags, and you
  • 17:58 - 18:04
    can see your conditional branch depending
    on these flags to what should be the
  • 18:04 - 18:06
    handler for this namespace.
  • 18:07 - 18:11
    Next one: We have another XOR
    that uses a different bit
  • 18:11 - 18:17
    mask — in this case C001. C001 is the
    namespace where the microcode update
  • 18:17 - 18:25
    routine is actually located in. So again,
    we branch to this handler. And if you just
  • 18:25 - 18:31
    continue on, there are more operations on
    rcx, followed by more branches, and this
  • 18:31 - 18:36
    continues until everything is dispatched
    to the correct handler. And this is how,
  • 18:36 - 18:40
    internally, wrmsr is implemented, and also
    Read MSR is going to be implemented pretty
  • 18:40 - 18:44
    similar, because it implements some kind
    of similar thing.
  • 18:48 - 18:49
    OK, so now I showed you
  • 18:49 - 18:52
    how we actually went ahead of
    reconstructing the knowledge we
  • 18:52 - 18:58
    currently have. And now I'm going to show
    you what we can actually do with it. And
  • 18:58 - 19:02
    for this I am going to quickly cover what
    applications we wrote in microcode. We
  • 19:02 - 19:05
    wrote a simple configurable
    rdtsc precision.
  • 19:05 - 19:08
    This means a certain bit mask is AND'd to
  • 19:08 - 19:12
    the result of rdtsc, so you can
    reduce the accuracy of it, which can
  • 19:12 - 19:18
    sometimes prevent timing attacks. We also
    implemented microcode-assisted address
  • 19:18 - 19:23
    sanitizer, which I'll cover quickly in a
    second. We also have some basic microcode
  • 19:23 - 19:29
    instruction set randomization. Some
    microcode-assisted instrumentation. What
  • 19:29 - 19:34
    this means is, you can write a filter for
    your instrumentation in microcode itself.
  • 19:34 - 19:38
    So instead of hooking an instruction,
    instead of debugging your code or
  • 19:38 - 19:42
    emulating it, you can just say whenever
    the instruction is executed filter if this
  • 19:42 - 19:47
    is relevant for me, and if it is, call my
    x86 handler — entirely in microcode,
  • 19:47 - 19:52
    without changing the instruction in the
    RAM. We also implemented some basic
  • 19:52 - 20:00
    authenticated microcode updates. The usual
    update mechanism is weak — that's how we
  • 20:00 - 20:05
    got our foot in the door in the first
    place. So we improved upon it a bit. Also
  • 20:05 - 20:10
    we found out that microcode actually has
    some enclave-like features because once
  • 20:10 - 20:14
    we're executing in Microcode, your kernel
    can't interupt you, your hypervisor can't
  • 20:14 - 20:19
    interrupt you and any state you want
    visible to the outside world. You actually
  • 20:19 - 20:23
    need to write explicitly. So all these
    microcode internal registers are not
  • 20:23 - 20:27
    accessible from the outside world. So any
    computation you perform in micro code
  • 20:27 - 20:30
    cannot be interfered with. So you can
    implement a simple enclave on top of this
  • 20:30 - 20:37
    one. So our hardware-assisted address
    sanitizer variant is based on the work by
  • 20:37 - 20:42
    the original authors and address sanitizer
    is a software instrumentation that detects
  • 20:42 - 20:47
    invalid memory access by using a shadow
    map shadow memory to just say which memory
  • 20:47 - 20:51
    is valid to be read and written to.
  • 20:51 - 20:54
    The authors proposed hardware
    address sanitizer
  • 20:54 - 20:59
    which is essentially doing the same checks
    but using a new instruction. And the
  • 20:59 - 21:04
    instruction should raise a fault if an
    invalid access is detected. This algorithm
  • 21:04 - 21:08
    they proposed - The details are not
    important. What is important is in
  • 21:08 - 21:12
    essence: It's pretty simple. You load from
    a certain adress, performs the operations
  • 21:12 - 21:19
    on it and if there is the shadow after
    this operations you just report a bug.
  • 21:19 - 21:25
    Advantages of hardware address sanitizer
    are for example you get better performance
  • 21:25 - 21:29
    out of it. Because you only have a single
    instruction maybe you can do some fancy
  • 21:29 - 21:34
    tricks inside your CPU that are faster
    than using x86 instructions, you get more
  • 21:34 - 21:39
    compact code and you have the possibility
    of one time configuration which is a bit
  • 21:39 - 21:45
    hard with software address sanitizer. We
    implemented hardware address sanitizer our
  • 21:45 - 21:49
    variant by replacing the bound instruction
    Bound is an old instruction that is no
  • 21:49 - 21:55
    longer used by compilers because in fact
    it is slower to use bound instead of
  • 21:55 - 21:59
    performing the checks with multiple x86
    instructions. We changed the interface.
  • 21:59 - 22:04
    The first argument is the register which
    holds the address you want to access. And
  • 22:04 - 22:08
    the second argument holds the size you
    want this access to be.
  • 22:08 - 22:11
    So, 1 byte, 2 byte and so on.
  • 22:11 - 22:15
    This instruction is a no-op if the
    check succeeds. So if there is no bug it
  • 22:15 - 22:20
    just continues on like nothing happened.
    However if we detect an invalid access we
  • 22:20 - 22:25
    can take a configurable action, we can for
    example just raise your normal page fault
  • 22:25 - 22:30
    or we can raise a bound interrupt, which
    is a custom interrupt, that only denotes
  • 22:30 - 22:34
    this one or we can branch to an x86
    handler that either performs additional
  • 22:34 - 22:40
    checking, for example whitelisting, or it
    generates a pretty error report for you.
  • 22:41 - 22:47
    Most importantly this is a single
    instruction. We also do not dirty any x86
  • 22:47 - 22:53
    registers because they are some
    intermediate results. You need to store
  • 22:53 - 22:56
    these somewhere and this you usually do in
    the x86 registers. So you increase
  • 22:56 - 23:00
    register pressure. Maybe you cause
    spilling. So overall your performance gets
  • 23:00 - 23:07
    worse. We also found out that we are
    actually faster than doing the checking
  • 23:07 - 23:12
    using x86 instructions. So just by moving
    the implementation from x86 level to
  • 23:12 - 23:17
    microcode, which in some way is still kind
    of like software, we already improved the
  • 23:17 - 23:22
    performance. Also on top of this you get
    better cache utilization because you have
  • 23:22 - 23:27
    less instructions, there are less bytes in
    the cache, so we get fuller cache lines.
  • 23:27 - 23:32
    And also it is really easy to tell which
    is testing code and which is your actual
  • 23:32 - 23:40
    program code. Lastly I'm going to show you
    just a rough overview of our framework
  • 23:40 - 23:46
    which we used during our development and
    which you can also find on GitHub. Early
  • 23:46 - 23:50
    on we found out that we are probably going
    to need to test a lot of microcode
  • 23:50 - 23:56
    updates, because in the beginning you just
    throw everything at the CPU and see how it
  • 23:56 - 24:01
    behaves and we wanted to do this in
    parallel. So we developed a small custom
  • 24:01 - 24:07
    OS called "Angry OS" and deployed it to
    mainboards. These mainboards are just old
  • 24:07 - 24:13
    AMD mainboards. All these mainboards were
    hooked up via serial for communication and
  • 24:13 - 24:19
    GPIO to a Raspberry Pi. With the GPIO you
    can reset, support power on, power down
  • 24:19 - 24:24
    and just have remote control of this
    mainboard and then you can connect to that
  • 24:24 - 24:29
    Raspberry Pi from anywhere on earth and
    just deploy and play around with it.
  • 24:29 - 24:31
    This was the first version.
  • 24:31 - 24:34
    In the beginning we
    didn't really know much about electronics
  • 24:34 - 24:39
    so we used one Raspberry Pi per mainboard.
    And it turns out Raspberry Pis are more
  • 24:39 - 24:44
    expensive than these old mainboards, but
    we improved upon this and now we're down
  • 24:44 - 24:48
    to one Raspberry Pi for
    four / five setups.
  • 24:48 - 24:52
    For example you only need 3 GPIO ports per
  • 24:52 - 24:57
    mainboard. You connect each of these to
    optocouplers just to separate the voltage
  • 24:57 - 25:02
    levels and then you connect one side of
    the optocoupler to the GPIO the other side
  • 25:02 - 25:06
    to your reset pin, to your power pin and
    for input to know whether your board is up
  • 25:06 - 25:11
    or down you connect the power LED. And
    that way you can save a lot of space, a
  • 25:11 - 25:17
    lot of money. And also if you're really
    constrained you can just remove the power
  • 25:17 - 25:24
    LED sensing because usually you know it is
    in the state your setup is in. As I
  • 25:24 - 25:28
    already said we wrote our custom operating
    system and it is intentionally really
  • 25:28 - 25:33
    really minimal because the major feature
    we wanted is control over every
  • 25:33 - 25:37
    instructions that's going to be executed
    from a certain point on, because we're
  • 25:37 - 25:41
    playing around with instruction encoding
    and if we execute an instructions that we
  • 25:41 - 25:46
    did not intend we might crash the CPU, we
    might go into an invalid state and we do
  • 25:46 - 25:51
    not even know which instruction caused it.
    And Angry OS essentially only listens on
  • 25:51 - 26:00
    the serial port for something to do. What
    it can do is apply an update. These
  • 26:00 - 26:05
    updates are just microcode updates. They
    are streamed via serial. We can also
  • 26:05 - 26:10
    stream x86 code which is then run by Angry
    OS and this is just so that we do not need
  • 26:10 - 26:14
    to reflash the USB stick every time we
    want to update our testing code and the
  • 26:14 - 26:19
    result, all the errors are reported back
    to the Raspberry Pi and thus they are
  • 26:19 - 26:27
    forwarded to us. The framework we use most
    importantly has the microcode assembler
  • 26:27 - 26:31
    and a pretty verbose disassembler. This
    disassembler generates the output I showed
  • 26:31 - 26:37
    you earlier and using this you can just
    quickly write your own microcode. We also
  • 26:37 - 26:42
    included an x86 assembler because we
    wanted to rapidly test different x86
  • 26:42 - 26:48
    testing codes. Using this framework we
    were able to disassemble the existing
  • 26:48 - 26:54
    updates and we also used it to disassemble
    our ROM after we reordered it and also
  • 26:54 - 27:01
    during the process when we fed it to our
    emulator. And we can also create the
  • 27:01 - 27:08
    proper binary files that can be loaded by
    the Linux kernel driver. We modified the
  • 27:08 - 27:13
    stock one to just load any update you give
    it without checking if it's the correct
  • 27:13 - 27:20
    CPU ID and all these things just for
    testing purposes. It's also available. And
  • 27:20 - 27:26
    also of course the framework can control
    Angry OS to make your testing easier. And
  • 27:26 - 27:30
    we implemented a pretty basic remote
    execution wrapper, so you can work on a
  • 27:30 - 27:33
    remote Raspberry Pi as if you were using
    it locally.
  • 27:35 - 27:37
    And this brings me to the end
  • 27:37 - 27:41
    of talk. And in conclusion we can say
    reversing the ROM opened up a lot of new
  • 27:41 - 27:45
    possibilities. We learned a lot about how
    microcode works. We learned about how to
  • 27:45 - 27:50
    actually use it properly instead of just
    inferring from a really small dataset,
  • 27:50 - 27:55
    that we have from the updates, or from the
    random bits things we send to the CPU and
  • 27:55 - 28:00
    observe what happened. But there's a lot
    left to do. So if you really want to hack
  • 28:00 - 28:04
    on it, just get in contact, we were happy
    to share our findings with you. And as I
  • 28:04 - 28:09
    said the framework AngryOS, example
    programs, that we implemented, and some
  • 28:09 - 28:14
    other stuff like the wiring is available
    on GitHub. So that's that. And we are
  • 28:14 - 28:17
    happy to answer any questions you might
    have.
  • 28:17 - 28:22
    applause
  • 28:25 - 28:28
    Herald Angel: Thank you very much. So we
  • 28:28 - 28:34
    have 10 minutes for questions please line
    up at the microphones. We start with this
  • 28:34 - 28:39
    one: microphone number 2.
    M2: Hi. Thanks for a nice talk. A few
  • 28:39 - 28:43
    questions about your hardware address
    sanitizer.
  • 28:43 - 28:50
    Benjamin: Mhm
    M2: As I understand you don't need the
  • 28:50 - 28:56
    source code instrumentation because the
    microcode is responsible for checking the
  • 28:56 - 29:03
    shadow memory, right?
    Benjamin: No... The original hardware
  • 29:03 - 29:08
    sanitizer implementation is also based on
    a compiler extension, that inserts a new
  • 29:08 - 29:12
    instruction because it doesn't exist
    usually. And it also inserts a bootstrap
  • 29:12 - 29:18
    code that in inits your shadow map and
    also instruments your allocators to update
  • 29:18 - 29:23
    the shadow map doing runtime and we
    essentially need the same component, but
  • 29:23 - 29:27
    we do not need the software address
    sanitizer component that essentially
  • 29:27 - 29:34
    inserts 10 or 20 x86 instructions before
    every memory access. So yes we still need
  • 29:34 - 29:38
    a compile time component and we are still
    source code based in a sense.
  • 29:39 - 29:46
    Herald: And, so..
    M2: And I didn't see, maybe I missed the
  • 29:46 - 29:51
    numbers. How much it is faster than this
    initial version?
  • 29:51 - 29:56
    Benjamin: You mean the initial hardware
    sanitizer version or the software address
  • 29:56 - 30:00
    sanitizer.
    M2: I mean let's say custom kernel address
  • 30:00 - 30:05
    sanitizer for Linux kernel which is the
    the usual one and your approach.
  • 30:05 - 30:10
    Benjamin: We only performed a micro
    benchmark on Angry OS and we essentially
  • 30:10 - 30:16
    took the instrumentation as emitted by the
    compiler for some memory access which is
  • 30:16 - 30:21
    your standard software address sanitizer
    and compared it to our version using only
  • 30:21 - 30:25
    the modified bound instruction. So I
    really can't talk about how it compares to
  • 30:25 - 30:29
    KASAN or something or some like real world
    implementation, because we only have the
  • 30:29 - 30:34
    prototype and the basic instrumentation.
    M2: Thank you very much.
  • 30:34 - 30:36
    Herald Angel: OK. Microphone number 4
    please.
  • 30:36 - 30:51
    M4: Hey thanks for the talk and did you
    find any weird microcode
  • 30:51 - 31:01
    implementations. I don't mean security
    wise, just like you rarely expected to
  • 31:01 - 31:07
    see it be implemented that way.
  • 31:09 - 31:12
    Benjamin: The problem is there's a lot of
  • 31:12 - 31:20
    microcode to begin with. You have f000
    triads. Each of which has 3 op-codes. So
  • 31:20 - 31:25
    you have a lot of ground to cover and also
    we have read-out errors. Sometimes you are
  • 31:25 - 31:29
    seeing bit flips, which kind of slows you
    down because you then need to always
  • 31:29 - 31:33
    consider: OK, maybe this register is
    something else, maybe this address is
  • 31:33 - 31:37
    wrong. And also sometimes you have a dust
    particles that kind of knocks out an
  • 31:37 - 31:43
    entire region. So we only looked at the
    components, we were pretty sure that we
  • 31:43 - 31:47
    recovered correctly, and we'd only looked
    at a really tiny subset compared to all of
  • 31:47 - 31:53
    the microcode ROM. It's just not feasible
    to do and to go through it and look at
  • 31:53 - 31:57
    everything. So no we didn't find anything
    funny but we also wouldn't know what funny
  • 31:57 - 32:01
    looks like because we don't know what the
    official spec for microcode is.
  • 32:01 - 32:04
    M4: Thanks.
    Herald Angel: Interesting. We have one
  • 32:04 - 32:06
    question from the Internet, from the
  • 32:06 - 32:10
    Signal Angel please.
    Signal Angel: Yes. Which AMD CPU
  • 32:10 - 32:16
    generations does this apply to?
    Benjamin: Yeah this is still based on the
  • 32:16 - 32:21
    work of our first talk and this only works
    on pretty old ones: K8, K10. So until,
  • 32:21 - 32:27
    CPUs produced until 2013. Yeah this was
    the last year AMD produced anything like
  • 32:27 - 32:33
    that. Newer ones use some public key based
    cryptography from what we can tell and we
  • 32:33 - 32:37
    haven't yet managed to break it. Same goes
    for Intel, they seem to be using public
  • 32:37 - 32:40
    key cryptography and we haven't gotten a
    foot in the door yet.
  • 32:41 - 32:45
    Herald Angel: Thank you. We go one around.
    On microphone number 3 please.
  • 32:45 - 32:51
    M3: Yeah. Thank you. I would like to know
    how complex could the microcode programs
  • 32:51 - 32:59
    be, that you could write. So what's the
    complexity of new operations you could
  • 32:59 - 33:03
    implement.
    Benjamin: The only limiting factor is the
  • 33:03 - 33:08
    size of your microcode update RAM. But
    this one is really really limited.
  • 33:08 - 33:13
    For example on K8, where we performed the
    majority of our experiments. We are
  • 33:13 - 33:19
    limited to 32 triads, which comes down to
    a sixty nine instructions and you also
  • 33:19 - 33:22
    have some constraints on these
    instructions for example the next triad
  • 33:22 - 33:28
    will always be executed no matter what.
    Some operations can only go at the second
  • 33:28 - 33:34
    slot. Some can only go on another slot, so
    it's really really hard. And you're also
  • 33:34 - 33:39
    limited from our knowledge to loading 16
    bit immediates instead of 32 bit or even
  • 33:39 - 33:44
    64 bit immediates. So your whole program
    grows really fast if you're trying to do
  • 33:44 - 33:49
    something complex. For example our
    authenticated microcode update mechanism
  • 33:49 - 33:54
    is the most complex one we wrote it nearly
    fills out the RAM and we used TEA – Tiny
  • 33:54 - 33:59
    Encryption Algorithm – because that was
    the only one we managed to fit mostly due
  • 33:59 - 34:05
    to S-box and other constants we would need
    to load. So it's really small.
  • 34:05 - 34:09
    Herald Angel: Thank you Microphone number
    1.
  • 34:09 - 34:15
    M1: So you said the microcode is used for
    instruction decoding and it needs to meet
  • 34:15 - 34:19
    the micro-ops to the scheduler and micro
    queue in some way. Did you find out how
  • 34:19 - 34:28
    that works?
    Bejamin: In essence we are not actually
  • 34:28 - 34:34
    executing code inside in microcode engine.
    From what from what we understand, the
  • 34:34 - 34:39
    microcode engine is just some kind of a
    software based recipe, that describes how
  • 34:39 - 34:43
    to decode an instruction, so you don't
    actually get execution, you just commit
  • 34:43 - 34:47
    instructions into the pipelines, that do
    what you want. And because we have some
  • 34:47 - 34:51
    control flow possibility, that is actually
    inside the micro code engine, because you
  • 34:51 - 34:55
    can branch to different addresses, you can
    conditionally branch and loop. You kind of
  • 34:55 - 34:59
    get an execution, but in essence to just
    commit stuff in the pipeline and the CPU
  • 34:59 - 35:01
    does what you tell it to.
  • 35:04 - 35:07
    Herald Angel: One more question.
    Microphone number 2, please.
  • 35:07 - 35:12
    M2: How did you take the picture of the
    internal CPU? Did you open it?
  • 35:12 - 35:15
    Benjamin: Yeah. We worked together with
  • 35:15 - 35:20
    Chris. He's our hardware guy. He has
    access to his equipment to delayer it and
  • 35:20 - 35:24
    to take high resolution optical shots and
    he also takes shots with a scanning
  • 35:24 - 35:29
    electron microscope. So I think about five
    or six CPUs were harmed in the making of
  • 35:29 - 35:30
    this paper.
  • 35:34 - 35:38
    Herald Angel: So we have one more last
    question. Microphone number 2 please.
  • 35:39 - 35:41
    M2: Are you aware of research done by
  • 35:41 - 35:49
    Christopher Domas, where he mapped out the
    instruction set for x86 processors?
  • 35:49 - 35:57
    B: You mean sandsifter? We
    actually talked with him and yeah we are
  • 35:57 - 36:03
    aware, that there's a map essentially of
    the instruction set and also maybe you can
  • 36:03 - 36:07
    combine it, because in the beginning we
    reverse engineered where certain x86
  • 36:07 - 36:11
    instructions are implemented in microcode.
    So if you plug these two together you kind
  • 36:11 - 36:15
    of map out the whole microcode ROM at the
    same time that you map out a whole
  • 36:15 - 36:19
    instruction set. However there are some
    components of the microcode ROM that are
  • 36:19 - 36:23
    most likely not triggered by instructions.
    For example it seems like power management
  • 36:23 - 36:27
    or everything that is behind a write MSR
    [wrmsr] or read MSR [rdmsr]. wrmsr is a
  • 36:27 - 36:31
    single instruction, but depending on the
    arguments you give it it just branches to
  • 36:31 - 36:36
    totally different triads and the microcode
    itself is implemented in microcode. And
  • 36:36 - 36:40
    this one is a huge chunk you wouldn't even
    find without brute forcing all
  • 36:40 - 36:44
    combinations for all instructions which is
    not really feasible.
  • 36:46 - 36:51
    Herald Angel: Thank you. Thank you
    Benjamin.
  • 36:51 - 36:57
    applause
  • 36:57 - 37:02
    35c3 postroll music
  • 37:02 - 37:21
    subtitles created by c3subtitles.de
    in the years 2019-2020. Join, and help us!
Title:
35C3 - Inside the AMD Microcode ROM
Description:

more » « less
Video Language:
English
Duration:
37:21

English subtitles

Revisions