< Return to Video

36C3 - How to Design Highly Reliable Digital Electronics

  • 0:00 - 0:18
    36C3 Intro musik
  • 0:18 - 0:23
    Herald: The next talk will be titled 'How
    to Design Highly Reliable Digital
  • 0:23 - 0:26
    Electronics', and it will be delivered to
    you by Szymon and Stefan. Warm Applause
  • 0:26 - 0:30
    for them.
  • 0:30 - 0:36
    applause
  • 0:36 - 0:41
    Stefan: All right. Good morning, Congress.
    So perhaps every one of you in the room
  • 0:41 - 0:46
    here has at one point or another in their
    lives witnessed their computer behaving
  • 0:46 - 0:50
    weirdly and doing things that it was not
    supposed to do or what you didn't
  • 0:50 - 0:54
    anticipate it to do. And well, typically
    that would have probably been the result
  • 0:54 - 1:00
    of a software bug of some sort somewhere
    inside the huge software stack your PC is
  • 1:00 - 1:05
    running on. Have you ever considered what
    the probability of this weird behavior
  • 1:05 - 1:09
    being caused by a bit flipped somewhere in
    your memory of your computer might have
  • 1:09 - 1:16
    been? So what you can see in this video on
    the screen now is a physics experiment
  • 1:16 - 1:21
    called a cloud chamber. It's a very simple
    experiment that is actually able to
  • 1:21 - 1:27
    visualize and make apparent all the
    constant stream of background radiation we
  • 1:27 - 1:33
    all are constantly exposed to. So what's
    happening here is that highly energetic
  • 1:33 - 1:39
    particles, for example, from space they
    trace through gaseous alcohol and they
  • 1:39 - 1:42
    collide with alcohol molecules and they
    form in this process a trail of
  • 1:42 - 1:48
    condensation while they do that. And if
    you think about your computer, a typical
  • 1:48 - 1:53
    cell of RAM, of which you might have, I
    don't know, 4, 8, 10 gigabytes in your
  • 1:53 - 1:58
    machine is as big as only 80 nanometers
    wide. So it's very, very tiny. And you
  • 1:58 - 2:03
    probably are able to appreciate the small
    amount of energy that is needed or that is
  • 2:03 - 2:08
    used to store the information inside each
    of those bits. And the sheer amount of of
  • 2:08 - 2:13
    those bits you have in your RAM and your
    computer. So a couple of years ago, there
  • 2:13 - 2:18
    was a study that concluded that in a
    computer with about four gigabytes of RAM,
  • 2:18 - 2:24
    a bit flip, um, caused by such an event by
    cosmic background radiation can occur
  • 2:24 - 2:29
    about once every 33 hours. So a
    bit less than than one per day. In an
  • 2:29 - 2:35
    incident in 2008, a Quantas Airlines
    flight actually nearly crashed, and the
  • 2:35 - 2:40
    reason for this crash was traced back to
    be very likely caused by a bit flipped
  • 2:40 - 2:44
    somewhere in one of the CPUs of the
    avionics system and nearly caused the
  • 2:44 - 2:50
    death of a lot of passengers on this
    plane. In 2003, in Belgium, a small
  • 2:50 - 2:57
    municipal vote actually had a weird hiccup
    in which one of the candidates in this
  • 2:57 - 3:02
    election actually got 4096 more votes added in a single instance.
  • 3:02 - 3:06
    And that was traced back to be very likely
    caused by cosmic background radiation,
  • 3:06 - 3:10
    flipping a memory cell somewhere that
    stored the vote count. And it was only
  • 3:10 - 3:15
    discovered that this happened because this
    number of votes for this particular
  • 3:15 - 3:19
    candidate was considered unreasonable, but
    otherwise would have gotten away probably
  • 3:19 - 3:27
    without being detected. So a few words
    about us: Szymon and I, we both work at
  • 3:27 - 3:32
    CERN in the microelectronics section and
    we both develop electronics that need to
  • 3:32 - 3:37
    be tolerant to these sorts of effects. So
    we develop radiation tolerant electronics
  • 3:37 - 3:43
    for the experiments at CERN, at the LHC.
    Among a lot of other applications, you can
  • 3:43 - 3:48
    meet the two of us at the Lötlabor Jena
    assembly if you are interested in what we
  • 3:48 - 3:56
    are talking about today. And we will also
    give a small talk or a small workshop
  • 3:56 - 3:59
    about radiation detection tomorrow, in one
    of the seminar rooms. So feel free to pass
  • 3:59 - 4:03
    by there, it will be a quick introduction.
    To give you a small idea of what kind of
  • 4:03 - 4:09
    environment we are working for: So if you
    would use one of your default intel i7
  • 4:09 - 4:14
    CPUs from your notebook and would put it
    anywhere where we operate our electronics,
  • 4:14 - 4:20
    it would very shortly die in a matter of
    probably one or two minutes and it would
  • 4:20 - 4:25
    die for more than just one reason, which
    is rather interesting and compelling. So
  • 4:25 - 4:31
    the idea for today's talk is to give you
    all an insight into all the things that
  • 4:31 - 4:35
    need to be taken into account when you
    design electronics for radiation
  • 4:35 - 4:39
    environments. What kinds of different
    challenges come when you try to do that.
  • 4:39 - 4:43
    We classify and explain the different
    types of radiation effects that exist. And
  • 4:43 - 4:48
    then we also present what you can do to
    mitigate these effects and also validate
  • 4:48 - 4:52
    that what you did to care for them or
    protect your circuits actually worked. And
  • 4:52 - 4:57
    of course, as we do that, we'll try to
    give our view on how we develop radiation
  • 4:57 - 5:03
    tolerant electronics at CERN and how our
    workflow looks like to make sure this
  • 5:03 - 5:08
    works. So let's first maybe take a step
    back and have a look at what we mean when
  • 5:08 - 5:13
    we say radiation environments. The first
    one that you probably have in mind right
  • 5:13 - 5:19
    now when you think about radiation is
    space. So, this interstellar space is
  • 5:19 - 5:24
    basically filled with, very high speed,
    highly energetic electrons and protons and
  • 5:24 - 5:29
    all sorts of high energy particles. And
    while they, for example, traverse close to
  • 5:29 - 5:35
    planets as our Earth - these planets
    sometimes do have a magnetic field and the
  • 5:35 - 5:39
    highly energetic particles are actually
    deflected by these magnetic fields and
  • 5:39 - 5:44
    they can protect the planets as our
    planet, for example, from this highly
  • 5:44 - 5:48
    energetic radiation. But in the process,
    there around these planets sometimes they
  • 5:48 - 5:52
    form these radiation belts - known as the
    Van Allen belts after the guy who
  • 5:52 - 5:56
    discovered this effect a long time ago.
    And a satellite in space as it orbits
  • 5:56 - 6:02
    around the Earth might, depending on what
    orbit is chosen, sometimes go through
  • 6:02 - 6:06
    these belts of highly intense radiation.
    That, of course, then needs to be taken
  • 6:06 - 6:12
    into account when designing electronics
    for such a satellite. And if Earth itself
  • 6:12 - 6:17
    is not able to give you enough radiation,
    you may think of the very famous Juno
  • 6:17 - 6:23
    Jupiter mission that has become famous
    about a year ago. They actually in the
  • 6:23 - 6:28
    environment of Jupiter they anticipated so
    much radiation that they actually decided
  • 6:28 - 6:33
    to put all the electronics of the
    satellite inside a one centimeter thick
  • 6:33 - 6:40
    cube of titanium, which is famously known
    as the Juno Radiation Vault. But not only
  • 6:40 - 6:44
    space offers radiation environments.
    Another form of radiation you probably all
  • 6:44 - 6:48
    recognize this when I show you this
    picture, which is an X-ray image of a
  • 6:48 - 6:55
    hand. And X-ray is also considered a form
    of radiation. And while, of course, the
  • 6:55 - 7:01
    doses or amounts of radiation any patient
    is exposed to while doing diagnosis or
  • 7:01 - 7:06
    treatment of some disease, that might not
    be the full story when it comes to medical
  • 7:06 - 7:10
    applications. So this is a medical
    particle accelerator which is used for
  • 7:10 - 7:15
    cancer treatment. And in these sorts of
    accelerators, typically carbon ions or
  • 7:15 - 7:20
    protons are accelerated and then focused
    and used to treat and selectively destroy
  • 7:20 - 7:25
    cancer cells in the body. And this comes
    already relatively close to the
  • 7:25 - 7:30
    environment we are working in and working
    for. So Szymon and I are working, for
  • 7:30 - 7:37
    example, on electronics, for the CMS
    detector inside the LHC or which we build
  • 7:37 - 7:44
    dedicated, radiation tolerant, integrated
    circuits which have to withstand very,
  • 7:44 - 7:49
    very large amounts and doses of short
    lived radiation in order to function
  • 7:49 - 7:54
    correctly. And if we didn't specifically
    design electronics for that, basically the
  • 7:54 - 8:02
    whole system would never be able to work.
    To illustrate a bit how you can imagine
  • 8:02 - 8:06
    the scale of this environment: This is a
    single plot of a collision event that was
  • 8:06 - 8:11
    recorded in the ATLAS experiment. And each
    of those tiny little traces you can make
  • 8:11 - 8:16
    out in this diagram is actually either one
    or multiple secondary particles that were
  • 8:16 - 8:22
    created in the initial collision of two
    proton bunches inside the experiment. And
  • 8:22 - 8:28
    in each of those, of course, races around
    the detector electronics, which make these
  • 8:28 - 8:33
    traces visible. Itself, then decaying into
    multiple other secondary particles which
  • 8:33 - 8:38
    all go through our electronics. And if
    that doesn't sound, let's say, bad enough
  • 8:38 - 8:43
    for digital electronics, these collisions
    happen about 40 million times a second. Of
  • 8:43 - 8:48
    course, multiplying the number of events
    or problems they can cause in our
  • 8:48 - 8:55
    circuits. So we now want to introduce all
    the things that can happen, the different
  • 8:55 - 9:00
    radiation effects. But first, probably we
    take a step back and look at what we mean
  • 9:00 - 9:06
    when we say digital electronics or digital
    logic, which we want to focus on today. So
  • 9:06 - 9:11
    from your university lectures or your
    reading, you probably know the first class
  • 9:11 - 9:15
    of digital logic, which is the
    combinatorial logic. So this is typically
  • 9:15 - 9:19
    logic that just does a simple linear
    relation of the inputs of a circuit and
  • 9:19 - 9:24
    produces an output as exemplified with
    these AND and OR, NAND, XOR gates that you
  • 9:24 - 9:29
    see here. But if you want to build - I
    mean even though we use those everywhere
  • 9:29 - 9:33
    in our circuits - you probably also want
    to store state in a more complex circuit,
  • 9:33 - 9:38
    for example, in the registers of your CPU
    they store some sort of internal
  • 9:38 - 9:42
    information. And for that we use the other
    class of logic, which is called the
  • 9:42 - 9:45
    sequential logic. So this is typically
    clocked with some system clock frequency
  • 9:45 - 9:51
    and it changes its output with relation to
    the inputs whenever this clock signal changes.
  • 9:51 - 9:54
    And now if we look at how all
    these different logic functionalities are
  • 9:54 - 9:58
    implemented. So typically nowadays for
    that you may know that we use CMOS
  • 9:58 - 10:02
    technologies and basically represent all
    this logic functionality as digital gates
  • 10:02 - 10:11
    using small P-MOS and N-MOS MOSFET
    transistors in CMOS technologies. And if
  • 10:11 - 10:16
    we kind of try to build a model for more
    complex digital circuits, we typically use
  • 10:16 - 10:22
    something we call the finite state machine
    model, in which we use a model that
  • 10:22 - 10:26
    consists of a combinatorial and a
    sequential part. And you can see that the
  • 10:26 - 10:31
    output of the circuit depends both on the
    internal state inside the register as well
  • 10:31 - 10:35
    as also the input to the combinatorial
    logic. And accordingly, also the state
  • 10:35 - 10:41
    that is internal is always changed by the
    inputs as well as the current state. So
  • 10:41 - 10:45
    this is kind of the simple model for more
    complex systems that can be used to model
  • 10:45 - 10:50
    different effects. Um, now let's try to
    actually look at what the radiation can do
  • 10:50 - 10:54
    to transistors. And for that we are going
    to have a quick recap at what the
  • 10:54 - 10:58
    transistor actually is and how it looks
    like. As you may perhaps know is that in
  • 10:58 - 11:04
    CMOS technologies, transistors are built
    on wafers of high purity silicon. So this
  • 11:04 - 11:09
    is a crystalline, very regularly organized
    lattice of silicon atoms. And what we do
  • 11:09 - 11:14
    to form a transistor on such a wafer is
    that we add dopants. So in order to form
  • 11:14 - 11:20
    diffusion regions, which later will become
    the source and drain of our transistors.
  • 11:20 - 11:24
    And then on top of that we grow a layer of
    insulating oxide. And on top of that we
  • 11:24 - 11:29
    put polysilicon, which forms the gate
    terminal of the transistor. And in the end
  • 11:29 - 11:33
    we end up with an equivalent circuit a bit
    like that. And now to put things back into
  • 11:33 - 11:38
    perspective - you may also note that the
    dimension of these structures are very
  • 11:38 - 11:43
    tiny. So we talk about tens of nanometers
    for some of the dimensions I've outlined
  • 11:43 - 11:48
    here. And as the technologies shrink,
    these become smaller and smaller and
  • 11:48 - 11:52
    therefore you'll probably also realize or
    are able to appreciate the small amount of
  • 11:52 - 11:57
    energy that are used to store information
    inside these digital circuits, which makes
  • 11:57 - 12:02
    them perhaps more sensitive to radiation.
    So let's take a look. What different types
  • 12:02 - 12:08
    of radiation effects exist? We typically
    in this case, differentiate them into two
  • 12:08 - 12:13
    main classes of events. The first one
    would be the cumulative effects, which are
  • 12:13 - 12:17
    effects that, as the name implies,
    accumulate over time. So as the circuit is
  • 12:17 - 12:22
    placed inside some radiation environment,
    over time it accumulates more and more
  • 12:22 - 12:27
    dose and therefore worsens its performance
    or changes how it operates. And on the
  • 12:27 - 12:31
    other side, we have the Single Event
    Effects, which are always events that
  • 12:31 - 12:35
    happen at some instantaneous point in
    time, and then suddenly, without being
  • 12:35 - 12:39
    predictable, change how the circuit
    operates or how it functions or if it
  • 12:39 - 12:44
    works in the first place or not. So I'm
    going to first go into the class of
  • 12:44 - 12:48
    cumulative effects and then later on,
    Szymon will go into the other class of the
  • 12:48 - 12:53
    Single Event Effects. So in terms of these
    accumulating effects, we basically have
  • 12:53 - 12:58
    two main subclasses: The first one being
    ionization or TID effects, for Total
  • 12:58 - 13:02
    Ionizing Dose - and the second one being
    displacement damages. So displacement
  • 13:02 - 13:07
    damages do exactly what they sound like.
    It is all the effects that happen when an
  • 13:07 - 13:11
    atom in the silicon lattice is actually
    displaced, so removed from its lattice
  • 13:11 - 13:15
    position and actually changes the
    structure of the semiconductor. But
  • 13:15 - 13:20
    luckily, these effects don't have a big
    impact in the CMOS digital circuits that
  • 13:20 - 13:23
    we are looking at today. So we will
    disregard them for the moment and we'll be
  • 13:23 - 13:28
    looking more at the ionization damage, or
    TID. So ionization - as a quick recap - is
  • 13:28 - 13:36
    whenever electrons are removed or added to
    an atom, effectively transforming it into
  • 13:36 - 13:43
    an ion. And these effects are especially
    critical for the circuits we are building
  • 13:43 - 13:46
    because of what they do is that they
    change the behavior of the transistors.
  • 13:46 - 13:50
    And without looking too much into the
    semiconductor details, I just want to show
  • 13:50 - 13:56
    their typical effect that we are concerned
    about in this very simple circuit here. So
  • 13:56 - 14:00
    this is just an inverter circuit
    consisting of two transistors here and
  • 14:00 - 14:06
    there. And what the circuit does in normal
    operation is it just takes an input signal
  • 14:06 - 14:10
    and inverts and basically gives the
    inverted signal at the output. And as the
  • 14:10 - 14:16
    transistors are irradiated and accumulate
    dose, you can see that the edges of the
  • 14:16 - 14:20
    output signal get slower. So the
    transistor takes longer to turn on and off.
  • 14:20 - 14:25
    And what that does in turn is that it
    limits the maximum operation frequency of
  • 14:25 - 14:29
    your circuit. And of course, that is not
    something you want to do. You want your
  • 14:29 - 14:32
    circuit to operate at some frequency in
    your final system. And if the maximum
  • 14:32 - 14:36
    frequency it can work at degrades over
    time, at some point it will fail as the
  • 14:36 - 14:39
    maximum frequency is just too low. So
    let's have a look at what we can do to
  • 14:39 - 14:44
    mitigate these effects. The first one and
    I already mentioned it when talking about
  • 14:44 - 14:48
    the Juno mission, is shielding. So if you
    can actually put a box around your
  • 14:48 - 14:53
    electronics and shield any radiation from
    actually hitting your transistors, it is
  • 14:53 - 14:57
    obvious that they will last longer and
    will suffer less from the radiation damage
  • 14:57 - 15:01
    that it would otherwise do. So this
    approach is very often used in space
  • 15:01 - 15:05
    applications like on satellites, but it's
    not very useful if you are actually trying
  • 15:05 - 15:08
    to measure the radiation with your
    circuits as we do, for example, in the
  • 15:08 - 15:12
    particle accelerators we build integrated
    circuits for. So there first of all, we
  • 15:12 - 15:16
    want to measure the radiation so we cannot
    shield our detectors from the radiation.
  • 15:16 - 15:21
    And also, we don't want to influence the
    tracks of these secondary collision
  • 15:21 - 15:24
    products with any shielding material that
    would be in the way. So this is not very
  • 15:24 - 15:28
    useful in a particle accelerator
    environment, let's say. So we have to
  • 15:28 - 15:34
    resort to different methods. So as I said,
    we do have to design our own integrated
  • 15:34 - 15:39
    circuits in the first place. So we have
    some freedom in what we call transistor
  • 15:39 - 15:45
    level design. So we can actually alter the
    dimensions of the transistors. We can make
  • 15:45 - 15:50
    them larger to withstand larger doses of
    radiation and we can use special
  • 15:50 - 15:54
    techniques in terms of layout that we can
    experimentally verifiy to be more
  • 15:54 - 15:59
    resistant to radiation effects. And as a
    third measure, which is probably the most
  • 15:59 - 16:03
    important one for us, is what we call
    modeling. So we actually are able to
  • 16:03 - 16:08
    characterize all the effects that
    radiation will have on a transistor. And
  • 16:08 - 16:12
    if we can do that, if we will know: 'If I
    put it into a radiation environment for a
  • 16:12 - 16:17
    year, how much slower will it become?'
    Then it is of course easy to say: 'OK, I
  • 16:17 - 16:21
    can just over-design my circuit and make
    it a bit more simple, maybe have less functionality,
  • 16:21 - 16:24
    but be able to operate at a
    higher frequency and therefore withstand
  • 16:24 - 16:30
    the radiation effects for a longer time
    while still working sufficiently well at
  • 16:30 - 16:35
    the end of its expected lifetime.' So
    that's more or less what we can do about
  • 16:35 - 16:38
    these effects. And I'll hand over to
    Szymon for the second class.
  • 16:38 - 16:43
    Szymon: Contrary to the cumulative effects
    presented by Stefan, the other group are
  • 16:43 - 16:46
    Single Event Effects which are caused by
    high energy deposits, which are caused by
  • 16:46 - 16:52
    a single particle or shower of particles.
    And they can happen at any time, even
  • 16:52 - 16:57
    seconds after irradiation is started. It
    means that if your circuit is vulnerable
  • 16:57 - 17:02
    to this class of effects, it can fail
    immediately after radiation is present.
  • 17:02 - 17:06
    And here we also classify these effects
    into several groups. The first are hard,
  • 17:06 - 17:11
    or permanent, errors, which as the name
    indicates can permanently destroy your
  • 17:11 - 17:20
    circuit. And this type of errors are
    typically critical for power devices where
  • 17:20 - 17:24
    you have large power densities and they
    are not so much of a problem for digital
  • 17:24 - 17:30
    circuits. In the other class of effects
    are soft errors. And here we distinguish
  • 17:30 - 17:34
    transient, or Single Event Transient
    errors, which are spurious signals
  • 17:34 - 17:41
    propagating in your circuit as a result of
    a gate being hit by a particle and they
  • 17:41 - 17:46
    are especially problematic for analog
    circuits or asynchronous digital circuits,
  • 17:46 - 17:51
    but under some circumstances they can be
    also problematic for synchronous systems.
  • 17:51 - 17:56
    And the other class of problems are
    static, or Single Event Upset problems,
  • 17:56 - 18:01
    which basically means that your memory
    element like a register gets flipped. And
  • 18:01 - 18:05
    then of course, if your system is not
    designed to handle this type of errors
  • 18:05 - 18:10
    properly, it can lead to a failure. So in
    the following part of the presentation
  • 18:10 - 18:15
    we'll focus mostly on soft errors. So
    let's try to understand what is the origin
  • 18:15 - 18:21
    of this type of problem. So as Stefan
    mentioned, the typical transistor is built
  • 18:21 - 18:25
    out of diffusions, gate and channel. So
    here you can see one diffusion. Let's
  • 18:25 - 18:29
    assume that it is a drain diffusion. And
    then when a particle goes through and
  • 18:29 - 18:37
    deposits charge, it creates free electron and
    hole pairs, which then in the presence of
  • 18:37 - 18:43
    electric fields, they get collected by
    means of drift, which results in a large
  • 18:43 - 18:47
    current spike, which is very short. And
    then the rest of the charge could be
  • 18:47 - 18:51
    collected by diffusion which is a much
    slower process and therefore also the
  • 18:51 - 18:56
    amplitude of the event is much, much
    smaller. So let's try to understand what
  • 18:56 - 19:01
    could happen in a typical memory cell. So
    on this schematic, you can see the
  • 19:01 - 19:06
    simplest memory cell, which is composed of
    two back-to-back inverters. And let's
  • 19:06 - 19:13
    assume that node A is at high and node B
    is at low potential initially. And then we
  • 19:13 - 19:17
    have a particle hitting the drain of
    transistor M1 which creates a short
  • 19:17 - 19:23
    circuit current between drain and ground,
    bringing the drain of transistor M1 to low
  • 19:23 - 19:30
    potential, which also acts on the gates of
    second inverter, temporarily changing its
  • 19:30 - 19:39
    state from low to high, which reinforces
    the wrong state in the first inverter. And
  • 19:39 - 19:45
    at this time the error is locked in your
    memory cell and you basically lost your
  • 19:45 - 19:50
    information. So you may be asking
    yourself: 'How much charge is needed
  • 19:50 - 19:54
    really to flip a state of a memory cell?'.
    And you can get this number from either
  • 19:54 - 20:00
    simulations or from measurements. So let's
    assume that what we could do, we could try
  • 20:00 - 20:05
    to inject some current into the sensitive
    node, for example, drain of transistor M1.
  • 20:05 - 20:09
    And here what I will show is that on the
    top plot you will have current as a function
  • 20:09 - 20:13
    of time. On the second plot you will have
    output voltage. So voltage at node B as a
  • 20:13 - 20:19
    function of time and at the lowest plot you
    will see a probability of having a bit
  • 20:19 - 20:23
    flip. So if you inject very little
    current, of course nothing changes at the
  • 20:23 - 20:28
    output, but once you start increasing the
    amount of current you are injecting, you
  • 20:28 - 20:33
    see that something appears at the output
    and at some point the output will toggle,
  • 20:33 - 20:40
    so it will switch to the other state. And
    at this point, if you really calculate
  • 20:40 - 20:46
    what is the area under the current curve
    you can find what is the critical charge
  • 20:46 - 20:53
    needed to flip the memory cell. And if you
    go further, if you start injecting even
  • 20:53 - 21:01
    more current, you will not see that much
    difference in the output voltage waveform.
  • 21:01 - 21:05
    It could become only slightly faster. And
    at this point, you also can notice that
  • 21:05 - 21:10
    the probability now jumped to one, which
    means that any time you inject so much
  • 21:10 - 21:17
    current there is a fault in your circuit.
    So for now, we just found what is the
  • 21:17 - 21:23
    probability of having a bit-flip from 0 to
    1 in node B. Of course we should also
  • 21:23 - 21:28
    calculate the same for the other
    direction, so from 1 to zero. And usually
  • 21:28 - 21:32
    it is slightly different. And then of
    course we should inject in all the other
  • 21:32 - 21:38
    nodes, for example node B and also should
    study all possible transitions. And then
  • 21:38 - 21:43
    at the end, if you calculate the
    superposition of these effects and you
  • 21:43 - 21:49
    multiply them by the active area of each
    node, you will end up with what we call
  • 21:49 - 21:52
    the cross section, which has a dimension
    of centimeters squared, which will tell
  • 21:52 - 21:57
    you how sensitive your circuit is to this
    type of effects. And then knowing the
  • 21:57 - 22:04
    radiation profile of your environment, you
    can calculate the expected upset rate in
  • 22:04 - 22:10
    the final application. So now, having
    covered the basic of the single event
  • 22:10 - 22:17
    effects, let's try to check how we can
    mitigate them. And here also technology
  • 22:17 - 22:21
    plays a significant role. So of course,
    newer technologies offer us much smaller
  • 22:21 - 22:27
    devices. And together with that, what
    follows is that usually supply voltages
  • 22:27 - 22:31
    are getting smaller and smaller as well as
    the node capacitance, which means that for
  • 22:31 - 22:36
    our Single Event Upsets it is very bad
    because the critical charge which is
  • 22:36 - 22:40
    required to flip our bit is getting less
    and less. But at the end, at the same
  • 22:40 - 22:44
    time, physical dimensions of our
    transistors are getting smaller, which
  • 22:44 - 22:48
    means that the cross section for them
    being hit is also getting smaller. So
  • 22:48 - 22:52
    overall, the effects really depend on the
    circuit topology and the radiation
  • 22:52 - 22:59
    environment. So another protection method
    could be introduced on the cell level. And
  • 22:59 - 23:05
    here we could imagine increasing the
    critical charge. And that could be done in
  • 23:05 - 23:11
    the easiest way by just increasing the
    node capacitance by, for example, putting
  • 23:11 - 23:16
    larger transistors. But of course, this
    also increases the collection electrode,
  • 23:16 - 23:23
    which is not nice. And another way could
    be just increase the capacitance by adding
  • 23:23 - 23:28
    some extra metal capacitance, but it, of
    course, slows down the circuit. Another
  • 23:28 - 23:34
    approach could be to try to store the
    information on more than two nodes. So I
  • 23:34 - 23:38
    showed you that on a simple SRAM cell we
    store information only on two nodes, so
  • 23:38 - 23:43
    you could try to come up with some other
    cells, for example, like that one in which
  • 23:43 - 23:47
    the information you stored on four nodes.
    So you can see that the architecture is
  • 23:47 - 23:54
    very similar to the basic SRAM cell. But
    you should be careful always to very
  • 23:54 - 23:59
    carefully simulate your design, because if
    we analyze this circuit, you will quickly
  • 23:59 - 24:03
    realize that this circuit, even though the
    information is stored in four different
  • 24:03 - 24:10
    nodes, the same type of loop exists as in
    the basic circuit. Meaning that at the end
  • 24:10 - 24:15
    the circuit offers basically no hardening
    with respect to the previous cell. So
  • 24:15 - 24:21
    actually we can do it better. So here you
    can see a typical dual interlocked cell.
  • 24:21 - 24:26
    So the amount of transistors is exactly
    the same as in the previous example, but
  • 24:26 - 24:31
    now they are interconnected slightly
    differently. And here you can see that
  • 24:31 - 24:36
    this cell has also two stable configurations.
    But this time data can propagate, the low
  • 24:36 - 24:41
    level from a given node can propagate
    only to the left hand side, while high
  • 24:41 - 24:48
    level can propagate to the right hand
    side. And each stage being inverting means
  • 24:48 - 24:55
    that the fault can not propagate for more
    than one node. Of course, this cell has
  • 24:55 - 25:00
    some drawbacks: It consumes more area than
    a simple SRAM cell and also write access
  • 25:00 - 25:04
    requires accessing at least two nodes at
    the same time to really change the state
  • 25:04 - 25:10
    of the cell. And so you may ask yourself,
    how effective is this cell? So here I will
  • 25:10 - 25:14
    show you a cross section plot. So it is
    the probability of having an error as a
  • 25:14 - 25:19
    function of injected energy. And as a
    reference, you can see a pink curve on the
  • 25:19 - 25:26
    top, which is for a normal, not protected
    cell. And on the green you can see the
  • 25:26 - 25:31
    cross section for the error in the DICE
    cell. So as you can see, it is one order
  • 25:31 - 25:37
    of magnitude better than the normal cell.
    But still, the cross section is far from
  • 25:37 - 25:41
    being negligible, So, the problem was
    identified: So it was identified that the
  • 25:41 - 25:46
    problem was caused by the fact that some
    sensitive nodes were very close together
  • 25:46 - 25:51
    on the layout and therefore they could be
    upset by the same particle. Because as we
  • 25:51 - 25:55
    mentioned, that single devices, they are very
    small. We are talking about dimensions
  • 25:55 - 26:00
    below a micron. So after realizing that,
    we designed another cell in which we
  • 26:00 - 26:05
    separated more sensitive nodes and we
    ended up with the blue curve, and as you
  • 26:05 - 26:09
    can see the cross section was reduced by
    two more orders of magnitude and the
  • 26:09 - 26:14
    threshold was increased significantly. So
    if you don't want to redesign your
  • 26:14 - 26:19
    standard cells, you could also apply some
    mitigation techniques on block level. So
  • 26:19 - 26:25
    here we can use some encoding to encode
    our state better. And as an example, I
  • 26:25 - 26:32
    will show you a typical Hamming code. So
    to protect four bits, we have to add three
  • 26:32 - 26:38
    additional party bits which are calculated
    according to this formula. And then once
  • 26:38 - 26:44
    you calculate the parity bits, you can use
    those to check the state integrity of your
  • 26:44 - 26:50
    internal state. And if any of their parity
    bits is not equal to zero, then the bits
  • 26:50 - 26:55
    instantaneously become syndromes,
    indicating where the error happened. And
  • 26:55 - 27:00
    you can use these information to correct
    the error. Of course, in this case, the
  • 27:00 - 27:07
    efficiency is not really nice because we
    need three additional bits to protect only
  • 27:07 - 27:12
    four bits of information. But as the state
    length increases the protection also is
  • 27:12 - 27:19
    more efficient. Another approach would be
    to do even less. Meaning that instead of
  • 27:19 - 27:24
    changing anything you need in your design,
    you can just triplicate your design or
  • 27:24 - 27:30
    multiply it many times and just vote,
    which state is correct? So this concept is
  • 27:30 - 27:35
    called tripple modular redudancy and it is
    based around a voter cell. So it is a
  • 27:35 - 27:40
    cell which has odd number of
    inputs and output is always equal to
  • 27:40 - 27:45
    majority of its input. And as I mentioned
    that the idea is that you have, for
  • 27:45 - 27:49
    example, three circuits: A, B and C, and
    during normal operation, when they are
  • 27:49 - 27:54
    identical, the output is also the same.
    However, when there is a problem, for
  • 27:54 - 28:01
    example, in logic, part B, the output
    is affected. So this problem is
  • 28:01 - 28:06
    effectively masked by the voter cell
    and it is not visible from outside of the
  • 28:06 - 28:10
    circuit. But you have to be careful not to
    take this picture as a as a design
  • 28:10 - 28:16
    template. So let's try to analyze what
    would happen with a state machine
  • 28:16 - 28:20
    similar to what Stephan introduced. If you
    were to just use this concept. So here you
  • 28:20 - 28:25
    can see three state machines and
    a voter on the output. And as we can see,
  • 28:25 - 28:29
    if you have an upside in, for example, the
    state register A, then the state is
  • 28:29 - 28:37
    broken. But still the output of the
    circuit, which is indicated by letter s is
  • 28:37 - 28:42
    correct because the B and C registers are
    still fine. But what happens if some time
  • 28:42 - 28:49
    later we have an upset in memory element B
    or C? Then of course the state
  • 28:49 - 28:56
    of our system is broken and we can not
    recover it. So you can ask yourself what
  • 28:56 - 29:02
    can we do better in order to avoid this
    situation? So that just to be sure. Please
  • 29:02 - 29:07
    do not use this technique to protect your
    circuits. So the easiest mitigation could
  • 29:07 - 29:13
    be to use as an input to your logic to use
    the output of the voter cell itself.
  • 29:13 - 29:18
    What it offers us is that now whenever you
    have an upset in one of the memory
  • 29:18 - 29:23
    elements for the next computation, for the
    next stage, we always use the voter
  • 29:23 - 29:28
    output, which ensures that the signal
    will be removed one clock cycle later. So
  • 29:28 - 29:33
    you will have another hit sometime later,
    basically, it will not affect our state.
  • 29:33 - 29:40
    Until now we consider only upsets in our
    registers but what happens if we have
  • 29:40 - 29:46
    charge in our voter? So you see that
    if there is no state change, basically the
  • 29:46 - 29:51
    transient in the voter doesn't impact
    our system. But if you are really unlucky
  • 29:51 - 29:56
    and the transient happens when the clock
    transition happens, so when whenever we
  • 29:56 - 30:01
    enlarge the data, we can corrupt the state
    in three registers at the same time, which
  • 30:01 - 30:06
    is less than ideal. So to overcome this
    limitation, you can consider skewing our
  • 30:06 - 30:11
    clocks by some time, which is larger than
    the maximum charge in time. And now,
  • 30:11 - 30:18
    because with each register samples the
    output of the voter a slightly different
  • 30:18 - 30:23
    time, we can corrupt only one flip-flop
    at the time. So of course, if you are
  • 30:23 - 30:29
    unlucky, we can have problematic
    situations in which one register is
  • 30:29 - 30:34
    already in your state. The other register
    is still in the old state. And then it
  • 30:34 - 30:40
    can lead to undetermenistic result. So it
    is better, but still not ideal. So as a
  • 30:40 - 30:47
    general theme, you have seen that we were
    adding and adding more resources so you
  • 30:47 - 30:50
    can ask yourself what would happen if we
    tripplicate everything. So in this case,
  • 30:50 - 30:54
    we tripplicated registers, we
    tripplicate our logic and our voters. And
  • 30:54 - 30:59
    now you can see that whenever we have an
    upset in our register, it can only affect
  • 30:59 - 31:05
    one register at the time and the error
    will be removed from the system one clock
  • 31:05 - 31:09
    cycle later. Also, if we have an upset
    in the voter or in their logic it can be
  • 31:09 - 31:13
    larged only to one register, which means
    that in principle we create that system
  • 31:13 - 31:18
    which is really robust. Unfortunately,
    nothing is for free. So here I compare a
  • 31:18 - 31:23
    different tripplication environments and
    as you can see that the more protection
  • 31:23 - 31:26
    you want to have, the more you have to pay
    in terms of resources being power in the
  • 31:26 - 31:31
    area. And also usual, you pay small
    penalty in terms of maximum operational
  • 31:31 - 31:38
    speed. So which flavor of protection you
    use depends really on
  • 31:38 - 31:42
    application. So for most sensitive
    circuits, you probably you want to use
  • 31:42 - 31:48
    full TMR and you may leave some other
    bits of logic unprotected. So another, if
  • 31:48 - 31:55
    your system is not mission critical and
    you can tolerate some downtime, you can
  • 31:55 - 32:00
    consider scrubbing, which means periodically
    checking the state of your system and refreshing it
  • 32:00 - 32:05
    if necessary if an error is detected using
    some parity bits or copy of the data in
  • 32:05 - 32:10
    a safe space. Or you can have a
    watchdog which will find out that
  • 32:10 - 32:14
    something went wrong and it will just
    reinitialize the whole system. So now,
  • 32:14 - 32:20
    having covered the basics of all the effects
    we will have to face, we would like
  • 32:20 - 32:24
    to show you the basic flow which we follow
    during designing our radiation hardened
  • 32:24 - 32:30
    circuits. So of course we always start
    with specifications. So we try to
  • 32:30 - 32:34
    understand our radiation environment in
    which the circuit is meant to operate. So
  • 32:34 - 32:39
    we come up with some specifications for
    total dose which could be accumulated and
  • 32:39 - 32:45
    for the rate of single event upsets. And
    at this moment, it is also not very rare
  • 32:45 - 32:50
    that we have to decide to move some
    functionality out of our detector volume,
  • 32:50 - 32:56
    outside, where we can use of the sort of
    commercial equipment to do number
  • 32:56 - 33:05
    crunching. But let's assume that we would
    go with our ASIC. So having the
  • 33:05 - 33:09
    specifications, of course we proceed with
    functional implementation. This we
  • 33:09 - 33:14
    typically do with hardware describtion
    languages, so verilog or VHDL which you may
  • 33:14 - 33:19
    know from typical FPGA flow. And of course
    we write a lot of simulations to
  • 33:19 - 33:24
    understand whether we are meeting our
    functional goals or whether our circuit
  • 33:24 - 33:31
    behaves as expected. And then we
    selectively select some parts of the
  • 33:31 - 33:36
    circuits which we want to protect from
    radiation effects. So, for example, we can
  • 33:36 - 33:42
    decide to use triplication or some other
    methods. So these days we typically use
  • 33:42 - 33:47
    triplication as the most straightforward
    and very effective method. So you can ask
  • 33:47 - 33:51
    yourself how do we triplicate the logic?
    So the simplest could be: Just copy
  • 33:51 - 33:55
    and paste the code three times at some
    postfixes like A, B and C and you are
  • 33:55 - 34:02
    done. But of course this solution has some
    drawbacks. So it is time consuming and it
  • 34:02 - 34:06
    is very error prone. So maybe you have
    noticed that I had a typo there. So of
  • 34:06 - 34:10
    course we don't want to do that. So we
    developed our own tool, which we called
  • 34:10 - 34:17
    TMRG, which automatizes the process of
    triplication and eliminates the two main
  • 34:17 - 34:22
    drawbacks, which I just described. So
    after we have our code triplicated and of
  • 34:22 - 34:27
    course, not before rerunning all the
    simulations to make sure that everything
  • 34:27 - 34:34
    went as expected. We then proceed to the
    synthesis process in which we convert our
  • 34:34 - 34:41
    high level hardware description languages
    to gate level netlists, in which all the functions
  • 34:41 - 34:46
    are mapped to gates, which were introduced
    by Stefan, so both combinatorial and
  • 34:46 - 34:54
    sequential. And here we also have to be
    careful because modern CAD tools have a
  • 34:54 - 34:59
    tendency, of course, to optimise the logic
    as much as possible. And our logic in most
  • 34:59 - 35:04
    of the cases is really redundant. So it is
    very easy; So, it should be removed. So we
  • 35:04 - 35:09
    really have to make sure that it is not
    removed. That's why our tool also provides
  • 35:09 - 35:14
    some constraints for the synthesizer to
    make sure that our design intent is
  • 35:14 - 35:21
    clearly and well understood by the tool.
    And once we have the output netlist, we
  • 35:21 - 35:27
    proceed to place and route process where
    this kind of netlist representation is
  • 35:27 - 35:33
    mapped to a layout of what will become
    soon our digital chip where we placed all
  • 35:33 - 35:37
    the cells and we route connections between
    them and here there is
  • 35:37 - 35:41
    another danger which I mentioned already,
    it's that in modern technologies the cells
  • 35:41 - 35:46
    are so small that they could be easily
    affected by a single particle at the same
  • 35:46 - 35:52
    time. So we have to really space out
    the big cells which are responsible for
  • 35:52 - 35:57
    keeping the information about the state to
    make sure that a single particle cannot
  • 35:57 - 36:05
    upset A and B, for example, registered
    from the same register. And then in the
  • 36:05 - 36:10
    last step, of course, we'll have to verify
    that everything, what we have done, is
  • 36:10 - 36:14
    correct. And at this level, we also try to
    introduce some single event effects in our
  • 36:14 - 36:20
    simulations. So we could randomly flip
    bits in our system. We can also inject
  • 36:20 - 36:26
    transients. And typically we used to do
    that on the netlist level, which works
  • 36:26 - 36:31
    very fine. And it is very nice. But the
    problem with this approach is that we can
  • 36:31 - 36:38
    perform these actions very late in the
    design cycle, which is less than ideal.
  • 36:38 - 36:43
    And also that if we find that there is
    problem in our simulation, typical netlist
  • 36:43 - 36:48
    at this level has probably few orders of
    magnitude more lines than our initial RTL
  • 36:48 - 36:53
    code. So to trace back what is the
    problematic line of code is not so
  • 36:53 - 36:58
    straightforward. At this time. So you can
    ask yourself why not to try to inject
  • 36:58 - 37:05
    errors in the RTL design? And the answer
    was, the answer is that it is not so
  • 37:05 - 37:11
    trivially to map the hardware description
    language's high level constructs to
  • 37:11 - 37:16
    what will become combinatorial or
    sequential logic. So in order to eliminate
  • 37:16 - 37:21
    this problem, we also develop another open
    source tool, which allows us to...
  • 37:21 - 37:28
    So we decided to use Yosys open
    source synthesis tool from clifford, which
  • 37:28 - 37:32
    was presented in the Congress several
    years ago. So we use this tool to make a
  • 37:32 - 37:36
    first pass through our RTL code to
    understand which elements will be mapped
  • 37:36 - 37:41
    to sequential and combinatorial. And then
    having this information, we will use
  • 37:41 - 37:46
    cocotb, another python verification
    framework, which allows us programmatic
  • 37:46 - 37:52
    access to these nodes and we can
    effectively inject the errors in our
  • 37:52 - 37:57
    simulations. And I forgot to mention that
    the TMRG tool is also open source. So if
  • 37:57 - 38:04
    you are interested in one of the tools,
    please feel free to contact us. And of
  • 38:04 - 38:11
    course, after our simulation is done, then in
    the next step we would really tape out. And
  • 38:11 - 38:15
    so we submit our chip to manufacturing and
    hopefully a few months later we receive
  • 38:15 - 38:18
    our chip back.
    Stefan: All right. So after patiently
  • 38:18 - 38:24
    waiting then for a couple of months while
    your chip is in manufacturing and you're
  • 38:24 - 38:28
    spending time on preparing a test set up
    and preparing yourself to actually test if
  • 38:28 - 38:34
    your chip works as you expected to. Now,
    it's probably also a good time to think
  • 38:34 - 38:38
    about how to actually validate or test if
    all the measures that you've taken to
  • 38:38 - 38:41
    protect your circuit from radiation
    effects actually are effective or if they
  • 38:41 - 38:46
    are not. And so again, we will split this
    in two parts. So you will probably want to
  • 38:46 - 38:50
    start with testing for the total ionizing
    dose effects. So for the cumulative effect
  • 38:50 - 38:55
    and for that, you typically use x ray
    radiation relatively similar to the one
  • 38:55 - 38:59
    used in medical treatment. So this
    radiation is relatively low, energetic,
  • 38:59 - 39:03
    which has the upside of not producing any
    single event effects, but you can really
  • 39:03 - 39:07
    only accumulate radiation dose and focus
    on the accumulating effects. And typically
  • 39:07 - 39:12
    you would use a machine that looks
    somewhat like this, a relatively compact
  • 39:12 - 39:17
    thing. You can have in your laboratory and
    you can use that to really accumulate
  • 39:17 - 39:22
    large amounts of radiation dose on your
    circuit. And then you need some sort of
  • 39:22 - 39:27
    mechanism to verify or to quantify how
    much your circuit slows down due to this
  • 39:27 - 39:31
    radiation dose. And if you do that, you
    typically end up with a graphic such as
  • 39:31 - 39:37
    this one, where in the x axis you have the
    radiation dose your circuit was exposed
  • 39:37 - 39:41
    to. And on the y axis, you see that the
    frequency has gone down over time and you
  • 39:41 - 39:45
    can use this information to say:
    "OK, my final application, I expect this
  • 39:45 - 39:49
    level of radiation dose. I mean, I can
    still see that my circuit will work fine
  • 39:49 - 39:54
    under some given environmental condition
    or some operation condition." So this is
  • 39:54 - 39:58
    the test for the first class of effects.
    And the test for the second class of
  • 39:58 - 40:02
    effects for the single event effect is a
    bit more involved. So there what you would
  • 40:02 - 40:07
    typically start to do is go for a heavy
    ion test campaign. So you would go to a
  • 40:07 - 40:13
    specialized, relatively rare facility. We
    have a couple of those in Europe and would
  • 40:13 - 40:17
    look perhaps somewhat like this. So it's a
    small particle accelerator somewhere.
  • 40:17 - 40:21
    They typically have
    different types of heavy ions at their
  • 40:21 - 40:26
    disposal that they can accelerate and then
    shoot at your chip that you can place in a
  • 40:26 - 40:32
    vacuum chamber and these ions can deposit
    very well known amounts of energy in your
  • 40:32 - 40:37
    circuit and you can use that information
    to characterize your circuit. The downside
  • 40:37 - 40:41
    is a bit that these facilities tend to be
    relatively expensive to access and also a
  • 40:41 - 40:45
    bit hard to access. So typically you need
    to book them a lot of time in advance and
  • 40:45 - 40:50
    that's sometimes not very easy. But what
    it offers you, you can use different types
  • 40:50 - 40:55
    of ions with different energies. You can
    really make a very well-defined
  • 40:55 - 41:00
    sensitivity curve similar to the one that
    Szymon has described. You can get from
  • 41:00 - 41:04
    simulations and really characterize your
    circuit for how often, any single event
  • 41:04 - 41:09
    effects will appear in the final
    application if there is any remaining
  • 41:09 - 41:13
    effects left. If you have left something
    unprotected. The problem here is that
  • 41:13 - 41:18
    these particle accelerators typically just
    bombard your circuit with like thousands
  • 41:18 - 41:23
    of particles per second and they hit
    basically the whole area in a random
  • 41:23 - 41:27
    fashion. So you don't really have a way of
    steering those or measuring the position
  • 41:27 - 41:31
    of these particles. So typically you are a
    bit in the dark and really have to really
  • 41:31 - 41:35
    carefully know the behavior of your
    circuit and all the quirks it has even
  • 41:35 - 41:39
    without the radiation to instantly notice
    when something has gone wrong. And
  • 41:39 - 41:44
    this is typically not very easy
    and you can kind of compare it with having
  • 41:44 - 41:47
    some weird crash somewhere in your
    software stack and then having to have
  • 41:47 - 41:52
    first take a look and see what actually
    has happened. Typically
  • 41:52 - 41:57
    you find something that has not been
    properly protected and you see some weird
  • 41:57 - 42:02
    effect on your circuit and then you try to
    get a better idea of where that problem
  • 42:02 - 42:06
    actually is located. And the answer for
    these types of problems involving position
  • 42:06 - 42:11
    is, of course, always lasers. So we have
    two types of laser experiments available
  • 42:11 - 42:16
    that can be used to more selectively probe
    your circuit for these problems. The first
  • 42:16 - 42:20
    one being the single photon absorption
    laser. And it sounds this relatively
  • 42:20 - 42:25
    simple in terms of setup. You just use a
    single laser beam that shoots straight up
  • 42:25 - 42:30
    at your circuit from the back. And while
    it does that, it deposits energy all along
  • 42:30 - 42:34
    the silicon and also in the diffusions of
    your transistors and is therefore also
  • 42:34 - 42:38
    able to inject energy there, potentially
    upsetting a bit of memory or exposing
  • 42:38 - 42:43
    whatever other single event effects you
    have. And of course, you can steer this
  • 42:43 - 42:47
    beam across the surface of your chip or
    whatever circuit you are testing and then
  • 42:47 - 42:51
    find the sensitive location. The problem
    here is that the amount of energy that is
  • 42:51 - 42:55
    deposited is really large due to the fact
    that it has to go through the whole
  • 42:55 - 42:59
    silicon until it reaches the transistor.
    And therefore it's mostly used to find
  • 42:59 - 43:03
    these destructive effects that really
    break something in your circuit. The more
  • 43:03 - 43:08
    clever and somehow beautiful experiment is
    the two photon absorption laser experiment
  • 43:08 - 43:13
    in which you use two laser beams of a
    different wavelength. And these actually
  • 43:13 - 43:18
    do not have enough energy to cause any
    effect in your silicon. If only one of the
  • 43:18 - 43:22
    laser beams is present, but only in the
    small location where the two beams
  • 43:22 - 43:27
    intersect, the energy is actually large
    enough to produce the effect. And this
  • 43:27 - 43:31
    allows you to very selectively and only on
    a very small volume induce charge and
  • 43:31 - 43:38
    cause an effect in your circuit. And when
    you do that now, you can systematically
  • 43:38 - 43:42
    scan both the X and Y directions across
    your chip and also the Z direction and can
  • 43:42 - 43:46
    really measure the volume of sensitive
    area. And this is what you would typically
  • 43:46 - 43:51
    get of such an experiment. So in black and
    white in the back, you'll see an infrared
  • 43:51 - 43:55
    image of your chip where you can really
    make out the individual, say structural
  • 43:55 - 43:59
    components. And then overlaid in blue, you
    can basically highlight all the sensitive
  • 43:59 - 44:04
    points that made you measure something you
    didn't expect, some weird bit flip in a
  • 44:04 - 44:08
    register or something. And you can really
    then go to your layout software and find
  • 44:08 - 44:14
    what is the the register or the gate in
    your netlist that is responsible for
  • 44:14 - 44:17
    this. And then it's more like operating a
    debugger in a software environment.
  • 44:17 - 44:23
    Tracing back from there what the line of
    code responsible for this bug is. And
  • 44:23 - 44:31
    to close out, it is always best to learn
    from mistakes. And we offer our mistakes
  • 44:31 - 44:36
    as a guideline for if you ever feel
    yourself the need to design radiation
  • 44:36 - 44:41
    tolerant circuits. So we want to present
    two or three small issues we had and
  • 44:41 - 44:45
    circuits where we were convinced it should
    have been working fine. So the first one
  • 44:45 - 44:50
    this you will probably recognize is this
    full triple modular redundancy scheme that
  • 44:50 - 44:55
    Szymon has presented. So we made sure to
    triplicate everything and we were relatively
  • 44:55 - 44:59
    sure that everything should be fine. The
    only modification we did is that to all
  • 44:59 - 45:04
    those registers in our design, we've added
    a reset, because we wanted to initialize
  • 45:04 - 45:08
    the system to some known state when we
    started up, which is a very obvious thing
  • 45:08 - 45:12
    to do. Every CPU has a reset. But of
    course, what we didn't think about here
  • 45:12 - 45:17
    was that at some point there's a buffer
    driving this reset line somewhere. And if
  • 45:17 - 45:20
    there's only a single buffer. What happens
    if this buffer experiences a small
  • 45:20 - 45:25
    transient event? Of course, the obvious
    thing that happened is that as soon as
  • 45:25 - 45:28
    that happened, all the registers were
    upset at the same time and were basically
  • 45:28 - 45:32
    cleared and all our fancy protection was
    invalidated. So next time we decided,
  • 45:32 - 45:38
    let's be smarter this time. And of course,
    we triplicate all the logic and all the
  • 45:38 - 45:41
    voters and all the registers. So let's
    also triplicate the reset lines. And while
  • 45:41 - 45:45
    the designer of that block probably had
    very good intentions, it turned out
  • 45:45 - 45:49
    that later than when we manufactured the
    chip, it still sometimes showed a complete
  • 45:49 - 45:55
    reset without any good explanation for
    that. And what was left out of the the
  • 45:55 - 46:00
    scope of thinking here was that this reset
    actually was connected to the system reset
  • 46:00 - 46:05
    of the chip that we had. And typically
    pins are on the chip or something that is
  • 46:05 - 46:09
    not available in huge quantities. So you
    typically don't want to spend three pins
  • 46:09 - 46:13
    of your chip just for a stupid reset that
    you don't use ninety nine percent of the
  • 46:13 - 46:18
    time. So what we did at some point we just
    connected again the reset lines to a
  • 46:18 - 46:22
    single input buffer. That was then
    connected to a pin of the chip. And of
  • 46:22 - 46:26
    course, this also represented a small
    sensitive area in the chip. And again,
  • 46:26 - 46:30
    a single upset here was able to destroy
    all three of our flip flops. All right.
  • 46:30 - 46:35
    And the last lesson I'm bringing or the
    last thing that goes back to the
  • 46:35 - 46:39
    implementation details that Szymon has
    mentioned. So this time, really simple
  • 46:39 - 46:43
    circuit. We were absolutely convinced it
    must work because it was basically the
  • 46:43 - 46:46
    textbook example that Szymon was
    presenting. And the code was so
  • 46:46 - 46:50
    small we were able to inspect everything
    and were very much sure that nothing
  • 46:50 - 46:55
    should have happened. And what we saw when
    we went for this laser testing experiment,
  • 46:55 - 47:00
    in simplified form is basically that
    only this first voter. And when this was
  • 47:00 - 47:04
    hit, always all our register was
    upset while the other ones were
  • 47:04 - 47:09
    never manifested to show anything strange.
    And it took us quite a while to actually
  • 47:09 - 47:14
    look at the layout later on and figure out
    that what was in the chip was rather this.
  • 47:14 - 47:17
    So two of the voters were actually not
    there. And Szymon mentioned the reason for
  • 47:17 - 47:21
    that. So synthesis tool these days are
    really clever at identifying redundant
  • 47:21 - 47:26
    logic and because we forgot to tell it to
    not optimize these redundant pieces of
  • 47:26 - 47:30
    logic, which the voters really are. It
    just merged them into one. And that
  • 47:30 - 47:34
    explains why we only saw this one voter
    being the sensitive one. And of course, if
  • 47:34 - 47:38
    you have a transient event there, then you
    suddenly upset all your registers and that
  • 47:38 - 47:42
    without even knowing it and with being
    sure, having looked at every single line
  • 47:42 - 47:46
    of verilog code and being very sure,
    everything should have been fine. But that
  • 47:46 - 47:52
    seems to be how this business goes. So we
    hope we had been we had the chance and you
  • 47:52 - 47:57
    were able to get some insight in in what
    we do to make sure the experiments at the
  • 47:57 - 48:02
    LHC work fine. What you can do to
    make sure the satellite you are working on
  • 48:02 - 48:06
    might be working OK. Even before launching
    it into space, if you're interested into
  • 48:06 - 48:11
    some more information on this topic, feel
    free to pass by at the assembly I
  • 48:11 - 48:15
    mentioned at the beginning or just meet us
    after the talk and otherwise thank you
  • 48:15 - 48:22
    very much.
    Applause
  • 48:22 - 48:27
    Herald: Thank you very much indeed.
    There's about 10 minutes left for Q and A,
  • 48:27 - 48:32
    so if you have any questions go to a
    microphone. And as a cautious reminder,
  • 48:32 - 48:38
    questions are short sentences with. That
    starts with a question. Well, ends with a
  • 48:38 - 48:43
    question mark and the first question goes
    to the Internet.
  • 48:43 - 48:46
    Internet: Well, hello. Um, do you also
    incorporate radiation as the source for
  • 48:46 - 48:51
    randomness when that's needed?
    Stefan: So we personally don't. So in our
  • 48:51 - 48:57
    designs we don't. But it is done indeed
    for a random number generator. This is
  • 48:57 - 49:01
    sometimes done that they use radioactive
    decay as a source for randomness. So this
  • 49:01 - 49:04
    is done, but we don't do it in our
    experiments.
  • 49:04 - 49:07
    We rather want deterministic data out of
    the things we built.
  • 49:07 - 49:11
    Herald: Okay. Next question goes to
    microphone number four.
  • 49:11 - 49:17
    Mic 4: Do you do your tripplication before
    or after elaboration?
  • 49:17 - 49:21
    Szymon: So currently we do it before
    elaboration. So we decided that our tool
  • 49:21 - 49:26
    works on verilog input and it produces
    verilog output because it offers much more
  • 49:26 - 49:30
    flexibility in the way how you can
    incorporate different tripplication
  • 49:30 - 49:34
    schemes. If you were to apply to only
    after elaboration, then of course doing a
  • 49:34 - 49:38
    full tripplication might be easy. But then
    you - to having a really precise control
  • 49:38 - 49:43
    or on types of tripplication on different
    levels is much more difficult.
  • 49:43 - 49:47
    Herald: Next question from microphone
    number two.
  • 49:47 - 49:51
    Mic 2: Is it possible to use DCDC
    converters or switch mode power supplies
  • 49:51 - 49:55
    within the radiation environment to power
    your logic? Or you use only linear power?
  • 49:55 - 50:00
    Szymon: Yes, alternatively we also have a
    dedicated program which develops radiation
  • 50:00 - 50:05
    hardened DCDC converters who operate
    in our environments. So they are available
  • 50:05 - 50:11
    also for space applications, as far as I'm
    aware. And they are hardened against total
  • 50:11 - 50:16
    ionizing dose as well as single event
    upsets.
  • 50:16 - 50:20
    Herald: Okay next question goes to
    microphone number one.
  • 50:20 - 50:23
    Mic 1: Thank you very much for the great
    talk. I'm just wondering, would it be
  • 50:23 - 50:27
    possible to hook up every logic gate in
    every water in a way of mesh network? And
  • 50:27 - 50:32
    what are the pitfalls and limitations for
    that?
  • 50:32 - 50:37
    Stefan: So that is not something I'm aware
    of, of being done. So typically: No. I
  • 50:37 - 50:41
    wouldn't say that that's something we
    would do.
  • 50:41 - 50:43
    Szymon: I'm not really sure if I
    understood the question.
  • 50:43 - 50:46
    Stefan: So maybe you can rephrase what
    your idea is?
  • 50:46 - 50:53
    Mic 1: On the last slide, there were a
    lesson learned.
  • 50:53 - 50:56
    Stefan: Yes. One of those?
    Mic 1: In here. Yeah. Would you be able to
  • 50:56 - 51:00
    connect everything interchangeably in a
    mesh network?
  • 51:00 - 51:04
    Szymon: So what you are probably asking
    about is whether we can build our own
  • 51:04 - 51:08
    FPGA, like programable logic device.
    Mic 1: Probably.
  • 51:08 - 51:11
    Szymon: Yeah. And so this we typically
    don't do, because in our experiments, our
  • 51:11 - 51:16
    power budget is also very limited, so we
    cannot really afford this level of
  • 51:16 - 51:21
    complexity. So of course you can make your
    FPGA design radiation hard, but this is
  • 51:21 - 51:25
    not what we will typically do in our
    experiments.
  • 51:25 - 51:29
    Herald: Next question goes to microphone
    number two.
  • 51:29 - 51:32
    Mic 2: Hi, I would like to ask if the
    orientation of your transistors and your
  • 51:32 - 51:38
    chip is part of your design. So mostly you
    have something like a bounding box around
  • 51:38 - 51:43
    your design and with an attack surface in
    different sizes. So do you use this
  • 51:43 - 51:48
    orientation to minimize the attack surface
    of the radiation on chips, if you know
  • 51:48 - 51:53
    the source of the radiation?
    Szymon: No. So I don't think we'd do that.
  • 51:53 - 51:59
    So, of course, we control our orientation
    of transistors during the design phase.
  • 51:59 - 52:03
    But usually in our experiment, the
    radiation is really perpendicular to the
  • 52:03 - 52:08
    chip area, which means that if you rotate
    it by 90 degrees, you don't really gain
  • 52:08 - 52:12
    that much. And moreover, our chips,
    usually they are mounted in a bigger
  • 52:12 - 52:17
    system where we don't control how they are
    oriented.
  • 52:17 - 52:24
    Herald: Again, microphone number two.
    Mic 2: Do you take meta stability into
  • 52:24 - 52:33
    account when designing voters?
    Szymon: The voter itself is combinatorial.
  • 52:33 - 52:39
    So ... -
    Mic 2: Yeah, but if the state of the rest
  • 52:39 - 52:45
    can change in any time that then the
    voters can have like glitches, yeah?
  • 52:45 - 52:51
    Szymon: Correct. So that's why - so to
    avoid this, we don't take it into account
  • 52:51 - 52:55
    during the design phase. But if we use
    that scheme which is just displayed here,
  • 52:55 - 52:59
    we avoid this problem altogether, right?
    Because even if you have meta stability in
  • 52:59 - 53:05
    one of the blocks like A, B or C, then it
    will be fixed in the next clock cycle.
  • 53:05 - 53:10
    Because usually our systems operate on
    clocks with low frequencies, hundreds of
  • 53:10 - 53:13
    megahertz, which means that any meta
    stability should be resolved by the next
  • 53:13 - 53:15
    clock cycle.
    Mic 2: Thank you.
  • 53:15 - 53:19
    Herald: Next question microphone number
    one.
  • 53:19 - 53:23
    Mic 1: How do you handle the register
    duplication that can be performed by a
  • 53:23 - 53:28
    synthesis and pleasant route? So the tools
    will try to optimize timing sometimes by
  • 53:28 - 53:32
    adding registers. And these registers are
    not trippled.
  • 53:32 - 53:36
    Stefan: Yes. So what we do is that I mean,
    in a typical, let's say, standard ASIC
  • 53:36 - 53:40
    design flaw, this is not what happens. So
    you have to actually instruct a tool to do
  • 53:40 - 53:45
    that, to do re timing and add additional
    registers. But for what we are doing, we
  • 53:45 - 53:48
    have to - let's say not do this
    optimization and instruct a tool to keep
  • 53:48 - 53:53
    all the registers we described in our RTL
    code to keep them until the very end. And
  • 53:53 - 53:57
    we realy also constrain them to always
    keep their associated logic tripplicated.
  • 53:57 - 54:02
    Herald: The next question is from the
    internet.
  • 54:02 - 54:08
    Internet: Do you have some simple tips for
    improving radiation tolerance?
  • 54:08 - 54:12
    Stefan: Simple tips? Ahhhm...
    Szymon: Put your electronics inside a
  • 54:12 - 54:13
    box.
    Stefan: Yes.
  • 54:13 - 54:17
    some laughter
    There's there's just no
  • 54:17 - 54:23
    single one size fits all textbook recipe
    for this as it really always comes down to
  • 54:23 - 54:28
    analyzing your environment, really getting
    an awareness first of what rate and what
  • 54:28 - 54:32
    number of events you are looking at, what
    type of particles cause them, and then
  • 54:32 - 54:36
    take the appropriate measures to mitigate
    them. So there is no one size fits all
  • 54:36 - 54:38
    thing I say.
    Herald: Next question goes from mycrophone
  • 54:38 - 54:42
    number two.
    Mic 2: Hi. Thanks for the talk. How much
  • 54:42 - 54:48
    of your software used to design is
    actually open source? I only know a super
  • 54:48 - 54:54
    expensive chip design software.
    Stefan: You write the core of all the
  • 54:54 - 55:01
    implementation tools like the synthesis
    and place and route stage for the ASICS,
  • 55:01 - 55:05
    that we design is actually a commercial
    closed source tools. And if
  • 55:05 - 55:10
    you're asking for the fraction, that's a
    bit hard to answer. I cannot give a
  • 55:10 - 55:15
    statement about the size of the commercial
    closed tools. But we tried to do
  • 55:15 - 55:19
    everything we develop, tried to make it
    available to the widest possible audience
  • 55:19 - 55:22
    and therefore decided to make the
    extensions to this design flaw available
  • 55:22 - 55:26
    in public form. And that's why these
    tools that we develop and share among the
  • 55:26 - 55:31
    community of ASIC designers and this
    environment are open source.
  • 55:31 - 55:35
    Herald: Microphone number four.
    Mic 4: Have you ever tried using steered
  • 55:35 - 55:41
    iron beams for more localized, radiation
    ingress testing?
  • 55:41 - 55:44
    Stefan: Yes, indeed! And the picture I
    showed actually, uh, didn't disclaimer
  • 55:44 - 55:49
    that, but the facility you saw here is
    actually a facility in Darmstadt in
  • 55:49 - 55:53
    Germany and is actually a micro beam
    facility. So it's a facility that allows
  • 55:53 - 55:58
    steering a heavy ion beam really on a
    single position with less than a
  • 55:58 - 56:02
    micrometer accuracy. So it provides
    probably exactly what you were asking for.
  • 56:02 - 56:06
    But that's not the typical case. That is
    really a special thing. And it's probably
  • 56:06 - 56:09
    also the only facility in Europe that can
    do that.
  • 56:09 - 56:13
    Herald: Microphone number one.
    Mic 1: Was very good very good talk. Thank
  • 56:13 - 56:19
    you very much. My question is, did you
    compare what you did to what is done for
  • 56:19 - 56:25
    securing secret chips? You know, when you
    have credit card chips, you can make fault
  • 56:25 - 56:30
    attacks into them so you can make them
    malfunction and extract the cryptographic
  • 56:30 - 56:34
    key for example from the banking card.
    There are techniques here to harden these
  • 56:34 - 56:38
    chips against fault attacks. So which are
    like voluntary faults while you have like
  • 56:38 - 56:43
    random less faults due to like involatility
    attacks. You know what? Can you explain if
  • 56:43 - 56:47
    you compared in a way what you did to
    this?
  • 56:47 - 56:51
    Stefan: Um, so no, we didn't explicitly
    compared it, but it is right that the
  • 56:51 - 56:54
    techniques we present can also be used in
    a variety of different contexts. So one
  • 56:54 - 56:59
    thing that's not exactly what you are
    referring to, but relatively on a similar
  • 56:59 - 57:04
    scale is that currently in very small
    technologies you get two problems with the
  • 57:04 - 57:08
    reliability and yield of the manufacturing
    process itself, meaning that sometimes
  • 57:08 - 57:12
    just the metal interconnection between two
    gates and your circuit might be broken
  • 57:12 - 57:16
    after manufacturing and then adding the
    sort of redundancy with the same kinds of
  • 57:16 - 57:21
    techniques can be used to make, to
    produce more working chips out of a
  • 57:21 - 57:25
    manufacturing run. So in this sort of
    context, these sorts of techniques are
  • 57:25 - 57:31
    used very often these days. But, um, I'm
    and I'm pretty sure they can be applied to
  • 57:31 - 57:35
    these sorts of, uh, security fault attack
    scenarios as well.
  • 57:35 - 57:40
    Herald: Next question from microphone
    number two.
  • 57:40 - 57:44
    Mic 2: Hi, you briefly also mentioned the
    mitigation techniques on the cell level
  • 57:44 - 57:52
    and yesterday there was a very nice talk
    from the Libre Silicon people and they
  • 57:52 - 57:56
    are trying to build a standard cell
    library, uh, open source standard cell
  • 57:56 - 58:00
    library. So are you in contact with them
    or maybe you could help them to improve
  • 58:00 - 58:04
    their design and then the radiation
    hardness?
  • 58:04 - 58:07
    Stefan: No. We also saw the talk
    yesterday, but we are not yet in
  • 58:07 - 58:14
    contact with them. No.
    Herald: Does the Internet have questions?
  • 58:14 - 58:21
    Internet: Yes, I do. Um, two in fact.
    First one would be would TTL or other BJT
  • 58:21 - 58:27
    based logic be more resistant?
    Szymon: Uh, yeah. So depending on which
  • 58:27 - 58:31
    type of errors we are considering. So BJT
    transistors, they have ...
  • 58:31 - 58:36
    Stefan in his part mentioned that
    displacement damage is not a problem for
  • 58:36 - 58:40
    seamless devices, but it is not the case
    for BJT devices. So when they are exposed
  • 58:40 - 58:47
    to high energy hadrons or protons,
    they degrade a lot. So that's why we don't
  • 58:47 - 58:52
    use them in really our environment. They
    could be probably much more robust to
  • 58:52 - 58:57
    single event effects because their
    resistance everywhere is much lower. But
  • 58:57 - 59:02
    they would have other problems. And also
    another problem which is worth
  • 59:02 - 59:06
    mentioning is that for those devices, they
    consume much, much, much more power, which
  • 59:06 - 59:13
    we cannot afford in our applications.
    Internet: And the last one would be how do
  • 59:13 - 59:19
    I use the output of the full TMR setup? Is
    it still three signals? How do I know
  • 59:19 - 59:26
    which one to use and to trust?
    Stefan: Um, yes. So with this, um,
  • 59:26 - 59:30
    architecture, what you could either do is
    really do the full triplication scheme
  • 59:30 - 59:35
    to your whole logic tree basically and
    really triplicate everything or, and
  • 59:35 - 59:39
    that's going in the direction of one of
    the lessons learned I had, at some point
  • 59:39 - 59:43
    of course you have an interface to your
    chip, so you have pins left and right that
  • 59:43 - 59:47
    are inputs and outputs. And then you have
    to decide either you want to spend the
  • 59:47 - 59:51
    effort and also have three dedicated input
    pins for each of the signals, or you at
  • 59:51 - 59:54
    some point have the voter and say, okay.
    At this point, all these signals are
  • 59:54 - 59:58
    combined. But I was able to reduce the
    amount of sensitive area in my chip
  • 59:58 - 60:04
    significantly and can live with the very
    small remaining sensitive area that just
  • 60:04 - 60:07
    the input and output pins provide.
    Szymon: So maybe I will add one more thing
  • 60:07 - 60:12
    is that typically in our systems, of
    course we triplicate our logic internally,
  • 60:12 - 60:15
    but when we interface with external
    world, we can apply another protection
  • 60:15 - 60:20
    mechanism. So for example, for our high
    speed serialisers, we will use different types
  • 60:20 - 60:24
    of encoding to add protect...,
    to add like forward error correction
  • 60:24 - 60:30
    codes which would allow us to recover these
    type of faults in the backend later on.
  • 60:30 - 60:37
    Herald: Okay. If ...if we keep this very,
    very short. Last question goes to
  • 60:37 - 60:41
    microphone number two.
    Mic 2: I don't know much about physics. So
  • 60:41 - 60:47
    just the question, how important is the
    physical testing after the chip is
  • 60:47 - 60:52
    manufactured? Isn't the simulation, the
    computer simulation enough if you just
  • 60:52 - 60:56
    shoot particles at it?
    Stefan: Yes and no. So in principle, of
  • 60:56 - 61:01
    course, you are right that you should be
    able to simulate all the effects we look
  • 61:01 - 61:07
    at. The problem is that as the designs
    grow big and they do grow bigger as the
  • 61:07 - 61:11
    technologies shrink, so
    this final net list that you end up with
  • 61:11 - 61:15
    can have millions or billions of nodes and
    it just is not feasible anymore to
  • 61:15 - 61:20
    simulate it exhaustively because you have
    to have so many dimensions. You have to
  • 61:20 - 61:26
    change when you inject. For example, bit
    flips or transients in your design in any
  • 61:26 - 61:31
    of those nodes for varying time offsets.
    And it's just the state space the circuit
  • 61:31 - 61:35
    can be in is just too huge to capture in a
    in a full simulation. So it's not possible
  • 61:35 - 61:39
    to exhaustively test it in simulation. And
    so typically you end up with having missed
  • 61:39 - 61:43
    something that you discover only in the
    physical testing afterwards, which you
  • 61:43 - 61:47
    always want to do before you put your, uh,
    your chip into final experiment or on your
  • 61:47 - 61:51
    satellite and then realise it's it's not
    working as intended. So it has a big
  • 61:51 - 61:56
    importance as well.
    Herald: Okay. Thank you. Time is up. All
  • 61:56 - 61:59
    right. Thank you all very much.
  • 61:59 - 62:05
    applause
  • 62:05 - 62:10
    36c3 postroll music
  • 62:10 - 62:32
    Subtitles created by c3subtitles.de
    in the year 2021. Join, and help us!
Title:
36C3 - How to Design Highly Reliable Digital Electronics
Description:

more » « less
Video Language:
English
Duration:
01:02:30

English subtitles

Revisions