36C3 - How to Design Highly Reliable Digital Electronics

Edit subtitles

0:00 - 0:18

36C3 Intro musik
0:18 - 0:23

Herald: The next talk will be titled 'How
to Design Highly Reliable Digital
0:23 - 0:26

Electronics', and it will be delivered to
you by Szymon and Stefan. Warm Applause
0:26 - 0:30

for them.
0:30 - 0:36

applause
0:36 - 0:41

Stefan: All right. Good morning, Congress.
So perhaps every one of you in the room
0:41 - 0:46

here has at one point or another in their
lives witnessed their computer behaving
0:46 - 0:50

weirdly and doing things that it was not
supposed to do or what you didn't
0:50 - 0:54

anticipate it to do. And well, typically
that would have probably been the result
0:54 - 1:00

of a software bug of some sort somewhere
inside the huge software stack your PC is
1:00 - 1:05

running on. Have you ever considered what
the probability of this weird behavior
1:05 - 1:09

being caused by a bit flipped somewhere in
your memory of your computer might have
1:09 - 1:16

been? So what you can see in this video on
the screen now is a physics experiment
1:16 - 1:21

called a cloud chamber. It's a very simple
experiment that is actually able to
1:21 - 1:27

visualize and make apparent all the
constant stream of background radiation we
1:27 - 1:33

all are constantly exposed to. So what's
happening here is that highly energetic
1:33 - 1:39

particles, for example, from space they
trace through gaseous alcohol and they
1:39 - 1:42

collide with alcohol molecules and they
form in this process a trail of
1:42 - 1:48

condensation while they do that. And if
you think about your computer, a typical
1:48 - 1:53

cell of RAM, of which you might have, I
don't know, 4, 8, 10 gigabytes in your
1:53 - 1:58

machine is as big as only 80 nanometers
wide. So it's very, very tiny. And you
1:58 - 2:03

probably are able to appreciate the small
amount of energy that is needed or that is
2:03 - 2:08

used to store the information inside each
of those bits. And the sheer amount of of
2:08 - 2:13

those bits you have in your RAM and your
computer. So a couple of years ago, there
2:13 - 2:18

was a study that concluded that in a
computer with about four gigabytes of RAM,
2:18 - 2:24

a bit flip, um, caused by such an event by
cosmic background radiation can occur
2:24 - 2:29

about once every 33 hours. So a
bit less than than one per day. In an
2:29 - 2:35

incident in 2008, a Quantas Airlines
flight actually nearly crashed, and the
2:35 - 2:40

reason for this crash was traced back to
be very likely caused by a bit flipped
2:40 - 2:44

somewhere in one of the CPUs of the
avionics system and nearly caused the
2:44 - 2:50

death of a lot of passengers on this
plane. In 2003, in Belgium, a small
2:50 - 2:57

municipal vote actually had a weird hiccup
in which one of the candidates in this
2:57 - 3:02

election actually got 4096 more votes added in a single instance.
3:02 - 3:06

And that was traced back to be very likely
caused by cosmic background radiation,
3:06 - 3:10

flipping a memory cell somewhere that
stored the vote count. And it was only
3:10 - 3:15

discovered that this happened because this
number of votes for this particular
3:15 - 3:19

candidate was considered unreasonable, but
otherwise would have gotten away probably
3:19 - 3:27

without being detected. So a few words
about us: Szymon and I, we both work at
3:27 - 3:32

CERN in the microelectronics section and
we both develop electronics that need to
3:32 - 3:37

be tolerant to these sorts of effects. So
we develop radiation tolerant electronics
3:37 - 3:43

for the experiments at CERN, at the LHC.
Among a lot of other applications, you can
3:43 - 3:48

meet the two of us at the Lötlabor Jena
assembly if you are interested in what we
3:48 - 3:56

are talking about today. And we will also
give a small talk or a small workshop
3:56 - 3:59

about radiation detection tomorrow, in one
of the seminar rooms. So feel free to pass
3:59 - 4:03

by there, it will be a quick introduction.
To give you a small idea of what kind of
4:03 - 4:09

environment we are working for: So if you
would use one of your default intel i7
4:09 - 4:14

CPUs from your notebook and would put it
anywhere where we operate our electronics,
4:14 - 4:20

it would very shortly die in a matter of
probably one or two minutes and it would
4:20 - 4:25

die for more than just one reason, which
is rather interesting and compelling. So
4:25 - 4:31

the idea for today's talk is to give you
all an insight into all the things that
4:31 - 4:35

need to be taken into account when you
design electronics for radiation
4:35 - 4:39

environments. What kinds of different
challenges come when you try to do that.
4:39 - 4:43

We classify and explain the different
types of radiation effects that exist. And
4:43 - 4:48

then we also present what you can do to
mitigate these effects and also validate
4:48 - 4:52

that what you did to care for them or
protect your circuits actually worked. And
4:52 - 4:57

of course, as we do that, we'll try to
give our view on how we develop radiation
4:57 - 5:03

tolerant electronics at CERN and how our
workflow looks like to make sure this
5:03 - 5:08

works. So let's first maybe take a step
back and have a look at what we mean when
5:08 - 5:13

we say radiation environments. The first
one that you probably have in mind right
5:13 - 5:19

now when you think about radiation is
space. So, this interstellar space is
5:19 - 5:24

basically filled with, very high speed,
highly energetic electrons and protons and
5:24 - 5:29

all sorts of high energy particles. And
while they, for example, traverse close to
5:29 - 5:35

planets as our Earth - these planets
sometimes do have a magnetic field and the
5:35 - 5:39

highly energetic particles are actually
deflected by these magnetic fields and
5:39 - 5:44

they can protect the planets as our
planet, for example, from this highly
5:44 - 5:48

energetic radiation. But in the process,
there around these planets sometimes they
5:48 - 5:52

form these radiation belts - known as the
Van Allen belts after the guy who
5:52 - 5:56

discovered this effect a long time ago.
And a satellite in space as it orbits
5:56 - 6:02

around the Earth might, depending on what
orbit is chosen, sometimes go through
6:02 - 6:06

these belts of highly intense radiation.
That, of course, then needs to be taken
6:06 - 6:12

into account when designing electronics
for such a satellite. And if Earth itself
6:12 - 6:17

is not able to give you enough radiation,
you may think of the very famous Juno
6:17 - 6:23

Jupiter mission that has become famous
about a year ago. They actually in the
6:23 - 6:28

environment of Jupiter they anticipated so
much radiation that they actually decided
6:28 - 6:33

to put all the electronics of the
satellite inside a one centimeter thick
6:33 - 6:40

cube of titanium, which is famously known
as the Juno Radiation Vault. But not only
6:40 - 6:44

space offers radiation environments.
Another form of radiation you probably all
6:44 - 6:48

recognize this when I show you this
picture, which is an X-ray image of a
6:48 - 6:55

hand. And X-ray is also considered a form
of radiation. And while, of course, the
6:55 - 7:01

doses or amounts of radiation any patient
is exposed to while doing diagnosis or
7:01 - 7:06

treatment of some disease, that might not
be the full story when it comes to medical
7:06 - 7:10

applications. So this is a medical
particle accelerator which is used for
7:10 - 7:15

cancer treatment. And in these sorts of
accelerators, typically carbon ions or
7:15 - 7:20

protons are accelerated and then focused
and used to treat and selectively destroy
7:20 - 7:25

cancer cells in the body. And this comes
already relatively close to the
7:25 - 7:30

environment we are working in and working
for. So Szymon and I are working, for
7:30 - 7:37

example, on electronics, for the CMS
detector inside the LHC or which we build
7:37 - 7:44

dedicated, radiation tolerant, integrated
circuits which have to withstand very,
7:44 - 7:49

very large amounts and doses of short
lived radiation in order to function
7:49 - 7:54

correctly. And if we didn't specifically
design electronics for that, basically the
7:54 - 8:02

whole system would never be able to work.
To illustrate a bit how you can imagine
8:02 - 8:06

the scale of this environment: This is a
single plot of a collision event that was
8:06 - 8:11

recorded in the ATLAS experiment. And each
of those tiny little traces you can make
8:11 - 8:16

out in this diagram is actually either one
or multiple secondary particles that were
8:16 - 8:22

created in the initial collision of two
proton bunches inside the experiment. And
8:22 - 8:28

in each of those, of course, races around
the detector electronics, which make these
8:28 - 8:33

traces visible. Itself, then decaying into
multiple other secondary particles which
8:33 - 8:38

all go through our electronics. And if
that doesn't sound, let's say, bad enough
8:38 - 8:43

for digital electronics, these collisions
happen about 40 million times a second. Of
8:43 - 8:48

course, multiplying the number of events
or problems they can cause in our
8:48 - 8:55

circuits. So we now want to introduce all
the things that can happen, the different
8:55 - 9:00

radiation effects. But first, probably we
take a step back and look at what we mean
9:00 - 9:06

when we say digital electronics or digital
logic, which we want to focus on today. So
9:06 - 9:11

from your university lectures or your
reading, you probably know the first class
9:11 - 9:15

of digital logic, which is the
combinatorial logic. So this is typically
9:15 - 9:19

logic that just does a simple linear
relation of the inputs of a circuit and
9:19 - 9:24

produces an output as exemplified with
these AND and OR, NAND, XOR gates that you
9:24 - 9:29

see here. But if you want to build - I
mean even though we use those everywhere
9:29 - 9:33

in our circuits - you probably also want
to store state in a more complex circuit,
9:33 - 9:38

for example, in the registers of your CPU
they store some sort of internal
9:38 - 9:42

information. And for that we use the other
class of logic, which is called the
9:42 - 9:45

sequential logic. So this is typically
clocked with some system clock frequency
9:45 - 9:51

and it changes its output with relation to
the inputs whenever this clock signal changes.
9:51 - 9:54

And now if we look at how all
these different logic functionalities are
9:54 - 9:58

implemented. So typically nowadays for
that you may know that we use CMOS
9:58 - 10:02

technologies and basically represent all
this logic functionality as digital gates
10:02 - 10:11

using small P-MOS and N-MOS MOSFET
transistors in CMOS technologies. And if
10:11 - 10:16

we kind of try to build a model for more
complex digital circuits, we typically use
10:16 - 10:22

something we call the finite state machine
model, in which we use a model that
10:22 - 10:26

consists of a combinatorial and a
sequential part. And you can see that the
10:26 - 10:31

output of the circuit depends both on the
internal state inside the register as well
10:31 - 10:35

as also the input to the combinatorial
logic. And accordingly, also the state
10:35 - 10:41

that is internal is always changed by the
inputs as well as the current state. So
10:41 - 10:45

this is kind of the simple model for more
complex systems that can be used to model
10:45 - 10:50

different effects. Um, now let's try to
actually look at what the radiation can do
10:50 - 10:54

to transistors. And for that we are going
to have a quick recap at what the
10:54 - 10:58

transistor actually is and how it looks
like. As you may perhaps know is that in
10:58 - 11:04

CMOS technologies, transistors are built
on wafers of high purity silicon. So this
11:04 - 11:09

is a crystalline, very regularly organized
lattice of silicon atoms. And what we do
11:09 - 11:14

to form a transistor on such a wafer is
that we add dopants. So in order to form
11:14 - 11:20

diffusion regions, which later will become
the source and drain of our transistors.
11:20 - 11:24

And then on top of that we grow a layer of
insulating oxide. And on top of that we
11:24 - 11:29

put polysilicon, which forms the gate
terminal of the transistor. And in the end
11:29 - 11:33

we end up with an equivalent circuit a bit
like that. And now to put things back into
11:33 - 11:38

perspective - you may also note that the
dimension of these structures are very
11:38 - 11:43

tiny. So we talk about tens of nanometers
for some of the dimensions I've outlined
11:43 - 11:48

here. And as the technologies shrink,
these become smaller and smaller and
11:48 - 11:52

therefore you'll probably also realize or
are able to appreciate the small amount of
11:52 - 11:57

energy that are used to store information
inside these digital circuits, which makes
11:57 - 12:02

them perhaps more sensitive to radiation.
So let's take a look. What different types
12:02 - 12:08

of radiation effects exist? We typically
in this case, differentiate them into two
12:08 - 12:13

main classes of events. The first one
would be the cumulative effects, which are
12:13 - 12:17

effects that, as the name implies,
accumulate over time. So as the circuit is
12:17 - 12:22

placed inside some radiation environment,
over time it accumulates more and more
12:22 - 12:27

dose and therefore worsens its performance
or changes how it operates. And on the
12:27 - 12:31

other side, we have the Single Event
Effects, which are always events that
12:31 - 12:35

happen at some instantaneous point in
time, and then suddenly, without being
12:35 - 12:39

predictable, change how the circuit
operates or how it functions or if it
12:39 - 12:44

works in the first place or not. So I'm
going to first go into the class of
12:44 - 12:48

cumulative effects and then later on,
Szymon will go into the other class of the
12:48 - 12:53

Single Event Effects. So in terms of these
accumulating effects, we basically have
12:53 - 12:58

two main subclasses: The first one being
ionization or TID effects, for Total
12:58 - 13:02

Ionizing Dose - and the second one being
displacement damages. So displacement
13:02 - 13:07

damages do exactly what they sound like.
It is all the effects that happen when an
13:07 - 13:11

atom in the silicon lattice is actually
displaced, so removed from its lattice
13:11 - 13:15

position and actually changes the
structure of the semiconductor. But
13:15 - 13:20

luckily, these effects don't have a big
impact in the CMOS digital circuits that
13:20 - 13:23

we are looking at today. So we will
disregard them for the moment and we'll be
13:23 - 13:28

looking more at the ionization damage, or
TID. So ionization - as a quick recap - is
13:28 - 13:36

whenever electrons are removed or added to
an atom, effectively transforming it into
13:36 - 13:43

an ion. And these effects are especially
critical for the circuits we are building
13:43 - 13:46

because of what they do is that they
change the behavior of the transistors.
13:46 - 13:50

And without looking too much into the
semiconductor details, I just want to show
13:50 - 13:56

their typical effect that we are concerned
about in this very simple circuit here. So
13:56 - 14:00

this is just an inverter circuit
consisting of two transistors here and
14:00 - 14:06

there. And what the circuit does in normal
operation is it just takes an input signal
14:06 - 14:10

and inverts and basically gives the
inverted signal at the output. And as the
14:10 - 14:16

transistors are irradiated and accumulate
dose, you can see that the edges of the
14:16 - 14:20

output signal get slower. So the
transistor takes longer to turn on and off.
14:20 - 14:25

And what that does in turn is that it
limits the maximum operation frequency of
14:25 - 14:29

your circuit. And of course, that is not
something you want to do. You want your
14:29 - 14:32

circuit to operate at some frequency in
your final system. And if the maximum
14:32 - 14:36

frequency it can work at degrades over
time, at some point it will fail as the
14:36 - 14:39

maximum frequency is just too low. So
let's have a look at what we can do to
14:39 - 14:44

mitigate these effects. The first one and
I already mentioned it when talking about
14:44 - 14:48

the Juno mission, is shielding. So if you
can actually put a box around your
14:48 - 14:53

electronics and shield any radiation from
actually hitting your transistors, it is
14:53 - 14:57

obvious that they will last longer and
will suffer less from the radiation damage
14:57 - 15:01

that it would otherwise do. So this
approach is very often used in space
15:01 - 15:05

applications like on satellites, but it's
not very useful if you are actually trying
15:05 - 15:08

to measure the radiation with your
circuits as we do, for example, in the
15:08 - 15:12

particle accelerators we build integrated
circuits for. So there first of all, we
15:12 - 15:16

want to measure the radiation so we cannot
shield our detectors from the radiation.
15:16 - 15:21

And also, we don't want to influence the
tracks of these secondary collision
15:21 - 15:24

products with any shielding material that
would be in the way. So this is not very
15:24 - 15:28

useful in a particle accelerator
environment, let's say. So we have to
15:28 - 15:34

resort to different methods. So as I said,
we do have to design our own integrated
15:34 - 15:39

circuits in the first place. So we have
some freedom in what we call transistor
15:39 - 15:45

level design. So we can actually alter the
dimensions of the transistors. We can make
15:45 - 15:50

them larger to withstand larger doses of
radiation and we can use special
15:50 - 15:54

techniques in terms of layout that we can
experimentally verifiy to be more
15:54 - 15:59

resistant to radiation effects. And as a
third measure, which is probably the most
15:59 - 16:03

important one for us, is what we call
modeling. So we actually are able to
16:03 - 16:08

characterize all the effects that
radiation will have on a transistor. And
16:08 - 16:12

if we can do that, if we will know: 'If I
put it into a radiation environment for a
16:12 - 16:17

year, how much slower will it become?'
Then it is of course easy to say: 'OK, I
16:17 - 16:21

can just over-design my circuit and make
it a bit more simple, maybe have less functionality,
16:21 - 16:24

but be able to operate at a
higher frequency and therefore withstand
16:24 - 16:30

the radiation effects for a longer time
while still working sufficiently well at
16:30 - 16:35

the end of its expected lifetime.' So
that's more or less what we can do about
16:35 - 16:38

these effects. And I'll hand over to
Szymon for the second class.
16:38 - 16:43

Szymon: Contrary to the cumulative effects
presented by Stefan, the other group are
16:43 - 16:46

Single Event Effects which are caused by
high energy deposits, which are caused by
16:46 - 16:52

a single particle or shower of particles.
And they can happen at any time, even
16:52 - 16:57

seconds after irradiation is started. It
means that if your circuit is vulnerable
16:57 - 17:02

to this class of effects, it can fail
immediately after radiation is present.
17:02 - 17:06

And here we also classify these effects
into several groups. The first are hard,
17:06 - 17:11

or permanent, errors, which as the name
indicates can permanently destroy your
17:11 - 17:20

circuit. And this type of errors are
typically critical for power devices where
17:20 - 17:24

you have large power densities and they
are not so much of a problem for digital
17:24 - 17:30

circuits. In the other class of effects
are soft errors. And here we distinguish
17:30 - 17:34

transient, or Single Event Transient
errors, which are spurious signals
17:34 - 17:41

propagating in your circuit as a result of
a gate being hit by a particle and they
17:41 - 17:46

are especially problematic for analog
circuits or asynchronous digital circuits,
17:46 - 17:51

but under some circumstances they can be
also problematic for synchronous systems.
17:51 - 17:56

And the other class of problems are
static, or Single Event Upset problems,
17:56 - 18:01

which basically means that your memory
element like a register gets flipped. And
18:01 - 18:05

then of course, if your system is not
designed to handle this type of errors
18:05 - 18:10

properly, it can lead to a failure. So in
the following part of the presentation
18:10 - 18:15

we'll focus mostly on soft errors. So
let's try to understand what is the origin
18:15 - 18:21

of this type of problem. So as Stefan
mentioned, the typical transistor is built
18:21 - 18:25

out of diffusions, gate and channel. So
here you can see one diffusion. Let's
18:25 - 18:29

assume that it is a drain diffusion. And
then when a particle goes through and
18:29 - 18:37

deposits charge, it creates free electron and
hole pairs, which then in the presence of
18:37 - 18:43

electric fields, they get collected by
means of drift, which results in a large
18:43 - 18:47

current spike, which is very short. And
then the rest of the charge could be
18:47 - 18:51

collected by diffusion which is a much
slower process and therefore also the
18:51 - 18:56

amplitude of the event is much, much
smaller. So let's try to understand what
18:56 - 19:01

could happen in a typical memory cell. So
on this schematic, you can see the
19:01 - 19:06

simplest memory cell, which is composed of
two back-to-back inverters. And let's
19:06 - 19:13

assume that node A is at high and node B
is at low potential initially. And then we
19:13 - 19:17

have a particle hitting the drain of
transistor M1 which creates a short
19:17 - 19:23

circuit current between drain and ground,
bringing the drain of transistor M1 to low
19:23 - 19:30

potential, which also acts on the gates of
second inverter, temporarily changing its
19:30 - 19:39

state from low to high, which reinforces
the wrong state in the first inverter. And
19:39 - 19:45

at this time the error is locked in your
memory cell and you basically lost your
19:45 - 19:50

information. So you may be asking
yourself: 'How much charge is needed
19:50 - 19:54

really to flip a state of a memory cell?'.
And you can get this number from either
19:54 - 20:00

simulations or from measurements. So let's
assume that what we could do, we could try
20:00 - 20:05

to inject some current into the sensitive
node, for example, drain of transistor M1.
20:05 - 20:09

And here what I will show is that on the
top plot you will have current as a function
20:09 - 20:13

of time. On the second plot you will have
output voltage. So voltage at node B as a
20:13 - 20:19

function of time and at the lowest plot you
will see a probability of having a bit
20:19 - 20:23

flip. So if you inject very little
current, of course nothing changes at the
20:23 - 20:28

output, but once you start increasing the
amount of current you are injecting, you
20:28 - 20:33

see that something appears at the output
and at some point the output will toggle,
20:33 - 20:40

so it will switch to the other state. And
at this point, if you really calculate
20:40 - 20:46

what is the area under the current curve
you can find what is the critical charge
20:46 - 20:53

needed to flip the memory cell. And if you
go further, if you start injecting even
20:53 - 21:01

more current, you will not see that much
difference in the output voltage waveform.
21:01 - 21:05

It could become only slightly faster. And
at this point, you also can notice that
21:05 - 21:10

the probability now jumped to one, which
means that any time you inject so much
21:10 - 21:17

current there is a fault in your circuit.
So for now, we just found what is the
21:17 - 21:23

probability of having a bit-flip from 0 to
1 in node B. Of course we should also
21:23 - 21:28

calculate the same for the other
direction, so from 1 to zero. And usually
21:28 - 21:32

it is slightly different. And then of
course we should inject in all the other
21:32 - 21:38

nodes, for example node B and also should
study all possible transitions. And then
21:38 - 21:43

at the end, if you calculate the
superposition of these effects and you
21:43 - 21:49

multiply them by the active area of each
node, you will end up with what we call
21:49 - 21:52

the cross section, which has a dimension
of centimeters squared, which will tell
21:52 - 21:57

you how sensitive your circuit is to this
type of effects. And then knowing the
21:57 - 22:04

radiation profile of your environment, you
can calculate the expected upset rate in
22:04 - 22:10

the final application. So now, having
covered the basic of the single event
22:10 - 22:17

effects, let's try to check how we can
mitigate them. And here also technology
22:17 - 22:21

plays a significant role. So of course,
newer technologies offer us much smaller
22:21 - 22:27

devices. And together with that, what
follows is that usually supply voltages
22:27 - 22:31

are getting smaller and smaller as well as
the node capacitance, which means that for
22:31 - 22:36

our Single Event Upsets it is very bad
because the critical charge which is
22:36 - 22:40

required to flip our bit is getting less
and less. But at the end, at the same
22:40 - 22:44

time, physical dimensions of our
transistors are getting smaller, which
22:44 - 22:48

means that the cross section for them
being hit is also getting smaller. So
22:48 - 22:52

overall, the effects really depend on the
circuit topology and the radiation
22:52 - 22:59

environment. So another protection method
could be introduced on the cell level. And
22:59 - 23:05

here we could imagine increasing the
critical charge. And that could be done in
23:05 - 23:11

the easiest way by just increasing the
node capacitance by, for example, putting
23:11 - 23:16

larger transistors. But of course, this
also increases the collection electrode,
23:16 - 23:23

which is not nice. And another way could
be just increase the capacitance by adding
23:23 - 23:28

some extra metal capacitance, but it, of
course, slows down the circuit. Another
23:28 - 23:34

approach could be to try to store the
information on more than two nodes. So I
23:34 - 23:38

showed you that on a simple SRAM cell we
store information only on two nodes, so
23:38 - 23:43

you could try to come up with some other
cells, for example, like that one in which
23:43 - 23:47

the information you stored on four nodes.
So you can see that the architecture is
23:47 - 23:54

very similar to the basic SRAM cell. But
you should be careful always to very
23:54 - 23:59

carefully simulate your design, because if
we analyze this circuit, you will quickly
23:59 - 24:03

realize that this circuit, even though the
information is stored in four different
24:03 - 24:10

nodes, the same type of loop exists as in
the basic circuit. Meaning that at the end
24:10 - 24:15

the circuit offers basically no hardening
with respect to the previous cell. So
24:15 - 24:21

actually we can do it better. So here you
can see a typical dual interlocked cell.
24:21 - 24:26

So the amount of transistors is exactly
the same as in the previous example, but
24:26 - 24:31

now they are interconnected slightly
differently. And here you can see that
24:31 - 24:36

this cell has also two stable configurations.
But this time data can propagate, the low
24:36 - 24:41

level from a given node can propagate
only to the left hand side, while high
24:41 - 24:48

level can propagate to the right hand
side. And each stage being inverting means
24:48 - 24:55

that the fault can not propagate for more
than one node. Of course, this cell has
24:55 - 25:00

some drawbacks: It consumes more area than
a simple SRAM cell and also write access
25:00 - 25:04

requires accessing at least two nodes at
the same time to really change the state
25:04 - 25:10

of the cell. And so you may ask yourself,
how effective is this cell? So here I will
25:10 - 25:14

show you a cross section plot. So it is
the probability of having an error as a
25:14 - 25:19

function of injected energy. And as a
reference, you can see a pink curve on the
25:19 - 25:26

top, which is for a normal, not protected
cell. And on the green you can see the
25:26 - 25:31

cross section for the error in the DICE
cell. So as you can see, it is one order
25:31 - 25:37

of magnitude better than the normal cell.
But still, the cross section is far from
25:37 - 25:41

being negligible, So, the problem was
identified: So it was identified that the
25:41 - 25:46

problem was caused by the fact that some
sensitive nodes were very close together
25:46 - 25:51

on the layout and therefore they could be
upset by the same particle. Because as we
25:51 - 25:55

mentioned, that single devices, they are very
small. We are talking about dimensions
25:55 - 26:00

below a micron. So after realizing that,
we designed another cell in which we
26:00 - 26:05

separated more sensitive nodes and we
ended up with the blue curve, and as you
26:05 - 26:09

can see the cross section was reduced by
two more orders of magnitude and the
26:09 - 26:14

threshold was increased significantly. So
if you don't want to redesign your
26:14 - 26:19

standard cells, you could also apply some
mitigation techniques on block level. So
26:19 - 26:25

here we can use some encoding to encode
our state better. And as an example, I
26:25 - 26:32

will show you a typical Hamming code. So
to protect four bits, we have to add three
26:32 - 26:38

additional party bits which are calculated
according to this formula. And then once
26:38 - 26:44

you calculate the parity bits, you can use
those to check the state integrity of your
26:44 - 26:50

internal state. And if any of their parity
bits is not equal to zero, then the bits
26:50 - 26:55

instantaneously become syndromes,
indicating where the error happened. And
26:55 - 27:00

you can use these information to correct
the error. Of course, in this case, the
27:00 - 27:07

efficiency is not really nice because we
need three additional bits to protect only
27:07 - 27:12

four bits of information. But as the state
length increases the protection also is
27:12 - 27:19

more efficient. Another approach would be
to do even less. Meaning that instead of
27:19 - 27:24

changing anything you need in your design,
you can just triplicate your design or
27:24 - 27:30

multiply it many times and just vote,
which state is correct? So this concept is
27:30 - 27:35

called tripple modular redudancy and it is
based around a voter cell. So it is a
27:35 - 27:40

cell which has odd number of
inputs and output is always equal to
27:40 - 27:45

majority of its input. And as I mentioned
that the idea is that you have, for
27:45 - 27:49

example, three circuits: A, B and C, and
during normal operation, when they are
27:49 - 27:54

identical, the output is also the same.
However, when there is a problem, for
27:54 - 28:01

example, in logic, part B, the output
is affected. So this problem is
28:01 - 28:06

effectively masked by the voter cell
and it is not visible from outside of the
28:06 - 28:10

circuit. But you have to be careful not to
take this picture as a as a design
28:10 - 28:16

template. So let's try to analyze what
would happen with a state machine
28:16 - 28:20

similar to what Stephan introduced. If you
were to just use this concept. So here you
28:20 - 28:25

can see three state machines and
a voter on the output. And as we can see,
28:25 - 28:29

if you have an upside in, for example, the
state register A, then the state is
28:29 - 28:37

broken. But still the output of the
circuit, which is indicated by letter s is
28:37 - 28:42

correct because the B and C registers are
still fine. But what happens if some time
28:42 - 28:49

later we have an upset in memory element B
or C? Then of course the state
28:49 - 28:56

of our system is broken and we can not
recover it. So you can ask yourself what
28:56 - 29:02

can we do better in order to avoid this
situation? So that just to be sure. Please
29:02 - 29:07

do not use this technique to protect your
circuits. So the easiest mitigation could
29:07 - 29:13

be to use as an input to your logic to use
the output of the voter cell itself.
29:13 - 29:18

What it offers us is that now whenever you
have an upset in one of the memory
29:18 - 29:23

elements for the next computation, for the
next stage, we always use the voter
29:23 - 29:28

output, which ensures that the signal
will be removed one clock cycle later. So
29:28 - 29:33

you will have another hit sometime later,
basically, it will not affect our state.
29:33 - 29:40

Until now we consider only upsets in our
registers but what happens if we have
29:40 - 29:46

charge in our voter? So you see that
if there is no state change, basically the
29:46 - 29:51

transient in the voter doesn't impact
our system. But if you are really unlucky
29:51 - 29:56

and the transient happens when the clock
transition happens, so when whenever we
29:56 - 30:01

enlarge the data, we can corrupt the state
in three registers at the same time, which
30:01 - 30:06

is less than ideal. So to overcome this
limitation, you can consider skewing our
30:06 - 30:11

clocks by some time, which is larger than
the maximum charge in time. And now,
30:11 - 30:18

because with each register samples the
output of the voter a slightly different
30:18 - 30:23

time, we can corrupt only one flip-flop
at the time. So of course, if you are
30:23 - 30:29

unlucky, we can have problematic
situations in which one register is
30:29 - 30:34

already in your state. The other register
is still in the old state. And then it
30:34 - 30:40

can lead to undetermenistic result. So it
is better, but still not ideal. So as a
30:40 - 30:47

general theme, you have seen that we were
adding and adding more resources so you
30:47 - 30:50

can ask yourself what would happen if we
tripplicate everything. So in this case,
30:50 - 30:54

we tripplicated registers, we
tripplicate our logic and our voters. And
30:54 - 30:59

now you can see that whenever we have an
upset in our register, it can only affect
30:59 - 31:05

one register at the time and the error
will be removed from the system one clock
31:05 - 31:09

cycle later. Also, if we have an upset
in the voter or in their logic it can be
31:09 - 31:13

larged only to one register, which means
that in principle we create that system
31:13 - 31:18

which is really robust. Unfortunately,
nothing is for free. So here I compare a
31:18 - 31:23

different tripplication environments and
as you can see that the more protection
31:23 - 31:26

you want to have, the more you have to pay
in terms of resources being power in the
31:26 - 31:31

area. And also usual, you pay small
penalty in terms of maximum operational
31:31 - 31:38

speed. So which flavor of protection you
use depends really on
31:38 - 31:42

application. So for most sensitive
circuits, you probably you want to use
31:42 - 31:48

full TMR and you may leave some other
bits of logic unprotected. So another, if
31:48 - 31:55

your system is not mission critical and
you can tolerate some downtime, you can
31:55 - 32:00

consider scrubbing, which means periodically
checking the state of your system and refreshing it
32:00 - 32:05

if necessary if an error is detected using
some parity bits or copy of the data in
32:05 - 32:10

a safe space. Or you can have a
watchdog which will find out that
32:10 - 32:14

something went wrong and it will just
reinitialize the whole system. So now,
32:14 - 32:20

having covered the basics of all the effects
we will have to face, we would like
32:20 - 32:24

to show you the basic flow which we follow
during designing our radiation hardened
32:24 - 32:30

circuits. So of course we always start
with specifications. So we try to
32:30 - 32:34

understand our radiation environment in
which the circuit is meant to operate. So
32:34 - 32:39

we come up with some specifications for
total dose which could be accumulated and
32:39 - 32:45

for the rate of single event upsets. And
at this moment, it is also not very rare
32:45 - 32:50

that we have to decide to move some
functionality out of our detector volume,
32:50 - 32:56

outside, where we can use of the sort of
commercial equipment to do number
32:56 - 33:05

crunching. But let's assume that we would
go with our ASIC. So having the
33:05 - 33:09

specifications, of course we proceed with
functional implementation. This we
33:09 - 33:14

typically do with hardware describtion
languages, so verilog or VHDL which you may
33:14 - 33:19

know from typical FPGA flow. And of course
we write a lot of simulations to
33:19 - 33:24

understand whether we are meeting our
functional goals or whether our circuit
33:24 - 33:31

behaves as expected. And then we
selectively select some parts of the
33:31 - 33:36

circuits which we want to protect from
radiation effects. So, for example, we can
33:36 - 33:42

decide to use triplication or some other
methods. So these days we typically use
33:42 - 33:47

triplication as the most straightforward
and very effective method. So you can ask
33:47 - 33:51

yourself how do we triplicate the logic?
So the simplest could be: Just copy
33:51 - 33:55

and paste the code three times at some
postfixes like A, B and C and you are
33:55 - 34:02

done. But of course this solution has some
drawbacks. So it is time consuming and it
34:02 - 34:06

is very error prone. So maybe you have
noticed that I had a typo there. So of
34:06 - 34:10

course we don't want to do that. So we
developed our own tool, which we called
34:10 - 34:17

TMRG, which automatizes the process of
triplication and eliminates the two main
34:17 - 34:22

drawbacks, which I just described. So
after we have our code triplicated and of
34:22 - 34:27

course, not before rerunning all the
simulations to make sure that everything
34:27 - 34:34

went as expected. We then proceed to the
synthesis process in which we convert our
34:34 - 34:41

high level hardware description languages
to gate level netlists, in which all the functions
34:41 - 34:46

are mapped to gates, which were introduced
by Stefan, so both combinatorial and
34:46 - 34:54

sequential. And here we also have to be
careful because modern CAD tools have a
34:54 - 34:59

tendency, of course, to optimise the logic
as much as possible. And our logic in most
34:59 - 35:04

of the cases is really redundant. So it is
very easy; So, it should be removed. So we
35:04 - 35:09

really have to make sure that it is not
removed. That's why our tool also provides
35:09 - 35:14

some constraints for the synthesizer to
make sure that our design intent is
35:14 - 35:21

clearly and well understood by the tool.
And once we have the output netlist, we
35:21 - 35:27

proceed to place and route process where
this kind of netlist representation is
35:27 - 35:33

mapped to a layout of what will become
soon our digital chip where we placed all
35:33 - 35:37

the cells and we route connections between
them and here there is
35:37 - 35:41

another danger which I mentioned already,
it's that in modern technologies the cells
35:41 - 35:46

are so small that they could be easily
affected by a single particle at the same
35:46 - 35:52

time. So we have to really space out
the big cells which are responsible for
35:52 - 35:57

keeping the information about the state to
make sure that a single particle cannot
35:57 - 36:05

upset A and B, for example, registered
from the same register. And then in the
36:05 - 36:10

last step, of course, we'll have to verify
that everything, what we have done, is
36:10 - 36:14

correct. And at this level, we also try to
introduce some single event effects in our
36:14 - 36:20

simulations. So we could randomly flip
bits in our system. We can also inject
36:20 - 36:26

transients. And typically we used to do
that on the netlist level, which works
36:26 - 36:31

very fine. And it is very nice. But the
problem with this approach is that we can
36:31 - 36:38

perform these actions very late in the
design cycle, which is less than ideal.
36:38 - 36:43

And also that if we find that there is
problem in our simulation, typical netlist
36:43 - 36:48

at this level has probably few orders of
magnitude more lines than our initial RTL
36:48 - 36:53

code. So to trace back what is the
problematic line of code is not so
36:53 - 36:58

straightforward. At this time. So you can
ask yourself why not to try to inject
36:58 - 37:05

errors in the RTL design? And the answer
was, the answer is that it is not so
37:05 - 37:11

trivially to map the hardware description
language's high level constructs to
37:11 - 37:16

what will become combinatorial or
sequential logic. So in order to eliminate
37:16 - 37:21

this problem, we also develop another open
source tool, which allows us to...
37:21 - 37:28

So we decided to use Yosys open
source synthesis tool from clifford, which
37:28 - 37:32

was presented in the Congress several
years ago. So we use this tool to make a
37:32 - 37:36

first pass through our RTL code to
understand which elements will be mapped
37:36 - 37:41

to sequential and combinatorial. And then
having this information, we will use
37:41 - 37:46

cocotb, another python verification
framework, which allows us programmatic
37:46 - 37:52

access to these nodes and we can
effectively inject the errors in our
37:52 - 37:57

simulations. And I forgot to mention that
the TMRG tool is also open source. So if
37:57 - 38:04

you are interested in one of the tools,
please feel free to contact us. And of
38:04 - 38:11

course, after our simulation is done, then in
the next step we would really tape out. And
38:11 - 38:15

so we submit our chip to manufacturing and
hopefully a few months later we receive
38:15 - 38:18

our chip back.
Stefan: All right. So after patiently
38:18 - 38:24

waiting then for a couple of months while
your chip is in manufacturing and you're
38:24 - 38:28

spending time on preparing a test set up
and preparing yourself to actually test if
38:28 - 38:34

your chip works as you expected to. Now,
it's probably also a good time to think
38:34 - 38:38

about how to actually validate or test if
all the measures that you've taken to
38:38 - 38:41

protect your circuit from radiation
effects actually are effective or if they
38:41 - 38:46

are not. And so again, we will split this
in two parts. So you will probably want to
38:46 - 38:50

start with testing for the total ionizing
dose effects. So for the cumulative effect
38:50 - 38:55

and for that, you typically use x ray
radiation relatively similar to the one
38:55 - 38:59

used in medical treatment. So this
radiation is relatively low, energetic,
38:59 - 39:03

which has the upside of not producing any
single event effects, but you can really
39:03 - 39:07

only accumulate radiation dose and focus
on the accumulating effects. And typically
39:07 - 39:12

you would use a machine that looks
somewhat like this, a relatively compact
39:12 - 39:17

thing. You can have in your laboratory and
you can use that to really accumulate
39:17 - 39:22

large amounts of radiation dose on your
circuit. And then you need some sort of
39:22 - 39:27

mechanism to verify or to quantify how
much your circuit slows down due to this
39:27 - 39:31

radiation dose. And if you do that, you
typically end up with a graphic such as
39:31 - 39:37

this one, where in the x axis you have the
radiation dose your circuit was exposed
39:37 - 39:41

to. And on the y axis, you see that the
frequency has gone down over time and you
39:41 - 39:45

can use this information to say:
"OK, my final application, I expect this
39:45 - 39:49

level of radiation dose. I mean, I can
still see that my circuit will work fine
39:49 - 39:54

under some given environmental condition
or some operation condition." So this is
39:54 - 39:58

the test for the first class of effects.
And the test for the second class of
39:58 - 40:02

effects for the single event effect is a
bit more involved. So there what you would
40:02 - 40:07

typically start to do is go for a heavy
ion test campaign. So you would go to a
40:07 - 40:13

specialized, relatively rare facility. We
have a couple of those in Europe and would
40:13 - 40:17

look perhaps somewhat like this. So it's a
small particle accelerator somewhere.
40:17 - 40:21

They typically have
different types of heavy ions at their
40:21 - 40:26

disposal that they can accelerate and then
shoot at your chip that you can place in a
40:26 - 40:32

vacuum chamber and these ions can deposit
very well known amounts of energy in your
40:32 - 40:37

circuit and you can use that information
to characterize your circuit. The downside
40:37 - 40:41

is a bit that these facilities tend to be
relatively expensive to access and also a
40:41 - 40:45

bit hard to access. So typically you need
to book them a lot of time in advance and
40:45 - 40:50

that's sometimes not very easy. But what
it offers you, you can use different types
40:50 - 40:55

of ions with different energies. You can
really make a very well-defined
40:55 - 41:00

sensitivity curve similar to the one that
Szymon has described. You can get from
41:00 - 41:04

simulations and really characterize your
circuit for how often, any single event
41:04 - 41:09

effects will appear in the final
application if there is any remaining
41:09 - 41:13

effects left. If you have left something
unprotected. The problem here is that
41:13 - 41:18

these particle accelerators typically just
bombard your circuit with like thousands
41:18 - 41:23

of particles per second and they hit
basically the whole area in a random
41:23 - 41:27

fashion. So you don't really have a way of
steering those or measuring the position
41:27 - 41:31

of these particles. So typically you are a
bit in the dark and really have to really
41:31 - 41:35

carefully know the behavior of your
circuit and all the quirks it has even
41:35 - 41:39

without the radiation to instantly notice
when something has gone wrong. And
41:39 - 41:44

this is typically not very easy
and you can kind of compare it with having
41:44 - 41:47

some weird crash somewhere in your
software stack and then having to have
41:47 - 41:52

first take a look and see what actually
has happened. Typically
41:52 - 41:57

you find something that has not been
properly protected and you see some weird
41:57 - 42:02

effect on your circuit and then you try to
get a better idea of where that problem
42:02 - 42:06

actually is located. And the answer for
these types of problems involving position
42:06 - 42:11

is, of course, always lasers. So we have
two types of laser experiments available
42:11 - 42:16

that can be used to more selectively probe
your circuit for these problems. The first
42:16 - 42:20

one being the single photon absorption
laser. And it sounds this relatively
42:20 - 42:25

simple in terms of setup. You just use a
single laser beam that shoots straight up
42:25 - 42:30

at your circuit from the back. And while
it does that, it deposits energy all along
42:30 - 42:34

the silicon and also in the diffusions of
your transistors and is therefore also
42:34 - 42:38

able to inject energy there, potentially
upsetting a bit of memory or exposing
42:38 - 42:43

whatever other single event effects you
have. And of course, you can steer this
42:43 - 42:47

beam across the surface of your chip or
whatever circuit you are testing and then
42:47 - 42:51

find the sensitive location. The problem
here is that the amount of energy that is
42:51 - 42:55

deposited is really large due to the fact
that it has to go through the whole
42:55 - 42:59

silicon until it reaches the transistor.
And therefore it's mostly used to find
42:59 - 43:03

these destructive effects that really
break something in your circuit. The more
43:03 - 43:08

clever and somehow beautiful experiment is
the two photon absorption laser experiment
43:08 - 43:13

in which you use two laser beams of a
different wavelength. And these actually
43:13 - 43:18

do not have enough energy to cause any
effect in your silicon. If only one of the
43:18 - 43:22

laser beams is present, but only in the
small location where the two beams
43:22 - 43:27

intersect, the energy is actually large
enough to produce the effect. And this
43:27 - 43:31

allows you to very selectively and only on
a very small volume induce charge and
43:31 - 43:38

cause an effect in your circuit. And when
you do that now, you can systematically
43:38 - 43:42

scan both the X and Y directions across
your chip and also the Z direction and can
43:42 - 43:46

really measure the volume of sensitive
area. And this is what you would typically
43:46 - 43:51

get of such an experiment. So in black and
white in the back, you'll see an infrared
43:51 - 43:55

image of your chip where you can really
make out the individual, say structural
43:55 - 43:59

components. And then overlaid in blue, you
can basically highlight all the sensitive
43:59 - 44:04

points that made you measure something you
didn't expect, some weird bit flip in a
44:04 - 44:08

register or something. And you can really
then go to your layout software and find
44:08 - 44:14

what is the the register or the gate in
your netlist that is responsible for
44:14 - 44:17

this. And then it's more like operating a
debugger in a software environment.
44:17 - 44:23

Tracing back from there what the line of
code responsible for this bug is. And
44:23 - 44:31

to close out, it is always best to learn
from mistakes. And we offer our mistakes
44:31 - 44:36

as a guideline for if you ever feel
yourself the need to design radiation
44:36 - 44:41

tolerant circuits. So we want to present
two or three small issues we had and
44:41 - 44:45

circuits where we were convinced it should
have been working fine. So the first one
44:45 - 44:50

this you will probably recognize is this
full triple modular redundancy scheme that
44:50 - 44:55

Szymon has presented. So we made sure to
triplicate everything and we were relatively
44:55 - 44:59

sure that everything should be fine. The
only modification we did is that to all
44:59 - 45:04

those registers in our design, we've added
a reset, because we wanted to initialize
45:04 - 45:08

the system to some known state when we
started up, which is a very obvious thing
45:08 - 45:12

to do. Every CPU has a reset. But of
course, what we didn't think about here
45:12 - 45:17

was that at some point there's a buffer
driving this reset line somewhere. And if
45:17 - 45:20

there's only a single buffer. What happens
if this buffer experiences a small
45:20 - 45:25

transient event? Of course, the obvious
thing that happened is that as soon as
45:25 - 45:28

that happened, all the registers were
upset at the same time and were basically
45:28 - 45:32

cleared and all our fancy protection was
invalidated. So next time we decided,
45:32 - 45:38

let's be smarter this time. And of course,
we triplicate all the logic and all the
45:38 - 45:41

voters and all the registers. So let's
also triplicate the reset lines. And while
45:41 - 45:45

the designer of that block probably had
very good intentions, it turned out
45:45 - 45:49

that later than when we manufactured the
chip, it still sometimes showed a complete
45:49 - 45:55

reset without any good explanation for
that. And what was left out of the the
45:55 - 46:00

scope of thinking here was that this reset
actually was connected to the system reset
46:00 - 46:05

of the chip that we had. And typically
pins are on the chip or something that is
46:05 - 46:09

not available in huge quantities. So you
typically don't want to spend three pins
46:09 - 46:13

of your chip just for a stupid reset that
you don't use ninety nine percent of the
46:13 - 46:18

time. So what we did at some point we just
connected again the reset lines to a
46:18 - 46:22

single input buffer. That was then
connected to a pin of the chip. And of
46:22 - 46:26

course, this also represented a small
sensitive area in the chip. And again,
46:26 - 46:30

a single upset here was able to destroy
all three of our flip flops. All right.
46:30 - 46:35

And the last lesson I'm bringing or the
last thing that goes back to the
46:35 - 46:39

implementation details that Szymon has
mentioned. So this time, really simple
46:39 - 46:43

circuit. We were absolutely convinced it
must work because it was basically the
46:43 - 46:46

textbook example that Szymon was
presenting. And the code was so
46:46 - 46:50

small we were able to inspect everything
and were very much sure that nothing
46:50 - 46:55

should have happened. And what we saw when
we went for this laser testing experiment,
46:55 - 47:00

in simplified form is basically that
only this first voter. And when this was
47:00 - 47:04

hit, always all our register was
upset while the other ones were
47:04 - 47:09

never manifested to show anything strange.
And it took us quite a while to actually
47:09 - 47:14

look at the layout later on and figure out
that what was in the chip was rather this.
47:14 - 47:17

So two of the voters were actually not
there. And Szymon mentioned the reason for
47:17 - 47:21

that. So synthesis tool these days are
really clever at identifying redundant
47:21 - 47:26

logic and because we forgot to tell it to
not optimize these redundant pieces of
47:26 - 47:30

logic, which the voters really are. It
just merged them into one. And that
47:30 - 47:34

explains why we only saw this one voter
being the sensitive one. And of course, if
47:34 - 47:38

you have a transient event there, then you
suddenly upset all your registers and that
47:38 - 47:42

without even knowing it and with being
sure, having looked at every single line
47:42 - 47:46

of verilog code and being very sure,
everything should have been fine. But that
47:46 - 47:52

seems to be how this business goes. So we
hope we had been we had the chance and you
47:52 - 47:57

were able to get some insight in in what
we do to make sure the experiments at the
47:57 - 48:02

LHC work fine. What you can do to
make sure the satellite you are working on
48:02 - 48:06

might be working OK. Even before launching
it into space, if you're interested into
48:06 - 48:11

some more information on this topic, feel
free to pass by at the assembly I
48:11 - 48:15

mentioned at the beginning or just meet us
after the talk and otherwise thank you
48:15 - 48:22

very much.
Applause
48:22 - 48:27

Herald: Thank you very much indeed.
There's about 10 minutes left for Q and A,
48:27 - 48:32

so if you have any questions go to a
microphone. And as a cautious reminder,
48:32 - 48:38

questions are short sentences with. That
starts with a question. Well, ends with a
48:38 - 48:43

question mark and the first question goes
to the Internet.
48:43 - 48:46

Internet: Well, hello. Um, do you also
incorporate radiation as the source for
48:46 - 48:51

randomness when that's needed?
Stefan: So we personally don't. So in our
48:51 - 48:57

designs we don't. But it is done indeed
for a random number generator. This is
48:57 - 49:01

sometimes done that they use radioactive
decay as a source for randomness. So this
49:01 - 49:04

is done, but we don't do it in our
experiments.
49:04 - 49:07

We rather want deterministic data out of
the things we built.
49:07 - 49:11

Herald: Okay. Next question goes to
microphone number four.
49:11 - 49:17

Mic 4: Do you do your tripplication before
or after elaboration?
49:17 - 49:21

Szymon: So currently we do it before
elaboration. So we decided that our tool
49:21 - 49:26

works on verilog input and it produces
verilog output because it offers much more
49:26 - 49:30

flexibility in the way how you can
incorporate different tripplication
49:30 - 49:34

schemes. If you were to apply to only
after elaboration, then of course doing a
49:34 - 49:38

full tripplication might be easy. But then
you - to having a really precise control
49:38 - 49:43

or on types of tripplication on different
levels is much more difficult.
49:43 - 49:47

Herald: Next question from microphone
number two.
49:47 - 49:51

Mic 2: Is it possible to use DCDC
converters or switch mode power supplies
49:51 - 49:55

within the radiation environment to power
your logic? Or you use only linear power?
49:55 - 50:00

Szymon: Yes, alternatively we also have a
dedicated program which develops radiation
50:00 - 50:05

hardened DCDC converters who operate
in our environments. So they are available
50:05 - 50:11

also for space applications, as far as I'm
aware. And they are hardened against total
50:11 - 50:16

ionizing dose as well as single event
upsets.
50:16 - 50:20

Herald: Okay next question goes to
microphone number one.
50:20 - 50:23

Mic 1: Thank you very much for the great
talk. I'm just wondering, would it be
50:23 - 50:27

possible to hook up every logic gate in
every water in a way of mesh network? And
50:27 - 50:32

what are the pitfalls and limitations for
that?
50:32 - 50:37

Stefan: So that is not something I'm aware
of, of being done. So typically: No. I
50:37 - 50:41

wouldn't say that that's something we
would do.
50:41 - 50:43

Szymon: I'm not really sure if I
understood the question.
50:43 - 50:46

Stefan: So maybe you can rephrase what
your idea is?
50:46 - 50:53

Mic 1: On the last slide, there were a
lesson learned.
50:53 - 50:56

Stefan: Yes. One of those?
Mic 1: In here. Yeah. Would you be able to
50:56 - 51:00

connect everything interchangeably in a
mesh network?
51:00 - 51:04

Szymon: So what you are probably asking
about is whether we can build our own
51:04 - 51:08

FPGA, like programable logic device.
Mic 1: Probably.
51:08 - 51:11

Szymon: Yeah. And so this we typically
don't do, because in our experiments, our
51:11 - 51:16

power budget is also very limited, so we
cannot really afford this level of
51:16 - 51:21

complexity. So of course you can make your
FPGA design radiation hard, but this is
51:21 - 51:25

not what we will typically do in our
experiments.
51:25 - 51:29

Herald: Next question goes to microphone
number two.
51:29 - 51:32

Mic 2: Hi, I would like to ask if the
orientation of your transistors and your
51:32 - 51:38

chip is part of your design. So mostly you
have something like a bounding box around
51:38 - 51:43

your design and with an attack surface in
different sizes. So do you use this
51:43 - 51:48

orientation to minimize the attack surface
of the radiation on chips, if you know
51:48 - 51:53

the source of the radiation?
Szymon: No. So I don't think we'd do that.
51:53 - 51:59

So, of course, we control our orientation
of transistors during the design phase.
51:59 - 52:03

But usually in our experiment, the
radiation is really perpendicular to the
52:03 - 52:08

chip area, which means that if you rotate
it by 90 degrees, you don't really gain
52:08 - 52:12

that much. And moreover, our chips,
usually they are mounted in a bigger
52:12 - 52:17

system where we don't control how they are
oriented.
52:17 - 52:24

Herald: Again, microphone number two.
Mic 2: Do you take meta stability into
52:24 - 52:33

account when designing voters?
Szymon: The voter itself is combinatorial.
52:33 - 52:39

So ... -
Mic 2: Yeah, but if the state of the rest
52:39 - 52:45

can change in any time that then the
voters can have like glitches, yeah?
52:45 - 52:51

Szymon: Correct. So that's why - so to
avoid this, we don't take it into account
52:51 - 52:55

during the design phase. But if we use
that scheme which is just displayed here,
52:55 - 52:59

we avoid this problem altogether, right?
Because even if you have meta stability in
52:59 - 53:05

one of the blocks like A, B or C, then it
will be fixed in the next clock cycle.
53:05 - 53:10

Because usually our systems operate on
clocks with low frequencies, hundreds of
53:10 - 53:13

megahertz, which means that any meta
stability should be resolved by the next
53:13 - 53:15

clock cycle.
Mic 2: Thank you.
53:15 - 53:19

Herald: Next question microphone number
one.
53:19 - 53:23

Mic 1: How do you handle the register
duplication that can be performed by a
53:23 - 53:28

synthesis and pleasant route? So the tools
will try to optimize timing sometimes by
53:28 - 53:32

adding registers. And these registers are
not trippled.
53:32 - 53:36

Stefan: Yes. So what we do is that I mean,
in a typical, let's say, standard ASIC
53:36 - 53:40

design flaw, this is not what happens. So
you have to actually instruct a tool to do
53:40 - 53:45

that, to do re timing and add additional
registers. But for what we are doing, we
53:45 - 53:48

have to - let's say not do this
optimization and instruct a tool to keep
53:48 - 53:53

all the registers we described in our RTL
code to keep them until the very end. And
53:53 - 53:57

we realy also constrain them to always
keep their associated logic tripplicated.
53:57 - 54:02

Herald: The next question is from the
internet.
54:02 - 54:08

Internet: Do you have some simple tips for
improving radiation tolerance?
54:08 - 54:12

Stefan: Simple tips? Ahhhm...
Szymon: Put your electronics inside a
54:12 - 54:13

box.
Stefan: Yes.
54:13 - 54:17

some laughter
There's there's just no
54:17 - 54:23

single one size fits all textbook recipe
for this as it really always comes down to
54:23 - 54:28

analyzing your environment, really getting
an awareness first of what rate and what
54:28 - 54:32

number of events you are looking at, what
type of particles cause them, and then
54:32 - 54:36

take the appropriate measures to mitigate
them. So there is no one size fits all
54:36 - 54:38

thing I say.
Herald: Next question goes from mycrophone
54:38 - 54:42

number two.
Mic 2: Hi. Thanks for the talk. How much
54:42 - 54:48

of your software used to design is
actually open source? I only know a super
54:48 - 54:54

expensive chip design software.
Stefan: You write the core of all the
54:54 - 55:01

implementation tools like the synthesis
and place and route stage for the ASICS,
55:01 - 55:05

that we design is actually a commercial
closed source tools. And if
55:05 - 55:10

you're asking for the fraction, that's a
bit hard to answer. I cannot give a
55:10 - 55:15

statement about the size of the commercial
closed tools. But we tried to do
55:15 - 55:19

everything we develop, tried to make it
available to the widest possible audience
55:19 - 55:22

and therefore decided to make the
extensions to this design flaw available
55:22 - 55:26

in public form. And that's why these
tools that we develop and share among the
55:26 - 55:31

community of ASIC designers and this
environment are open source.
55:31 - 55:35

Herald: Microphone number four.
Mic 4: Have you ever tried using steered
55:35 - 55:41

iron beams for more localized, radiation
ingress testing?
55:41 - 55:44

Stefan: Yes, indeed! And the picture I
showed actually, uh, didn't disclaimer
55:44 - 55:49

that, but the facility you saw here is
actually a facility in Darmstadt in
55:49 - 55:53

Germany and is actually a micro beam
facility. So it's a facility that allows
55:53 - 55:58

steering a heavy ion beam really on a
single position with less than a
55:58 - 56:02

micrometer accuracy. So it provides
probably exactly what you were asking for.
56:02 - 56:06

But that's not the typical case. That is
really a special thing. And it's probably
56:06 - 56:09

also the only facility in Europe that can
do that.
56:09 - 56:13

Herald: Microphone number one.
Mic 1: Was very good very good talk. Thank
56:13 - 56:19

you very much. My question is, did you
compare what you did to what is done for
56:19 - 56:25

securing secret chips? You know, when you
have credit card chips, you can make fault
56:25 - 56:30

attacks into them so you can make them
malfunction and extract the cryptographic
56:30 - 56:34

key for example from the banking card.
There are techniques here to harden these
56:34 - 56:38

chips against fault attacks. So which are
like voluntary faults while you have like
56:38 - 56:43

random less faults due to like involatility
attacks. You know what? Can you explain if
56:43 - 56:47

you compared in a way what you did to
this?
56:47 - 56:51

Stefan: Um, so no, we didn't explicitly
compared it, but it is right that the
56:51 - 56:54

techniques we present can also be used in
a variety of different contexts. So one
56:54 - 56:59

thing that's not exactly what you are
referring to, but relatively on a similar
56:59 - 57:04

scale is that currently in very small
technologies you get two problems with the
57:04 - 57:08

reliability and yield of the manufacturing
process itself, meaning that sometimes
57:08 - 57:12

just the metal interconnection between two
gates and your circuit might be broken
57:12 - 57:16

after manufacturing and then adding the
sort of redundancy with the same kinds of
57:16 - 57:21

techniques can be used to make, to
produce more working chips out of a
57:21 - 57:25

manufacturing run. So in this sort of
context, these sorts of techniques are
57:25 - 57:31

used very often these days. But, um, I'm
and I'm pretty sure they can be applied to
57:31 - 57:35

these sorts of, uh, security fault attack
scenarios as well.
57:35 - 57:40

Herald: Next question from microphone
number two.
57:40 - 57:44

Mic 2: Hi, you briefly also mentioned the
mitigation techniques on the cell level
57:44 - 57:52

and yesterday there was a very nice talk
from the Libre Silicon people and they
57:52 - 57:56

are trying to build a standard cell
library, uh, open source standard cell
57:56 - 58:00

library. So are you in contact with them
or maybe you could help them to improve
58:00 - 58:04

their design and then the radiation
hardness?
58:04 - 58:07

Stefan: No. We also saw the talk
yesterday, but we are not yet in
58:07 - 58:14

contact with them. No.
Herald: Does the Internet have questions?
58:14 - 58:21

Internet: Yes, I do. Um, two in fact.
First one would be would TTL or other BJT
58:21 - 58:27

based logic be more resistant?
Szymon: Uh, yeah. So depending on which
58:27 - 58:31

type of errors we are considering. So BJT
transistors, they have ...
58:31 - 58:36

Stefan in his part mentioned that
displacement damage is not a problem for
58:36 - 58:40

seamless devices, but it is not the case
for BJT devices. So when they are exposed
58:40 - 58:47

to high energy hadrons or protons,
they degrade a lot. So that's why we don't
58:47 - 58:52

use them in really our environment. They
could be probably much more robust to
58:52 - 58:57

single event effects because their
resistance everywhere is much lower. But
58:57 - 59:02

they would have other problems. And also
another problem which is worth
59:02 - 59:06

mentioning is that for those devices, they
consume much, much, much more power, which
59:06 - 59:13

we cannot afford in our applications.
Internet: And the last one would be how do
59:13 - 59:19

I use the output of the full TMR setup? Is
it still three signals? How do I know
59:19 - 59:26

which one to use and to trust?
Stefan: Um, yes. So with this, um,
59:26 - 59:30

architecture, what you could either do is
really do the full triplication scheme
59:30 - 59:35

to your whole logic tree basically and
really triplicate everything or, and
59:35 - 59:39

that's going in the direction of one of
the lessons learned I had, at some point
59:39 - 59:43

of course you have an interface to your
chip, so you have pins left and right that
59:43 - 59:47

are inputs and outputs. And then you have
to decide either you want to spend the
59:47 - 59:51

effort and also have three dedicated input
pins for each of the signals, or you at
59:51 - 59:54

some point have the voter and say, okay.
At this point, all these signals are
59:54 - 59:58

combined. But I was able to reduce the
amount of sensitive area in my chip
59:58 - 60:04

significantly and can live with the very
small remaining sensitive area that just
60:04 - 60:07

the input and output pins provide.
Szymon: So maybe I will add one more thing
60:07 - 60:12

is that typically in our systems, of
course we triplicate our logic internally,
60:12 - 60:15

but when we interface with external
world, we can apply another protection
60:15 - 60:20

mechanism. So for example, for our high
speed serialisers, we will use different types
60:20 - 60:24

of encoding to add protect...,
to add like forward error correction
60:24 - 60:30

codes which would allow us to recover these
type of faults in the backend later on.
60:30 - 60:37

Herald: Okay. If ...if we keep this very,
very short. Last question goes to
60:37 - 60:41

microphone number two.
Mic 2: I don't know much about physics. So
60:41 - 60:47

just the question, how important is the
physical testing after the chip is
60:47 - 60:52

manufactured? Isn't the simulation, the
computer simulation enough if you just
60:52 - 60:56

shoot particles at it?
Stefan: Yes and no. So in principle, of
60:56 - 61:01

course, you are right that you should be
able to simulate all the effects we look
61:01 - 61:07

at. The problem is that as the designs
grow big and they do grow bigger as the
61:07 - 61:11

technologies shrink, so
this final net list that you end up with
61:11 - 61:15

can have millions or billions of nodes and
it just is not feasible anymore to
61:15 - 61:20

simulate it exhaustively because you have
to have so many dimensions. You have to
61:20 - 61:26

change when you inject. For example, bit
flips or transients in your design in any
61:26 - 61:31

of those nodes for varying time offsets.
And it's just the state space the circuit
61:31 - 61:35

can be in is just too huge to capture in a
in a full simulation. So it's not possible
61:35 - 61:39

to exhaustively test it in simulation. And
so typically you end up with having missed
61:39 - 61:43

something that you discover only in the
physical testing afterwards, which you
61:43 - 61:47

always want to do before you put your, uh,
your chip into final experiment or on your
61:47 - 61:51

satellite and then realise it's it's not
working as intended. So it has a big
61:51 - 61:56

importance as well.
Herald: Okay. Thank you. Time is up. All
61:56 - 61:59

right. Thank you all very much.
61:59 - 62:05

applause
62:05 - 62:10

36c3 postroll music
62:10 - 62:32

Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!

Title:: 36C3 - How to Design Highly Reliable Digital Electronics
Description:: more » « less
Video Language:: English
Duration:: 01:02:30

	Teilkasko edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	flavioamieiro edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	flavioamieiro edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	Maximilian Marx edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	Jule 2210@rc3 edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	Jule 2210@rc3 edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	Maximilian Marx edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics
	Maximilian Marx edited English subtitles for 36C3 - How to Design Highly Reliable Digital Electronics

Show all

English subtitles

Revisions

Revision 9 Edited

Teilkasko

36C3 - How to Design Highly Reliable Digital Electronics

Revisions

Our website uses cookies

Operating cookies (Required)