-
36C3 Intro musik
-
Herald: The next talk will be titled 'How
to Design Highly Reliable Digital
-
Electronics', and it will be delivered to
you by Szymon and Stefan. Warm Applause
-
for them.
-
applause
-
Stefan: All right. Good morning, Congress.
So perhaps every one of you in the room
-
here has at one point or another in their
lives witnessed their computer behaving
-
weirdly and doing things that it was not
supposed to do or what you didn't
-
anticipate it to do. And well, typically
that would have probably been the result
-
of a software bug of some sort somewhere
inside the huge software stack your PC is
-
running on. Have you ever considered what
the probability of this weird behavior
-
being caused by a bit flipped somewhere in
your memory of your computer might have
-
been? So what you can see in this video on
the screen now is a physics experiment
-
called a cloud chamber. It's a very simple
experiment that is actually able to
-
visualize and make apparent all the
constant stream of background radiation we
-
all are constantly exposed to. So what's
happening here is that highly energetic
-
particles, for example, from space they
trace through gaseous alcohol and they
-
collide with alcohol molecules and they
form in this process a trail of
-
condensation while they do that. And if
you think about your computer, a typical
-
cell of RAM, of which you might have, I
don't know, 4, 8, 10 gigabytes in your
-
machine is as big as only 80 nanometers
wide. So it's very, very tiny. And you
-
probably are able to appreciate the small
amount of energy that is needed or that is
-
used to store the information inside each
of those bits. And the sheer amount of of
-
those bits you have in your RAM and your
computer. So a couple of years ago, there
-
was a study that concluded that in a
computer with about four gigabytes of RAM,
-
a bit flip, um, caused by such an event by
cosmic background radiation can occur
-
about once every 33 hours. So a
bit less than than one per day. In an
-
incident in 2008, a Quantas Airlines
flight actually nearly crashed, and the
-
reason for this crash was traced back to
be very likely caused by a bit flipped
-
somewhere in one of the CPUs of the
avionics system and nearly caused the
-
death of a lot of passengers on this
plane. In 2003, in Belgium, a small
-
municipal vote actually had a weird hiccup
in which one of the candidates in this
-
election actually got 4096 more votes added in a single instance.
-
And that was traced back to be very likely
caused by cosmic background radiation,
-
flipping a memory cell somewhere that
stored the vote count. And it was only
-
discovered that this happened because this
number of votes for this particular
-
candidate was considered unreasonable, but
otherwise would have gotten away probably
-
without being detected. So a few words
about us: Szymon and I, we both work at
-
CERN in the microelectronics section and
we both develop electronics that need to
-
be tolerant to these sorts of effects. So
we develop radiation tolerant electronics
-
for the experiments at CERN, at the LHC.
Among a lot of other applications, you can
-
meet the two of us at the Lötlabor Jena
assembly if you are interested in what we
-
are talking about today. And we will also
give a small talk or a small workshop
-
about radiation detection tomorrow, in one
of the seminar rooms. So feel free to pass
-
by there, it will be a quick introduction.
To give you a small idea of what kind of
-
environment we are working for: So if you
would use one of your default intel i7
-
CPUs from your notebook and would put it
anywhere where we operate our electronics,
-
it would very shortly die in a matter of
probably one or two minutes and it would
-
die for more than just one reason, which
is rather interesting and compelling. So
-
the idea for today's talk is to give you
all an insight into all the things that
-
need to be taken into account when you
design electronics for radiation
-
environments. What kinds of different
challenges come when you try to do that.
-
We classify and explain the different
types of radiation effects that exist. And
-
then we also present what you can do to
mitigate these effects and also validate
-
that what you did to care for them or
protect your circuits actually worked. And
-
of course, as we do that, we'll try to
give our view on how we develop radiation
-
tolerant electronics at CERN and how our
workflow looks like to make sure this
-
works. So let's first maybe take a step
back and have a look at what we mean when
-
we say radiation environments. The first
one that you probably have in mind right
-
now when you think about radiation is
space. So, this interstellar space is
-
basically filled with, very high speed,
highly energetic electrons and protons and
-
all sorts of high energy particles. And
while they, for example, traverse close to
-
planets as our Earth - these planets
sometimes do have a magnetic field and the
-
highly energetic particles are actually
deflected by these magnetic fields and
-
they can protect the planets as our
planet, for example, from this highly
-
energetic radiation. But in the process,
there around these planets sometimes they
-
form these radiation belts - known as the
Van Allen belts after the guy who
-
discovered this effect a long time ago.
And a satellite in space as it orbits
-
around the Earth might, depending on what
orbit is chosen, sometimes go through
-
these belts of highly intense radiation.
That, of course, then needs to be taken
-
into account when designing electronics
for such a satellite. And if Earth itself
-
is not able to give you enough radiation,
you may think of the very famous Juno
-
Jupiter mission that has become famous
about a year ago. They actually in the
-
environment of Jupiter they anticipated so
much radiation that they actually decided
-
to put all the electronics of the
satellite inside a one centimeter thick
-
cube of titanium, which is famously known
as the Juno Radiation Vault. But not only
-
space offers radiation environments.
Another form of radiation you probably all
-
recognize this when I show you this
picture, which is an X-ray image of a
-
hand. And X-ray is also considered a form
of radiation. And while, of course, the
-
doses or amounts of radiation any patient
is exposed to while doing diagnosis or
-
treatment of some disease, that might not
be the full story when it comes to medical
-
applications. So this is a medical
particle accelerator which is used for
-
cancer treatment. And in these sorts of
accelerators, typically carbon ions or
-
protons are accelerated and then focused
and used to treat and selectively destroy
-
cancer cells in the body. And this comes
already relatively close to the
-
environment we are working in and working
for. So Szymon and I are working, for
-
example, on electronics, for the CMS
detector inside the LHC or which we build
-
dedicated, radiation tolerant, integrated
circuits which have to withstand very,
-
very large amounts and doses of short
lived radiation in order to function
-
correctly. And if we didn't specifically
design electronics for that, basically the
-
whole system would never be able to work.
To illustrate a bit how you can imagine
-
the scale of this environment: This is a
single plot of a collision event that was
-
recorded in the ATLAS experiment. And each
of those tiny little traces you can make
-
out in this diagram is actually either one
or multiple secondary particles that were
-
created in the initial collision of two
proton bunches inside the experiment. And
-
in each of those, of course, races around
the detector electronics, which make these
-
traces visible. Itself, then decaying into
multiple other secondary particles which
-
all go through our electronics. And if
that doesn't sound, let's say, bad enough
-
for digital electronics, these collisions
happen about 40 million times a second. Of
-
course, multiplying the number of events
or problems they can cause in our
-
circuits. So we now want to introduce all
the things that can happen, the different
-
radiation effects. But first, probably we
take a step back and look at what we mean
-
when we say digital electronics or digital
logic, which we want to focus on today. So
-
from your university lectures or your
reading, you probably know the first class
-
of digital logic, which is the
combinatorial logic. So this is typically
-
logic that just does a simple linear
relation of the inputs of a circuit and
-
produces an output as exemplified with
these AND and OR, NAND, XOR gates that you
-
see here. But if you want to build - I
mean even though we use those everywhere
-
in our circuits - you probably also want
to store state in a more complex circuit,
-
for example, in the registers of your CPU
they store some sort of internal
-
information. And for that we use the other
class of logic, which is called the
-
sequential logic. So this is typically
clocked with some system clock frequency
-
and it changes its output with relation to
the inputs whenever this clock signal changes.
-
And now if we look at how all
these different logic functionalities are
-
implemented. So typically nowadays for
that you may know that we use CMOS
-
technologies and basically represent all
this logic functionality as digital gates
-
using small P-MOS and N-MOS MOSFET
transistors in CMOS technologies. And if
-
we kind of try to build a model for more
complex digital circuits, we typically use
-
something we call the finite state machine
model, in which we use a model that
-
consists of a combinatorial and a
sequential part. And you can see that the
-
output of the circuit depends both on the
internal state inside the register as well
-
as also the input to the combinatorial
logic. And accordingly, also the state
-
that is internal is always changed by the
inputs as well as the current state. So
-
this is kind of the simple model for more
complex systems that can be used to model
-
different effects. Um, now let's try to
actually look at what the radiation can do
-
to transistors. And for that we are going
to have a quick recap at what the
-
transistor actually is and how it looks
like. As you may perhaps know is that in
-
CMOS technologies, transistors are built
on wafers of high purity silicon. So this
-
is a crystalline, very regularly organized
lattice of silicon atoms. And what we do
-
to form a transistor on such a wafer is
that we add dopants. So in order to form
-
diffusion regions, which later will become
the source and drain of our transistors.
-
And then on top of that we grow a layer of
insulating oxide. And on top of that we
-
put polysilicon, which forms the gate
terminal of the transistor. And in the end
-
we end up with an equivalent circuit a bit
like that. And now to put things back into
-
perspective - you may also note that the
dimension of these structures are very
-
tiny. So we talk about tens of nanometers
for some of the dimensions I've outlined
-
here. And as the technologies shrink,
these become smaller and smaller and
-
therefore you'll probably also realize or
are able to appreciate the small amount of
-
energy that are used to store information
inside these digital circuits, which makes
-
them perhaps more sensitive to radiation.
So let's take a look. What different types
-
of radiation effects exist? We typically
in this case, differentiate them into two
-
main classes of events. The first one
would be the cumulative effects, which are
-
effects that, as the name implies,
accumulate over time. So as the circuit is
-
placed inside some radiation environment,
over time it accumulates more and more
-
dose and therefore worsens its performance
or changes how it operates. And on the
-
other side, we have the Single Event
Effects, which are always events that
-
happen at some instantaneous point in
time, and then suddenly, without being
-
predictable, change how the circuit
operates or how it functions or if it
-
works in the first place or not. So I'm
going to first go into the class of
-
cumulative effects and then later on,
Szymon will go into the other class of the
-
Single Event Effects. So in terms of these
accumulating effects, we basically have
-
two main subclasses: The first one being
ionization or TID effects, for Total
-
Ionizing Dose - and the second one being
displacement damages. So displacement
-
damages do exactly what they sound like.
It is all the effects that happen when an
-
atom in the silicon lattice is actually
displaced, so removed from its lattice
-
position and actually changes the
structure of the semiconductor. But
-
luckily, these effects don't have a big
impact in the CMOS digital circuits that
-
we are looking at today. So we will
disregard them for the moment and we'll be
-
looking more at the ionization damage, or
TID. So ionization - as a quick recap - is
-
whenever electrons are removed or added to
an atom, effectively transforming it into
-
an ion. And these effects are especially
critical for the circuits we are building
-
because of what they do is that they
change the behavior of the transistors.
-
And without looking too much into the
semiconductor details, I just want to show
-
their typical effect that we are concerned
about in this very simple circuit here. So
-
this is just an inverter circuit
consisting of two transistors here and
-
there. And what the circuit does in normal
operation is it just takes an input signal
-
and inverts and basically gives the
inverted signal at the output. And as the
-
transistors are irradiated and accumulate
dose, you can see that the edges of the
-
output signal get slower. So the
transistor takes longer to turn on and off.
-
And what that does in turn is that it
limits the maximum operation frequency of
-
your circuit. And of course, that is not
something you want to do. You want your
-
circuit to operate at some frequency in
your final system. And if the maximum
-
frequency it can work at degrades over
time, at some point it will fail as the
-
maximum frequency is just too low. So
let's have a look at what we can do to
-
mitigate these effects. The first one and
I already mentioned it when talking about
-
the Juno mission, is shielding. So if you
can actually put a box around your
-
electronics and shield any radiation from
actually hitting your transistors, it is
-
obvious that they will last longer and
will suffer less from the radiation damage
-
that it would otherwise do. So this
approach is very often used in space
-
applications like on satellites, but it's
not very useful if you are actually trying
-
to measure the radiation with your
circuits as we do, for example, in the
-
particle accelerators we build integrated
circuits for. So there first of all, we
-
want to measure the radiation so we cannot
shield our detectors from the radiation.
-
And also, we don't want to influence the
tracks of these secondary collision
-
products with any shielding material that
would be in the way. So this is not very
-
useful in a particle accelerator
environment, let's say. So we have to
-
resort to different methods. So as I said,
we do have to design our own integrated
-
circuits in the first place. So we have
some freedom in what we call transistor
-
level design. So we can actually alter the
dimensions of the transistors. We can make
-
them larger to withstand larger doses of
radiation and we can use special
-
techniques in terms of layout that we can
experimentally verifiy to be more
-
resistant to radiation effects. And as a
third measure, which is probably the most
-
important one for us, is what we call
modeling. So we actually are able to
-
characterize all the effects that
radiation will have on a transistor. And
-
if we can do that, if we will know: 'If I
put it into a radiation environment for a
-
year, how much slower will it become?'
Then it is of course easy to say: 'OK, I
-
can just over-design my circuit and make
it a bit more simple, maybe have less functionality,
-
but be able to operate at a
higher frequency and therefore withstand
-
the radiation effects for a longer time
while still working sufficiently well at
-
the end of its expected lifetime.' So
that's more or less what we can do about
-
these effects. And I'll hand over to
Szymon for the second class.
-
Szymon: Contrary to the cumulative effects
presented by Stefan, the other group are
-
Single Event Effects which are caused by
high energy deposits, which are caused by
-
a single particle or shower of particles.
And they can happen at any time, even
-
seconds after irradiation is started. It
means that if your circuit is vulnerable
-
to this class of effects, it can fail
immediately after radiation is present.
-
And here we also classify these effects
into several groups. The first are hard,
-
or permanent, errors, which as the name
indicates can permanently destroy your
-
circuit. And this type of errors are
typically critical for power devices where
-
you have large power densities and they
are not so much of a problem for digital
-
circuits. In the other class of effects
are soft errors. And here we distinguish
-
transient, or Single Event Transient
errors, which are spurious signals
-
propagating in your circuit as a result of
a gate being hit by a particle and they
-
are especially problematic for analog
circuits or asynchronous digital circuits,
-
but under some circumstances they can be
also problematic for synchronous systems.
-
And the other class of problems are
static, or Single Event Upset problems,
-
which basically means that your memory
element like a register gets flipped. And
-
then of course, if your system is not
designed to handle this type of errors
-
properly, it can lead to a failure. So in
the following part of the presentation
-
we'll focus mostly on soft errors. So
let's try to understand what is the origin
-
of this type of problem. So as Stefan
mentioned, the typical transistor is built
-
out of diffusions, gate and channel. So
here you can see one diffusion. Let's
-
assume that it is a drain diffusion. And
then when a particle goes through and
-
deposits charge, it creates free electron and
hole pairs, which then in the presence of
-
electric fields, they get collected by
means of drift, which results in a large
-
current spike, which is very short. And
then the rest of the charge could be
-
collected by diffusion which is a much
slower process and therefore also the
-
amplitude of the event is much, much
smaller. So let's try to understand what
-
could happen in a typical memory cell. So
on this schematic, you can see the
-
simplest memory cell, which is composed of
two back-to-back inverters. And let's
-
assume that node A is at high and node B
is at low potential initially. And then we
-
have a particle hitting the drain of
transistor M1 which creates a short
-
circuit current between drain and ground,
bringing the drain of transistor M1 to low
-
potential, which also acts on the gates of
second inverter, temporarily changing its
-
state from low to high, which reinforces
the wrong state in the first inverter. And
-
at this time the error is locked in your
memory cell and you basically lost your
-
information. So you may be asking
yourself: 'How much charge is needed
-
really to flip a state of a memory cell?'.
And you can get this number from either
-
simulations or from measurements. So let's
assume that what we could do, we could try
-
to inject some current into the sensitive
node, for example, drain of transistor M1.
-
And here what I will show is that on the
top plot you will have current as a function
-
of time. On the second plot you will have
output voltage. So voltage at node B as a
-
function of time and at the lowest plot you
will see a probability of having a bit
-
flip. So if you inject very little
current, of course nothing changes at the
-
output, but once you start increasing the
amount of current you are injecting, you
-
see that something appears at the output
and at some point the output will toggle,
-
so it will switch to the other state. And
at this point, if you really calculate
-
what is the area under the current curve
you can find what is the critical charge
-
needed to flip the memory cell. And if you
go further, if you start injecting even
-
more current, you will not see that much
difference in the output voltage waveform.
-
It could become only slightly faster. And
at this point, you also can notice that
-
the probability now jumped to one, which
means that any time you inject so much
-
current there is a fault in your circuit.
So for now, we just found what is the
-
probability of having a bit-flip from 0 to
1 in node B. Of course we should also
-
calculate the same for the other
direction, so from 1 to zero. And usually
-
it is slightly different. And then of
course we should inject in all the other
-
nodes, for example node B and also should
study all possible transitions. And then
-
at the end, if you calculate the
superposition of these effects and you
-
multiply them by the active area of each
node, you will end up with what we call
-
the cross section, which has a dimension
of centimeters squared, which will tell
-
you how sensitive your circuit is to this
type of effects. And then knowing the
-
radiation profile of your environment, you
can calculate the expected upset rate in
-
the final application. So now, having
covered the basic of the single event
-
effects, let's try to check how we can
mitigate them. And here also technology
-
plays a significant role. So of course,
newer technologies offer us much smaller
-
devices. And together with that, what
follows is that usually supply voltages
-
are getting smaller and smaller as well as
the node capacitance, which means that for
-
our Single Event Upsets it is very bad
because the critical charge which is
-
required to flip our bit is getting less
and less. But at the end, at the same
-
time, physical dimensions of our
transistors are getting smaller, which
-
means that the cross section for them
being hit is also getting smaller. So
-
overall, the effects really depend on the
circuit topology and the radiation
-
environment. So another protection method
could be introduced on the cell level. And
-
here we could imagine increasing the
critical charge. And that could be done in
-
the easiest way by just increasing the
node capacitance by, for example, putting
-
larger transistors. But of course, this
also increases the collection electrode,
-
which is not nice. And another way could
be just increase the capacitance by adding
-
some extra metal capacitance, but it, of
course, slows down the circuit. Another
-
approach could be to try to store the
information on more than two nodes. So I
-
showed you that on a simple SRAM cell we
store information only on two nodes, so
-
you could try to come up with some other
cells, for example, like that one in which
-
the information you stored on four nodes.
So you can see that the architecture is
-
very similar to the basic SRAM cell. But
you should be careful always to very
-
carefully simulate your design, because if
we analyze this circuit, you will quickly
-
realize that this circuit, even though the
information is stored in four different
-
nodes, the same type of loop exists as in
the basic circuit. Meaning that at the end
-
the circuit offers basically no hardening
with respect to the previous cell. So
-
actually we can do it better. So here you
can see a typical dual interlocked cell.
-
So the amount of transistors is exactly
the same as in the previous example, but
-
now they are interconnected slightly
differently. And here you can see that
-
this cell has also two stable configurations.
But this time data can propagate, the low
-
level from a given node can propagate
only to the left hand side, while high
-
level can propagate to the right hand
side. And each stage being inverting means
-
that the fault can not propagate for more
than one node. Of course, this cell has
-
some drawbacks: It consumes more area than
a simple SRAM cell and also write access
-
requires accessing at least two nodes at
the same time to really change the state
-
of the cell. And so you may ask yourself,
how effective is this cell? So here I will
-
show you a cross section plot. So it is
the probability of having an error as a
-
function of injected energy. And as a
reference, you can see a pink curve on the
-
top, which is for a normal, not protected
cell. And on the green you can see the
-
cross section for the error in the DICE
cell. So as you can see, it is one order
-
of magnitude better than the normal cell.
But still, the cross section is far from
-
being negligible, So, the problem was
identified: So it was identified that the
-
problem was caused by the fact that some
sensitive nodes were very close together
-
on the layout and therefore they could be
upset by the same particle. Because as we
-
mentioned, that single devices, they are very
small. We are talking about dimensions
-
below a micron. So after realizing that,
we designed another cell in which we
-
separated more sensitive nodes and we
ended up with the blue curve, and as you
-
can see the cross section was reduced by
two more orders of magnitude and the
-
threshold was increased significantly. So
if you don't want to redesign your
-
standard cells, you could also apply some
mitigation techniques on block level. So
-
here we can use some encoding to encode
our state better. And as an example, I
-
will show you a typical Hamming code. So
to protect four bits, we have to add three
-
additional party bits which are calculated
according to this formula. And then once
-
you calculate the parity bits, you can use
those to check the state integrity of your
-
internal state. And if any of their parity
bits is not equal to zero, then the bits
-
instantaneously become syndromes,
indicating where the error happened. And
-
you can use these information to correct
the error. Of course, in this case, the
-
efficiency is not really nice because we
need three additional bits to protect only
-
four bits of information. But as the state
length increases the protection also is
-
more efficient. Another approach would be
to do even less. Meaning that instead of
-
changing anything you need in your design,
you can just triplicate your design or
-
multiply it many times and just vote,
which state is correct? So this concept is
-
called tripple modular redudancy and it is
based around a voter cell. So it is a
-
cell which has odd number of
inputs and output is always equal to
-
majority of its input. And as I mentioned
that the idea is that you have, for
-
example, three circuits: A, B and C, and
during normal operation, when they are
-
identical, the output is also the same.
However, when there is a problem, for
-
example, in logic, part B, the output
is affected. So this problem is
-
effectively masked by the voter cell
and it is not visible from outside of the
-
circuit. But you have to be careful not to
take this picture as a as a design
-
template. So let's try to analyze what
would happen with a state machine
-
similar to what Stephan introduced. If you
were to just use this concept. So here you
-
can see three state machines and
a voter on the output. And as we can see,
-
if you have an upside in, for example, the
state register A, then the state is
-
broken. But still the output of the
circuit, which is indicated by letter s is
-
correct because the B and C registers are
still fine. But what happens if some time
-
later we have an upset in memory element B
or C? Then of course the state
-
of our system is broken and we can not
recover it. So you can ask yourself what
-
can we do better in order to avoid this
situation? So that just to be sure. Please
-
do not use this technique to protect your
circuits. So the easiest mitigation could
-
be to use as an input to your logic to use
the output of the voter cell itself.
-
What it offers us is that now whenever you
have an upset in one of the memory
-
elements for the next computation, for the
next stage, we always use the voter
-
output, which ensures that the signal
will be removed one clock cycle later. So
-
you will have another hit sometime later,
basically, it will not affect our state.
-
Until now we consider only upsets in our
registers but what happens if we have
-
charge in our voter? So you see that
if there is no state change, basically the
-
transient in the voter doesn't impact
our system. But if you are really unlucky
-
and the transient happens when the clock
transition happens, so when whenever we
-
enlarge the data, we can corrupt the state
in three registers at the same time, which
-
is less than ideal. So to overcome this
limitation, you can consider skewing our
-
clocks by some time, which is larger than
the maximum charge in time. And now,
-
because with each register samples the
output of the voter a slightly different
-
time, we can corrupt only one flip-flop
at the time. So of course, if you are
-
unlucky, we can have problematic
situations in which one register is
-
already in your state. The other register
is still in the old state. And then it
-
can lead to undetermenistic result. So it
is better, but still not ideal. So as a
-
general theme, you have seen that we were
adding and adding more resources so you
-
can ask yourself what would happen if we
tripplicate everything. So in this case,
-
we tripplicated registers, we
tripplicate our logic and our voters. And
-
now you can see that whenever we have an
upset in our register, it can only affect
-
one register at the time and the error
will be removed from the system one clock
-
cycle later. Also, if we have an upset
in the voter or in their logic it can be
-
larged only to one register, which means
that in principle we create that system
-
which is really robust. Unfortunately,
nothing is for free. So here I compare a
-
different tripplication environments and
as you can see that the more protection
-
you want to have, the more you have to pay
in terms of resources being power in the
-
area. And also usual, you pay small
penalty in terms of maximum operational
-
speed. So which flavor of protection you
use depends really on
-
application. So for most sensitive
circuits, you probably you want to use
-
full TMR and you may leave some other
bits of logic unprotected. So another, if
-
your system is not mission critical and
you can tolerate some downtime, you can
-
consider scrubbing, which means periodically
checking the state of your system and refreshing it
-
if necessary if an error is detected using
some parity bits or copy of the data in
-
a safe space. Or you can have a
watchdog which will find out that
-
something went wrong and it will just
reinitialize the whole system. So now,
-
having covered the basics of all the effects
we will have to face, we would like
-
to show you the basic flow which we follow
during designing our radiation hardened
-
circuits. So of course we always start
with specifications. So we try to
-
understand our radiation environment in
which the circuit is meant to operate. So
-
we come up with some specifications for
total dose which could be accumulated and
-
for the rate of single event upsets. And
at this moment, it is also not very rare
-
that we have to decide to move some
functionality out of our detector volume,
-
outside, where we can use of the sort of
commercial equipment to do number
-
crunching. But let's assume that we would
go with our ASIC. So having the
-
specifications, of course we proceed with
functional implementation. This we
-
typically do with hardware describtion
languages, so verilog or VHDL which you may
-
know from typical FPGA flow. And of course
we write a lot of simulations to
-
understand whether we are meeting our
functional goals or whether our circuit
-
behaves as expected. And then we
selectively select some parts of the
-
circuits which we want to protect from
radiation effects. So, for example, we can
-
decide to use triplication or some other
methods. So these days we typically use
-
triplication as the most straightforward
and very effective method. So you can ask
-
yourself how do we triplicate the logic?
So the simplest could be: Just copy
-
and paste the code three times at some
postfixes like A, B and C and you are
-
done. But of course this solution has some
drawbacks. So it is time consuming and it
-
is very error prone. So maybe you have
noticed that I had a typo there. So of
-
course we don't want to do that. So we
developed our own tool, which we called
-
TMRG, which automatizes the process of
triplication and eliminates the two main
-
drawbacks, which I just described. So
after we have our code triplicated and of
-
course, not before rerunning all the
simulations to make sure that everything
-
went as expected. We then proceed to the
synthesis process in which we convert our
-
high level hardware description languages
to gate level netlists, in which all the functions
-
are mapped to gates, which were introduced
by Stefan, so both combinatorial and
-
sequential. And here we also have to be
careful because modern CAD tools have a
-
tendency, of course, to optimise the logic
as much as possible. And our logic in most
-
of the cases is really redundant. So it is
very easy; So, it should be removed. So we
-
really have to make sure that it is not
removed. That's why our tool also provides
-
some constraints for the synthesizer to
make sure that our design intent is
-
clearly and well understood by the tool.
And once we have the output netlist, we
-
proceed to place and route process where
this kind of netlist representation is
-
mapped to a layout of what will become
soon our digital chip where we placed all
-
the cells and we route connections between
them and here there is
-
another danger which I mentioned already,
it's that in modern technologies the cells
-
are so small that they could be easily
affected by a single particle at the same
-
time. So we have to really space out
the big cells which are responsible for
-
keeping the information about the state to
make sure that a single particle cannot
-
upset A and B, for example, registered
from the same register. And then in the
-
last step, of course, we'll have to verify
that everything, what we have done, is
-
correct. And at this level, we also try to
introduce some single event effects in our
-
simulations. So we could randomly flip
bits in our system. We can also inject
-
transients. And typically we used to do
that on the netlist level, which works
-
very fine. And it is very nice. But the
problem with this approach is that we can
-
perform these actions very late in the
design cycle, which is less than ideal.
-
And also that if we find that there is
problem in our simulation, typical netlist
-
at this level has probably few orders of
magnitude more lines than our initial RTL
-
code. So to trace back what is the
problematic line of code is not so
-
straightforward. At this time. So you can
ask yourself why not to try to inject
-
errors in the RTL design? And the answer
was, the answer is that it is not so
-
trivially to map the hardware description
language's high level constructs to
-
what will become combinatorial or
sequential logic. So in order to eliminate
-
this problem, we also develop another open
source tool, which allows us to...
-
So we decided to use Yosys open
source synthesis tool from clifford, which
-
was presented in the Congress several
years ago. So we use this tool to make a
-
first pass through our RTL code to
understand which elements will be mapped
-
to sequential and combinatorial. And then
having this information, we will use
-
cocotb, another python verification
framework, which allows us programmatic
-
access to these nodes and we can
effectively inject the errors in our
-
simulations. And I forgot to mention that
the TMRG tool is also open source. So if
-
you are interested in one of the tools,
please feel free to contact us. And of
-
course, after our simulation is done, then in
the next step we would really tape out. And
-
so we submit our chip to manufacturing and
hopefully a few months later we receive
-
our chip back.
Stefan: All right. So after patiently
-
waiting then for a couple of months while
your chip is in manufacturing and you're
-
spending time on preparing a test set up
and preparing yourself to actually test if
-
your chip works as you expected to. Now,
it's probably also a good time to think
-
about how to actually validate or test if
all the measures that you've taken to
-
protect your circuit from radiation
effects actually are effective or if they
-
are not. And so again, we will split this
in two parts. So you will probably want to
-
start with testing for the total ionizing
dose effects. So for the cumulative effect
-
and for that, you typically use x ray
radiation relatively similar to the one
-
used in medical treatment. So this
radiation is relatively low, energetic,
-
which has the upside of not producing any
single event effects, but you can really
-
only accumulate radiation dose and focus
on the accumulating effects. And typically
-
you would use a machine that looks
somewhat like this, a relatively compact
-
thing. You can have in your laboratory and
you can use that to really accumulate
-
large amounts of radiation dose on your
circuit. And then you need some sort of
-
mechanism to verify or to quantify how
much your circuit slows down due to this
-
radiation dose. And if you do that, you
typically end up with a graphic such as
-
this one, where in the x axis you have the
radiation dose your circuit was exposed
-
to. And on the y axis, you see that the
frequency has gone down over time and you
-
can use this information to say:
"OK, my final application, I expect this
-
level of radiation dose. I mean, I can
still see that my circuit will work fine
-
under some given environmental condition
or some operation condition." So this is
-
the test for the first class of effects.
And the test for the second class of
-
effects for the single event effect is a
bit more involved. So there what you would
-
typically start to do is go for a heavy
ion test campaign. So you would go to a
-
specialized, relatively rare facility. We
have a couple of those in Europe and would
-
look perhaps somewhat like this. So it's a
small particle accelerator somewhere.
-
They typically have
different types of heavy ions at their
-
disposal that they can accelerate and then
shoot at your chip that you can place in a
-
vacuum chamber and these ions can deposit
very well known amounts of energy in your
-
circuit and you can use that information
to characterize your circuit. The downside
-
is a bit that these facilities tend to be
relatively expensive to access and also a
-
bit hard to access. So typically you need
to book them a lot of time in advance and
-
that's sometimes not very easy. But what
it offers you, you can use different types
-
of ions with different energies. You can
really make a very well-defined
-
sensitivity curve similar to the one that
Szymon has described. You can get from
-
simulations and really characterize your
circuit for how often, any single event
-
effects will appear in the final
application if there is any remaining
-
effects left. If you have left something
unprotected. The problem here is that
-
these particle accelerators typically just
bombard your circuit with like thousands
-
of particles per second and they hit
basically the whole area in a random
-
fashion. So you don't really have a way of
steering those or measuring the position
-
of these particles. So typically you are a
bit in the dark and really have to really
-
carefully know the behavior of your
circuit and all the quirks it has even
-
without the radiation to instantly notice
when something has gone wrong. And
-
this is typically not very easy
and you can kind of compare it with having
-
some weird crash somewhere in your
software stack and then having to have
-
first take a look and see what actually
has happened. Typically
-
you find something that has not been
properly protected and you see some weird
-
effect on your circuit and then you try to
get a better idea of where that problem
-
actually is located. And the answer for
these types of problems involving position
-
is, of course, always lasers. So we have
two types of laser experiments available
-
that can be used to more selectively probe
your circuit for these problems. The first
-
one being the single photon absorption
laser. And it sounds this relatively
-
simple in terms of setup. You just use a
single laser beam that shoots straight up
-
at your circuit from the back. And while
it does that, it deposits energy all along
-
the silicon and also in the diffusions of
your transistors and is therefore also
-
able to inject energy there, potentially
upsetting a bit of memory or exposing
-
whatever other single event effects you
have. And of course, you can steer this
-
beam across the surface of your chip or
whatever circuit you are testing and then
-
find the sensitive location. The problem
here is that the amount of energy that is
-
deposited is really large due to the fact
that it has to go through the whole
-
silicon until it reaches the transistor.
And therefore it's mostly used to find
-
these destructive effects that really
break something in your circuit. The more
-
clever and somehow beautiful experiment is
the two photon absorption laser experiment
-
in which you use two laser beams of a
different wavelength. And these actually
-
do not have enough energy to cause any
effect in your silicon. If only one of the
-
laser beams is present, but only in the
small location where the two beams
-
intersect, the energy is actually large
enough to produce the effect. And this
-
allows you to very selectively and only on
a very small volume induce charge and
-
cause an effect in your circuit. And when
you do that now, you can systematically
-
scan both the X and Y directions across
your chip and also the Z direction and can
-
really measure the volume of sensitive
area. And this is what you would typically
-
get of such an experiment. So in black and
white in the back, you'll see an infrared
-
image of your chip where you can really
make out the individual, say structural
-
components. And then overlaid in blue, you
can basically highlight all the sensitive
-
points that made you measure something you
didn't expect, some weird bit flip in a
-
register or something. And you can really
then go to your layout software and find
-
what is the the register or the gate in
your netlist that is responsible for
-
this. And then it's more like operating a
debugger in a software environment.
-
Tracing back from there what the line of
code responsible for this bug is. And
-
to close out, it is always best to learn
from mistakes. And we offer our mistakes
-
as a guideline for if you ever feel
yourself the need to design radiation
-
tolerant circuits. So we want to present
two or three small issues we had and
-
circuits where we were convinced it should
have been working fine. So the first one
-
this you will probably recognize is this
full triple modular redundancy scheme that
-
Szymon has presented. So we made sure to
triplicate everything and we were relatively
-
sure that everything should be fine. The
only modification we did is that to all
-
those registers in our design, we've added
a reset, because we wanted to initialize
-
the system to some known state when we
started up, which is a very obvious thing
-
to do. Every CPU has a reset. But of
course, what we didn't think about here
-
was that at some point there's a buffer
driving this reset line somewhere. And if
-
there's only a single buffer. What happens
if this buffer experiences a small
-
transient event? Of course, the obvious
thing that happened is that as soon as
-
that happened, all the registers were
upset at the same time and were basically
-
cleared and all our fancy protection was
invalidated. So next time we decided,
-
let's be smarter this time. And of course,
we triplicate all the logic and all the
-
voters and all the registers. So let's
also triplicate the reset lines. And while
-
the designer of that block probably had
very good intentions, it turned out
-
that later than when we manufactured the
chip, it still sometimes showed a complete
-
reset without any good explanation for
that. And what was left out of the the
-
scope of thinking here was that this reset
actually was connected to the system reset
-
of the chip that we had. And typically
pins are on the chip or something that is
-
not available in huge quantities. So you
typically don't want to spend three pins
-
of your chip just for a stupid reset that
you don't use ninety nine percent of the
-
time. So what we did at some point we just
connected again the reset lines to a
-
single input buffer. That was then
connected to a pin of the chip. And of
-
course, this also represented a small
sensitive area in the chip. And again,
-
a single upset here was able to destroy
all three of our flip flops. All right.
-
And the last lesson I'm bringing or the
last thing that goes back to the
-
implementation details that Szymon has
mentioned. So this time, really simple
-
circuit. We were absolutely convinced it
must work because it was basically the
-
textbook example that Szymon was
presenting. And the code was so
-
small we were able to inspect everything
and were very much sure that nothing
-
should have happened. And what we saw when
we went for this laser testing experiment,
-
in simplified form is basically that
only this first voter. And when this was
-
hit, always all our register was
upset while the other ones were
-
never manifested to show anything strange.
And it took us quite a while to actually
-
look at the layout later on and figure out
that what was in the chip was rather this.
-
So two of the voters were actually not
there. And Szymon mentioned the reason for
-
that. So synthesis tool these days are
really clever at identifying redundant
-
logic and because we forgot to tell it to
not optimize these redundant pieces of
-
logic, which the voters really are. It
just merged them into one. And that
-
explains why we only saw this one voter
being the sensitive one. And of course, if
-
you have a transient event there, then you
suddenly upset all your registers and that
-
without even knowing it and with being
sure, having looked at every single line
-
of verilog code and being very sure,
everything should have been fine. But that
-
seems to be how this business goes. So we
hope we had been we had the chance and you
-
were able to get some insight in in what
we do to make sure the experiments at the
-
LHC work fine. What you can do to
make sure the satellite you are working on
-
might be working OK. Even before launching
it into space, if you're interested into
-
some more information on this topic, feel
free to pass by at the assembly I
-
mentioned at the beginning or just meet us
after the talk and otherwise thank you
-
very much.
Applause
-
Herald: Thank you very much indeed.
There's about 10 minutes left for Q and A,
-
so if you have any questions go to a
microphone. And as a cautious reminder,
-
questions are short sentences with. That
starts with a question. Well, ends with a
-
question mark and the first question goes
to the Internet.
-
Internet: Well, hello. Um, do you also
incorporate radiation as the source for
-
randomness when that's needed?
Stefan: So we personally don't. So in our
-
designs we don't. But it is done indeed
for a random number generator. This is
-
sometimes done that they use radioactive
decay as a source for randomness. So this
-
is done, but we don't do it in our
experiments.
-
We rather want deterministic data out of
the things we built.
-
Herald: Okay. Next question goes to
microphone number four.
-
Mic 4: Do you do your tripplication before
or after elaboration?
-
Szymon: So currently we do it before
elaboration. So we decided that our tool
-
works on verilog input and it produces
verilog output because it offers much more
-
flexibility in the way how you can
incorporate different tripplication
-
schemes. If you were to apply to only
after elaboration, then of course doing a
-
full tripplication might be easy. But then
you - to having a really precise control
-
or on types of tripplication on different
levels is much more difficult.
-
Herald: Next question from microphone
number two.
-
Mic 2: Is it possible to use DCDC
converters or switch mode power supplies
-
within the radiation environment to power
your logic? Or you use only linear power?
-
Szymon: Yes, alternatively we also have a
dedicated program which develops radiation
-
hardened DCDC converters who operate
in our environments. So they are available
-
also for space applications, as far as I'm
aware. And they are hardened against total
-
ionizing dose as well as single event
upsets.
-
Herald: Okay next question goes to
microphone number one.
-
Mic 1: Thank you very much for the great
talk. I'm just wondering, would it be
-
possible to hook up every logic gate in
every water in a way of mesh network? And
-
what are the pitfalls and limitations for
that?
-
Stefan: So that is not something I'm aware
of, of being done. So typically: No. I
-
wouldn't say that that's something we
would do.
-
Szymon: I'm not really sure if I
understood the question.
-
Stefan: So maybe you can rephrase what
your idea is?
-
Mic 1: On the last slide, there were a
lesson learned.
-
Stefan: Yes. One of those?
Mic 1: In here. Yeah. Would you be able to
-
connect everything interchangeably in a
mesh network?
-
Szymon: So what you are probably asking
about is whether we can build our own
-
FPGA, like programable logic device.
Mic 1: Probably.
-
Szymon: Yeah. And so this we typically
don't do, because in our experiments, our
-
power budget is also very limited, so we
cannot really afford this level of
-
complexity. So of course you can make your
FPGA design radiation hard, but this is
-
not what we will typically do in our
experiments.
-
Herald: Next question goes to microphone
number two.
-
Mic 2: Hi, I would like to ask if the
orientation of your transistors and your
-
chip is part of your design. So mostly you
have something like a bounding box around
-
your design and with an attack surface in
different sizes. So do you use this
-
orientation to minimize the attack surface
of the radiation on chips, if you know
-
the source of the radiation?
Szymon: No. So I don't think we'd do that.
-
So, of course, we control our orientation
of transistors during the design phase.
-
But usually in our experiment, the
radiation is really perpendicular to the
-
chip area, which means that if you rotate
it by 90 degrees, you don't really gain
-
that much. And moreover, our chips,
usually they are mounted in a bigger
-
system where we don't control how they are
oriented.
-
Herald: Again, microphone number two.
Mic 2: Do you take meta stability into
-
account when designing voters?
Szymon: The voter itself is combinatorial.
-
So ... -
Mic 2: Yeah, but if the state of the rest
-
can change in any time that then the
voters can have like glitches, yeah?
-
Szymon: Correct. So that's why - so to
avoid this, we don't take it into account
-
during the design phase. But if we use
that scheme which is just displayed here,
-
we avoid this problem altogether, right?
Because even if you have meta stability in
-
one of the blocks like A, B or C, then it
will be fixed in the next clock cycle.
-
Because usually our systems operate on
clocks with low frequencies, hundreds of
-
megahertz, which means that any meta
stability should be resolved by the next
-
clock cycle.
Mic 2: Thank you.
-
Herald: Next question microphone number
one.
-
Mic 1: How do you handle the register
duplication that can be performed by a
-
synthesis and pleasant route? So the tools
will try to optimize timing sometimes by
-
adding registers. And these registers are
not trippled.
-
Stefan: Yes. So what we do is that I mean,
in a typical, let's say, standard ASIC
-
design flaw, this is not what happens. So
you have to actually instruct a tool to do
-
that, to do re timing and add additional
registers. But for what we are doing, we
-
have to - let's say not do this
optimization and instruct a tool to keep
-
all the registers we described in our RTL
code to keep them until the very end. And
-
we realy also constrain them to always
keep their associated logic tripplicated.
-
Herald: The next question is from the
internet.
-
Internet: Do you have some simple tips for
improving radiation tolerance?
-
Stefan: Simple tips? Ahhhm...
Szymon: Put your electronics inside a
-
box.
Stefan: Yes.
-
some laughter
There's there's just no
-
single one size fits all textbook recipe
for this as it really always comes down to
-
analyzing your environment, really getting
an awareness first of what rate and what
-
number of events you are looking at, what
type of particles cause them, and then
-
take the appropriate measures to mitigate
them. So there is no one size fits all
-
thing I say.
Herald: Next question goes from mycrophone
-
number two.
Mic 2: Hi. Thanks for the talk. How much
-
of your software used to design is
actually open source? I only know a super
-
expensive chip design software.
Stefan: You write the core of all the
-
implementation tools like the synthesis
and place and route stage for the ASICS,
-
that we design is actually a commercial
closed source tools. And if
-
you're asking for the fraction, that's a
bit hard to answer. I cannot give a
-
statement about the size of the commercial
closed tools. But we tried to do
-
everything we develop, tried to make it
available to the widest possible audience
-
and therefore decided to make the
extensions to this design flaw available
-
in public form. And that's why these
tools that we develop and share among the
-
community of ASIC designers and this
environment are open source.
-
Herald: Microphone number four.
Mic 4: Have you ever tried using steered
-
iron beams for more localized, radiation
ingress testing?
-
Stefan: Yes, indeed! And the picture I
showed actually, uh, didn't disclaimer
-
that, but the facility you saw here is
actually a facility in Darmstadt in
-
Germany and is actually a micro beam
facility. So it's a facility that allows
-
steering a heavy ion beam really on a
single position with less than a
-
micrometer accuracy. So it provides
probably exactly what you were asking for.
-
But that's not the typical case. That is
really a special thing. And it's probably
-
also the only facility in Europe that can
do that.
-
Herald: Microphone number one.
Mic 1: Was very good very good talk. Thank
-
you very much. My question is, did you
compare what you did to what is done for
-
securing secret chips? You know, when you
have credit card chips, you can make fault
-
attacks into them so you can make them
malfunction and extract the cryptographic
-
key for example from the banking card.
There are techniques here to harden these
-
chips against fault attacks. So which are
like voluntary faults while you have like
-
random less faults due to like involatility
attacks. You know what? Can you explain if
-
you compared in a way what you did to
this?
-
Stefan: Um, so no, we didn't explicitly
compared it, but it is right that the
-
techniques we present can also be used in
a variety of different contexts. So one
-
thing that's not exactly what you are
referring to, but relatively on a similar
-
scale is that currently in very small
technologies you get two problems with the
-
reliability and yield of the manufacturing
process itself, meaning that sometimes
-
just the metal interconnection between two
gates and your circuit might be broken
-
after manufacturing and then adding the
sort of redundancy with the same kinds of
-
techniques can be used to make, to
produce more working chips out of a
-
manufacturing run. So in this sort of
context, these sorts of techniques are
-
used very often these days. But, um, I'm
and I'm pretty sure they can be applied to
-
these sorts of, uh, security fault attack
scenarios as well.
-
Herald: Next question from microphone
number two.
-
Mic 2: Hi, you briefly also mentioned the
mitigation techniques on the cell level
-
and yesterday there was a very nice talk
from the Libre Silicon people and they
-
are trying to build a standard cell
library, uh, open source standard cell
-
library. So are you in contact with them
or maybe you could help them to improve
-
their design and then the radiation
hardness?
-
Stefan: No. We also saw the talk
yesterday, but we are not yet in
-
contact with them. No.
Herald: Does the Internet have questions?
-
Internet: Yes, I do. Um, two in fact.
First one would be would TTL or other BJT
-
based logic be more resistant?
Szymon: Uh, yeah. So depending on which
-
type of errors we are considering. So BJT
transistors, they have ...
-
Stefan in his part mentioned that
displacement damage is not a problem for
-
seamless devices, but it is not the case
for BJT devices. So when they are exposed
-
to high energy hadrons or protons,
they degrade a lot. So that's why we don't
-
use them in really our environment. They
could be probably much more robust to
-
single event effects because their
resistance everywhere is much lower. But
-
they would have other problems. And also
another problem which is worth
-
mentioning is that for those devices, they
consume much, much, much more power, which
-
we cannot afford in our applications.
Internet: And the last one would be how do
-
I use the output of the full TMR setup? Is
it still three signals? How do I know
-
which one to use and to trust?
Stefan: Um, yes. So with this, um,
-
architecture, what you could either do is
really do the full triplication scheme
-
to your whole logic tree basically and
really triplicate everything or, and
-
that's going in the direction of one of
the lessons learned I had, at some point
-
of course you have an interface to your
chip, so you have pins left and right that
-
are inputs and outputs. And then you have
to decide either you want to spend the
-
effort and also have three dedicated input
pins for each of the signals, or you at
-
some point have the voter and say, okay.
At this point, all these signals are
-
combined. But I was able to reduce the
amount of sensitive area in my chip
-
significantly and can live with the very
small remaining sensitive area that just
-
the input and output pins provide.
Szymon: So maybe I will add one more thing
-
is that typically in our systems, of
course we triplicate our logic internally,
-
but when we interface with external
world, we can apply another protection
-
mechanism. So for example, for our high
speed serialisers, we will use different types
-
of encoding to add protect...,
to add like forward error correction
-
codes which would allow us to recover these
type of faults in the backend later on.
-
Herald: Okay. If ...if we keep this very,
very short. Last question goes to
-
microphone number two.
Mic 2: I don't know much about physics. So
-
just the question, how important is the
physical testing after the chip is
-
manufactured? Isn't the simulation, the
computer simulation enough if you just
-
shoot particles at it?
Stefan: Yes and no. So in principle, of
-
course, you are right that you should be
able to simulate all the effects we look
-
at. The problem is that as the designs
grow big and they do grow bigger as the
-
technologies shrink, so
this final net list that you end up with
-
can have millions or billions of nodes and
it just is not feasible anymore to
-
simulate it exhaustively because you have
to have so many dimensions. You have to
-
change when you inject. For example, bit
flips or transients in your design in any
-
of those nodes for varying time offsets.
And it's just the state space the circuit
-
can be in is just too huge to capture in a
in a full simulation. So it's not possible
-
to exhaustively test it in simulation. And
so typically you end up with having missed
-
something that you discover only in the
physical testing afterwards, which you
-
always want to do before you put your, uh,
your chip into final experiment or on your
-
satellite and then realise it's it's not
working as intended. So it has a big
-
importance as well.
Herald: Okay. Thank you. Time is up. All
-
right. Thank you all very much.
-
applause
-
36c3 postroll music
-
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!