WEBVTT

00:00:00.000 --> 00:00:18.406
<i>36C3 Intro musik</i>

00:00:18.406 --> 00:00:22.640
Herald: The next talk will be titled 'How
to Design Highly Reliable Digital

00:00:22.640 --> 00:00:26.472
Electronics', and it will be delivered to
you by Szymon and Stefan. Warm Applause

00:00:26.472 --> 00:00:30.199
for them.

00:00:30.199 --> 00:00:36.360
<i>applause</i>

00:00:36.360 --> 00:00:41.360
Stefan: All right. Good morning, Congress.
So perhaps every one of you in the room

00:00:41.360 --> 00:00:45.600
here has at one point or another in their
lives witnessed their computer behaving

00:00:45.600 --> 00:00:50.320
weirdly and doing things that it was not
supposed to do or what you didn't

00:00:50.320 --> 00:00:54.400
anticipate it to do. And well, typically
that would have probably been the result

00:00:54.400 --> 00:01:00.000
of a software bug of some sort somewhere
inside the huge software stack your PC is

00:01:00.000 --> 00:01:04.720
running on. Have you ever considered what
the probability of this weird behavior

00:01:04.720 --> 00:01:09.120
being caused by a bit flipped somewhere in
your memory of your computer might have

00:01:09.120 --> 00:01:16.240
been? So what you can see in this video on
the screen now is a physics experiment

00:01:16.240 --> 00:01:20.720
called a cloud chamber. It's a very simple
experiment that is actually able to

00:01:20.720 --> 00:01:26.560
visualize and make apparent all the
constant stream of background radiation we

00:01:26.560 --> 00:01:32.640
all are constantly exposed to. So what's
happening here is that highly energetic

00:01:32.640 --> 00:01:39.040
particles, for example, from space they
trace through gaseous alcohol and they

00:01:39.040 --> 00:01:42.160
collide with alcohol molecules and they
form in this process a trail of

00:01:42.160 --> 00:01:48.240
condensation while they do that. And if
you think about your computer, a typical

00:01:48.240 --> 00:01:53.200
cell of RAM, of which you might have, I
don't know, 4, 8, 10 gigabytes in your

00:01:53.200 --> 00:01:58.400
machine is as big as only 80 nanometers
wide. So it's very, very tiny. And you

00:01:58.400 --> 00:02:02.560
probably are able to appreciate the small
amount of energy that is needed or that is

00:02:02.560 --> 00:02:08.480
used to store the information inside each
of those bits. And the sheer amount of of

00:02:08.480 --> 00:02:12.560
those bits you have in your RAM and your
computer. So a couple of years ago, there

00:02:12.560 --> 00:02:17.600
was a study that concluded that in a
computer with about four gigabytes of RAM,

00:02:17.600 --> 00:02:23.600
a bit flip, um, caused by such an event by
cosmic background radiation can occur

00:02:23.600 --> 00:02:29.200
about once every 33 hours. So a
bit less than than one per day. In an

00:02:29.200 --> 00:02:34.960
incident in 2008, a Quantas Airlines
flight actually nearly crashed, and the

00:02:34.960 --> 00:02:40.080
reason for this crash was traced back to
be very likely caused by a bit flipped

00:02:40.080 --> 00:02:44.400
somewhere in one of the CPUs of the
avionics system and nearly caused the

00:02:44.400 --> 00:02:50.480
death of a lot of passengers on this
plane. In 2003, in Belgium, a small

00:02:50.480 --> 00:02:56.880
municipal vote actually had a weird hiccup
in which one of the candidates in this

00:02:56.880 --> 00:03:02.153
election actually got 4096 more votes added in a single instance.

00:03:02.153 --> 00:03:06.480
And that was traced back to be very likely
caused by cosmic background radiation,

00:03:06.480 --> 00:03:10.000
flipping a memory cell somewhere that
stored the vote count. And it was only

00:03:10.000 --> 00:03:14.560
discovered that this happened because this
number of votes for this particular

00:03:14.560 --> 00:03:18.880
candidate was considered unreasonable, but
otherwise would have gotten away probably

00:03:18.880 --> 00:03:27.360
without being detected. So a few words
about us: Szymon and I, we both work at

00:03:27.360 --> 00:03:32.480
CERN in the microelectronics section and
we both develop electronics that need to

00:03:32.480 --> 00:03:37.360
be tolerant to these sorts of effects. So
we develop radiation tolerant electronics

00:03:37.360 --> 00:03:42.846
for the experiments at CERN, at the LHC.
Among a lot of other applications, you can

00:03:42.846 --> 00:03:48.330
meet the two of us at the Lötlabor Jena
assembly if you are interested in what we

00:03:48.330 --> 00:03:55.847
are talking about today. And we will also
give a small talk or a small workshop

00:03:55.847 --> 00:03:59.190
about radiation detection tomorrow, in one
of the seminar rooms. So feel free to pass

00:03:59.190 --> 00:04:02.544
by there, it will be a quick introduction.
To give you a small idea of what kind of

00:04:02.544 --> 00:04:08.541
environment we are working for: So if you
would use one of your default intel i7

00:04:08.541 --> 00:04:14.294
CPUs from your notebook and would put it
anywhere where we operate our electronics,

00:04:14.294 --> 00:04:19.632
it would very shortly die in a matter of
probably one or two minutes and it would

00:04:19.632 --> 00:04:24.626
die for more than just one reason, which
is rather interesting and compelling. So

00:04:24.626 --> 00:04:30.985
the idea for today's talk is to give you
all an insight into all the things that

00:04:30.985 --> 00:04:34.575
need to be taken into account when you
design electronics for radiation

00:04:34.575 --> 00:04:39.152
environments. What kinds of different
challenges come when you try to do that.

00:04:39.152 --> 00:04:43.116
We classify and explain the different
types of radiation effects that exist. And

00:04:43.116 --> 00:04:47.617
then we also present what you can do to
mitigate these effects and also validate

00:04:47.617 --> 00:04:52.116
that what you did to care for them or
protect your circuits actually worked. And

00:04:52.116 --> 00:04:57.477
of course, as we do that, we'll try to
give our view on how we develop radiation

00:04:57.477 --> 00:05:03.257
tolerant electronics at CERN and how our
workflow looks like to make sure this

00:05:03.257 --> 00:05:08.272
works. So let's first maybe take a step
back and have a look at what we mean when

00:05:08.272 --> 00:05:12.997
we say radiation environments. The first
one that you probably have in mind right

00:05:12.997 --> 00:05:19.044
now when you think about radiation is
space. So, this interstellar space is

00:05:19.044 --> 00:05:24.292
basically filled with, very high speed,
highly energetic electrons and protons and

00:05:24.292 --> 00:05:28.716
all sorts of high energy particles. And
while they, for example, traverse close to

00:05:28.716 --> 00:05:34.513
planets as our Earth - these planets
sometimes do have a magnetic field and the

00:05:34.513 --> 00:05:39.317
highly energetic particles are actually
deflected by these magnetic fields and

00:05:39.317 --> 00:05:43.824
they can protect the planets as our
planet, for example, from this highly

00:05:43.824 --> 00:05:47.986
energetic radiation. But in the process,
there around these planets sometimes they

00:05:47.986 --> 00:05:52.107
form these radiation belts - known as the
Van Allen belts after the guy who

00:05:52.107 --> 00:05:56.043
discovered this effect a long time ago.
And a satellite in space as it orbits

00:05:56.043 --> 00:06:01.620
around the Earth might, depending on what
orbit is chosen, sometimes go through

00:06:01.620 --> 00:06:05.647
these belts of highly intense radiation.
That, of course, then needs to be taken

00:06:05.647 --> 00:06:11.552
into account when designing electronics
for such a satellite. And if Earth itself

00:06:11.552 --> 00:06:17.191
is not able to give you enough radiation,
you may think of the very famous Juno

00:06:17.191 --> 00:06:22.874
Jupiter mission that has become famous
about a year ago. They actually in the

00:06:22.874 --> 00:06:28.288
environment of Jupiter they anticipated so
much radiation that they actually decided

00:06:28.288 --> 00:06:33.408
to put all the electronics of the
satellite inside a one centimeter thick

00:06:33.408 --> 00:06:39.831
cube of titanium, which is famously known
as the Juno Radiation Vault. But not only

00:06:39.831 --> 00:06:43.870
space offers radiation environments.
Another form of radiation you probably all

00:06:43.870 --> 00:06:48.292
recognize this when I show you this
picture, which is an X-ray image of a

00:06:48.292 --> 00:06:54.936
hand. And X-ray is also considered a form
of radiation. And while, of course, the

00:06:54.936 --> 00:07:01.320
doses or amounts of radiation any patient
is exposed to while doing diagnosis or

00:07:01.320 --> 00:07:05.801
treatment of some disease, that might not
be the full story when it comes to medical

00:07:05.801 --> 00:07:10.220
applications. So this is a medical
particle accelerator which is used for

00:07:10.220 --> 00:07:15.288
cancer treatment. And in these sorts of
accelerators, typically carbon ions or

00:07:15.288 --> 00:07:20.389
protons are accelerated and then focused
and used to treat and selectively destroy

00:07:20.389 --> 00:07:25.302
cancer cells in the body. And this comes
already relatively close to the

00:07:25.302 --> 00:07:29.695
environment we are working in and working
for. So Szymon and I are working, for

00:07:29.695 --> 00:07:36.616
example, on electronics, for the CMS
detector inside the LHC or which we build

00:07:36.616 --> 00:07:43.906
dedicated, radiation tolerant, integrated
circuits which have to withstand very,

00:07:43.906 --> 00:07:49.373
very large amounts and doses of short
lived radiation in order to function

00:07:49.373 --> 00:07:54.414
correctly. And if we didn't specifically
design electronics for that, basically the

00:07:54.414 --> 00:08:01.893
whole system would never be able to work.
To illustrate a bit how you can imagine

00:08:01.893 --> 00:08:06.062
the scale of this environment: This is a
single plot of a collision event that was

00:08:06.062 --> 00:08:11.161
recorded in the ATLAS experiment. And each
of those tiny little traces you can make

00:08:11.161 --> 00:08:15.997
out in this diagram is actually either one
or multiple secondary particles that were

00:08:15.997 --> 00:08:22.166
created in the initial collision of two
proton bunches inside the experiment. And

00:08:22.166 --> 00:08:27.501
in each of those, of course, races around
the detector electronics, which make these

00:08:27.501 --> 00:08:32.817
traces visible. Itself, then decaying into
multiple other secondary particles which

00:08:32.817 --> 00:08:37.856
all go through our electronics. And if
that doesn't sound, let's say, bad enough

00:08:37.856 --> 00:08:42.576
for digital electronics, these collisions
happen about 40 million times a second. Of

00:08:42.576 --> 00:08:47.608
course, multiplying the number of events
or problems they can cause in our

00:08:47.608 --> 00:08:54.608
circuits. So we now want to introduce all
the things that can happen, the different

00:08:54.608 --> 00:08:59.570
radiation effects. But first, probably we
take a step back and look at what we mean

00:08:59.570 --> 00:09:05.805
when we say digital electronics or digital
logic, which we want to focus on today. So

00:09:05.805 --> 00:09:11.058
from your university lectures or your
reading, you probably know the first class

00:09:11.058 --> 00:09:14.577
of digital logic, which is the
combinatorial logic. So this is typically

00:09:14.577 --> 00:09:19.222
logic that just does a simple linear
relation of the inputs of a circuit and

00:09:19.222 --> 00:09:23.956
produces an output as exemplified with
these AND and OR, NAND, XOR gates that you

00:09:23.956 --> 00:09:28.829
see here. But if you want to build - I
mean even though we use those everywhere

00:09:28.829 --> 00:09:32.775
in our circuits - you probably also want
to store state in a more complex circuit,

00:09:32.775 --> 00:09:37.857
for example, in the registers of your CPU
they store some sort of internal

00:09:37.857 --> 00:09:41.736
information. And for that we use the other
class of logic, which is called the

00:09:41.736 --> 00:09:44.726
sequential logic. So this is typically
clocked with some system clock frequency

00:09:44.726 --> 00:09:50.883
and it changes its output with relation to
the inputs whenever this clock signal changes.

00:09:50.883 --> 00:09:54.263
And now if we look at how all
these different logic functionalities are

00:09:54.263 --> 00:09:58.292
implemented. So typically nowadays for
that you may know that we use CMOS

00:09:58.292 --> 00:10:02.340
technologies and basically represent all
this logic functionality as digital gates

00:10:02.340 --> 00:10:10.558
using small P-MOS and N-MOS MOSFET
transistors in CMOS technologies. And if

00:10:10.558 --> 00:10:16.408
we kind of try to build a model for more
complex digital circuits, we typically use

00:10:16.408 --> 00:10:21.814
something we call the finite state machine
model, in which we use a model that

00:10:21.814 --> 00:10:25.822
consists of a combinatorial and a
sequential part. And you can see that the

00:10:25.822 --> 00:10:31.031
output of the circuit depends both on the
internal state inside the register as well

00:10:31.031 --> 00:10:35.331
as also the input to the combinatorial
logic. And accordingly, also the state

00:10:35.331 --> 00:10:40.924
that is internal is always changed by the
inputs as well as the current state. So

00:10:40.924 --> 00:10:44.604
this is kind of the simple model for more
complex systems that can be used to model

00:10:44.604 --> 00:10:50.214
different effects. Um, now let's try to
actually look at what the radiation can do

00:10:50.214 --> 00:10:53.948
to transistors. And for that we are going
to have a quick recap at what the

00:10:53.948 --> 00:10:57.895
transistor actually is and how it looks
like. As you may perhaps know is that in

00:10:57.895 --> 00:11:03.736
CMOS technologies, transistors are built
on wafers of high purity silicon. So this

00:11:03.736 --> 00:11:09.074
is a crystalline, very regularly organized
lattice of silicon atoms. And what we do

00:11:09.074 --> 00:11:14.092
to form a transistor on such a wafer is
that we add dopants. So in order to form

00:11:14.092 --> 00:11:19.629
diffusion regions, which later will become
the source and drain of our transistors.

00:11:19.629 --> 00:11:24.474
And then on top of that we grow a layer of
insulating oxide. And on top of that we

00:11:24.474 --> 00:11:28.713
put polysilicon, which forms the gate
terminal of the transistor. And in the end

00:11:28.713 --> 00:11:32.813
we end up with an equivalent circuit a bit
like that. And now to put things back into

00:11:32.813 --> 00:11:37.670
perspective - you may also note that the
dimension of these structures are very

00:11:37.670 --> 00:11:42.543
tiny. So we talk about tens of nanometers
for some of the dimensions I've outlined

00:11:42.543 --> 00:11:47.958
here. And as the technologies shrink,
these become smaller and smaller and

00:11:47.958 --> 00:11:52.284
therefore you'll probably also realize or
are able to appreciate the small amount of

00:11:52.284 --> 00:11:56.560
energy that are used to store information
inside these digital circuits, which makes

00:11:56.560 --> 00:12:02.390
them perhaps more sensitive to radiation.
So let's take a look. What different types

00:12:02.390 --> 00:12:08.385
of radiation effects exist? We typically
in this case, differentiate them into two

00:12:08.385 --> 00:12:13.268
main classes of events. The first one
would be the cumulative effects, which are

00:12:13.268 --> 00:12:17.362
effects that, as the name implies,
accumulate over time. So as the circuit is

00:12:17.362 --> 00:12:22.127
placed inside some radiation environment,
over time it accumulates more and more

00:12:22.127 --> 00:12:26.969
dose and therefore worsens its performance
or changes how it operates. And on the

00:12:26.969 --> 00:12:30.549
other side, we have the Single Event
Effects, which are always events that

00:12:30.549 --> 00:12:35.075
happen at some instantaneous point in
time, and then suddenly, without being

00:12:35.075 --> 00:12:39.316
predictable, change how the circuit
operates or how it functions or if it

00:12:39.316 --> 00:12:43.931
works in the first place or not. So I'm
going to first go into the class of

00:12:43.931 --> 00:12:47.685
cumulative effects and then later on,
Szymon will go into the other class of the

00:12:47.685 --> 00:12:53.173
Single Event Effects. So in terms of these
accumulating effects, we basically have

00:12:53.173 --> 00:12:57.580
two main subclasses: The first one being
ionization or TID effects, for Total

00:12:57.580 --> 00:13:02.033
Ionizing Dose - and the second one being
displacement damages. So displacement

00:13:02.033 --> 00:13:07.137
damages do exactly what they sound like.
It is all the effects that happen when an

00:13:07.137 --> 00:13:11.249
atom in the silicon lattice is actually
displaced, so removed from its lattice

00:13:11.249 --> 00:13:15.266
position and actually changes the
structure of the semiconductor. But

00:13:15.266 --> 00:13:19.548
luckily, these effects don't have a big
impact in the CMOS digital circuits that

00:13:19.548 --> 00:13:23.164
we are looking at today. So we will
disregard them for the moment and we'll be

00:13:23.164 --> 00:13:28.120
looking more at the ionization damage, or
TID. So ionization - as a quick recap - is

00:13:28.120 --> 00:13:35.901
whenever electrons are removed or added to
an atom, effectively transforming it into

00:13:35.901 --> 00:13:42.747
an ion. And these effects are especially
critical for the circuits we are building

00:13:42.747 --> 00:13:46.316
because of what they do is that they
change the behavior of the transistors.

00:13:46.316 --> 00:13:50.233
And without looking too much into the
semiconductor details, I just want to show

00:13:50.233 --> 00:13:55.730
their typical effect that we are concerned
about in this very simple circuit here. So

00:13:55.730 --> 00:14:00.348
this is just an inverter circuit
consisting of two transistors here and

00:14:00.348 --> 00:14:05.812
there. And what the circuit does in normal
operation is it just takes an input signal

00:14:05.812 --> 00:14:10.062
and inverts and basically gives the
inverted signal at the output. And as the

00:14:10.062 --> 00:14:15.549
transistors are irradiated and accumulate
dose, you can see that the edges of the

00:14:15.549 --> 00:14:20.391
output signal get slower. So the
transistor takes longer to turn on and off.

00:14:20.391 --> 00:14:24.574
And what that does in turn is that it
limits the maximum operation frequency of

00:14:24.574 --> 00:14:28.795
your circuit. And of course, that is not
something you want to do. You want your

00:14:28.795 --> 00:14:31.723
circuit to operate at some frequency in
your final system. And if the maximum

00:14:31.723 --> 00:14:35.600
frequency it can work at degrades over
time, at some point it will fail as the

00:14:35.600 --> 00:14:39.276
maximum frequency is just too low. So
let's have a look at what we can do to

00:14:39.276 --> 00:14:44.395
mitigate these effects. The first one and
I already mentioned it when talking about

00:14:44.395 --> 00:14:48.488
the Juno mission, is shielding. So if you
can actually put a box around your

00:14:48.488 --> 00:14:52.586
electronics and shield any radiation from
actually hitting your transistors, it is

00:14:52.586 --> 00:14:56.900
obvious that they will last longer and
will suffer less from the radiation damage

00:14:56.900 --> 00:15:01.241
that it would otherwise do. So this
approach is very often used in space

00:15:01.241 --> 00:15:04.988
applications like on satellites, but it's
not very useful if you are actually trying

00:15:04.988 --> 00:15:08.209
to measure the radiation with your
circuits as we do, for example, in the

00:15:08.209 --> 00:15:12.415
particle accelerators we build integrated
circuits for. So there first of all, we

00:15:12.415 --> 00:15:16.344
want to measure the radiation so we cannot
shield our detectors from the radiation.

00:15:16.344 --> 00:15:20.592
And also, we don't want to influence the
tracks of these secondary collision

00:15:20.592 --> 00:15:24.162
products with any shielding material that
would be in the way. So this is not very

00:15:24.162 --> 00:15:28.315
useful in a particle accelerator
environment, let's say. So we have to

00:15:28.315 --> 00:15:33.880
resort to different methods. So as I said,
we do have to design our own integrated

00:15:33.880 --> 00:15:38.826
circuits in the first place. So we have
some freedom in what we call transistor

00:15:38.826 --> 00:15:45.236
level design. So we can actually alter the
dimensions of the transistors. We can make

00:15:45.236 --> 00:15:50.055
them larger to withstand larger doses of
radiation and we can use special

00:15:50.055 --> 00:15:54.354
techniques in terms of layout that we can
experimentally verifiy to be more

00:15:54.354 --> 00:15:59.266
resistant to radiation effects. And as a
third measure, which is probably the most

00:15:59.266 --> 00:16:03.491
important one for us, is what we call
modeling. So we actually are able to

00:16:03.491 --> 00:16:08.358
characterize all the effects that
radiation will have on a transistor. And

00:16:08.358 --> 00:16:12.442
if we can do that, if we will know: 'If I
put it into a radiation environment for a

00:16:12.442 --> 00:16:17.000
year, how much slower will it become?'
Then it is of course easy to say: 'OK, I

00:16:17.000 --> 00:16:20.648
can just over-design my circuit and make
it a bit more simple, maybe have less functionality,

00:16:20.648 --> 00:16:24.464
but be able to operate at a
higher frequency and therefore withstand

00:16:24.464 --> 00:16:30.240
the radiation effects for a longer time
while still working sufficiently well at

00:16:30.240 --> 00:16:35.118
the end of its expected lifetime.' So
that's more or less what we can do about

00:16:35.118 --> 00:16:38.254
these effects. And I'll hand over to
Szymon for the second class.

00:16:38.254 --> 00:16:42.655
Szymon: Contrary to the cumulative effects
presented by Stefan, the other group are

00:16:42.655 --> 00:16:46.424
Single Event Effects which are caused by
high energy deposits, which are caused by

00:16:46.424 --> 00:16:52.143
a single particle or shower of particles.
And they can happen at any time, even

00:16:52.143 --> 00:16:57.089
seconds after irradiation is started. It
means that if your circuit is vulnerable

00:16:57.089 --> 00:17:01.667
to this class of effects, it can fail
immediately after radiation is present.

00:17:01.667 --> 00:17:06.313
And here we also classify these effects
into several groups. The first are hard,

00:17:06.313 --> 00:17:11.450
or permanent, errors, which as the name
indicates can permanently destroy your

00:17:11.450 --> 00:17:20.260
circuit. And this type of errors are
typically critical for power devices where

00:17:20.260 --> 00:17:24.340
you have large power densities and they
are not so much of a problem for digital

00:17:24.340 --> 00:17:30.100
circuits. In the other class of effects
are soft errors. And here we distinguish

00:17:30.100 --> 00:17:34.100
transient, or Single Event Transient
errors, which are spurious signals

00:17:34.100 --> 00:17:41.220
propagating in your circuit as a result of
a gate being hit by a particle and they

00:17:41.220 --> 00:17:45.700
are especially problematic for analog
circuits or asynchronous digital circuits,

00:17:45.700 --> 00:17:51.460
but under some circumstances they can be
also problematic for synchronous systems.

00:17:51.460 --> 00:17:56.420
And the other class of problems are
static, or Single Event Upset problems,

00:17:56.420 --> 00:18:01.220
which basically means that your memory
element like a register gets flipped. And

00:18:01.220 --> 00:18:05.060
then of course, if your system is not
designed to handle this type of errors

00:18:05.060 --> 00:18:09.620
properly, it can lead to a failure. So in
the following part of the presentation

00:18:09.620 --> 00:18:15.300
we'll focus mostly on soft errors. So
let's try to understand what is the origin

00:18:15.300 --> 00:18:20.820
of this type of problem. So as Stefan
mentioned, the typical transistor is built

00:18:20.820 --> 00:18:25.230
out of diffusions, gate and channel. So
here you can see one diffusion. Let's

00:18:25.230 --> 00:18:29.230
assume that it is a drain diffusion. And
then when a particle goes through and

00:18:29.230 --> 00:18:36.700
deposits charge, it creates free electron and
hole pairs, which then in the presence of

00:18:36.700 --> 00:18:43.320
electric fields, they get collected by
means of drift, which results in a large

00:18:43.320 --> 00:18:46.930
current spike, which is very short. And
then the rest of the charge could be

00:18:46.930 --> 00:18:50.940
collected by diffusion which is a much
slower process and therefore also the

00:18:50.940 --> 00:18:56.390
amplitude of the event is much, much
smaller. So let's try to understand what

00:18:56.390 --> 00:19:01.230
could happen in a typical memory cell. So
on this schematic, you can see the

00:19:01.230 --> 00:19:05.740
simplest memory cell, which is composed of
two back-to-back inverters. And let's

00:19:05.740 --> 00:19:12.810
assume that node A is at high and node B
is at low potential initially. And then we

00:19:12.810 --> 00:19:17.210
have a particle hitting the drain of
transistor M1 which creates a short

00:19:17.210 --> 00:19:22.590
circuit current between drain and ground,
bringing the drain of transistor M1 to low

00:19:22.590 --> 00:19:29.871
potential, which also acts on the gates of
second inverter, temporarily changing its

00:19:29.871 --> 00:19:38.734
state from low to high, which reinforces
the wrong state in the first inverter. And

00:19:38.734 --> 00:19:45.340
at this time the error is locked in your
memory cell and you basically lost your

00:19:45.340 --> 00:19:49.652
information. So you may be asking
yourself: 'How much charge is needed

00:19:49.652 --> 00:19:54.281
really to flip a state of a memory cell?'.
And you can get this number from either

00:19:54.281 --> 00:19:59.952
simulations or from measurements. So let's
assume that what we could do, we could try

00:19:59.952 --> 00:20:04.605
to inject some current into the sensitive
node, for example, drain of transistor M1.

00:20:04.605 --> 00:20:08.790
And here what I will show is that on the
top plot you will have current as a function

00:20:08.790 --> 00:20:13.484
of time. On the second plot you will have
output voltage. So voltage at node B as a

00:20:13.484 --> 00:20:19.121
function of time and at the lowest plot you
will see a probability of having a bit

00:20:19.121 --> 00:20:23.097
flip. So if you inject very little
current, of course nothing changes at the

00:20:23.097 --> 00:20:27.670
output, but once you start increasing the
amount of current you are injecting, you

00:20:27.670 --> 00:20:33.306
see that something appears at the output
and at some point the output will toggle,

00:20:33.306 --> 00:20:39.747
so it will switch to the other state. And
at this point, if you really calculate

00:20:39.747 --> 00:20:46.369
what is the area under the current curve
you can find what is the critical charge

00:20:46.369 --> 00:20:53.499
needed to flip the memory cell. And if you
go further, if you start injecting even

00:20:53.499 --> 00:21:00.701
more current, you will not see that much
difference in the output voltage waveform.

00:21:00.701 --> 00:21:05.112
It could become only slightly faster. And
at this point, you also can notice that

00:21:05.112 --> 00:21:09.528
the probability now jumped to one, which
means that any time you inject so much

00:21:09.528 --> 00:21:17.431
current there is a fault in your circuit.
So for now, we just found what is the

00:21:17.431 --> 00:21:23.414
probability of having a bit-flip from 0 to
1 in node B. Of course we should also

00:21:23.414 --> 00:21:27.904
calculate the same for the other
direction, so from 1 to zero. And usually

00:21:27.904 --> 00:21:32.377
it is slightly different. And then of
course we should inject in all the other

00:21:32.377 --> 00:21:37.817
nodes, for example node B and also should
study all possible transitions. And then

00:21:37.817 --> 00:21:43.492
at the end, if you calculate the
superposition of these effects and you

00:21:43.492 --> 00:21:48.655
multiply them by the active area of each
node, you will end up with what we call

00:21:48.655 --> 00:21:52.420
the cross section, which has a dimension
of centimeters squared, which will tell

00:21:52.420 --> 00:21:57.357
you how sensitive your circuit is to this
type of effects. And then knowing the

00:21:57.357 --> 00:22:03.761
radiation profile of your environment, you
can calculate the expected upset rate in

00:22:03.761 --> 00:22:10.105
the final application. So now, having
covered the basic of the single event

00:22:10.105 --> 00:22:16.517
effects, let's try to check how we can
mitigate them. And here also technology

00:22:16.517 --> 00:22:20.875
plays a significant role. So of course,
newer technologies offer us much smaller

00:22:20.875 --> 00:22:26.692
devices. And together with that, what
follows is that usually supply voltages

00:22:26.692 --> 00:22:31.047
are getting smaller and smaller as well as
the node capacitance, which means that for

00:22:31.047 --> 00:22:35.565
our Single Event Upsets it is very bad
because the critical charge which is

00:22:35.565 --> 00:22:40.207
required to flip our bit is getting less
and less. But at the end, at the same

00:22:40.207 --> 00:22:44.135
time, physical dimensions of our
transistors are getting smaller, which

00:22:44.135 --> 00:22:48.097
means that the cross section for them
being hit is also getting smaller. So

00:22:48.097 --> 00:22:52.495
overall, the effects really depend on the
circuit topology and the radiation

00:22:52.495 --> 00:22:59.181
environment. So another protection method
could be introduced on the cell level. And

00:22:59.181 --> 00:23:04.914
here we could imagine increasing the
critical charge. And that could be done in

00:23:04.914 --> 00:23:10.819
the easiest way by just increasing the
node capacitance by, for example, putting

00:23:10.819 --> 00:23:16.096
larger transistors. But of course, this
also increases the collection electrode,

00:23:16.096 --> 00:23:22.657
which is not nice. And another way could
be just increase the capacitance by adding

00:23:22.657 --> 00:23:28.336
some extra metal capacitance, but it, of
course, slows down the circuit. Another

00:23:28.336 --> 00:23:33.615
approach could be to try to store the
information on more than two nodes. So I

00:23:33.615 --> 00:23:38.377
showed you that on a simple SRAM cell we
store information only on two nodes, so

00:23:38.377 --> 00:23:43.102
you could try to come up with some other
cells, for example, like that one in which

00:23:43.102 --> 00:23:47.406
the information you stored on four nodes.
So you can see that the architecture is

00:23:47.406 --> 00:23:53.800
very similar to the basic SRAM cell. But
you should be careful always to very

00:23:53.800 --> 00:23:59.000
carefully simulate your design, because if
we analyze this circuit, you will quickly

00:23:59.000 --> 00:24:02.936
realize that this circuit, even though the
information is stored in four different

00:24:02.936 --> 00:24:09.867
nodes, the same type of loop exists as in
the basic circuit. Meaning that at the end

00:24:09.867 --> 00:24:15.227
the circuit offers basically no hardening
with respect to the previous cell. So

00:24:15.227 --> 00:24:21.074
actually we can do it better. So here you
can see a typical dual interlocked cell.

00:24:21.074 --> 00:24:26.445
So the amount of transistors is exactly
the same as in the previous example, but

00:24:26.445 --> 00:24:30.819
now they are interconnected slightly
differently. And here you can see that

00:24:30.819 --> 00:24:36.262
this cell has also two stable configurations. 
But this time data can propagate, the low

00:24:36.262 --> 00:24:40.587
level from a given node can propagate
only to the left hand side, while high

00:24:40.587 --> 00:24:47.872
level can propagate to the right hand
side. And each stage being inverting means

00:24:47.872 --> 00:24:54.918
that the fault can not propagate for more
than one node. Of course, this cell has

00:24:54.918 --> 00:25:00.379
some drawbacks: It consumes more area than
a simple SRAM cell and also write access

00:25:00.379 --> 00:25:04.240
requires accessing at least two nodes at
the same time to really change the state

00:25:04.240 --> 00:25:09.801
of the cell. And so you may ask yourself,
how effective is this cell? So here I will

00:25:09.801 --> 00:25:13.709
show you a cross section plot. So it is
the probability of having an error as a

00:25:13.709 --> 00:25:18.883
function of injected energy. And as a
reference, you can see a pink curve on the

00:25:18.883 --> 00:25:25.650
top, which is for a normal, not protected
cell. And on the green you can see the

00:25:25.650 --> 00:25:31.399
cross section for the error in the DICE
cell. So as you can see, it is one order

00:25:31.399 --> 00:25:36.934
of magnitude better than the normal cell.
But still, the cross section is far from

00:25:36.934 --> 00:25:41.426
being negligible, So, the problem was
identified: So it was identified that the

00:25:41.426 --> 00:25:45.679
problem was caused by the fact that some
sensitive nodes were very close together

00:25:45.679 --> 00:25:50.807
on the layout and therefore they could be
upset by the same particle. Because as we

00:25:50.807 --> 00:25:54.721
mentioned, that single devices, they are very
small. We are talking about dimensions

00:25:54.721 --> 00:25:59.675
below a micron. So after realizing that,
we designed another cell in which we

00:25:59.675 --> 00:26:04.799
separated more sensitive nodes and we
ended up with the blue curve, and as you

00:26:04.799 --> 00:26:08.907
can see the cross section was reduced by
two more orders of magnitude and the

00:26:08.907 --> 00:26:14.205
threshold was increased significantly. So
if you don't want to redesign your

00:26:14.205 --> 00:26:18.771
standard cells, you could also apply some
mitigation techniques on block level. So

00:26:18.771 --> 00:26:24.717
here we can use some encoding to encode
our state better. And as an example, I

00:26:24.717 --> 00:26:31.540
will show you a typical Hamming code. So
to protect four bits, we have to add three

00:26:31.540 --> 00:26:38.052
additional party bits which are calculated
according to this formula. And then once

00:26:38.052 --> 00:26:44.133
you calculate the parity bits, you can use
those to check the state integrity of your

00:26:44.133 --> 00:26:50.360
internal state. And if any of their parity
bits is not equal to zero, then the bits

00:26:50.360 --> 00:26:55.375
instantaneously become syndromes,
indicating where the error happened. And

00:26:55.375 --> 00:26:59.916
you can use these information to correct
the error. Of course, in this case, the

00:26:59.916 --> 00:27:06.533
efficiency is not really nice because we
need three additional bits to protect only

00:27:06.533 --> 00:27:11.828
four bits of information. But as the state
length increases the protection also is

00:27:11.828 --> 00:27:18.855
more efficient. Another approach would be
to do even less. Meaning that instead of

00:27:18.855 --> 00:27:23.970
changing anything you need in your design,
you can just triplicate your design or

00:27:23.970 --> 00:27:30.190
multiply it many times and just vote,
which state is correct? So this concept is

00:27:30.190 --> 00:27:35.046
called tripple modular redudancy and it is
based around a voter cell. So it is a

00:27:35.046 --> 00:27:40.210
cell which has odd number of
inputs and output is always equal to

00:27:40.210 --> 00:27:45.040
majority of its input. And as I mentioned
that the idea is that you have, for

00:27:45.040 --> 00:27:49.292
example, three circuits: A, B and C, and
during normal operation, when they are

00:27:49.292 --> 00:27:54.471
identical, the output is also the same.
However, when there is a problem, for

00:27:54.471 --> 00:28:00.957
example, in logic, part B, the output
is affected. So this problem is

00:28:00.957 --> 00:28:05.509
effectively masked by the voter cell
and it is not visible from outside of the

00:28:05.509 --> 00:28:10.383
circuit. But you have to be careful not to
take this picture as a as a design

00:28:10.383 --> 00:28:15.501
template. So let's try to analyze what
would happen with a state machine

00:28:15.501 --> 00:28:20.329
similar to what Stephan introduced. If you
were to just use this concept. So here you

00:28:20.329 --> 00:28:24.859
can see three state machines and
a voter on the output. And as we can see,

00:28:24.859 --> 00:28:29.484
if you have an upside in, for example, the
state register A, then the state is

00:28:29.484 --> 00:28:36.676
broken. But still the output of the
circuit, which is indicated by letter s is

00:28:36.676 --> 00:28:42.355
correct because the B and C registers are
still fine. But what happens if some time

00:28:42.355 --> 00:28:49.283
later we have an upset in memory element B
or C? Then of course the state

00:28:49.283 --> 00:28:56.028
of our system is broken and we can not
recover it. So you can ask yourself what

00:28:56.028 --> 00:29:02.204
can we do better in order to avoid this
situation? So that just to be sure. Please

00:29:02.204 --> 00:29:06.654
do not use this technique to protect your
circuits. So the easiest mitigation could

00:29:06.654 --> 00:29:13.201
be to use as an input to your logic to use
the output of the voter cell itself.

00:29:13.201 --> 00:29:18.491
What it offers us is that now whenever you
have an upset in one of the memory

00:29:18.491 --> 00:29:22.933
elements for the next computation, for the
next stage, we always use the voter

00:29:22.933 --> 00:29:27.631
output, which ensures that the signal
will be removed one clock cycle later. So

00:29:27.631 --> 00:29:32.726
you will have another hit sometime later,
basically, it will not affect our state.

00:29:32.726 --> 00:29:39.765
Until now we consider only upsets in our
registers but what happens if we have

00:29:39.765 --> 00:29:45.885
charge in our voter? So you see that
if there is no state change, basically the

00:29:45.885 --> 00:29:50.981
transient in the voter doesn't impact
our system. But if you are really unlucky

00:29:50.981 --> 00:29:55.777
and the transient happens when the clock
transition happens, so when whenever we

00:29:55.777 --> 00:30:01.182
enlarge the data, we can corrupt the state
in three registers at the same time, which

00:30:01.182 --> 00:30:05.605
is less than ideal. So to overcome this
limitation, you can consider skewing our

00:30:05.605 --> 00:30:11.101
clocks by some time, which is larger than
the maximum charge in time. And now,

00:30:11.101 --> 00:30:18.050
because with each register samples the
output of the voter a slightly different

00:30:18.050 --> 00:30:23.449
time, we can corrupt only one flip-flop
at the time. So of course, if you are

00:30:23.449 --> 00:30:28.780
unlucky, we can have problematic
situations in which one register is

00:30:28.780 --> 00:30:33.646
already in your state. The other register
is still in the old state. And then it

00:30:33.646 --> 00:30:39.728
can lead to undetermenistic result. So it
is better, but still not ideal. So as a

00:30:39.728 --> 00:30:46.578
general theme, you have seen that we were
adding and adding more resources so you

00:30:46.578 --> 00:30:50.418
can ask yourself what would happen if we
tripplicate everything. So in this case,

00:30:50.418 --> 00:30:54.262
we tripplicated registers, we
tripplicate our logic and our voters. And

00:30:54.262 --> 00:30:59.138
now you can see that whenever we have an
upset in our register, it can only affect

00:30:59.138 --> 00:31:04.513
one register at the time and the error
will be removed from the system one clock

00:31:04.513 --> 00:31:08.912
cycle later. Also, if we have an upset
in the voter or in their logic it can be

00:31:08.912 --> 00:31:13.372
larged only to one register, which means
that in principle we create that system

00:31:13.372 --> 00:31:17.885
which is really robust. Unfortunately,
nothing is for free. So here I compare a

00:31:17.885 --> 00:31:22.823
different tripplication environments and
as you can see that the more protection

00:31:22.823 --> 00:31:26.326
you want to have, the more you have to pay
in terms of resources being power in the

00:31:26.326 --> 00:31:31.373
area. And also usual, you pay small
penalty in terms of maximum operational

00:31:31.373 --> 00:31:37.597
speed. So which flavor of protection you
use depends really on

00:31:37.597 --> 00:31:42.420
application. So for most sensitive
circuits, you probably you want to use

00:31:42.420 --> 00:31:48.493
full TMR and you may leave some other
bits of logic unprotected. So another, if

00:31:48.493 --> 00:31:54.749
your system is not mission critical and
you can tolerate some downtime, you can

00:31:54.749 --> 00:32:00.294
consider scrubbing, which means periodically 
checking the state of your system and refreshing it

00:32:00.294 --> 00:32:05.120
if necessary if an error is detected using
some parity bits or copy of the data in

00:32:05.120 --> 00:32:10.394
a safe space. Or you can have a
watchdog which will find out that

00:32:10.394 --> 00:32:13.951
something went wrong and it will just
reinitialize the whole system. So now,

00:32:13.951 --> 00:32:20.011
having covered the basics of all the effects
we will have to face, we would like

00:32:20.011 --> 00:32:24.293
to show you the basic flow which we follow
during designing our radiation hardened

00:32:24.293 --> 00:32:29.746
circuits. So of course we always start
with specifications. So we try to

00:32:29.746 --> 00:32:34.228
understand our radiation environment in
which the circuit is meant to operate. So

00:32:34.228 --> 00:32:38.750
we come up with some specifications for
total dose which could be accumulated and

00:32:38.750 --> 00:32:45.348
for the rate of single event upsets. And
at this moment, it is also not very rare

00:32:45.348 --> 00:32:49.705
that we have to decide to move some
functionality out of our detector volume,

00:32:49.705 --> 00:32:56.133
outside, where we can use of the sort of
commercial equipment to do number

00:32:56.133 --> 00:33:04.820
crunching. But let's assume that we would
go with our ASIC. So having the

00:33:04.820 --> 00:33:09.220
specifications, of course we proceed with
functional implementation. This we

00:33:09.220 --> 00:33:14.260
typically do with hardware describtion
languages, so verilog or VHDL which you may

00:33:14.260 --> 00:33:18.900
know from typical FPGA flow. And of course
we write a lot of simulations to

00:33:18.900 --> 00:33:24.205
understand whether we are meeting our
functional goals or whether our circuit

00:33:24.205 --> 00:33:30.665
behaves as expected. And then we
selectively select some parts of the

00:33:30.665 --> 00:33:36.318
circuits which we want to protect from
radiation effects. So, for example, we can

00:33:36.318 --> 00:33:42.290
decide to use triplication or some other
methods. So these days we typically use

00:33:42.290 --> 00:33:46.645
triplication as the most straightforward
and very effective method. So you can ask

00:33:46.645 --> 00:33:50.750
yourself how do we triplicate the logic?
So the simplest could be: Just copy

00:33:50.750 --> 00:33:55.099
and paste the code three times at some
postfixes like A, B and C and you are

00:33:55.099 --> 00:34:01.653
done. But of course this solution has some
drawbacks. So it is time consuming and it

00:34:01.653 --> 00:34:05.964
is very error prone. So maybe you have
noticed that I had a typo there. So of

00:34:05.964 --> 00:34:10.220
course we don't want to do that. So we
developed our own tool, which we called

00:34:10.220 --> 00:34:16.924
TMRG, which automatizes the process of
triplication and eliminates the two main

00:34:16.924 --> 00:34:22.494
drawbacks, which I just described. So
after we have our code triplicated and of

00:34:22.494 --> 00:34:27.075
course, not before rerunning all the
simulations to make sure that everything

00:34:27.075 --> 00:34:34.230
went as expected. We then proceed to the
synthesis process in which we convert our

00:34:34.230 --> 00:34:41.091
high level hardware description languages
to gate level netlists, in which all the functions

00:34:41.091 --> 00:34:46.189
are mapped to gates, which were introduced
by Stefan, so both combinatorial and

00:34:46.189 --> 00:34:53.631
sequential. And here we also have to be
careful because modern CAD tools have a

00:34:53.631 --> 00:34:59.020
tendency, of course, to optimise the logic
as much as possible. And our logic in most

00:34:59.020 --> 00:35:03.810
of the cases is really redundant. So it is
very easy; So, it should be removed. So we

00:35:03.810 --> 00:35:08.632
really have to make sure that it is not
removed. That's why our tool also provides

00:35:08.632 --> 00:35:13.633
some constraints for the synthesizer to
make sure that our design intent is

00:35:13.633 --> 00:35:20.900
clearly and well understood by the tool.
And once we have the output netlist, we

00:35:20.900 --> 00:35:26.980
proceed to place and route process where
this kind of netlist representation is

00:35:26.980 --> 00:35:32.580
mapped to a layout of what will become
soon our digital chip where we placed all

00:35:32.580 --> 00:35:36.624
the cells and we route connections between
them and here there is

00:35:36.624 --> 00:35:40.907
another danger which I mentioned already,
it's that in modern technologies the cells

00:35:40.907 --> 00:35:45.597
are so small that they could be easily
affected by a single particle at the same

00:35:45.597 --> 00:35:51.892
time. So we have to really space out
the big cells which are responsible for

00:35:51.892 --> 00:35:56.982
keeping the information about the state to
make sure that a single particle cannot

00:35:56.982 --> 00:36:04.980
upset A and B, for example, registered
from the same register. And then in the

00:36:04.980 --> 00:36:09.540
last step, of course, we'll have to verify
that everything, what we have done, is

00:36:09.540 --> 00:36:13.926
correct. And at this level, we also try to
introduce some single event effects in our

00:36:13.926 --> 00:36:19.971
simulations. So we could randomly flip
bits in our system. We can also inject

00:36:19.971 --> 00:36:26.094
transients. And typically we used to do
that on the netlist level, which works

00:36:26.094 --> 00:36:31.424
very fine. And it is very nice. But the
problem with this approach is that we can

00:36:31.424 --> 00:36:37.640
perform these actions very late in the
design cycle, which is less than ideal.

00:36:37.640 --> 00:36:43.084
And also that if we find that there is
problem in our simulation, typical netlist

00:36:43.084 --> 00:36:48.437
at this level has probably few orders of
magnitude more lines than our initial RTL

00:36:48.437 --> 00:36:52.990
code. So to trace back what is the
problematic line of code is not so

00:36:52.990 --> 00:36:57.533
straightforward. At this time. So you can
ask yourself why not to try to inject

00:36:57.533 --> 00:37:05.458
errors in the RTL design? And the answer
was, the answer is that it is not so

00:37:05.458 --> 00:37:10.670
trivially to map the hardware description
language's high level constructs to

00:37:10.670 --> 00:37:15.585
what will become combinatorial or
sequential logic. So in order to eliminate

00:37:15.585 --> 00:37:20.980
this problem, we also develop another open
source tool, which allows us to...

00:37:20.980 --> 00:37:27.860
So we decided to use Yosys open
source synthesis tool from clifford, which

00:37:27.860 --> 00:37:31.530
was presented in the Congress several
years ago. So we use this tool to make a

00:37:31.530 --> 00:37:35.680
first pass through our RTL code to
understand which elements will be mapped

00:37:35.680 --> 00:37:40.678
to sequential and combinatorial. And then
having this information, we will use

00:37:40.678 --> 00:37:45.951
cocotb, another python verification
framework, which allows us programmatic

00:37:45.951 --> 00:37:51.838
access to these nodes and we can
effectively inject the errors in our

00:37:51.838 --> 00:37:56.660
simulations. And I forgot to mention that
the TMRG tool is also open source. So if

00:37:56.660 --> 00:38:03.841
you are interested in one of the tools,
please feel free to contact us. And of

00:38:03.841 --> 00:38:10.505
course, after our simulation is done, then in
the next step we would really tape out. And

00:38:10.505 --> 00:38:14.637
so we submit our chip to manufacturing and
hopefully a few months later we receive

00:38:14.637 --> 00:38:18.105
our chip back.
Stefan: All right. So after patiently

00:38:18.105 --> 00:38:23.546
waiting then for a couple of months while
your chip is in manufacturing and you're

00:38:23.546 --> 00:38:28.245
spending time on preparing a test set up
and preparing yourself to actually test if

00:38:28.245 --> 00:38:33.772
your chip works as you expected to. Now,
it's probably also a good time to think

00:38:33.772 --> 00:38:38.307
about how to actually validate or test if
all the measures that you've taken to

00:38:38.307 --> 00:38:41.389
protect your circuit from radiation
effects actually are effective or if they

00:38:41.389 --> 00:38:46.196
are not. And so again, we will split this
in two parts. So you will probably want to

00:38:46.196 --> 00:38:50.024
start with testing for the total ionizing
dose effects. So for the cumulative effect

00:38:50.024 --> 00:38:54.554
and for that, you typically use x ray
radiation relatively similar to the one

00:38:54.554 --> 00:38:59.005
used in medical treatment. So this
radiation is relatively low, energetic,

00:38:59.005 --> 00:39:03.344
which has the upside of not producing any
single event effects, but you can really

00:39:03.344 --> 00:39:07.462
only accumulate radiation dose and focus
on the accumulating effects. And typically

00:39:07.462 --> 00:39:11.600
you would use a machine that looks
somewhat like this, a relatively compact

00:39:11.600 --> 00:39:16.840
thing. You can have in your laboratory and
you can use that to really accumulate

00:39:16.840 --> 00:39:21.520
large amounts of radiation dose on your
circuit. And then you need some sort of

00:39:21.520 --> 00:39:26.641
mechanism to verify or to quantify how
much your circuit slows down due to this

00:39:26.641 --> 00:39:31.285
radiation dose. And if you do that, you
typically end up with a graphic such as

00:39:31.285 --> 00:39:36.567
this one, where in the x axis you have the
radiation dose your circuit was exposed

00:39:36.567 --> 00:39:40.639
to. And on the y axis, you see that the
frequency has gone down over time and you

00:39:40.639 --> 00:39:44.536
can use this information to say:
"OK, my final application, I expect this

00:39:44.536 --> 00:39:49.324
level of radiation dose. I mean, I can
still see that my circuit will work fine

00:39:49.324 --> 00:39:53.565
under some given environmental condition
or some operation condition." So this is

00:39:53.565 --> 00:39:58.285
the test for the first class of effects.
And the test for the second class of

00:39:58.285 --> 00:40:02.318
effects for the single event effect is a
bit more involved. So there what you would

00:40:02.318 --> 00:40:07.157
typically start to do is go for a heavy
ion test campaign. So you would go to a

00:40:07.157 --> 00:40:12.760
specialized, relatively rare facility. We
have a couple of those in Europe and would

00:40:12.760 --> 00:40:16.532
look perhaps somewhat like this. So it's a
small particle accelerator somewhere.

00:40:16.532 --> 00:40:20.794
They typically have
different types of heavy ions at their

00:40:20.794 --> 00:40:26.311
disposal that they can accelerate and then
shoot at your chip that you can place in a

00:40:26.311 --> 00:40:32.390
vacuum chamber and these ions can deposit
very well known amounts of energy in your

00:40:32.390 --> 00:40:36.818
circuit and you can use that information
to characterize your circuit. The downside

00:40:36.818 --> 00:40:41.207
is a bit that these facilities tend to be
relatively expensive to access and also a

00:40:41.207 --> 00:40:45.161
bit hard to access. So typically you need
to book them a lot of time in advance and

00:40:45.161 --> 00:40:50.351
that's sometimes not very easy. But what
it offers you, you can use different types

00:40:50.351 --> 00:40:55.244
of ions with different energies. You can
really make a very well-defined

00:40:55.244 --> 00:41:00.190
sensitivity curve similar to the one that
Szymon has described. You can get from

00:41:00.190 --> 00:41:04.052
simulations and really characterize your
circuit for how often, any single event

00:41:04.052 --> 00:41:09.026
effects will appear in the final
application if there is any remaining

00:41:09.026 --> 00:41:12.827
effects left. If you have left something
unprotected. The problem here is that

00:41:12.827 --> 00:41:18.190
these particle accelerators typically just
bombard your circuit with like thousands

00:41:18.190 --> 00:41:23.310
of particles per second and they hit
basically the whole area in a random

00:41:23.310 --> 00:41:26.940
fashion. So you don't really have a way of
steering those or measuring the position

00:41:26.940 --> 00:41:30.964
of these particles. So typically you are a
bit in the dark and really have to really

00:41:30.964 --> 00:41:34.884
carefully know the behavior of your
circuit and all the quirks it has even

00:41:34.884 --> 00:41:39.481
without the radiation to instantly notice
when something has gone wrong. And

00:41:39.481 --> 00:41:44.088
this is typically not very easy
and you can kind of compare it with having

00:41:44.088 --> 00:41:47.372
some weird crash somewhere in your
software stack and then having to have

00:41:47.372 --> 00:41:51.800
first take a look and see what actually
has happened. Typically

00:41:51.800 --> 00:41:57.058
you find something that has not been
properly protected and you see some weird

00:41:57.058 --> 00:42:01.847
effect on your circuit and then you try to
get a better idea of where that problem

00:42:01.847 --> 00:42:06.256
actually is located. And the answer for
these types of problems involving position

00:42:06.256 --> 00:42:11.381
is, of course, always lasers. So we have
two types of laser experiments available

00:42:11.381 --> 00:42:15.796
that can be used to more selectively probe
your circuit for these problems. The first

00:42:15.796 --> 00:42:19.691
one being the single photon absorption
laser. And it sounds this relatively

00:42:19.691 --> 00:42:24.709
simple in terms of setup. You just use a
single laser beam that shoots straight up

00:42:24.709 --> 00:42:29.884
at your circuit from the back. And while
it does that, it deposits energy all along

00:42:29.884 --> 00:42:34.180
the silicon and also in the diffusions of
your transistors and is therefore also

00:42:34.180 --> 00:42:38.388
able to inject energy there, potentially
upsetting a bit of memory or exposing

00:42:38.388 --> 00:42:43.053
whatever other single event effects you
have. And of course, you can steer this

00:42:43.053 --> 00:42:46.880
beam across the surface of your chip or
whatever circuit you are testing and then

00:42:46.880 --> 00:42:51.330
find the sensitive location. The problem
here is that the amount of energy that is

00:42:51.330 --> 00:42:55.238
deposited is really large due to the fact
that it has to go through the whole

00:42:55.238 --> 00:42:59.053
silicon until it reaches the transistor.
And therefore it's mostly used to find

00:42:59.053 --> 00:43:02.582
these destructive effects that really
break something in your circuit. The more

00:43:02.582 --> 00:43:07.972
clever and somehow beautiful experiment is
the two photon absorption laser experiment

00:43:07.972 --> 00:43:12.624
in which you use two laser beams of a
different wavelength. And these actually

00:43:12.624 --> 00:43:18.366
do not have enough energy to cause any
effect in your silicon. If only one of the

00:43:18.366 --> 00:43:22.174
laser beams is present, but only in the
small location where the two beams

00:43:22.174 --> 00:43:26.874
intersect, the energy is actually large
enough to produce the effect. And this

00:43:26.874 --> 00:43:30.664
allows you to very selectively and only on
a very small volume induce charge and

00:43:30.664 --> 00:43:37.818
cause an effect in your circuit. And when
you do that now, you can systematically

00:43:37.818 --> 00:43:41.964
scan both the X and Y directions across
your chip and also the Z direction and can

00:43:41.964 --> 00:43:46.366
really measure the volume of sensitive
area. And this is what you would typically

00:43:46.366 --> 00:43:50.804
get of such an experiment. So in black and
white in the back, you'll see an infrared

00:43:50.804 --> 00:43:54.621
image of your chip where you can really
make out the individual, say structural

00:43:54.621 --> 00:43:59.406
components. And then overlaid in blue, you
can basically highlight all the sensitive

00:43:59.406 --> 00:44:03.897
points that made you measure something you
didn't expect, some weird bit flip in a

00:44:03.897 --> 00:44:08.338
register or something. And you can really
then go to your layout software and find

00:44:08.338 --> 00:44:13.644
what is the the register or the gate in
your netlist that is responsible for

00:44:13.644 --> 00:44:17.465
this. And then it's more like operating a
debugger in a software environment.

00:44:17.465 --> 00:44:22.889
Tracing back from there what the line of
code responsible for this bug is. And

00:44:22.889 --> 00:44:31.260
to close out, it is always best to learn
from mistakes. And we offer our mistakes

00:44:31.260 --> 00:44:35.901
as a guideline for if you ever feel
yourself the need to design radiation

00:44:35.901 --> 00:44:40.695
tolerant circuits. So we want to present
two or three small issues we had and

00:44:40.695 --> 00:44:45.300
circuits where we were convinced it should
have been working fine. So the first one

00:44:45.300 --> 00:44:50.018
this you will probably recognize is this
full triple modular redundancy scheme that

00:44:50.018 --> 00:44:55.279
Szymon has presented. So we made sure to
triplicate everything and we were relatively

00:44:55.279 --> 00:44:59.102
sure that everything should be fine. The
only modification we did is that to all

00:44:59.102 --> 00:45:03.506
those registers in our design, we've added
a reset, because we wanted to initialize

00:45:03.506 --> 00:45:07.710
the system to some known state when we
started up, which is a very obvious thing

00:45:07.710 --> 00:45:12.327
to do. Every CPU has a reset. But of
course, what we didn't think about here

00:45:12.327 --> 00:45:16.577
was that at some point there's a buffer
driving this reset line somewhere. And if

00:45:16.577 --> 00:45:20.355
there's only a single buffer. What happens
if this buffer experiences a small

00:45:20.355 --> 00:45:24.501
transient event? Of course, the obvious
thing that happened is that as soon as

00:45:24.501 --> 00:45:28.247
that happened, all the registers were
upset at the same time and were basically

00:45:28.247 --> 00:45:32.205
cleared and all our fancy protection was
invalidated. So next time we decided,

00:45:32.205 --> 00:45:37.679
let's be smarter this time. And of course,
we triplicate all the logic and all the

00:45:37.679 --> 00:45:40.633
voters and all the registers. So let's
also triplicate the reset lines. And while

00:45:40.633 --> 00:45:44.955
the designer of that block probably had
very good intentions, it turned out

00:45:44.955 --> 00:45:49.268
that later than when we manufactured the
chip, it still sometimes showed a complete

00:45:49.268 --> 00:45:54.570
reset without any good explanation for
that. And what was left out of the the

00:45:54.570 --> 00:45:59.981
scope of thinking here was that this reset
actually was connected to the system reset

00:45:59.981 --> 00:46:05.033
of the chip that we had. And typically
pins are on the chip or something that is

00:46:05.033 --> 00:46:09.005
not available in huge quantities. So you
typically don't want to spend three pins

00:46:09.005 --> 00:46:13.128
of your chip just for a stupid reset that
you don't use ninety nine percent of the

00:46:13.128 --> 00:46:17.895
time. So what we did at some point we just
connected again the reset lines to a

00:46:17.895 --> 00:46:21.972
single input buffer. That was then
connected to a pin of the chip. And of

00:46:21.972 --> 00:46:25.590
course, this also represented a small
sensitive area in the chip. And again,

00:46:25.590 --> 00:46:30.216
a single upset here was able to destroy
all three of our flip flops. All right.

00:46:30.216 --> 00:46:35.132
And the last lesson I'm bringing or the
last thing that goes back to the

00:46:35.132 --> 00:46:38.930
implementation details that Szymon has
mentioned. So this time, really simple

00:46:38.930 --> 00:46:42.532
circuit. We were absolutely convinced it
must work because it was basically the

00:46:42.532 --> 00:46:46.072
textbook example that Szymon was
presenting. And the code was so

00:46:46.072 --> 00:46:49.817
small we were able to inspect everything
and were very much sure that nothing

00:46:49.817 --> 00:46:54.690
should have happened. And what we saw when
we went for this laser testing experiment,

00:46:54.690 --> 00:46:59.769
in simplified form is basically that
only this first voter. And when this was

00:46:59.769 --> 00:47:04.414
hit, always all our register was 
upset while the other ones were

00:47:04.414 --> 00:47:09.161
never manifested to show anything strange.
And it took us quite a while to actually

00:47:09.161 --> 00:47:13.563
look at the layout later on and figure out
that what was in the chip was rather this.

00:47:13.563 --> 00:47:17.250
So two of the voters were actually not
there. And Szymon mentioned the reason for

00:47:17.250 --> 00:47:21.208
that. So synthesis tool these days are
really clever at identifying redundant

00:47:21.208 --> 00:47:26.102
logic and because we forgot to tell it to
not optimize these redundant pieces of

00:47:26.102 --> 00:47:30.248
logic, which the voters really are. It
just merged them into one. And that

00:47:30.248 --> 00:47:34.393
explains why we only saw this one voter
being the sensitive one. And of course, if

00:47:34.393 --> 00:47:38.255
you have a transient event there, then you
suddenly upset all your registers and that

00:47:38.255 --> 00:47:41.871
without even knowing it and with being
sure, having looked at every single line

00:47:41.871 --> 00:47:45.652
of verilog code and being very sure,
everything should have been fine. But that

00:47:45.652 --> 00:47:51.805
seems to be how this business goes. So we
hope we had been we had the chance and you

00:47:51.805 --> 00:47:56.648
were able to get some insight in in what
we do to make sure the experiments at the

00:47:56.648 --> 00:48:01.966
LHC work fine. What you can do to
make sure the satellite you are working on

00:48:01.966 --> 00:48:06.393
might be working OK. Even before launching
it into space, if you're interested into

00:48:06.393 --> 00:48:10.715
some more information on this topic, feel
free to pass by at the assembly I

00:48:10.715 --> 00:48:15.014
mentioned at the beginning or just meet us
after the talk and otherwise thank you

00:48:15.014 --> 00:48:22.286
very much.
<i>Applause</i>

00:48:22.286 --> 00:48:27.041
Herald: Thank you very much indeed.
There's about 10 minutes left for Q and A,

00:48:27.041 --> 00:48:31.872
so if you have any questions go to a
microphone. And as a cautious reminder,

00:48:31.872 --> 00:48:38.297
questions are short sentences with. That
starts with a question. Well, ends with a

00:48:38.297 --> 00:48:42.548
question mark and the first question goes
to the Internet.

00:48:42.548 --> 00:48:46.433
Internet: Well, hello. Um, do you also
incorporate radiation as the source for

00:48:46.433 --> 00:48:50.596
randomness when that's needed?
Stefan: So we personally don't. So in our

00:48:50.596 --> 00:48:56.880
designs we don't. But it is done indeed
for a random number generator. This is

00:48:56.880 --> 00:49:01.081
sometimes done that they use radioactive
decay as a source for randomness. So this

00:49:01.081 --> 00:49:03.989
is done, but we don't do it in our
experiments.

00:49:03.989 --> 00:49:06.802
We rather want deterministic data out of
the things we built.

00:49:06.802 --> 00:49:10.929
Herald: Okay. Next question goes to
microphone number four.

00:49:10.929 --> 00:49:16.714
Mic 4: Do you do your tripplication before
or after elaboration?

00:49:16.714 --> 00:49:21.003
Szymon: So currently we do it before
elaboration. So we decided that our tool

00:49:21.003 --> 00:49:25.764
works on verilog input and it produces
verilog output because it offers much more

00:49:25.764 --> 00:49:30.496
flexibility in the way how you can
incorporate different tripplication

00:49:30.496 --> 00:49:34.423
schemes. If you were to apply to only
after elaboration, then of course doing a

00:49:34.423 --> 00:49:38.453
full tripplication might be easy. But then
you - to having a really precise control

00:49:38.453 --> 00:49:43.438
or on types of tripplication on different
levels is much more difficult.

00:49:43.438 --> 00:49:47.296
Herald: Next question from microphone
number two.

00:49:47.296 --> 00:49:50.840
Mic 2: Is it possible to use DCDC
converters or switch mode power supplies

00:49:50.840 --> 00:49:54.630
within the radiation environment to power
your logic? Or you use only linear power?

00:49:54.630 --> 00:49:59.866
Szymon: Yes, alternatively we also have a
dedicated program which develops radiation

00:49:59.866 --> 00:50:05.366
hardened DCDC converters who operate
in our environments. So they are available

00:50:05.366 --> 00:50:10.988
also for space applications, as far as I'm
aware. And they are hardened against total

00:50:10.988 --> 00:50:16.027
ionizing dose as well as single event
upsets.

00:50:16.027 --> 00:50:19.667
Herald: Okay next question goes to
microphone number one.

00:50:19.667 --> 00:50:22.614
Mic 1: Thank you very much for the great
talk. I'm just wondering, would it be

00:50:22.614 --> 00:50:27.435
possible to hook up every logic gate in
every water in a way of mesh network? And

00:50:27.435 --> 00:50:31.873
what are the pitfalls and limitations for
that?

00:50:31.873 --> 00:50:36.734
Stefan: So that is not something I'm aware
of, of being done. So typically: No. I

00:50:36.734 --> 00:50:41.473
wouldn't say that that's something we
would do.

00:50:41.473 --> 00:50:43.431
Szymon: I'm not really sure if I
understood the question.

00:50:43.431 --> 00:50:46.401
Stefan: So maybe you can rephrase what
your idea is?

00:50:46.401 --> 00:50:52.613
Mic 1: On the last slide, there were a
lesson learned.

00:50:52.613 --> 00:50:56.253
Stefan: Yes. One of those?
Mic 1: In here. Yeah. Would you be able to

00:50:56.253 --> 00:51:00.309
connect everything interchangeably in a
mesh network?

00:51:00.309 --> 00:51:04.030
Szymon: So what you are probably asking
about is whether we can build our own

00:51:04.030 --> 00:51:08.166
FPGA, like programable logic device.
Mic 1: Probably.

00:51:08.166 --> 00:51:11.074
Szymon: Yeah. And so this we typically
don't do, because in our experiments, our

00:51:11.074 --> 00:51:15.857
power budget is also very limited, so we
cannot really afford this level of

00:51:15.857 --> 00:51:20.903
complexity. So of course you can make your
FPGA design radiation hard, but this is

00:51:20.903 --> 00:51:24.890
not what we will typically do in our
experiments.

00:51:24.890 --> 00:51:28.630
Herald: Next question goes to microphone
number two.

00:51:28.630 --> 00:51:32.059
Mic 2: Hi, I would like to ask if the
orientation of your transistors and your

00:51:32.059 --> 00:51:38.029
chip is part of your design. So mostly you
have something like a bounding box around

00:51:38.029 --> 00:51:42.921
your design and with an attack surface in
different sizes. So do you use this

00:51:42.921 --> 00:51:48.350
orientation to minimize the attack surface
of the radiation on chips, if you know

00:51:48.350 --> 00:51:52.616
the source of the radiation?
Szymon: No. So I don't think we'd do that.

00:51:52.616 --> 00:51:58.515
So, of course, we control our orientation
of transistors during the design phase.

00:51:58.515 --> 00:52:02.651
But usually in our experiment, the
radiation is really perpendicular to the

00:52:02.651 --> 00:52:07.981
chip area, which means that if you rotate
it by 90 degrees, you don't really gain

00:52:07.981 --> 00:52:12.082
that much. And moreover, our chips,
usually they are mounted in a bigger

00:52:12.082 --> 00:52:16.625
system where we don't control how they are
oriented.

00:52:16.625 --> 00:52:24.420
Herald: Again, microphone number two.
Mic 2: Do you take meta stability into

00:52:24.420 --> 00:52:33.140
account when designing voters?
Szymon: The voter itself is combinatorial.

00:52:33.140 --> 00:52:38.820
So ... -
Mic 2: Yeah, but if the state of the rest

00:52:38.820 --> 00:52:45.300
can change in any time that then the
voters can have like glitches, yeah?

00:52:45.300 --> 00:52:51.140
Szymon: Correct. So that's why - so to
avoid this, we don't take it into account

00:52:51.140 --> 00:52:55.060
during the design phase. But if we use
that scheme which is just displayed here,

00:52:55.060 --> 00:52:58.980
we avoid this problem altogether, right?
Because even if you have meta stability in

00:52:58.980 --> 00:53:05.300
one of the blocks like A, B or C, then it
will be fixed in the next clock cycle.

00:53:05.300 --> 00:53:09.940
Because usually our systems operate on
clocks with low frequencies, hundreds of

00:53:09.940 --> 00:53:13.236
megahertz, which means that any meta
stability should be resolved by the next

00:53:13.236 --> 00:53:15.065
clock cycle.
Mic 2: Thank you.

00:53:15.065 --> 00:53:19.145
Herald: Next question microphone number
one.

00:53:19.145 --> 00:53:23.014
Mic 1: How do you handle the register
duplication that can be performed by a

00:53:23.014 --> 00:53:27.947
synthesis and pleasant route? So the tools
will try to optimize timing sometimes by

00:53:27.947 --> 00:53:32.375
adding registers. And these registers are
not trippled.

00:53:32.375 --> 00:53:35.784
Stefan: Yes. So what we do is that I mean,
in a typical, let's say, standard ASIC

00:53:35.784 --> 00:53:40.405
design flaw, this is not what happens. So
you have to actually instruct a tool to do

00:53:40.405 --> 00:53:44.585
that, to do re timing and add additional
registers. But for what we are doing, we

00:53:44.585 --> 00:53:48.174
have to - let's say not do this
optimization and instruct a tool to keep

00:53:48.174 --> 00:53:52.823
all the registers we described in our RTL
code to keep them until the very end. And

00:53:52.823 --> 00:53:56.908
we realy also constrain them to always
keep their associated logic tripplicated.

00:53:56.908 --> 00:54:01.759
Herald: The next question is from the
internet.

00:54:01.759 --> 00:54:07.887
Internet: Do you have some simple tips for
improving radiation tolerance?

00:54:07.887 --> 00:54:12.020
Stefan: Simple tips? Ahhhm...
Szymon: Put your electronics inside a

00:54:12.020 --> 00:54:12.820
box.
Stefan: Yes.

00:54:12.820 --> 00:54:17.380
<i>some laughter</i>
There's there's just no

00:54:17.380 --> 00:54:22.980
single one size fits all textbook recipe
for this as it really always comes down to

00:54:22.980 --> 00:54:28.020
analyzing your environment, really getting
an awareness first of what rate and what

00:54:28.020 --> 00:54:31.940
number of events you are looking at, what
type of particles cause them, and then

00:54:31.940 --> 00:54:36.420
take the appropriate measures to mitigate
them. So there is no one size fits all

00:54:36.420 --> 00:54:38.095
thing I say.
Herald: Next question goes from mycrophone

00:54:38.095 --> 00:54:41.620
number two.
Mic 2: Hi. Thanks for the talk. How much

00:54:41.620 --> 00:54:47.611
of your software used to design is
actually open source? I only know a super

00:54:47.611 --> 00:54:54.495
expensive chip design software.
Stefan: You write the core of all the

00:54:54.495 --> 00:55:00.604
implementation tools like the synthesis
and place and route stage for the ASICS,

00:55:00.604 --> 00:55:04.987
that we design is actually a commercial
closed source tools. And if

00:55:04.987 --> 00:55:10.443
you're asking for the fraction, that's a
bit hard to answer. I cannot give a

00:55:10.443 --> 00:55:14.518
statement about the size of the commercial
closed tools. But we tried to do

00:55:14.518 --> 00:55:18.638
everything we develop, tried to make it
available to the widest possible audience

00:55:18.638 --> 00:55:22.353
and therefore decided to make the
extensions to this design flaw available

00:55:22.353 --> 00:55:26.237
in public form. And that's why these
tools that we develop and share among the

00:55:26.237 --> 00:55:30.541
community of ASIC designers and this
environment are open source.

00:55:30.541 --> 00:55:35.196
Herald: Microphone number four.
Mic 4: Have you ever tried using steered

00:55:35.196 --> 00:55:41.098
iron beams for more localized, radiation
ingress testing?

00:55:41.098 --> 00:55:44.495
Stefan: Yes, indeed! And the picture I
showed actually, uh, didn't disclaimer

00:55:44.495 --> 00:55:49.311
that, but the facility you saw here is
actually a facility in Darmstadt in

00:55:49.311 --> 00:55:53.366
Germany and is actually a micro beam
facility. So it's a facility that allows

00:55:53.366 --> 00:55:58.400
steering a heavy ion beam really on a
single position with less than a

00:55:58.400 --> 00:56:01.808
micrometer accuracy. So it provides
probably exactly what you were asking for.

00:56:01.808 --> 00:56:05.854
But that's not the typical case. That is
really a special thing. And it's probably

00:56:05.854 --> 00:56:09.405
also the only facility in Europe that can
do that.

00:56:09.405 --> 00:56:13.316
Herald: Microphone number one.
Mic 1: Was very good very good talk. Thank

00:56:13.316 --> 00:56:19.282
you very much. My question is, did you
compare what you did to what is done for

00:56:19.282 --> 00:56:25.380
securing secret chips? You know, when you
have credit card chips, you can make fault

00:56:25.380 --> 00:56:29.949
attacks into them so you can make them
malfunction and extract the cryptographic

00:56:29.949 --> 00:56:33.830
key for example from the banking card.
There are techniques here to harden these

00:56:33.830 --> 00:56:38.207
chips against fault attacks. So which are
like voluntary faults while you have like

00:56:38.207 --> 00:56:43.121
random less faults due to like involatility
attacks. You know what? Can you explain if

00:56:43.121 --> 00:56:47.294
you compared in a way what you did to
this?

00:56:47.294 --> 00:56:50.861
Stefan: Um, so no, we didn't explicitly
compared it, but it is right that the

00:56:50.861 --> 00:56:54.427
techniques we present can also be used in
a variety of different contexts. So one

00:56:54.427 --> 00:56:59.134
thing that's not exactly what you are
referring to, but relatively on a similar

00:56:59.134 --> 00:57:03.513
scale is that currently in very small
technologies you get two problems with the

00:57:03.513 --> 00:57:07.855
reliability and yield of the manufacturing
process itself, meaning that sometimes

00:57:07.855 --> 00:57:11.721
just the metal interconnection between two
gates and your circuit might be broken

00:57:11.721 --> 00:57:16.297
after manufacturing and then adding the
sort of redundancy with the same kinds of

00:57:16.297 --> 00:57:20.576
techniques can be used to make, to
produce more working chips out of a

00:57:20.576 --> 00:57:24.715
manufacturing run. So in this sort of
context, these sorts of techniques are

00:57:24.715 --> 00:57:30.674
used very often these days. But, um, I'm
and I'm pretty sure they can be applied to

00:57:30.674 --> 00:57:34.953
these sorts of, uh, security fault attack
scenarios as well.

00:57:34.953 --> 00:57:39.703
Herald: Next question from microphone
number two.

00:57:39.703 --> 00:57:44.126
Mic 2: Hi, you briefly also mentioned the
mitigation techniques on the cell level

00:57:44.126 --> 00:57:52.426
and yesterday there was a very nice talk
from the Libre Silicon people and they

00:57:52.426 --> 00:57:55.914
are trying to build a standard cell
library, uh, open source standard cell

00:57:55.914 --> 00:58:00.015
library. So are you in contact with them
or maybe you could help them to improve

00:58:00.015 --> 00:58:03.980
their design and then the radiation
hardness?

00:58:03.980 --> 00:58:07.430
Stefan: No. We also saw the talk
yesterday, but we are not yet in

00:58:07.430 --> 00:58:14.180
contact with them. No.
Herald: Does the Internet have questions?

00:58:14.180 --> 00:58:21.380
Internet: Yes, I do. Um, two in fact.
First one would be would TTL or other BJT

00:58:21.380 --> 00:58:26.740
based logic be more resistant?
Szymon: Uh, yeah. So depending on which

00:58:26.740 --> 00:58:31.126
type of errors we are considering. So BJT
transistors, they have ...

00:58:31.126 --> 00:58:35.917
Stefan in his part mentioned that
displacement damage is not a problem for

00:58:35.917 --> 00:58:40.305
seamless devices, but it is not the case
for BJT devices. So when they are exposed

00:58:40.305 --> 00:58:47.074
to high energy hadrons or protons,
they degrade a lot. So that's why we don't

00:58:47.074 --> 00:58:52.393
use them in really our environment. They
could be probably much more robust to

00:58:52.393 --> 00:58:57.369
single event effects because their
resistance everywhere is much lower. But

00:58:57.369 --> 00:59:01.633
they would have other problems. And also
another problem which is worth

00:59:01.633 --> 00:59:06.204
mentioning is that for those devices, they
consume much, much, much more power, which

00:59:06.204 --> 00:59:13.041
we cannot afford in our applications.
Internet: And the last one would be how do

00:59:13.041 --> 00:59:19.396
I use the output of the full TMR setup? Is
it still three signals? How do I know

00:59:19.396 --> 00:59:26.260
which one to use and to trust?
Stefan: Um, yes. So with this, um,

00:59:26.260 --> 00:59:30.047
architecture, what you could either do is
really do the full triplication scheme

00:59:30.047 --> 00:59:34.804
to your whole logic tree basically and
really triplicate everything or, and

00:59:34.804 --> 00:59:38.903
that's going in the direction of one of
the lessons learned I had, at some point

00:59:38.903 --> 00:59:43.261
of course you have an interface to your
chip, so you have pins left and right that

00:59:43.261 --> 00:59:46.630
are inputs and outputs. And then you have
to decide either you want to spend the

00:59:46.630 --> 00:59:51.025
effort and also have three dedicated input
pins for each of the signals, or you at

00:59:51.025 --> 00:59:54.260
some point have the voter and say, okay.
At this point, all these signals are

00:59:54.260 --> 00:59:58.202
combined. But I was able to reduce the
amount of sensitive area in my chip

00:59:58.202 --> 01:00:03.780
significantly and can live with the very
small remaining sensitive area that just

01:00:03.780 --> 01:00:07.460
the input and output pins provide.
Szymon: So maybe I will add one more thing

01:00:07.460 --> 01:00:11.780
is that typically in our systems, of
course we triplicate our logic internally,

01:00:11.780 --> 01:00:15.300
but when we interface with external
world, we can apply another protection

01:00:15.300 --> 01:00:20.340
mechanism. So for example, for our high
speed serialisers, we will use different types

01:00:20.340 --> 01:00:23.733
of encoding to add protect..., 
to add like forward error correction

01:00:23.733 --> 01:00:30.340
codes which would allow us to recover these
type of faults in the backend later on.

01:00:30.340 --> 01:00:36.522
Herald: Okay. If ...if we keep this very,
very short. Last question goes to

01:00:36.522 --> 01:00:41.401
microphone number two.
Mic 2: I don't know much about physics. So

01:00:41.401 --> 01:00:47.370
just the question, how important is the
physical testing after the chip is

01:00:47.370 --> 01:00:51.895
manufactured? Isn't the simulation, the
computer simulation enough if you just

01:00:51.895 --> 01:00:56.332
shoot particles at it?
Stefan: Yes and no. So in principle, of

01:00:56.332 --> 01:01:01.267
course, you are right that you should be
able to simulate all the effects we look

01:01:01.267 --> 01:01:06.531
at. The problem is that as the designs
grow big and they do grow bigger as the

01:01:06.531 --> 01:01:10.892
technologies shrink, so
this final net list that you end up with

01:01:10.892 --> 01:01:15.175
can have millions or billions of nodes and
it just is not feasible anymore to

01:01:15.175 --> 01:01:19.558
simulate it exhaustively because you have
to have so many dimensions. You have to

01:01:19.558 --> 01:01:25.852
change when you inject. For example, bit
flips or transients in your design in any

01:01:25.852 --> 01:01:30.745
of those nodes for varying time offsets.
And it's just the state space the circuit

01:01:30.745 --> 01:01:34.553
can be in is just too huge to capture in a
in a full simulation. So it's not possible

01:01:34.553 --> 01:01:38.803
to exhaustively test it in simulation. And
so typically you end up with having missed

01:01:38.803 --> 01:01:43.048
something that you discover only in the
physical testing afterwards, which you

01:01:43.048 --> 01:01:47.311
always want to do before you put your, uh,
your chip into final experiment or on your

01:01:47.311 --> 01:01:50.934
satellite and then realise it's it's not
working as intended. So it has a big

01:01:50.934 --> 01:01:55.540
importance as well.
Herald: Okay. Thank you. Time is up. All

01:01:55.540 --> 01:01:58.584
right. Thank you all very much.

NOTE Paragraph

01:01:58.584 --> 01:02:04.602
<i>applause</i>

01:02:04.602 --> 01:02:09.599
<i>36c3 postroll music</i>

01:02:09.599 --> 01:02:32.100
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!