1
00:00:00,000 --> 00:00:18,406
36C3 Intro musik
2
00:00:18,406 --> 00:00:22,640
Herald: The next talk will be titled 'How
to Design Highly Reliable Digital
3
00:00:22,640 --> 00:00:26,472
Electronics', and it will be delivered to
you by Szymon and Stefan. Warm Applause
4
00:00:26,472 --> 00:00:30,199
for them.
5
00:00:30,199 --> 00:00:36,360
applause
6
00:00:36,360 --> 00:00:41,360
Stefan: All right. Good morning, Congress.
So perhaps every one of you in the room
7
00:00:41,360 --> 00:00:45,600
here has at one point or another in their
lives witnessed their computer behaving
8
00:00:45,600 --> 00:00:50,320
weirdly and doing things that it was not
supposed to do or what you didn't
9
00:00:50,320 --> 00:00:54,400
anticipate it to do. And well, typically
that would have probably been the result
10
00:00:54,400 --> 00:01:00,000
of a software bug of some sort somewhere
inside the huge software stack your PC is
11
00:01:00,000 --> 00:01:04,720
running on. Have you ever considered what
the probability of this weird behavior
12
00:01:04,720 --> 00:01:09,120
being caused by a bit flipped somewhere in
your memory of your computer might have
13
00:01:09,120 --> 00:01:16,240
been? So what you can see in this video on
the screen now is a physics experiment
14
00:01:16,240 --> 00:01:20,720
called a cloud chamber. It's a very simple
experiment that is actually able to
15
00:01:20,720 --> 00:01:26,560
visualize and make apparent all the
constant stream of background radiation we
16
00:01:26,560 --> 00:01:32,640
all are constantly exposed to. So what's
happening here is that highly energetic
17
00:01:32,640 --> 00:01:39,040
particles, for example, from space they
trace through gaseous alcohol and they
18
00:01:39,040 --> 00:01:42,160
collide with alcohol molecules and they
form in this process a trail of
19
00:01:42,160 --> 00:01:48,240
condensation while they do that. And if
you think about your computer, a typical
20
00:01:48,240 --> 00:01:53,200
cell of RAM, of which you might have, I
don't know, 4, 8, 10 gigabytes in your
21
00:01:53,200 --> 00:01:58,400
machine is as big as only 80 nanometers
wide. So it's very, very tiny. And you
22
00:01:58,400 --> 00:02:02,560
probably are able to appreciate the small
amount of energy that is needed or that is
23
00:02:02,560 --> 00:02:08,480
used to store the information inside each
of those bits. And the sheer amount of of
24
00:02:08,480 --> 00:02:12,560
those bits you have in your RAM and your
computer. So a couple of years ago, there
25
00:02:12,560 --> 00:02:17,600
was a study that concluded that in a
computer with about four gigabytes of RAM,
26
00:02:17,600 --> 00:02:23,600
a bit flip, um, caused by such an event by
cosmic background radiation can occur
27
00:02:23,600 --> 00:02:29,200
about once every 33 hours. So a
bit less than than one per day. In an
28
00:02:29,200 --> 00:02:34,960
incident in 2008, a Quantas Airlines
flight actually nearly crashed, and the
29
00:02:34,960 --> 00:02:40,080
reason for this crash was traced back to
be very likely caused by a bit flipped
30
00:02:40,080 --> 00:02:44,400
somewhere in one of the CPUs of the
avionics system and nearly caused the
31
00:02:44,400 --> 00:02:50,480
death of a lot of passengers on this
plane. In 2003, in Belgium, a small
32
00:02:50,480 --> 00:02:56,880
municipal vote actually had a weird hiccup
in which one of the candidates in this
33
00:02:56,880 --> 00:03:02,153
election actually got 4096 more votes added in a single instance.
34
00:03:02,153 --> 00:03:06,480
And that was traced back to be very likely
caused by cosmic background radiation,
35
00:03:06,480 --> 00:03:10,000
flipping a memory cell somewhere that
stored the vote count. And it was only
36
00:03:10,000 --> 00:03:14,560
discovered that this happened because this
number of votes for this particular
37
00:03:14,560 --> 00:03:18,880
candidate was considered unreasonable, but
otherwise would have gotten away probably
38
00:03:18,880 --> 00:03:27,360
without being detected. So a few words
about us: Szymon and I, we both work at
39
00:03:27,360 --> 00:03:32,480
CERN in the microelectronics section and
we both develop electronics that need to
40
00:03:32,480 --> 00:03:37,360
be tolerant to these sorts of effects. So
we develop radiation tolerant electronics
41
00:03:37,360 --> 00:03:42,846
for the experiments at CERN, at the LHC.
Among a lot of other applications, you can
42
00:03:42,846 --> 00:03:48,330
meet the two of us at the Lötlabor Jena
assembly if you are interested in what we
43
00:03:48,330 --> 00:03:55,847
are talking about today. And we will also
give a small talk or a small workshop
44
00:03:55,847 --> 00:03:59,190
about radiation detection tomorrow, in one
of the seminar rooms. So feel free to pass
45
00:03:59,190 --> 00:04:02,544
by there, it will be a quick introduction.
To give you a small idea of what kind of
46
00:04:02,544 --> 00:04:08,541
environment we are working for: So if you
would use one of your default intel i7
47
00:04:08,541 --> 00:04:14,294
CPUs from your notebook and would put it
anywhere where we operate our electronics,
48
00:04:14,294 --> 00:04:19,632
it would very shortly die in a matter of
probably one or two minutes and it would
49
00:04:19,632 --> 00:04:24,626
die for more than just one reason, which
is rather interesting and compelling. So
50
00:04:24,626 --> 00:04:30,985
the idea for today's talk is to give you
all an insight into all the things that
51
00:04:30,985 --> 00:04:34,575
need to be taken into account when you
design electronics for radiation
52
00:04:34,575 --> 00:04:39,152
environments. What kinds of different
challenges come when you try to do that.
53
00:04:39,152 --> 00:04:43,116
We classify and explain the different
types of radiation effects that exist. And
54
00:04:43,116 --> 00:04:47,617
then we also present what you can do to
mitigate these effects and also validate
55
00:04:47,617 --> 00:04:52,116
that what you did to care for them or
protect your circuits actually worked. And
56
00:04:52,116 --> 00:04:57,477
of course, as we do that, we'll try to
give our view on how we develop radiation
57
00:04:57,477 --> 00:05:03,257
tolerant electronics at CERN and how our
workflow looks like to make sure this
58
00:05:03,257 --> 00:05:08,272
works. So let's first maybe take a step
back and have a look at what we mean when
59
00:05:08,272 --> 00:05:12,997
we say radiation environments. The first
one that you probably have in mind right
60
00:05:12,997 --> 00:05:19,044
now when you think about radiation is
space. So, this interstellar space is
61
00:05:19,044 --> 00:05:24,292
basically filled with, very high speed,
highly energetic electrons and protons and
62
00:05:24,292 --> 00:05:28,716
all sorts of high energy particles. And
while they, for example, traverse close to
63
00:05:28,716 --> 00:05:34,513
planets as our Earth - these planets
sometimes do have a magnetic field and the
64
00:05:34,513 --> 00:05:39,317
highly energetic particles are actually
deflected by these magnetic fields and
65
00:05:39,317 --> 00:05:43,824
they can protect the planets as our
planet, for example, from this highly
66
00:05:43,824 --> 00:05:47,986
energetic radiation. But in the process,
there around these planets sometimes they
67
00:05:47,986 --> 00:05:52,107
form these radiation belts - known as the
Van Allen belts after the guy who
68
00:05:52,107 --> 00:05:56,043
discovered this effect a long time ago.
And a satellite in space as it orbits
69
00:05:56,043 --> 00:06:01,620
around the Earth might, depending on what
orbit is chosen, sometimes go through
70
00:06:01,620 --> 00:06:05,647
these belts of highly intense radiation.
That, of course, then needs to be taken
71
00:06:05,647 --> 00:06:11,552
into account when designing electronics
for such a satellite. And if Earth itself
72
00:06:11,552 --> 00:06:17,191
is not able to give you enough radiation,
you may think of the very famous Juno
73
00:06:17,191 --> 00:06:22,874
Jupiter mission that has become famous
about a year ago. They actually in the
74
00:06:22,874 --> 00:06:28,288
environment of Jupiter they anticipated so
much radiation that they actually decided
75
00:06:28,288 --> 00:06:33,408
to put all the electronics of the
satellite inside a one centimeter thick
76
00:06:33,408 --> 00:06:39,831
cube of titanium, which is famously known
as the Juno Radiation Vault. But not only
77
00:06:39,831 --> 00:06:43,870
space offers radiation environments.
Another form of radiation you probably all
78
00:06:43,870 --> 00:06:48,292
recognize this when I show you this
picture, which is an X-ray image of a
79
00:06:48,292 --> 00:06:54,936
hand. And X-ray is also considered a form
of radiation. And while, of course, the
80
00:06:54,936 --> 00:07:01,320
doses or amounts of radiation any patient
is exposed to while doing diagnosis or
81
00:07:01,320 --> 00:07:05,801
treatment of some disease, that might not
be the full story when it comes to medical
82
00:07:05,801 --> 00:07:10,220
applications. So this is a medical
particle accelerator which is used for
83
00:07:10,220 --> 00:07:15,288
cancer treatment. And in these sorts of
accelerators, typically carbon ions or
84
00:07:15,288 --> 00:07:20,389
protons are accelerated and then focused
and used to treat and selectively destroy
85
00:07:20,389 --> 00:07:25,302
cancer cells in the body. And this comes
already relatively close to the
86
00:07:25,302 --> 00:07:29,695
environment we are working in and working
for. So Szymon and I are working, for
87
00:07:29,695 --> 00:07:36,616
example, on electronics, for the CMS
detector inside the LHC or which we build
88
00:07:36,616 --> 00:07:43,906
dedicated, radiation tolerant, integrated
circuits which have to withstand very,
89
00:07:43,906 --> 00:07:49,373
very large amounts and doses of short
lived radiation in order to function
90
00:07:49,373 --> 00:07:54,414
correctly. And if we didn't specifically
design electronics for that, basically the
91
00:07:54,414 --> 00:08:01,893
whole system would never be able to work.
To illustrate a bit how you can imagine
92
00:08:01,893 --> 00:08:06,062
the scale of this environment: This is a
single plot of a collision event that was
93
00:08:06,062 --> 00:08:11,161
recorded in the ATLAS experiment. And each
of those tiny little traces you can make
94
00:08:11,161 --> 00:08:15,997
out in this diagram is actually either one
or multiple secondary particles that were
95
00:08:15,997 --> 00:08:22,166
created in the initial collision of two
proton bunches inside the experiment. And
96
00:08:22,166 --> 00:08:27,501
in each of those, of course, races around
the detector electronics, which make these
97
00:08:27,501 --> 00:08:32,817
traces visible. Itself, then decaying into
multiple other secondary particles which
98
00:08:32,817 --> 00:08:37,856
all go through our electronics. And if
that doesn't sound, let's say, bad enough
99
00:08:37,856 --> 00:08:42,576
for digital electronics, these collisions
happen about 40 million times a second. Of
100
00:08:42,576 --> 00:08:47,608
course, multiplying the number of events
or problems they can cause in our
101
00:08:47,608 --> 00:08:54,608
circuits. So we now want to introduce all
the things that can happen, the different
102
00:08:54,608 --> 00:08:59,570
radiation effects. But first, probably we
take a step back and look at what we mean
103
00:08:59,570 --> 00:09:05,805
when we say digital electronics or digital
logic, which we want to focus on today. So
104
00:09:05,805 --> 00:09:11,058
from your university lectures or your
reading, you probably know the first class
105
00:09:11,058 --> 00:09:14,577
of digital logic, which is the
combinatorial logic. So this is typically
106
00:09:14,577 --> 00:09:19,222
logic that just does a simple linear
relation of the inputs of a circuit and
107
00:09:19,222 --> 00:09:23,956
produces an output as exemplified with
these AND and OR, NAND, XOR gates that you
108
00:09:23,956 --> 00:09:28,829
see here. But if you want to build - I
mean even though we use those everywhere
109
00:09:28,829 --> 00:09:32,775
in our circuits - you probably also want
to store state in a more complex circuit,
110
00:09:32,775 --> 00:09:37,857
for example, in the registers of your CPU
they store some sort of internal
111
00:09:37,857 --> 00:09:41,736
information. And for that we use the other
class of logic, which is called the
112
00:09:41,736 --> 00:09:44,726
sequential logic. So this is typically
clocked with some system clock frequency
113
00:09:44,726 --> 00:09:50,883
and it changes its output with relation to
the inputs whenever this clock signal changes.
114
00:09:50,883 --> 00:09:54,263
And now if we look at how all
these different logic functionalities are
115
00:09:54,263 --> 00:09:58,292
implemented. So typically nowadays for
that you may know that we use CMOS
116
00:09:58,292 --> 00:10:02,340
technologies and basically represent all
this logic functionality as digital gates
117
00:10:02,340 --> 00:10:10,558
using small P-MOS and N-MOS MOSFET
transistors in CMOS technologies. And if
118
00:10:10,558 --> 00:10:16,408
we kind of try to build a model for more
complex digital circuits, we typically use
119
00:10:16,408 --> 00:10:21,814
something we call the finite state machine
model, in which we use a model that
120
00:10:21,814 --> 00:10:25,822
consists of a combinatorial and a
sequential part. And you can see that the
121
00:10:25,822 --> 00:10:31,031
output of the circuit depends both on the
internal state inside the register as well
122
00:10:31,031 --> 00:10:35,331
as also the input to the combinatorial
logic. And accordingly, also the state
123
00:10:35,331 --> 00:10:40,924
that is internal is always changed by the
inputs as well as the current state. So
124
00:10:40,924 --> 00:10:44,604
this is kind of the simple model for more
complex systems that can be used to model
125
00:10:44,604 --> 00:10:50,214
different effects. Um, now let's try to
actually look at what the radiation can do
126
00:10:50,214 --> 00:10:53,948
to transistors. And for that we are going
to have a quick recap at what the
127
00:10:53,948 --> 00:10:57,895
transistor actually is and how it looks
like. As you may perhaps know is that in
128
00:10:57,895 --> 00:11:03,736
CMOS technologies, transistors are built
on wafers of high purity silicon. So this
129
00:11:03,736 --> 00:11:09,074
is a crystalline, very regularly organized
lattice of silicon atoms. And what we do
130
00:11:09,074 --> 00:11:14,092
to form a transistor on such a wafer is
that we add dopants. So in order to form
131
00:11:14,092 --> 00:11:19,629
diffusion regions, which later will become
the source and drain of our transistors.
132
00:11:19,629 --> 00:11:24,474
And then on top of that we grow a layer of
insulating oxide. And on top of that we
133
00:11:24,474 --> 00:11:28,713
put polysilicon, which forms the gate
terminal of the transistor. And in the end
134
00:11:28,713 --> 00:11:32,813
we end up with an equivalent circuit a bit
like that. And now to put things back into
135
00:11:32,813 --> 00:11:37,670
perspective - you may also note that the
dimension of these structures are very
136
00:11:37,670 --> 00:11:42,543
tiny. So we talk about tens of nanometers
for some of the dimensions I've outlined
137
00:11:42,543 --> 00:11:47,958
here. And as the technologies shrink,
these become smaller and smaller and
138
00:11:47,958 --> 00:11:52,284
therefore you'll probably also realize or
are able to appreciate the small amount of
139
00:11:52,284 --> 00:11:56,560
energy that are used to store information
inside these digital circuits, which makes
140
00:11:56,560 --> 00:12:02,390
them perhaps more sensitive to radiation.
So let's take a look. What different types
141
00:12:02,390 --> 00:12:08,385
of radiation effects exist? We typically
in this case, differentiate them into two
142
00:12:08,385 --> 00:12:13,268
main classes of events. The first one
would be the cumulative effects, which are
143
00:12:13,268 --> 00:12:17,362
effects that, as the name implies,
accumulate over time. So as the circuit is
144
00:12:17,362 --> 00:12:22,127
placed inside some radiation environment,
over time it accumulates more and more
145
00:12:22,127 --> 00:12:26,969
dose and therefore worsens its performance
or changes how it operates. And on the
146
00:12:26,969 --> 00:12:30,549
other side, we have the Single Event
Effects, which are always events that
147
00:12:30,549 --> 00:12:35,075
happen at some instantaneous point in
time, and then suddenly, without being
148
00:12:35,075 --> 00:12:39,316
predictable, change how the circuit
operates or how it functions or if it
149
00:12:39,316 --> 00:12:43,931
works in the first place or not. So I'm
going to first go into the class of
150
00:12:43,931 --> 00:12:47,685
cumulative effects and then later on,
Szymon will go into the other class of the
151
00:12:47,685 --> 00:12:53,173
Single Event Effects. So in terms of these
accumulating effects, we basically have
152
00:12:53,173 --> 00:12:57,580
two main subclasses: The first one being
ionization or TID effects, for Total
153
00:12:57,580 --> 00:13:02,033
Ionizing Dose - and the second one being
displacement damages. So displacement
154
00:13:02,033 --> 00:13:07,137
damages do exactly what they sound like.
It is all the effects that happen when an
155
00:13:07,137 --> 00:13:11,249
atom in the silicon lattice is actually
displaced, so removed from its lattice
156
00:13:11,249 --> 00:13:15,266
position and actually changes the
structure of the semiconductor. But
157
00:13:15,266 --> 00:13:19,548
luckily, these effects don't have a big
impact in the CMOS digital circuits that
158
00:13:19,548 --> 00:13:23,164
we are looking at today. So we will
disregard them for the moment and we'll be
159
00:13:23,164 --> 00:13:28,120
looking more at the ionization damage, or
TID. So ionization - as a quick recap - is
160
00:13:28,120 --> 00:13:35,901
whenever electrons are removed or added to
an atom, effectively transforming it into
161
00:13:35,901 --> 00:13:42,747
an ion. And these effects are especially
critical for the circuits we are building
162
00:13:42,747 --> 00:13:46,316
because of what they do is that they
change the behavior of the transistors.
163
00:13:46,316 --> 00:13:50,233
And without looking too much into the
semiconductor details, I just want to show
164
00:13:50,233 --> 00:13:55,730
their typical effect that we are concerned
about in this very simple circuit here. So
165
00:13:55,730 --> 00:14:00,348
this is just an inverter circuit
consisting of two transistors here and
166
00:14:00,348 --> 00:14:05,812
there. And what the circuit does in normal
operation is it just takes an input signal
167
00:14:05,812 --> 00:14:10,062
and inverts and basically gives the
inverted signal at the output. And as the
168
00:14:10,062 --> 00:14:15,549
transistors are irradiated and accumulate
dose, you can see that the edges of the
169
00:14:15,549 --> 00:14:20,391
output signal get slower. So the
transistor takes longer to turn on and off.
170
00:14:20,391 --> 00:14:24,574
And what that does in turn is that it
limits the maximum operation frequency of
171
00:14:24,574 --> 00:14:28,795
your circuit. And of course, that is not
something you want to do. You want your
172
00:14:28,795 --> 00:14:31,723
circuit to operate at some frequency in
your final system. And if the maximum
173
00:14:31,723 --> 00:14:35,600
frequency it can work at degrades over
time, at some point it will fail as the
174
00:14:35,600 --> 00:14:39,276
maximum frequency is just too low. So
let's have a look at what we can do to
175
00:14:39,276 --> 00:14:44,395
mitigate these effects. The first one and
I already mentioned it when talking about
176
00:14:44,395 --> 00:14:48,488
the Juno mission, is shielding. So if you
can actually put a box around your
177
00:14:48,488 --> 00:14:52,586
electronics and shield any radiation from
actually hitting your transistors, it is
178
00:14:52,586 --> 00:14:56,900
obvious that they will last longer and
will suffer less from the radiation damage
179
00:14:56,900 --> 00:15:01,241
that it would otherwise do. So this
approach is very often used in space
180
00:15:01,241 --> 00:15:04,988
applications like on satellites, but it's
not very useful if you are actually trying
181
00:15:04,988 --> 00:15:08,209
to measure the radiation with your
circuits as we do, for example, in the
182
00:15:08,209 --> 00:15:12,415
particle accelerators we build integrated
circuits for. So there first of all, we
183
00:15:12,415 --> 00:15:16,344
want to measure the radiation so we cannot
shield our detectors from the radiation.
184
00:15:16,344 --> 00:15:20,592
And also, we don't want to influence the
tracks of these secondary collision
185
00:15:20,592 --> 00:15:24,162
products with any shielding material that
would be in the way. So this is not very
186
00:15:24,162 --> 00:15:28,315
useful in a particle accelerator
environment, let's say. So we have to
187
00:15:28,315 --> 00:15:33,880
resort to different methods. So as I said,
we do have to design our own integrated
188
00:15:33,880 --> 00:15:38,826
circuits in the first place. So we have
some freedom in what we call transistor
189
00:15:38,826 --> 00:15:45,236
level design. So we can actually alter the
dimensions of the transistors. We can make
190
00:15:45,236 --> 00:15:50,055
them larger to withstand larger doses of
radiation and we can use special
191
00:15:50,055 --> 00:15:54,354
techniques in terms of layout that we can
experimentally verifiy to be more
192
00:15:54,354 --> 00:15:59,266
resistant to radiation effects. And as a
third measure, which is probably the most
193
00:15:59,266 --> 00:16:03,491
important one for us, is what we call
modeling. So we actually are able to
194
00:16:03,491 --> 00:16:08,358
characterize all the effects that
radiation will have on a transistor. And
195
00:16:08,358 --> 00:16:12,442
if we can do that, if we will know: 'If I
put it into a radiation environment for a
196
00:16:12,442 --> 00:16:17,000
year, how much slower will it become?'
Then it is of course easy to say: 'OK, I
197
00:16:17,000 --> 00:16:20,648
can just over-design my circuit and make
it a bit more simple, maybe have less functionality,
198
00:16:20,648 --> 00:16:24,464
but be able to operate at a
higher frequency and therefore withstand
199
00:16:24,464 --> 00:16:30,240
the radiation effects for a longer time
while still working sufficiently well at
200
00:16:30,240 --> 00:16:35,118
the end of its expected lifetime.' So
that's more or less what we can do about
201
00:16:35,118 --> 00:16:38,254
these effects. And I'll hand over to
Szymon for the second class.
202
00:16:38,254 --> 00:16:42,655
Szymon: Contrary to the cumulative effects
presented by Stefan, the other group are
203
00:16:42,655 --> 00:16:46,424
Single Event Effects which are caused by
high energy deposits, which are caused by
204
00:16:46,424 --> 00:16:52,143
a single particle or shower of particles.
And they can happen at any time, even
205
00:16:52,143 --> 00:16:57,089
seconds after irradiation is started. It
means that if your circuit is vulnerable
206
00:16:57,089 --> 00:17:01,667
to this class of effects, it can fail
immediately after radiation is present.
207
00:17:01,667 --> 00:17:06,313
And here we also classify these effects
into several groups. The first are hard,
208
00:17:06,313 --> 00:17:11,450
or permanent, errors, which as the name
indicates can permanently destroy your
209
00:17:11,450 --> 00:17:20,260
circuit. And this type of errors are
typically critical for power devices where
210
00:17:20,260 --> 00:17:24,340
you have large power densities and they
are not so much of a problem for digital
211
00:17:24,340 --> 00:17:30,100
circuits. In the other class of effects
are soft errors. And here we distinguish
212
00:17:30,100 --> 00:17:34,100
transient, or Single Event Transient
errors, which are spurious signals
213
00:17:34,100 --> 00:17:41,220
propagating in your circuit as a result of
a gate being hit by a particle and they
214
00:17:41,220 --> 00:17:45,700
are especially problematic for analog
circuits or asynchronous digital circuits,
215
00:17:45,700 --> 00:17:51,460
but under some circumstances they can be
also problematic for synchronous systems.
216
00:17:51,460 --> 00:17:56,420
And the other class of problems are
static, or Single Event Upset problems,
217
00:17:56,420 --> 00:18:01,220
which basically means that your memory
element like a register gets flipped. And
218
00:18:01,220 --> 00:18:05,060
then of course, if your system is not
designed to handle this type of errors
219
00:18:05,060 --> 00:18:09,620
properly, it can lead to a failure. So in
the following part of the presentation
220
00:18:09,620 --> 00:18:15,300
we'll focus mostly on soft errors. So
let's try to understand what is the origin
221
00:18:15,300 --> 00:18:20,820
of this type of problem. So as Stefan
mentioned, the typical transistor is built
222
00:18:20,820 --> 00:18:25,230
out of diffusions, gate and channel. So
here you can see one diffusion. Let's
223
00:18:25,230 --> 00:18:29,230
assume that it is a drain diffusion. And
then when a particle goes through and
224
00:18:29,230 --> 00:18:36,700
deposits charge, it creates free electron and
hole pairs, which then in the presence of
225
00:18:36,700 --> 00:18:43,320
electric fields, they get collected by
means of drift, which results in a large
226
00:18:43,320 --> 00:18:46,930
current spike, which is very short. And
then the rest of the charge could be
227
00:18:46,930 --> 00:18:50,940
collected by diffusion which is a much
slower process and therefore also the
228
00:18:50,940 --> 00:18:56,390
amplitude of the event is much, much
smaller. So let's try to understand what
229
00:18:56,390 --> 00:19:01,230
could happen in a typical memory cell. So
on this schematic, you can see the
230
00:19:01,230 --> 00:19:05,740
simplest memory cell, which is composed of
two back-to-back inverters. And let's
231
00:19:05,740 --> 00:19:12,810
assume that node A is at high and node B
is at low potential initially. And then we
232
00:19:12,810 --> 00:19:17,210
have a particle hitting the drain of
transistor M1 which creates a short
233
00:19:17,210 --> 00:19:22,590
circuit current between drain and ground,
bringing the drain of transistor M1 to low
234
00:19:22,590 --> 00:19:29,871
potential, which also acts on the gates of
second inverter, temporarily changing its
235
00:19:29,871 --> 00:19:38,734
state from low to high, which reinforces
the wrong state in the first inverter. And
236
00:19:38,734 --> 00:19:45,340
at this time the error is locked in your
memory cell and you basically lost your
237
00:19:45,340 --> 00:19:49,652
information. So you may be asking
yourself: 'How much charge is needed
238
00:19:49,652 --> 00:19:54,281
really to flip a state of a memory cell?'.
And you can get this number from either
239
00:19:54,281 --> 00:19:59,952
simulations or from measurements. So let's
assume that what we could do, we could try
240
00:19:59,952 --> 00:20:04,605
to inject some current into the sensitive
node, for example, drain of transistor M1.
241
00:20:04,605 --> 00:20:08,790
And here what I will show is that on the
top plot you will have current as a function
242
00:20:08,790 --> 00:20:13,484
of time. On the second plot you will have
output voltage. So voltage at node B as a
243
00:20:13,484 --> 00:20:19,121
function of time and at the lowest plot you
will see a probability of having a bit
244
00:20:19,121 --> 00:20:23,097
flip. So if you inject very little
current, of course nothing changes at the
245
00:20:23,097 --> 00:20:27,670
output, but once you start increasing the
amount of current you are injecting, you
246
00:20:27,670 --> 00:20:33,306
see that something appears at the output
and at some point the output will toggle,
247
00:20:33,306 --> 00:20:39,747
so it will switch to the other state. And
at this point, if you really calculate
248
00:20:39,747 --> 00:20:46,369
what is the area under the current curve
you can find what is the critical charge
249
00:20:46,369 --> 00:20:53,499
needed to flip the memory cell. And if you
go further, if you start injecting even
250
00:20:53,499 --> 00:21:00,701
more current, you will not see that much
difference in the output voltage waveform.
251
00:21:00,701 --> 00:21:05,112
It could become only slightly faster. And
at this point, you also can notice that
252
00:21:05,112 --> 00:21:09,528
the probability now jumped to one, which
means that any time you inject so much
253
00:21:09,528 --> 00:21:17,431
current there is a fault in your circuit.
So for now, we just found what is the
254
00:21:17,431 --> 00:21:23,414
probability of having a bit-flip from 0 to
1 in node B. Of course we should also
255
00:21:23,414 --> 00:21:27,904
calculate the same for the other
direction, so from 1 to zero. And usually
256
00:21:27,904 --> 00:21:32,377
it is slightly different. And then of
course we should inject in all the other
257
00:21:32,377 --> 00:21:37,817
nodes, for example node B and also should
study all possible transitions. And then
258
00:21:37,817 --> 00:21:43,492
at the end, if you calculate the
superposition of these effects and you
259
00:21:43,492 --> 00:21:48,655
multiply them by the active area of each
node, you will end up with what we call
260
00:21:48,655 --> 00:21:52,420
the cross section, which has a dimension
of centimeters squared, which will tell
261
00:21:52,420 --> 00:21:57,357
you how sensitive your circuit is to this
type of effects. And then knowing the
262
00:21:57,357 --> 00:22:03,761
radiation profile of your environment, you
can calculate the expected upset rate in
263
00:22:03,761 --> 00:22:10,105
the final application. So now, having
covered the basic of the single event
264
00:22:10,105 --> 00:22:16,517
effects, let's try to check how we can
mitigate them. And here also technology
265
00:22:16,517 --> 00:22:20,875
plays a significant role. So of course,
newer technologies offer us much smaller
266
00:22:20,875 --> 00:22:26,692
devices. And together with that, what
follows is that usually supply voltages
267
00:22:26,692 --> 00:22:31,047
are getting smaller and smaller as well as
the node capacitance, which means that for
268
00:22:31,047 --> 00:22:35,565
our Single Event Upsets it is very bad
because the critical charge which is
269
00:22:35,565 --> 00:22:40,207
required to flip our bit is getting less
and less. But at the end, at the same
270
00:22:40,207 --> 00:22:44,135
time, physical dimensions of our
transistors are getting smaller, which
271
00:22:44,135 --> 00:22:48,097
means that the cross section for them
being hit is also getting smaller. So
272
00:22:48,097 --> 00:22:52,495
overall, the effects really depend on the
circuit topology and the radiation
273
00:22:52,495 --> 00:22:59,181
environment. So another protection method
could be introduced on the cell level. And
274
00:22:59,181 --> 00:23:04,914
here we could imagine increasing the
critical charge. And that could be done in
275
00:23:04,914 --> 00:23:10,819
the easiest way by just increasing the
node capacitance by, for example, putting
276
00:23:10,819 --> 00:23:16,096
larger transistors. But of course, this
also increases the collection electrode,
277
00:23:16,096 --> 00:23:22,657
which is not nice. And another way could
be just increase the capacitance by adding
278
00:23:22,657 --> 00:23:28,336
some extra metal capacitance, but it, of
course, slows down the circuit. Another
279
00:23:28,336 --> 00:23:33,615
approach could be to try to store the
information on more than two nodes. So I
280
00:23:33,615 --> 00:23:38,377
showed you that on a simple SRAM cell we
store information only on two nodes, so
281
00:23:38,377 --> 00:23:43,102
you could try to come up with some other
cells, for example, like that one in which
282
00:23:43,102 --> 00:23:47,406
the information you stored on four nodes.
So you can see that the architecture is
283
00:23:47,406 --> 00:23:53,800
very similar to the basic SRAM cell. But
you should be careful always to very
284
00:23:53,800 --> 00:23:59,000
carefully simulate your design, because if
we analyze this circuit, you will quickly
285
00:23:59,000 --> 00:24:02,936
realize that this circuit, even though the
information is stored in four different
286
00:24:02,936 --> 00:24:09,867
nodes, the same type of loop exists as in
the basic circuit. Meaning that at the end
287
00:24:09,867 --> 00:24:15,227
the circuit offers basically no hardening
with respect to the previous cell. So
288
00:24:15,227 --> 00:24:21,074
actually we can do it better. So here you
can see a typical dual interlocked cell.
289
00:24:21,074 --> 00:24:26,445
So the amount of transistors is exactly
the same as in the previous example, but
290
00:24:26,445 --> 00:24:30,819
now they are interconnected slightly
differently. And here you can see that
291
00:24:30,819 --> 00:24:36,262
this cell has also two stable configurations.
But this time data can propagate, the low
292
00:24:36,262 --> 00:24:40,587
level from a given node can propagate
only to the left hand side, while high
293
00:24:40,587 --> 00:24:47,872
level can propagate to the right hand
side. And each stage being inverting means
294
00:24:47,872 --> 00:24:54,918
that the fault can not propagate for more
than one node. Of course, this cell has
295
00:24:54,918 --> 00:25:00,379
some drawbacks: It consumes more area than
a simple SRAM cell and also write access
296
00:25:00,379 --> 00:25:04,240
requires accessing at least two nodes at
the same time to really change the state
297
00:25:04,240 --> 00:25:09,801
of the cell. And so you may ask yourself,
how effective is this cell? So here I will
298
00:25:09,801 --> 00:25:13,709
show you a cross section plot. So it is
the probability of having an error as a
299
00:25:13,709 --> 00:25:18,883
function of injected energy. And as a
reference, you can see a pink curve on the
300
00:25:18,883 --> 00:25:25,650
top, which is for a normal, not protected
cell. And on the green you can see the
301
00:25:25,650 --> 00:25:31,399
cross section for the error in the DICE
cell. So as you can see, it is one order
302
00:25:31,399 --> 00:25:36,934
of magnitude better than the normal cell.
But still, the cross section is far from
303
00:25:36,934 --> 00:25:41,426
being negligible, So, the problem was
identified: So it was identified that the
304
00:25:41,426 --> 00:25:45,679
problem was caused by the fact that some
sensitive nodes were very close together
305
00:25:45,679 --> 00:25:50,807
on the layout and therefore they could be
upset by the same particle. Because as we
306
00:25:50,807 --> 00:25:54,721
mentioned, that single devices, they are very
small. We are talking about dimensions
307
00:25:54,721 --> 00:25:59,675
below a micron. So after realizing that,
we designed another cell in which we
308
00:25:59,675 --> 00:26:04,799
separated more sensitive nodes and we
ended up with the blue curve, and as you
309
00:26:04,799 --> 00:26:08,907
can see the cross section was reduced by
two more orders of magnitude and the
310
00:26:08,907 --> 00:26:14,205
threshold was increased significantly. So
if you don't want to redesign your
311
00:26:14,205 --> 00:26:18,771
standard cells, you could also apply some
mitigation techniques on block level. So
312
00:26:18,771 --> 00:26:24,717
here we can use some encoding to encode
our state better. And as an example, I
313
00:26:24,717 --> 00:26:31,540
will show you a typical Hamming code. So
to protect four bits, we have to add three
314
00:26:31,540 --> 00:26:38,052
additional party bits which are calculated
according to this formula. And then once
315
00:26:38,052 --> 00:26:44,133
you calculate the parity bits, you can use
those to check the state integrity of your
316
00:26:44,133 --> 00:26:50,360
internal state. And if any of their parity
bits is not equal to zero, then the bits
317
00:26:50,360 --> 00:26:55,375
instantaneously become syndromes,
indicating where the error happened. And
318
00:26:55,375 --> 00:26:59,916
you can use these information to correct
the error. Of course, in this case, the
319
00:26:59,916 --> 00:27:06,533
efficiency is not really nice because we
need three additional bits to protect only
320
00:27:06,533 --> 00:27:11,828
four bits of information. But as the state
length increases the protection also is
321
00:27:11,828 --> 00:27:18,855
more efficient. Another approach would be
to do even less. Meaning that instead of
322
00:27:18,855 --> 00:27:23,970
changing anything you need in your design,
you can just triplicate your design or
323
00:27:23,970 --> 00:27:30,190
multiply it many times and just vote,
which state is correct? So this concept is
324
00:27:30,190 --> 00:27:35,046
called tripple modular redudancy and it is
based around a voter cell. So it is a
325
00:27:35,046 --> 00:27:40,210
cell which has odd number of
inputs and output is always equal to
326
00:27:40,210 --> 00:27:45,040
majority of its input. And as I mentioned
that the idea is that you have, for
327
00:27:45,040 --> 00:27:49,292
example, three circuits: A, B and C, and
during normal operation, when they are
328
00:27:49,292 --> 00:27:54,471
identical, the output is also the same.
However, when there is a problem, for
329
00:27:54,471 --> 00:28:00,957
example, in logic, part B, the output
is affected. So this problem is
330
00:28:00,957 --> 00:28:05,509
effectively masked by the voter cell
and it is not visible from outside of the
331
00:28:05,509 --> 00:28:10,383
circuit. But you have to be careful not to
take this picture as a as a design
332
00:28:10,383 --> 00:28:15,501
template. So let's try to analyze what
would happen with a state machine
333
00:28:15,501 --> 00:28:20,329
similar to what Stephan introduced. If you
were to just use this concept. So here you
334
00:28:20,329 --> 00:28:24,859
can see three state machines and
a voter on the output. And as we can see,
335
00:28:24,859 --> 00:28:29,484
if you have an upside in, for example, the
state register A, then the state is
336
00:28:29,484 --> 00:28:36,676
broken. But still the output of the
circuit, which is indicated by letter s is
337
00:28:36,676 --> 00:28:42,355
correct because the B and C registers are
still fine. But what happens if some time
338
00:28:42,355 --> 00:28:49,283
later we have an upset in memory element B
or C? Then of course the state
339
00:28:49,283 --> 00:28:56,028
of our system is broken and we can not
recover it. So you can ask yourself what
340
00:28:56,028 --> 00:29:02,204
can we do better in order to avoid this
situation? So that just to be sure. Please
341
00:29:02,204 --> 00:29:06,654
do not use this technique to protect your
circuits. So the easiest mitigation could
342
00:29:06,654 --> 00:29:13,201
be to use as an input to your logic to use
the output of the voter cell itself.
343
00:29:13,201 --> 00:29:18,491
What it offers us is that now whenever you
have an upset in one of the memory
344
00:29:18,491 --> 00:29:22,933
elements for the next computation, for the
next stage, we always use the voter
345
00:29:22,933 --> 00:29:27,631
output, which ensures that the signal
will be removed one clock cycle later. So
346
00:29:27,631 --> 00:29:32,726
you will have another hit sometime later,
basically, it will not affect our state.
347
00:29:32,726 --> 00:29:39,765
Until now we consider only upsets in our
registers but what happens if we have
348
00:29:39,765 --> 00:29:45,885
charge in our voter? So you see that
if there is no state change, basically the
349
00:29:45,885 --> 00:29:50,981
transient in the voter doesn't impact
our system. But if you are really unlucky
350
00:29:50,981 --> 00:29:55,777
and the transient happens when the clock
transition happens, so when whenever we
351
00:29:55,777 --> 00:30:01,182
enlarge the data, we can corrupt the state
in three registers at the same time, which
352
00:30:01,182 --> 00:30:05,605
is less than ideal. So to overcome this
limitation, you can consider skewing our
353
00:30:05,605 --> 00:30:11,101
clocks by some time, which is larger than
the maximum charge in time. And now,
354
00:30:11,101 --> 00:30:18,050
because with each register samples the
output of the voter a slightly different
355
00:30:18,050 --> 00:30:23,449
time, we can corrupt only one flip-flop
at the time. So of course, if you are
356
00:30:23,449 --> 00:30:28,780
unlucky, we can have problematic
situations in which one register is
357
00:30:28,780 --> 00:30:33,646
already in your state. The other register
is still in the old state. And then it
358
00:30:33,646 --> 00:30:39,728
can lead to undetermenistic result. So it
is better, but still not ideal. So as a
359
00:30:39,728 --> 00:30:46,578
general theme, you have seen that we were
adding and adding more resources so you
360
00:30:46,578 --> 00:30:50,418
can ask yourself what would happen if we
tripplicate everything. So in this case,
361
00:30:50,418 --> 00:30:54,262
we tripplicated registers, we
tripplicate our logic and our voters. And
362
00:30:54,262 --> 00:30:59,138
now you can see that whenever we have an
upset in our register, it can only affect
363
00:30:59,138 --> 00:31:04,513
one register at the time and the error
will be removed from the system one clock
364
00:31:04,513 --> 00:31:08,912
cycle later. Also, if we have an upset
in the voter or in their logic it can be
365
00:31:08,912 --> 00:31:13,372
larged only to one register, which means
that in principle we create that system
366
00:31:13,372 --> 00:31:17,885
which is really robust. Unfortunately,
nothing is for free. So here I compare a
367
00:31:17,885 --> 00:31:22,823
different tripplication environments and
as you can see that the more protection
368
00:31:22,823 --> 00:31:26,326
you want to have, the more you have to pay
in terms of resources being power in the
369
00:31:26,326 --> 00:31:31,373
area. And also usual, you pay small
penalty in terms of maximum operational
370
00:31:31,373 --> 00:31:37,597
speed. So which flavor of protection you
use depends really on
371
00:31:37,597 --> 00:31:42,420
application. So for most sensitive
circuits, you probably you want to use
372
00:31:42,420 --> 00:31:48,493
full TMR and you may leave some other
bits of logic unprotected. So another, if
373
00:31:48,493 --> 00:31:54,749
your system is not mission critical and
you can tolerate some downtime, you can
374
00:31:54,749 --> 00:32:00,294
consider scrubbing, which means periodically
checking the state of your system and refreshing it
375
00:32:00,294 --> 00:32:05,120
if necessary if an error is detected using
some parity bits or copy of the data in
376
00:32:05,120 --> 00:32:10,394
a safe space. Or you can have a
watchdog which will find out that
377
00:32:10,394 --> 00:32:13,951
something went wrong and it will just
reinitialize the whole system. So now,
378
00:32:13,951 --> 00:32:20,011
having covered the basics of all the effects
we will have to face, we would like
379
00:32:20,011 --> 00:32:24,293
to show you the basic flow which we follow
during designing our radiation hardened
380
00:32:24,293 --> 00:32:29,746
circuits. So of course we always start
with specifications. So we try to
381
00:32:29,746 --> 00:32:34,228
understand our radiation environment in
which the circuit is meant to operate. So
382
00:32:34,228 --> 00:32:38,750
we come up with some specifications for
total dose which could be accumulated and
383
00:32:38,750 --> 00:32:45,348
for the rate of single event upsets. And
at this moment, it is also not very rare
384
00:32:45,348 --> 00:32:49,705
that we have to decide to move some
functionality out of our detector volume,
385
00:32:49,705 --> 00:32:56,133
outside, where we can use of the sort of
commercial equipment to do number
386
00:32:56,133 --> 00:33:04,820
crunching. But let's assume that we would
go with our ASIC. So having the
387
00:33:04,820 --> 00:33:09,220
specifications, of course we proceed with
functional implementation. This we
388
00:33:09,220 --> 00:33:14,260
typically do with hardware describtion
languages, so verilog or VHDL which you may
389
00:33:14,260 --> 00:33:18,900
know from typical FPGA flow. And of course
we write a lot of simulations to
390
00:33:18,900 --> 00:33:24,205
understand whether we are meeting our
functional goals or whether our circuit
391
00:33:24,205 --> 00:33:30,665
behaves as expected. And then we
selectively select some parts of the
392
00:33:30,665 --> 00:33:36,318
circuits which we want to protect from
radiation effects. So, for example, we can
393
00:33:36,318 --> 00:33:42,290
decide to use triplication or some other
methods. So these days we typically use
394
00:33:42,290 --> 00:33:46,645
triplication as the most straightforward
and very effective method. So you can ask
395
00:33:46,645 --> 00:33:50,750
yourself how do we triplicate the logic?
So the simplest could be: Just copy
396
00:33:50,750 --> 00:33:55,099
and paste the code three times at some
postfixes like A, B and C and you are
397
00:33:55,099 --> 00:34:01,653
done. But of course this solution has some
drawbacks. So it is time consuming and it
398
00:34:01,653 --> 00:34:05,964
is very error prone. So maybe you have
noticed that I had a typo there. So of
399
00:34:05,964 --> 00:34:10,220
course we don't want to do that. So we
developed our own tool, which we called
400
00:34:10,220 --> 00:34:16,924
TMRG, which automatizes the process of
triplication and eliminates the two main
401
00:34:16,924 --> 00:34:22,494
drawbacks, which I just described. So
after we have our code triplicated and of
402
00:34:22,494 --> 00:34:27,075
course, not before rerunning all the
simulations to make sure that everything
403
00:34:27,075 --> 00:34:34,230
went as expected. We then proceed to the
synthesis process in which we convert our
404
00:34:34,230 --> 00:34:41,091
high level hardware description languages
to gate level netlists, in which all the functions
405
00:34:41,091 --> 00:34:46,189
are mapped to gates, which were introduced
by Stefan, so both combinatorial and
406
00:34:46,189 --> 00:34:53,631
sequential. And here we also have to be
careful because modern CAD tools have a
407
00:34:53,631 --> 00:34:59,020
tendency, of course, to optimise the logic
as much as possible. And our logic in most
408
00:34:59,020 --> 00:35:03,810
of the cases is really redundant. So it is
very easy; So, it should be removed. So we
409
00:35:03,810 --> 00:35:08,632
really have to make sure that it is not
removed. That's why our tool also provides
410
00:35:08,632 --> 00:35:13,633
some constraints for the synthesizer to
make sure that our design intent is
411
00:35:13,633 --> 00:35:20,900
clearly and well understood by the tool.
And once we have the output netlist, we
412
00:35:20,900 --> 00:35:26,980
proceed to place and route process where
this kind of netlist representation is
413
00:35:26,980 --> 00:35:32,580
mapped to a layout of what will become
soon our digital chip where we placed all
414
00:35:32,580 --> 00:35:36,624
the cells and we route connections between
them and here there is
415
00:35:36,624 --> 00:35:40,907
another danger which I mentioned already,
it's that in modern technologies the cells
416
00:35:40,907 --> 00:35:45,597
are so small that they could be easily
affected by a single particle at the same
417
00:35:45,597 --> 00:35:51,892
time. So we have to really space out
the big cells which are responsible for
418
00:35:51,892 --> 00:35:56,982
keeping the information about the state to
make sure that a single particle cannot
419
00:35:56,982 --> 00:36:04,980
upset A and B, for example, registered
from the same register. And then in the
420
00:36:04,980 --> 00:36:09,540
last step, of course, we'll have to verify
that everything, what we have done, is
421
00:36:09,540 --> 00:36:13,926
correct. And at this level, we also try to
introduce some single event effects in our
422
00:36:13,926 --> 00:36:19,971
simulations. So we could randomly flip
bits in our system. We can also inject
423
00:36:19,971 --> 00:36:26,094
transients. And typically we used to do
that on the netlist level, which works
424
00:36:26,094 --> 00:36:31,424
very fine. And it is very nice. But the
problem with this approach is that we can
425
00:36:31,424 --> 00:36:37,640
perform these actions very late in the
design cycle, which is less than ideal.
426
00:36:37,640 --> 00:36:43,084
And also that if we find that there is
problem in our simulation, typical netlist
427
00:36:43,084 --> 00:36:48,437
at this level has probably few orders of
magnitude more lines than our initial RTL
428
00:36:48,437 --> 00:36:52,990
code. So to trace back what is the
problematic line of code is not so
429
00:36:52,990 --> 00:36:57,533
straightforward. At this time. So you can
ask yourself why not to try to inject
430
00:36:57,533 --> 00:37:05,458
errors in the RTL design? And the answer
was, the answer is that it is not so
431
00:37:05,458 --> 00:37:10,670
trivially to map the hardware description
language's high level constructs to
432
00:37:10,670 --> 00:37:15,585
what will become combinatorial or
sequential logic. So in order to eliminate
433
00:37:15,585 --> 00:37:20,980
this problem, we also develop another open
source tool, which allows us to...
434
00:37:20,980 --> 00:37:27,860
So we decided to use Yosys open
source synthesis tool from clifford, which
435
00:37:27,860 --> 00:37:31,530
was presented in the Congress several
years ago. So we use this tool to make a
436
00:37:31,530 --> 00:37:35,680
first pass through our RTL code to
understand which elements will be mapped
437
00:37:35,680 --> 00:37:40,678
to sequential and combinatorial. And then
having this information, we will use
438
00:37:40,678 --> 00:37:45,951
cocotb, another python verification
framework, which allows us programmatic
439
00:37:45,951 --> 00:37:51,838
access to these nodes and we can
effectively inject the errors in our
440
00:37:51,838 --> 00:37:56,660
simulations. And I forgot to mention that
the TMRG tool is also open source. So if
441
00:37:56,660 --> 00:38:03,841
you are interested in one of the tools,
please feel free to contact us. And of
442
00:38:03,841 --> 00:38:10,505
course, after our simulation is done, then in
the next step we would really tape out. And
443
00:38:10,505 --> 00:38:14,637
so we submit our chip to manufacturing and
hopefully a few months later we receive
444
00:38:14,637 --> 00:38:18,105
our chip back.
Stefan: All right. So after patiently
445
00:38:18,105 --> 00:38:23,546
waiting then for a couple of months while
your chip is in manufacturing and you're
446
00:38:23,546 --> 00:38:28,245
spending time on preparing a test set up
and preparing yourself to actually test if
447
00:38:28,245 --> 00:38:33,772
your chip works as you expected to. Now,
it's probably also a good time to think
448
00:38:33,772 --> 00:38:38,307
about how to actually validate or test if
all the measures that you've taken to
449
00:38:38,307 --> 00:38:41,389
protect your circuit from radiation
effects actually are effective or if they
450
00:38:41,389 --> 00:38:46,196
are not. And so again, we will split this
in two parts. So you will probably want to
451
00:38:46,196 --> 00:38:50,024
start with testing for the total ionizing
dose effects. So for the cumulative effect
452
00:38:50,024 --> 00:38:54,554
and for that, you typically use x ray
radiation relatively similar to the one
453
00:38:54,554 --> 00:38:59,005
used in medical treatment. So this
radiation is relatively low, energetic,
454
00:38:59,005 --> 00:39:03,344
which has the upside of not producing any
single event effects, but you can really
455
00:39:03,344 --> 00:39:07,462
only accumulate radiation dose and focus
on the accumulating effects. And typically
456
00:39:07,462 --> 00:39:11,600
you would use a machine that looks
somewhat like this, a relatively compact
457
00:39:11,600 --> 00:39:16,840
thing. You can have in your laboratory and
you can use that to really accumulate
458
00:39:16,840 --> 00:39:21,520
large amounts of radiation dose on your
circuit. And then you need some sort of
459
00:39:21,520 --> 00:39:26,641
mechanism to verify or to quantify how
much your circuit slows down due to this
460
00:39:26,641 --> 00:39:31,285
radiation dose. And if you do that, you
typically end up with a graphic such as
461
00:39:31,285 --> 00:39:36,567
this one, where in the x axis you have the
radiation dose your circuit was exposed
462
00:39:36,567 --> 00:39:40,639
to. And on the y axis, you see that the
frequency has gone down over time and you
463
00:39:40,639 --> 00:39:44,536
can use this information to say:
"OK, my final application, I expect this
464
00:39:44,536 --> 00:39:49,324
level of radiation dose. I mean, I can
still see that my circuit will work fine
465
00:39:49,324 --> 00:39:53,565
under some given environmental condition
or some operation condition." So this is
466
00:39:53,565 --> 00:39:58,285
the test for the first class of effects.
And the test for the second class of
467
00:39:58,285 --> 00:40:02,318
effects for the single event effect is a
bit more involved. So there what you would
468
00:40:02,318 --> 00:40:07,157
typically start to do is go for a heavy
ion test campaign. So you would go to a
469
00:40:07,157 --> 00:40:12,760
specialized, relatively rare facility. We
have a couple of those in Europe and would
470
00:40:12,760 --> 00:40:16,532
look perhaps somewhat like this. So it's a
small particle accelerator somewhere.
471
00:40:16,532 --> 00:40:20,794
They typically have
different types of heavy ions at their
472
00:40:20,794 --> 00:40:26,311
disposal that they can accelerate and then
shoot at your chip that you can place in a
473
00:40:26,311 --> 00:40:32,390
vacuum chamber and these ions can deposit
very well known amounts of energy in your
474
00:40:32,390 --> 00:40:36,818
circuit and you can use that information
to characterize your circuit. The downside
475
00:40:36,818 --> 00:40:41,207
is a bit that these facilities tend to be
relatively expensive to access and also a
476
00:40:41,207 --> 00:40:45,161
bit hard to access. So typically you need
to book them a lot of time in advance and
477
00:40:45,161 --> 00:40:50,351
that's sometimes not very easy. But what
it offers you, you can use different types
478
00:40:50,351 --> 00:40:55,244
of ions with different energies. You can
really make a very well-defined
479
00:40:55,244 --> 00:41:00,190
sensitivity curve similar to the one that
Szymon has described. You can get from
480
00:41:00,190 --> 00:41:04,052
simulations and really characterize your
circuit for how often, any single event
481
00:41:04,052 --> 00:41:09,026
effects will appear in the final
application if there is any remaining
482
00:41:09,026 --> 00:41:12,827
effects left. If you have left something
unprotected. The problem here is that
483
00:41:12,827 --> 00:41:18,190
these particle accelerators typically just
bombard your circuit with like thousands
484
00:41:18,190 --> 00:41:23,310
of particles per second and they hit
basically the whole area in a random
485
00:41:23,310 --> 00:41:26,940
fashion. So you don't really have a way of
steering those or measuring the position
486
00:41:26,940 --> 00:41:30,964
of these particles. So typically you are a
bit in the dark and really have to really
487
00:41:30,964 --> 00:41:34,884
carefully know the behavior of your
circuit and all the quirks it has even
488
00:41:34,884 --> 00:41:39,481
without the radiation to instantly notice
when something has gone wrong. And
489
00:41:39,481 --> 00:41:44,088
this is typically not very easy
and you can kind of compare it with having
490
00:41:44,088 --> 00:41:47,372
some weird crash somewhere in your
software stack and then having to have
491
00:41:47,372 --> 00:41:51,800
first take a look and see what actually
has happened. Typically
492
00:41:51,800 --> 00:41:57,058
you find something that has not been
properly protected and you see some weird
493
00:41:57,058 --> 00:42:01,847
effect on your circuit and then you try to
get a better idea of where that problem
494
00:42:01,847 --> 00:42:06,256
actually is located. And the answer for
these types of problems involving position
495
00:42:06,256 --> 00:42:11,381
is, of course, always lasers. So we have
two types of laser experiments available
496
00:42:11,381 --> 00:42:15,796
that can be used to more selectively probe
your circuit for these problems. The first
497
00:42:15,796 --> 00:42:19,691
one being the single photon absorption
laser. And it sounds this relatively
498
00:42:19,691 --> 00:42:24,709
simple in terms of setup. You just use a
single laser beam that shoots straight up
499
00:42:24,709 --> 00:42:29,884
at your circuit from the back. And while
it does that, it deposits energy all along
500
00:42:29,884 --> 00:42:34,180
the silicon and also in the diffusions of
your transistors and is therefore also
501
00:42:34,180 --> 00:42:38,388
able to inject energy there, potentially
upsetting a bit of memory or exposing
502
00:42:38,388 --> 00:42:43,053
whatever other single event effects you
have. And of course, you can steer this
503
00:42:43,053 --> 00:42:46,880
beam across the surface of your chip or
whatever circuit you are testing and then
504
00:42:46,880 --> 00:42:51,330
find the sensitive location. The problem
here is that the amount of energy that is
505
00:42:51,330 --> 00:42:55,238
deposited is really large due to the fact
that it has to go through the whole
506
00:42:55,238 --> 00:42:59,053
silicon until it reaches the transistor.
And therefore it's mostly used to find
507
00:42:59,053 --> 00:43:02,582
these destructive effects that really
break something in your circuit. The more
508
00:43:02,582 --> 00:43:07,972
clever and somehow beautiful experiment is
the two photon absorption laser experiment
509
00:43:07,972 --> 00:43:12,624
in which you use two laser beams of a
different wavelength. And these actually
510
00:43:12,624 --> 00:43:18,366
do not have enough energy to cause any
effect in your silicon. If only one of the
511
00:43:18,366 --> 00:43:22,174
laser beams is present, but only in the
small location where the two beams
512
00:43:22,174 --> 00:43:26,874
intersect, the energy is actually large
enough to produce the effect. And this
513
00:43:26,874 --> 00:43:30,664
allows you to very selectively and only on
a very small volume induce charge and
514
00:43:30,664 --> 00:43:37,818
cause an effect in your circuit. And when
you do that now, you can systematically
515
00:43:37,818 --> 00:43:41,964
scan both the X and Y directions across
your chip and also the Z direction and can
516
00:43:41,964 --> 00:43:46,366
really measure the volume of sensitive
area. And this is what you would typically
517
00:43:46,366 --> 00:43:50,804
get of such an experiment. So in black and
white in the back, you'll see an infrared
518
00:43:50,804 --> 00:43:54,621
image of your chip where you can really
make out the individual, say structural
519
00:43:54,621 --> 00:43:59,406
components. And then overlaid in blue, you
can basically highlight all the sensitive
520
00:43:59,406 --> 00:44:03,897
points that made you measure something you
didn't expect, some weird bit flip in a
521
00:44:03,897 --> 00:44:08,338
register or something. And you can really
then go to your layout software and find
522
00:44:08,338 --> 00:44:13,644
what is the the register or the gate in
your netlist that is responsible for
523
00:44:13,644 --> 00:44:17,465
this. And then it's more like operating a
debugger in a software environment.
524
00:44:17,465 --> 00:44:22,889
Tracing back from there what the line of
code responsible for this bug is. And
525
00:44:22,889 --> 00:44:31,260
to close out, it is always best to learn
from mistakes. And we offer our mistakes
526
00:44:31,260 --> 00:44:35,901
as a guideline for if you ever feel
yourself the need to design radiation
527
00:44:35,901 --> 00:44:40,695
tolerant circuits. So we want to present
two or three small issues we had and
528
00:44:40,695 --> 00:44:45,300
circuits where we were convinced it should
have been working fine. So the first one
529
00:44:45,300 --> 00:44:50,018
this you will probably recognize is this
full triple modular redundancy scheme that
530
00:44:50,018 --> 00:44:55,279
Szymon has presented. So we made sure to
triplicate everything and we were relatively
531
00:44:55,279 --> 00:44:59,102
sure that everything should be fine. The
only modification we did is that to all
532
00:44:59,102 --> 00:45:03,506
those registers in our design, we've added
a reset, because we wanted to initialize
533
00:45:03,506 --> 00:45:07,710
the system to some known state when we
started up, which is a very obvious thing
534
00:45:07,710 --> 00:45:12,327
to do. Every CPU has a reset. But of
course, what we didn't think about here
535
00:45:12,327 --> 00:45:16,577
was that at some point there's a buffer
driving this reset line somewhere. And if
536
00:45:16,577 --> 00:45:20,355
there's only a single buffer. What happens
if this buffer experiences a small
537
00:45:20,355 --> 00:45:24,501
transient event? Of course, the obvious
thing that happened is that as soon as
538
00:45:24,501 --> 00:45:28,247
that happened, all the registers were
upset at the same time and were basically
539
00:45:28,247 --> 00:45:32,205
cleared and all our fancy protection was
invalidated. So next time we decided,
540
00:45:32,205 --> 00:45:37,679
let's be smarter this time. And of course,
we triplicate all the logic and all the
541
00:45:37,679 --> 00:45:40,633
voters and all the registers. So let's
also triplicate the reset lines. And while
542
00:45:40,633 --> 00:45:44,955
the designer of that block probably had
very good intentions, it turned out
543
00:45:44,955 --> 00:45:49,268
that later than when we manufactured the
chip, it still sometimes showed a complete
544
00:45:49,268 --> 00:45:54,570
reset without any good explanation for
that. And what was left out of the the
545
00:45:54,570 --> 00:45:59,981
scope of thinking here was that this reset
actually was connected to the system reset
546
00:45:59,981 --> 00:46:05,033
of the chip that we had. And typically
pins are on the chip or something that is
547
00:46:05,033 --> 00:46:09,005
not available in huge quantities. So you
typically don't want to spend three pins
548
00:46:09,005 --> 00:46:13,128
of your chip just for a stupid reset that
you don't use ninety nine percent of the
549
00:46:13,128 --> 00:46:17,895
time. So what we did at some point we just
connected again the reset lines to a
550
00:46:17,895 --> 00:46:21,972
single input buffer. That was then
connected to a pin of the chip. And of
551
00:46:21,972 --> 00:46:25,590
course, this also represented a small
sensitive area in the chip. And again,
552
00:46:25,590 --> 00:46:30,216
a single upset here was able to destroy
all three of our flip flops. All right.
553
00:46:30,216 --> 00:46:35,132
And the last lesson I'm bringing or the
last thing that goes back to the
554
00:46:35,132 --> 00:46:38,930
implementation details that Szymon has
mentioned. So this time, really simple
555
00:46:38,930 --> 00:46:42,532
circuit. We were absolutely convinced it
must work because it was basically the
556
00:46:42,532 --> 00:46:46,072
textbook example that Szymon was
presenting. And the code was so
557
00:46:46,072 --> 00:46:49,817
small we were able to inspect everything
and were very much sure that nothing
558
00:46:49,817 --> 00:46:54,690
should have happened. And what we saw when
we went for this laser testing experiment,
559
00:46:54,690 --> 00:46:59,769
in simplified form is basically that
only this first voter. And when this was
560
00:46:59,769 --> 00:47:04,414
hit, always all our register was
upset while the other ones were
561
00:47:04,414 --> 00:47:09,161
never manifested to show anything strange.
And it took us quite a while to actually
562
00:47:09,161 --> 00:47:13,563
look at the layout later on and figure out
that what was in the chip was rather this.
563
00:47:13,563 --> 00:47:17,250
So two of the voters were actually not
there. And Szymon mentioned the reason for
564
00:47:17,250 --> 00:47:21,208
that. So synthesis tool these days are
really clever at identifying redundant
565
00:47:21,208 --> 00:47:26,102
logic and because we forgot to tell it to
not optimize these redundant pieces of
566
00:47:26,102 --> 00:47:30,248
logic, which the voters really are. It
just merged them into one. And that
567
00:47:30,248 --> 00:47:34,393
explains why we only saw this one voter
being the sensitive one. And of course, if
568
00:47:34,393 --> 00:47:38,255
you have a transient event there, then you
suddenly upset all your registers and that
569
00:47:38,255 --> 00:47:41,871
without even knowing it and with being
sure, having looked at every single line
570
00:47:41,871 --> 00:47:45,652
of verilog code and being very sure,
everything should have been fine. But that
571
00:47:45,652 --> 00:47:51,805
seems to be how this business goes. So we
hope we had been we had the chance and you
572
00:47:51,805 --> 00:47:56,648
were able to get some insight in in what
we do to make sure the experiments at the
573
00:47:56,648 --> 00:48:01,966
LHC work fine. What you can do to
make sure the satellite you are working on
574
00:48:01,966 --> 00:48:06,393
might be working OK. Even before launching
it into space, if you're interested into
575
00:48:06,393 --> 00:48:10,715
some more information on this topic, feel
free to pass by at the assembly I
576
00:48:10,715 --> 00:48:15,014
mentioned at the beginning or just meet us
after the talk and otherwise thank you
577
00:48:15,014 --> 00:48:22,286
very much.
Applause
578
00:48:22,286 --> 00:48:27,041
Herald: Thank you very much indeed.
There's about 10 minutes left for Q and A,
579
00:48:27,041 --> 00:48:31,872
so if you have any questions go to a
microphone. And as a cautious reminder,
580
00:48:31,872 --> 00:48:38,297
questions are short sentences with. That
starts with a question. Well, ends with a
581
00:48:38,297 --> 00:48:42,548
question mark and the first question goes
to the Internet.
582
00:48:42,548 --> 00:48:46,433
Internet: Well, hello. Um, do you also
incorporate radiation as the source for
583
00:48:46,433 --> 00:48:50,596
randomness when that's needed?
Stefan: So we personally don't. So in our
584
00:48:50,596 --> 00:48:56,880
designs we don't. But it is done indeed
for a random number generator. This is
585
00:48:56,880 --> 00:49:01,081
sometimes done that they use radioactive
decay as a source for randomness. So this
586
00:49:01,081 --> 00:49:03,989
is done, but we don't do it in our
experiments.
587
00:49:03,989 --> 00:49:06,802
We rather want deterministic data out of
the things we built.
588
00:49:06,802 --> 00:49:10,929
Herald: Okay. Next question goes to
microphone number four.
589
00:49:10,929 --> 00:49:16,714
Mic 4: Do you do your tripplication before
or after elaboration?
590
00:49:16,714 --> 00:49:21,003
Szymon: So currently we do it before
elaboration. So we decided that our tool
591
00:49:21,003 --> 00:49:25,764
works on verilog input and it produces
verilog output because it offers much more
592
00:49:25,764 --> 00:49:30,496
flexibility in the way how you can
incorporate different tripplication
593
00:49:30,496 --> 00:49:34,423
schemes. If you were to apply to only
after elaboration, then of course doing a
594
00:49:34,423 --> 00:49:38,453
full tripplication might be easy. But then
you - to having a really precise control
595
00:49:38,453 --> 00:49:43,438
or on types of tripplication on different
levels is much more difficult.
596
00:49:43,438 --> 00:49:47,296
Herald: Next question from microphone
number two.
597
00:49:47,296 --> 00:49:50,840
Mic 2: Is it possible to use DCDC
converters or switch mode power supplies
598
00:49:50,840 --> 00:49:54,630
within the radiation environment to power
your logic? Or you use only linear power?
599
00:49:54,630 --> 00:49:59,866
Szymon: Yes, alternatively we also have a
dedicated program which develops radiation
600
00:49:59,866 --> 00:50:05,366
hardened DCDC converters who operate
in our environments. So they are available
601
00:50:05,366 --> 00:50:10,988
also for space applications, as far as I'm
aware. And they are hardened against total
602
00:50:10,988 --> 00:50:16,027
ionizing dose as well as single event
upsets.
603
00:50:16,027 --> 00:50:19,667
Herald: Okay next question goes to
microphone number one.
604
00:50:19,667 --> 00:50:22,614
Mic 1: Thank you very much for the great
talk. I'm just wondering, would it be
605
00:50:22,614 --> 00:50:27,435
possible to hook up every logic gate in
every water in a way of mesh network? And
606
00:50:27,435 --> 00:50:31,873
what are the pitfalls and limitations for
that?
607
00:50:31,873 --> 00:50:36,734
Stefan: So that is not something I'm aware
of, of being done. So typically: No. I
608
00:50:36,734 --> 00:50:41,473
wouldn't say that that's something we
would do.
609
00:50:41,473 --> 00:50:43,431
Szymon: I'm not really sure if I
understood the question.
610
00:50:43,431 --> 00:50:46,401
Stefan: So maybe you can rephrase what
your idea is?
611
00:50:46,401 --> 00:50:52,613
Mic 1: On the last slide, there were a
lesson learned.
612
00:50:52,613 --> 00:50:56,253
Stefan: Yes. One of those?
Mic 1: In here. Yeah. Would you be able to
613
00:50:56,253 --> 00:51:00,309
connect everything interchangeably in a
mesh network?
614
00:51:00,309 --> 00:51:04,030
Szymon: So what you are probably asking
about is whether we can build our own
615
00:51:04,030 --> 00:51:08,166
FPGA, like programable logic device.
Mic 1: Probably.
616
00:51:08,166 --> 00:51:11,074
Szymon: Yeah. And so this we typically
don't do, because in our experiments, our
617
00:51:11,074 --> 00:51:15,857
power budget is also very limited, so we
cannot really afford this level of
618
00:51:15,857 --> 00:51:20,903
complexity. So of course you can make your
FPGA design radiation hard, but this is
619
00:51:20,903 --> 00:51:24,890
not what we will typically do in our
experiments.
620
00:51:24,890 --> 00:51:28,630
Herald: Next question goes to microphone
number two.
621
00:51:28,630 --> 00:51:32,059
Mic 2: Hi, I would like to ask if the
orientation of your transistors and your
622
00:51:32,059 --> 00:51:38,029
chip is part of your design. So mostly you
have something like a bounding box around
623
00:51:38,029 --> 00:51:42,921
your design and with an attack surface in
different sizes. So do you use this
624
00:51:42,921 --> 00:51:48,350
orientation to minimize the attack surface
of the radiation on chips, if you know
625
00:51:48,350 --> 00:51:52,616
the source of the radiation?
Szymon: No. So I don't think we'd do that.
626
00:51:52,616 --> 00:51:58,515
So, of course, we control our orientation
of transistors during the design phase.
627
00:51:58,515 --> 00:52:02,651
But usually in our experiment, the
radiation is really perpendicular to the
628
00:52:02,651 --> 00:52:07,981
chip area, which means that if you rotate
it by 90 degrees, you don't really gain
629
00:52:07,981 --> 00:52:12,082
that much. And moreover, our chips,
usually they are mounted in a bigger
630
00:52:12,082 --> 00:52:16,625
system where we don't control how they are
oriented.
631
00:52:16,625 --> 00:52:24,420
Herald: Again, microphone number two.
Mic 2: Do you take meta stability into
632
00:52:24,420 --> 00:52:33,140
account when designing voters?
Szymon: The voter itself is combinatorial.
633
00:52:33,140 --> 00:52:38,820
So ... -
Mic 2: Yeah, but if the state of the rest
634
00:52:38,820 --> 00:52:45,300
can change in any time that then the
voters can have like glitches, yeah?
635
00:52:45,300 --> 00:52:51,140
Szymon: Correct. So that's why - so to
avoid this, we don't take it into account
636
00:52:51,140 --> 00:52:55,060
during the design phase. But if we use
that scheme which is just displayed here,
637
00:52:55,060 --> 00:52:58,980
we avoid this problem altogether, right?
Because even if you have meta stability in
638
00:52:58,980 --> 00:53:05,300
one of the blocks like A, B or C, then it
will be fixed in the next clock cycle.
639
00:53:05,300 --> 00:53:09,940
Because usually our systems operate on
clocks with low frequencies, hundreds of
640
00:53:09,940 --> 00:53:13,236
megahertz, which means that any meta
stability should be resolved by the next
641
00:53:13,236 --> 00:53:15,065
clock cycle.
Mic 2: Thank you.
642
00:53:15,065 --> 00:53:19,145
Herald: Next question microphone number
one.
643
00:53:19,145 --> 00:53:23,014
Mic 1: How do you handle the register
duplication that can be performed by a
644
00:53:23,014 --> 00:53:27,947
synthesis and pleasant route? So the tools
will try to optimize timing sometimes by
645
00:53:27,947 --> 00:53:32,375
adding registers. And these registers are
not trippled.
646
00:53:32,375 --> 00:53:35,784
Stefan: Yes. So what we do is that I mean,
in a typical, let's say, standard ASIC
647
00:53:35,784 --> 00:53:40,405
design flaw, this is not what happens. So
you have to actually instruct a tool to do
648
00:53:40,405 --> 00:53:44,585
that, to do re timing and add additional
registers. But for what we are doing, we
649
00:53:44,585 --> 00:53:48,174
have to - let's say not do this
optimization and instruct a tool to keep
650
00:53:48,174 --> 00:53:52,823
all the registers we described in our RTL
code to keep them until the very end. And
651
00:53:52,823 --> 00:53:56,908
we realy also constrain them to always
keep their associated logic tripplicated.
652
00:53:56,908 --> 00:54:01,759
Herald: The next question is from the
internet.
653
00:54:01,759 --> 00:54:07,887
Internet: Do you have some simple tips for
improving radiation tolerance?
654
00:54:07,887 --> 00:54:12,020
Stefan: Simple tips? Ahhhm...
Szymon: Put your electronics inside a
655
00:54:12,020 --> 00:54:12,820
box.
Stefan: Yes.
656
00:54:12,820 --> 00:54:17,380
some laughter
There's there's just no
657
00:54:17,380 --> 00:54:22,980
single one size fits all textbook recipe
for this as it really always comes down to
658
00:54:22,980 --> 00:54:28,020
analyzing your environment, really getting
an awareness first of what rate and what
659
00:54:28,020 --> 00:54:31,940
number of events you are looking at, what
type of particles cause them, and then
660
00:54:31,940 --> 00:54:36,420
take the appropriate measures to mitigate
them. So there is no one size fits all
661
00:54:36,420 --> 00:54:38,095
thing I say.
Herald: Next question goes from mycrophone
662
00:54:38,095 --> 00:54:41,620
number two.
Mic 2: Hi. Thanks for the talk. How much
663
00:54:41,620 --> 00:54:47,611
of your software used to design is
actually open source? I only know a super
664
00:54:47,611 --> 00:54:54,495
expensive chip design software.
Stefan: You write the core of all the
665
00:54:54,495 --> 00:55:00,604
implementation tools like the synthesis
and place and route stage for the ASICS,
666
00:55:00,604 --> 00:55:04,987
that we design is actually a commercial
closed source tools. And if
667
00:55:04,987 --> 00:55:10,443
you're asking for the fraction, that's a
bit hard to answer. I cannot give a
668
00:55:10,443 --> 00:55:14,518
statement about the size of the commercial
closed tools. But we tried to do
669
00:55:14,518 --> 00:55:18,638
everything we develop, tried to make it
available to the widest possible audience
670
00:55:18,638 --> 00:55:22,353
and therefore decided to make the
extensions to this design flaw available
671
00:55:22,353 --> 00:55:26,237
in public form. And that's why these
tools that we develop and share among the
672
00:55:26,237 --> 00:55:30,541
community of ASIC designers and this
environment are open source.
673
00:55:30,541 --> 00:55:35,196
Herald: Microphone number four.
Mic 4: Have you ever tried using steered
674
00:55:35,196 --> 00:55:41,098
iron beams for more localized, radiation
ingress testing?
675
00:55:41,098 --> 00:55:44,495
Stefan: Yes, indeed! And the picture I
showed actually, uh, didn't disclaimer
676
00:55:44,495 --> 00:55:49,311
that, but the facility you saw here is
actually a facility in Darmstadt in
677
00:55:49,311 --> 00:55:53,366
Germany and is actually a micro beam
facility. So it's a facility that allows
678
00:55:53,366 --> 00:55:58,400
steering a heavy ion beam really on a
single position with less than a
679
00:55:58,400 --> 00:56:01,808
micrometer accuracy. So it provides
probably exactly what you were asking for.
680
00:56:01,808 --> 00:56:05,854
But that's not the typical case. That is
really a special thing. And it's probably
681
00:56:05,854 --> 00:56:09,405
also the only facility in Europe that can
do that.
682
00:56:09,405 --> 00:56:13,316
Herald: Microphone number one.
Mic 1: Was very good very good talk. Thank
683
00:56:13,316 --> 00:56:19,282
you very much. My question is, did you
compare what you did to what is done for
684
00:56:19,282 --> 00:56:25,380
securing secret chips? You know, when you
have credit card chips, you can make fault
685
00:56:25,380 --> 00:56:29,949
attacks into them so you can make them
malfunction and extract the cryptographic
686
00:56:29,949 --> 00:56:33,830
key for example from the banking card.
There are techniques here to harden these
687
00:56:33,830 --> 00:56:38,207
chips against fault attacks. So which are
like voluntary faults while you have like
688
00:56:38,207 --> 00:56:43,121
random less faults due to like involatility
attacks. You know what? Can you explain if
689
00:56:43,121 --> 00:56:47,294
you compared in a way what you did to
this?
690
00:56:47,294 --> 00:56:50,861
Stefan: Um, so no, we didn't explicitly
compared it, but it is right that the
691
00:56:50,861 --> 00:56:54,427
techniques we present can also be used in
a variety of different contexts. So one
692
00:56:54,427 --> 00:56:59,134
thing that's not exactly what you are
referring to, but relatively on a similar
693
00:56:59,134 --> 00:57:03,513
scale is that currently in very small
technologies you get two problems with the
694
00:57:03,513 --> 00:57:07,855
reliability and yield of the manufacturing
process itself, meaning that sometimes
695
00:57:07,855 --> 00:57:11,721
just the metal interconnection between two
gates and your circuit might be broken
696
00:57:11,721 --> 00:57:16,297
after manufacturing and then adding the
sort of redundancy with the same kinds of
697
00:57:16,297 --> 00:57:20,576
techniques can be used to make, to
produce more working chips out of a
698
00:57:20,576 --> 00:57:24,715
manufacturing run. So in this sort of
context, these sorts of techniques are
699
00:57:24,715 --> 00:57:30,674
used very often these days. But, um, I'm
and I'm pretty sure they can be applied to
700
00:57:30,674 --> 00:57:34,953
these sorts of, uh, security fault attack
scenarios as well.
701
00:57:34,953 --> 00:57:39,703
Herald: Next question from microphone
number two.
702
00:57:39,703 --> 00:57:44,126
Mic 2: Hi, you briefly also mentioned the
mitigation techniques on the cell level
703
00:57:44,126 --> 00:57:52,426
and yesterday there was a very nice talk
from the Libre Silicon people and they
704
00:57:52,426 --> 00:57:55,914
are trying to build a standard cell
library, uh, open source standard cell
705
00:57:55,914 --> 00:58:00,015
library. So are you in contact with them
or maybe you could help them to improve
706
00:58:00,015 --> 00:58:03,980
their design and then the radiation
hardness?
707
00:58:03,980 --> 00:58:07,430
Stefan: No. We also saw the talk
yesterday, but we are not yet in
708
00:58:07,430 --> 00:58:14,180
contact with them. No.
Herald: Does the Internet have questions?
709
00:58:14,180 --> 00:58:21,380
Internet: Yes, I do. Um, two in fact.
First one would be would TTL or other BJT
710
00:58:21,380 --> 00:58:26,740
based logic be more resistant?
Szymon: Uh, yeah. So depending on which
711
00:58:26,740 --> 00:58:31,126
type of errors we are considering. So BJT
transistors, they have ...
712
00:58:31,126 --> 00:58:35,917
Stefan in his part mentioned that
displacement damage is not a problem for
713
00:58:35,917 --> 00:58:40,305
seamless devices, but it is not the case
for BJT devices. So when they are exposed
714
00:58:40,305 --> 00:58:47,074
to high energy hadrons or protons,
they degrade a lot. So that's why we don't
715
00:58:47,074 --> 00:58:52,393
use them in really our environment. They
could be probably much more robust to
716
00:58:52,393 --> 00:58:57,369
single event effects because their
resistance everywhere is much lower. But
717
00:58:57,369 --> 00:59:01,633
they would have other problems. And also
another problem which is worth
718
00:59:01,633 --> 00:59:06,204
mentioning is that for those devices, they
consume much, much, much more power, which
719
00:59:06,204 --> 00:59:13,041
we cannot afford in our applications.
Internet: And the last one would be how do
720
00:59:13,041 --> 00:59:19,396
I use the output of the full TMR setup? Is
it still three signals? How do I know
721
00:59:19,396 --> 00:59:26,260
which one to use and to trust?
Stefan: Um, yes. So with this, um,
722
00:59:26,260 --> 00:59:30,047
architecture, what you could either do is
really do the full triplication scheme
723
00:59:30,047 --> 00:59:34,804
to your whole logic tree basically and
really triplicate everything or, and
724
00:59:34,804 --> 00:59:38,903
that's going in the direction of one of
the lessons learned I had, at some point
725
00:59:38,903 --> 00:59:43,261
of course you have an interface to your
chip, so you have pins left and right that
726
00:59:43,261 --> 00:59:46,630
are inputs and outputs. And then you have
to decide either you want to spend the
727
00:59:46,630 --> 00:59:51,025
effort and also have three dedicated input
pins for each of the signals, or you at
728
00:59:51,025 --> 00:59:54,260
some point have the voter and say, okay.
At this point, all these signals are
729
00:59:54,260 --> 00:59:58,202
combined. But I was able to reduce the
amount of sensitive area in my chip
730
00:59:58,202 --> 01:00:03,780
significantly and can live with the very
small remaining sensitive area that just
731
01:00:03,780 --> 01:00:07,460
the input and output pins provide.
Szymon: So maybe I will add one more thing
732
01:00:07,460 --> 01:00:11,780
is that typically in our systems, of
course we triplicate our logic internally,
733
01:00:11,780 --> 01:00:15,300
but when we interface with external
world, we can apply another protection
734
01:00:15,300 --> 01:00:20,340
mechanism. So for example, for our high
speed serialisers, we will use different types
735
01:00:20,340 --> 01:00:23,733
of encoding to add protect...,
to add like forward error correction
736
01:00:23,733 --> 01:00:30,340
codes which would allow us to recover these
type of faults in the backend later on.
737
01:00:30,340 --> 01:00:36,522
Herald: Okay. If ...if we keep this very,
very short. Last question goes to
738
01:00:36,522 --> 01:00:41,401
microphone number two.
Mic 2: I don't know much about physics. So
739
01:00:41,401 --> 01:00:47,370
just the question, how important is the
physical testing after the chip is
740
01:00:47,370 --> 01:00:51,895
manufactured? Isn't the simulation, the
computer simulation enough if you just
741
01:00:51,895 --> 01:00:56,332
shoot particles at it?
Stefan: Yes and no. So in principle, of
742
01:00:56,332 --> 01:01:01,267
course, you are right that you should be
able to simulate all the effects we look
743
01:01:01,267 --> 01:01:06,531
at. The problem is that as the designs
grow big and they do grow bigger as the
744
01:01:06,531 --> 01:01:10,892
technologies shrink, so
this final net list that you end up with
745
01:01:10,892 --> 01:01:15,175
can have millions or billions of nodes and
it just is not feasible anymore to
746
01:01:15,175 --> 01:01:19,558
simulate it exhaustively because you have
to have so many dimensions. You have to
747
01:01:19,558 --> 01:01:25,852
change when you inject. For example, bit
flips or transients in your design in any
748
01:01:25,852 --> 01:01:30,745
of those nodes for varying time offsets.
And it's just the state space the circuit
749
01:01:30,745 --> 01:01:34,553
can be in is just too huge to capture in a
in a full simulation. So it's not possible
750
01:01:34,553 --> 01:01:38,803
to exhaustively test it in simulation. And
so typically you end up with having missed
751
01:01:38,803 --> 01:01:43,048
something that you discover only in the
physical testing afterwards, which you
752
01:01:43,048 --> 01:01:47,311
always want to do before you put your, uh,
your chip into final experiment or on your
753
01:01:47,311 --> 01:01:50,934
satellite and then realise it's it's not
working as intended. So it has a big
754
01:01:50,934 --> 01:01:55,540
importance as well.
Herald: Okay. Thank you. Time is up. All
755
01:01:55,540 --> 01:01:58,584
right. Thank you all very much.
756
01:01:58,584 --> 01:02:04,602
applause
757
01:02:04,602 --> 01:02:09,599
36c3 postroll music
758
01:02:09,599 --> 01:02:32,100
Subtitles created by c3subtitles.de
in the year 2021. Join, and help us!