1 00:00:00,000 --> 00:00:18,406 36C3 Intro musik 2 00:00:18,406 --> 00:00:22,640 Herald: The next talk will be titled 'How to Design Highly Reliable Digital 3 00:00:22,640 --> 00:00:26,472 Electronics', and it will be delivered to you by Szymon and Stefan. Warm Applause 4 00:00:26,472 --> 00:00:30,199 for them. 5 00:00:30,199 --> 00:00:36,360 applause 6 00:00:36,360 --> 00:00:41,360 Stefan: All right. Good morning, Congress. So perhaps every one of you in the room 7 00:00:41,360 --> 00:00:45,600 here has at one point or another in their lives witnessed their computer behaving 8 00:00:45,600 --> 00:00:50,320 weirdly and doing things that it was not supposed to do or what you didn't 9 00:00:50,320 --> 00:00:54,400 anticipate it to do. And well, typically that would have probably been the result 10 00:00:54,400 --> 00:01:00,000 of a software bug of some sort somewhere inside the huge software stack your PC is 11 00:01:00,000 --> 00:01:04,720 running on. Have you ever considered what the probability of this weird behavior 12 00:01:04,720 --> 00:01:09,120 being caused by a bit flipped somewhere in your memory of your computer might have 13 00:01:09,120 --> 00:01:16,240 been? So what you can see in this video on the screen now is a physics experiment 14 00:01:16,240 --> 00:01:20,720 called a cloud chamber. It's a very simple experiment that is actually able to 15 00:01:20,720 --> 00:01:26,560 visualize and make apparent all the constant stream of background radiation we 16 00:01:26,560 --> 00:01:32,640 all are constantly exposed to. So what's happening here is that highly energetic 17 00:01:32,640 --> 00:01:39,040 particles, for example, from space they trace through gaseous alcohol and they 18 00:01:39,040 --> 00:01:42,160 collide with alcohol molecules and they form in this process a trail of 19 00:01:42,160 --> 00:01:48,240 condensation while they do that. And if you think about your computer, a typical 20 00:01:48,240 --> 00:01:53,200 cell of RAM, of which you might have, I don't know, 4, 8, 10 gigabytes in your 21 00:01:53,200 --> 00:01:58,400 machine is as big as only 80 nanometers wide. So it's very, very tiny. And you 22 00:01:58,400 --> 00:02:02,560 probably are able to appreciate the small amount of energy that is needed or that is 23 00:02:02,560 --> 00:02:08,480 used to store the information inside each of those bits. And the sheer amount of of 24 00:02:08,480 --> 00:02:12,560 those bits you have in your RAM and your computer. So a couple of years ago, there 25 00:02:12,560 --> 00:02:17,600 was a study that concluded that in a computer with about four gigabytes of RAM, 26 00:02:17,600 --> 00:02:23,600 a bit flip, um, caused by such an event by cosmic background radiation can occur 27 00:02:23,600 --> 00:02:29,200 about once every 33 hours. So a bit less than than one per day. In an 28 00:02:29,200 --> 00:02:34,960 incident in 2008, a Quantas Airlines flight actually nearly crashed, and the 29 00:02:34,960 --> 00:02:40,080 reason for this crash was traced back to be very likely caused by a bit flipped 30 00:02:40,080 --> 00:02:44,400 somewhere in one of the CPUs of the avionics system and nearly caused the 31 00:02:44,400 --> 00:02:50,480 death of a lot of passengers on this plane. In 2003, in Belgium, a small 32 00:02:50,480 --> 00:02:56,880 municipal vote actually had a weird hiccup in which one of the candidates in this 33 00:02:56,880 --> 00:03:02,153 election actually got 4096 more votes added in a single instance. 34 00:03:02,153 --> 00:03:06,480 And that was traced back to be very likely caused by cosmic background radiation, 35 00:03:06,480 --> 00:03:10,000 flipping a memory cell somewhere that stored the vote count. And it was only 36 00:03:10,000 --> 00:03:14,560 discovered that this happened because this number of votes for this particular 37 00:03:14,560 --> 00:03:18,880 candidate was considered unreasonable, but otherwise would have gotten away probably 38 00:03:18,880 --> 00:03:27,360 without being detected. So a few words about us: Szymon and I, we both work at 39 00:03:27,360 --> 00:03:32,480 CERN in the microelectronics section and we both develop electronics that need to 40 00:03:32,480 --> 00:03:37,360 be tolerant to these sorts of effects. So we develop radiation tolerant electronics 41 00:03:37,360 --> 00:03:42,846 for the experiments at CERN, at the LHC. Among a lot of other applications, you can 42 00:03:42,846 --> 00:03:48,330 meet the two of us at the Lötlabor Jena assembly if you are interested in what we 43 00:03:48,330 --> 00:03:55,847 are talking about today. And we will also give a small talk or a small workshop 44 00:03:55,847 --> 00:03:59,190 about radiation detection tomorrow, in one of the seminar rooms. So feel free to pass 45 00:03:59,190 --> 00:04:02,544 by there, it will be a quick introduction. To give you a small idea of what kind of 46 00:04:02,544 --> 00:04:08,541 environment we are working for: So if you would use one of your default intel i7 47 00:04:08,541 --> 00:04:14,294 CPUs from your notebook and would put it anywhere where we operate our electronics, 48 00:04:14,294 --> 00:04:19,632 it would very shortly die in a matter of probably one or two minutes and it would 49 00:04:19,632 --> 00:04:24,626 die for more than just one reason, which is rather interesting and compelling. So 50 00:04:24,626 --> 00:04:30,985 the idea for today's talk is to give you all an insight into all the things that 51 00:04:30,985 --> 00:04:34,575 need to be taken into account when you design electronics for radiation 52 00:04:34,575 --> 00:04:39,152 environments. What kinds of different challenges come when you try to do that. 53 00:04:39,152 --> 00:04:43,116 We classify and explain the different types of radiation effects that exist. And 54 00:04:43,116 --> 00:04:47,617 then we also present what you can do to mitigate these effects and also validate 55 00:04:47,617 --> 00:04:52,116 that what you did to care for them or protect your circuits actually worked. And 56 00:04:52,116 --> 00:04:57,477 of course, as we do that, we'll try to give our view on how we develop radiation 57 00:04:57,477 --> 00:05:03,257 tolerant electronics at CERN and how our workflow looks like to make sure this 58 00:05:03,257 --> 00:05:08,272 works. So let's first maybe take a step back and have a look at what we mean when 59 00:05:08,272 --> 00:05:12,997 we say radiation environments. The first one that you probably have in mind right 60 00:05:12,997 --> 00:05:19,044 now when you think about radiation is space. So, this interstellar space is 61 00:05:19,044 --> 00:05:24,292 basically filled with, very high speed, highly energetic electrons and protons and 62 00:05:24,292 --> 00:05:28,716 all sorts of high energy particles. And while they, for example, traverse close to 63 00:05:28,716 --> 00:05:34,513 planets as our Earth - these planets sometimes do have a magnetic field and the 64 00:05:34,513 --> 00:05:39,317 highly energetic particles are actually deflected by these magnetic fields and 65 00:05:39,317 --> 00:05:43,824 they can protect the planets as our planet, for example, from this highly 66 00:05:43,824 --> 00:05:47,986 energetic radiation. But in the process, there around these planets sometimes they 67 00:05:47,986 --> 00:05:52,107 form these radiation belts - known as the Van Allen belts after the guy who 68 00:05:52,107 --> 00:05:56,043 discovered this effect a long time ago. And a satellite in space as it orbits 69 00:05:56,043 --> 00:06:01,620 around the Earth might, depending on what orbit is chosen, sometimes go through 70 00:06:01,620 --> 00:06:05,647 these belts of highly intense radiation. That, of course, then needs to be taken 71 00:06:05,647 --> 00:06:11,552 into account when designing electronics for such a satellite. And if Earth itself 72 00:06:11,552 --> 00:06:17,191 is not able to give you enough radiation, you may think of the very famous Juno 73 00:06:17,191 --> 00:06:22,874 Jupiter mission that has become famous about a year ago. They actually in the 74 00:06:22,874 --> 00:06:28,288 environment of Jupiter they anticipated so much radiation that they actually decided 75 00:06:28,288 --> 00:06:33,408 to put all the electronics of the satellite inside a one centimeter thick 76 00:06:33,408 --> 00:06:39,831 cube of titanium, which is famously known as the Juno Radiation Vault. But not only 77 00:06:39,831 --> 00:06:43,870 space offers radiation environments. Another form of radiation you probably all 78 00:06:43,870 --> 00:06:48,292 recognize this when I show you this picture, which is an X-ray image of a 79 00:06:48,292 --> 00:06:54,936 hand. And X-ray is also considered a form of radiation. And while, of course, the 80 00:06:54,936 --> 00:07:01,320 doses or amounts of radiation any patient is exposed to while doing diagnosis or 81 00:07:01,320 --> 00:07:05,801 treatment of some disease, that might not be the full story when it comes to medical 82 00:07:05,801 --> 00:07:10,220 applications. So this is a medical particle accelerator which is used for 83 00:07:10,220 --> 00:07:15,288 cancer treatment. And in these sorts of accelerators, typically carbon ions or 84 00:07:15,288 --> 00:07:20,389 protons are accelerated and then focused and used to treat and selectively destroy 85 00:07:20,389 --> 00:07:25,302 cancer cells in the body. And this comes already relatively close to the 86 00:07:25,302 --> 00:07:29,695 environment we are working in and working for. So Szymon and I are working, for 87 00:07:29,695 --> 00:07:36,616 example, on electronics, for the CMS detector inside the LHC or which we build 88 00:07:36,616 --> 00:07:43,906 dedicated, radiation tolerant, integrated circuits which have to withstand very, 89 00:07:43,906 --> 00:07:49,373 very large amounts and doses of short lived radiation in order to function 90 00:07:49,373 --> 00:07:54,414 correctly. And if we didn't specifically design electronics for that, basically the 91 00:07:54,414 --> 00:08:01,893 whole system would never be able to work. To illustrate a bit how you can imagine 92 00:08:01,893 --> 00:08:06,062 the scale of this environment: This is a single plot of a collision event that was 93 00:08:06,062 --> 00:08:11,161 recorded in the ATLAS experiment. And each of those tiny little traces you can make 94 00:08:11,161 --> 00:08:15,997 out in this diagram is actually either one or multiple secondary particles that were 95 00:08:15,997 --> 00:08:22,166 created in the initial collision of two proton bunches inside the experiment. And 96 00:08:22,166 --> 00:08:27,501 in each of those, of course, races around the detector electronics, which make these 97 00:08:27,501 --> 00:08:32,817 traces visible. Itself, then decaying into multiple other secondary particles which 98 00:08:32,817 --> 00:08:37,856 all go through our electronics. And if that doesn't sound, let's say, bad enough 99 00:08:37,856 --> 00:08:42,576 for digital electronics, these collisions happen about 40 million times a second. Of 100 00:08:42,576 --> 00:08:47,608 course, multiplying the number of events or problems they can cause in our 101 00:08:47,608 --> 00:08:54,608 circuits. So we now want to introduce all the things that can happen, the different 102 00:08:54,608 --> 00:08:59,570 radiation effects. But first, probably we take a step back and look at what we mean 103 00:08:59,570 --> 00:09:05,805 when we say digital electronics or digital logic, which we want to focus on today. So 104 00:09:05,805 --> 00:09:11,058 from your university lectures or your reading, you probably know the first class 105 00:09:11,058 --> 00:09:14,577 of digital logic, which is the combinatorial logic. So this is typically 106 00:09:14,577 --> 00:09:19,222 logic that just does a simple linear relation of the inputs of a circuit and 107 00:09:19,222 --> 00:09:23,956 produces an output as exemplified with these AND and OR, NAND, XOR gates that you 108 00:09:23,956 --> 00:09:28,829 see here. But if you want to build - I mean even though we use those everywhere 109 00:09:28,829 --> 00:09:32,775 in our circuits - you probably also want to store state in a more complex circuit, 110 00:09:32,775 --> 00:09:37,857 for example, in the registers of your CPU they store some sort of internal 111 00:09:37,857 --> 00:09:41,736 information. And for that we use the other class of logic, which is called the 112 00:09:41,736 --> 00:09:44,726 sequential logic. So this is typically clocked with some system clock frequency 113 00:09:44,726 --> 00:09:50,883 and it changes its output with relation to the inputs whenever this clock signal changes. 114 00:09:50,883 --> 00:09:54,263 And now if we look at how all these different logic functionalities are 115 00:09:54,263 --> 00:09:58,292 implemented. So typically nowadays for that you may know that we use CMOS 116 00:09:58,292 --> 00:10:02,340 technologies and basically represent all this logic functionality as digital gates 117 00:10:02,340 --> 00:10:10,558 using small P-MOS and N-MOS MOSFET transistors in CMOS technologies. And if 118 00:10:10,558 --> 00:10:16,408 we kind of try to build a model for more complex digital circuits, we typically use 119 00:10:16,408 --> 00:10:21,814 something we call the finite state machine model, in which we use a model that 120 00:10:21,814 --> 00:10:25,822 consists of a combinatorial and a sequential part. And you can see that the 121 00:10:25,822 --> 00:10:31,031 output of the circuit depends both on the internal state inside the register as well 122 00:10:31,031 --> 00:10:35,331 as also the input to the combinatorial logic. And accordingly, also the state 123 00:10:35,331 --> 00:10:40,924 that is internal is always changed by the inputs as well as the current state. So 124 00:10:40,924 --> 00:10:44,604 this is kind of the simple model for more complex systems that can be used to model 125 00:10:44,604 --> 00:10:50,214 different effects. Um, now let's try to actually look at what the radiation can do 126 00:10:50,214 --> 00:10:53,948 to transistors. And for that we are going to have a quick recap at what the 127 00:10:53,948 --> 00:10:57,895 transistor actually is and how it looks like. As you may perhaps know is that in 128 00:10:57,895 --> 00:11:03,736 CMOS technologies, transistors are built on wafers of high purity silicon. So this 129 00:11:03,736 --> 00:11:09,074 is a crystalline, very regularly organized lattice of silicon atoms. And what we do 130 00:11:09,074 --> 00:11:14,092 to form a transistor on such a wafer is that we add dopants. So in order to form 131 00:11:14,092 --> 00:11:19,629 diffusion regions, which later will become the source and drain of our transistors. 132 00:11:19,629 --> 00:11:24,474 And then on top of that we grow a layer of insulating oxide. And on top of that we 133 00:11:24,474 --> 00:11:28,713 put polysilicon, which forms the gate terminal of the transistor. And in the end 134 00:11:28,713 --> 00:11:32,813 we end up with an equivalent circuit a bit like that. And now to put things back into 135 00:11:32,813 --> 00:11:37,670 perspective - you may also note that the dimension of these structures are very 136 00:11:37,670 --> 00:11:42,543 tiny. So we talk about tens of nanometers for some of the dimensions I've outlined 137 00:11:42,543 --> 00:11:47,958 here. And as the technologies shrink, these become smaller and smaller and 138 00:11:47,958 --> 00:11:52,284 therefore you'll probably also realize or are able to appreciate the small amount of 139 00:11:52,284 --> 00:11:56,560 energy that are used to store information inside these digital circuits, which makes 140 00:11:56,560 --> 00:12:02,390 them perhaps more sensitive to radiation. So let's take a look. What different types 141 00:12:02,390 --> 00:12:08,385 of radiation effects exist? We typically in this case, differentiate them into two 142 00:12:08,385 --> 00:12:13,268 main classes of events. The first one would be the cumulative effects, which are 143 00:12:13,268 --> 00:12:17,362 effects that, as the name implies, accumulate over time. So as the circuit is 144 00:12:17,362 --> 00:12:22,127 placed inside some radiation environment, over time it accumulates more and more 145 00:12:22,127 --> 00:12:26,969 dose and therefore worsens its performance or changes how it operates. And on the 146 00:12:26,969 --> 00:12:30,549 other side, we have the Single Event Effects, which are always events that 147 00:12:30,549 --> 00:12:35,075 happen at some instantaneous point in time, and then suddenly, without being 148 00:12:35,075 --> 00:12:39,316 predictable, change how the circuit operates or how it functions or if it 149 00:12:39,316 --> 00:12:43,931 works in the first place or not. So I'm going to first go into the class of 150 00:12:43,931 --> 00:12:47,685 cumulative effects and then later on, Szymon will go into the other class of the 151 00:12:47,685 --> 00:12:53,173 Single Event Effects. So in terms of these accumulating effects, we basically have 152 00:12:53,173 --> 00:12:57,580 two main subclasses: The first one being ionization or TID effects, for Total 153 00:12:57,580 --> 00:13:02,033 Ionizing Dose - and the second one being displacement damages. So displacement 154 00:13:02,033 --> 00:13:07,137 damages do exactly what they sound like. It is all the effects that happen when an 155 00:13:07,137 --> 00:13:11,249 atom in the silicon lattice is actually displaced, so removed from its lattice 156 00:13:11,249 --> 00:13:15,266 position and actually changes the structure of the semiconductor. But 157 00:13:15,266 --> 00:13:19,548 luckily, these effects don't have a big impact in the CMOS digital circuits that 158 00:13:19,548 --> 00:13:23,164 we are looking at today. So we will disregard them for the moment and we'll be 159 00:13:23,164 --> 00:13:28,120 looking more at the ionization damage, or TID. So ionization - as a quick recap - is 160 00:13:28,120 --> 00:13:35,901 whenever electrons are removed or added to an atom, effectively transforming it into 161 00:13:35,901 --> 00:13:42,747 an ion. And these effects are especially critical for the circuits we are building 162 00:13:42,747 --> 00:13:46,316 because of what they do is that they change the behavior of the transistors. 163 00:13:46,316 --> 00:13:50,233 And without looking too much into the semiconductor details, I just want to show 164 00:13:50,233 --> 00:13:55,730 their typical effect that we are concerned about in this very simple circuit here. So 165 00:13:55,730 --> 00:14:00,348 this is just an inverter circuit consisting of two transistors here and 166 00:14:00,348 --> 00:14:05,812 there. And what the circuit does in normal operation is it just takes an input signal 167 00:14:05,812 --> 00:14:10,062 and inverts and basically gives the inverted signal at the output. And as the 168 00:14:10,062 --> 00:14:15,549 transistors are irradiated and accumulate dose, you can see that the edges of the 169 00:14:15,549 --> 00:14:20,391 output signal get slower. So the transistor takes longer to turn on and off. 170 00:14:20,391 --> 00:14:24,574 And what that does in turn is that it limits the maximum operation frequency of 171 00:14:24,574 --> 00:14:28,795 your circuit. And of course, that is not something you want to do. You want your 172 00:14:28,795 --> 00:14:31,723 circuit to operate at some frequency in your final system. And if the maximum 173 00:14:31,723 --> 00:14:35,600 frequency it can work at degrades over time, at some point it will fail as the 174 00:14:35,600 --> 00:14:39,276 maximum frequency is just too low. So let's have a look at what we can do to 175 00:14:39,276 --> 00:14:44,395 mitigate these effects. The first one and I already mentioned it when talking about 176 00:14:44,395 --> 00:14:48,488 the Juno mission, is shielding. So if you can actually put a box around your 177 00:14:48,488 --> 00:14:52,586 electronics and shield any radiation from actually hitting your transistors, it is 178 00:14:52,586 --> 00:14:56,900 obvious that they will last longer and will suffer less from the radiation damage 179 00:14:56,900 --> 00:15:01,241 that it would otherwise do. So this approach is very often used in space 180 00:15:01,241 --> 00:15:04,988 applications like on satellites, but it's not very useful if you are actually trying 181 00:15:04,988 --> 00:15:08,209 to measure the radiation with your circuits as we do, for example, in the 182 00:15:08,209 --> 00:15:12,415 particle accelerators we build integrated circuits for. So there first of all, we 183 00:15:12,415 --> 00:15:16,344 want to measure the radiation so we cannot shield our detectors from the radiation. 184 00:15:16,344 --> 00:15:20,592 And also, we don't want to influence the tracks of these secondary collision 185 00:15:20,592 --> 00:15:24,162 products with any shielding material that would be in the way. So this is not very 186 00:15:24,162 --> 00:15:28,315 useful in a particle accelerator environment, let's say. So we have to 187 00:15:28,315 --> 00:15:33,880 resort to different methods. So as I said, we do have to design our own integrated 188 00:15:33,880 --> 00:15:38,826 circuits in the first place. So we have some freedom in what we call transistor 189 00:15:38,826 --> 00:15:45,236 level design. So we can actually alter the dimensions of the transistors. We can make 190 00:15:45,236 --> 00:15:50,055 them larger to withstand larger doses of radiation and we can use special 191 00:15:50,055 --> 00:15:54,354 techniques in terms of layout that we can experimentally verifiy to be more 192 00:15:54,354 --> 00:15:59,266 resistant to radiation effects. And as a third measure, which is probably the most 193 00:15:59,266 --> 00:16:03,491 important one for us, is what we call modeling. So we actually are able to 194 00:16:03,491 --> 00:16:08,358 characterize all the effects that radiation will have on a transistor. And 195 00:16:08,358 --> 00:16:12,442 if we can do that, if we will know: 'If I put it into a radiation environment for a 196 00:16:12,442 --> 00:16:17,000 year, how much slower will it become?' Then it is of course easy to say: 'OK, I 197 00:16:17,000 --> 00:16:20,648 can just over-design my circuit and make it a bit more simple, maybe have less functionality, 198 00:16:20,648 --> 00:16:24,464 but be able to operate at a higher frequency and therefore withstand 199 00:16:24,464 --> 00:16:30,240 the radiation effects for a longer time while still working sufficiently well at 200 00:16:30,240 --> 00:16:35,118 the end of its expected lifetime.' So that's more or less what we can do about 201 00:16:35,118 --> 00:16:38,254 these effects. And I'll hand over to Szymon for the second class. 202 00:16:38,254 --> 00:16:42,655 Szymon: Contrary to the cumulative effects presented by Stefan, the other group are 203 00:16:42,655 --> 00:16:46,424 Single Event Effects which are caused by high energy deposits, which are caused by 204 00:16:46,424 --> 00:16:52,143 a single particle or shower of particles. And they can happen at any time, even 205 00:16:52,143 --> 00:16:57,089 seconds after irradiation is started. It means that if your circuit is vulnerable 206 00:16:57,089 --> 00:17:01,667 to this class of effects, it can fail immediately after radiation is present. 207 00:17:01,667 --> 00:17:06,313 And here we also classify these effects into several groups. The first are hard, 208 00:17:06,313 --> 00:17:11,450 or permanent, errors, which as the name indicates can permanently destroy your 209 00:17:11,450 --> 00:17:20,260 circuit. And this type of errors are typically critical for power devices where 210 00:17:20,260 --> 00:17:24,340 you have large power densities and they are not so much of a problem for digital 211 00:17:24,340 --> 00:17:30,100 circuits. In the other class of effects are soft errors. And here we distinguish 212 00:17:30,100 --> 00:17:34,100 transient, or Single Event Transient errors, which are spurious signals 213 00:17:34,100 --> 00:17:41,220 propagating in your circuit as a result of a gate being hit by a particle and they 214 00:17:41,220 --> 00:17:45,700 are especially problematic for analog circuits or asynchronous digital circuits, 215 00:17:45,700 --> 00:17:51,460 but under some circumstances they can be also problematic for synchronous systems. 216 00:17:51,460 --> 00:17:56,420 And the other class of problems are static, or Single Event Upset problems, 217 00:17:56,420 --> 00:18:01,220 which basically means that your memory element like a register gets flipped. And 218 00:18:01,220 --> 00:18:05,060 then of course, if your system is not designed to handle this type of errors 219 00:18:05,060 --> 00:18:09,620 properly, it can lead to a failure. So in the following part of the presentation 220 00:18:09,620 --> 00:18:15,300 we'll focus mostly on soft errors. So let's try to understand what is the origin 221 00:18:15,300 --> 00:18:20,820 of this type of problem. So as Stefan mentioned, the typical transistor is built 222 00:18:20,820 --> 00:18:25,230 out of diffusions, gate and channel. So here you can see one diffusion. Let's 223 00:18:25,230 --> 00:18:29,230 assume that it is a drain diffusion. And then when a particle goes through and 224 00:18:29,230 --> 00:18:36,700 deposits charge, it creates free electron and hole pairs, which then in the presence of 225 00:18:36,700 --> 00:18:43,320 electric fields, they get collected by means of drift, which results in a large 226 00:18:43,320 --> 00:18:46,930 current spike, which is very short. And then the rest of the charge could be 227 00:18:46,930 --> 00:18:50,940 collected by diffusion which is a much slower process and therefore also the 228 00:18:50,940 --> 00:18:56,390 amplitude of the event is much, much smaller. So let's try to understand what 229 00:18:56,390 --> 00:19:01,230 could happen in a typical memory cell. So on this schematic, you can see the 230 00:19:01,230 --> 00:19:05,740 simplest memory cell, which is composed of two back-to-back inverters. And let's 231 00:19:05,740 --> 00:19:12,810 assume that node A is at high and node B is at low potential initially. And then we 232 00:19:12,810 --> 00:19:17,210 have a particle hitting the drain of transistor M1 which creates a short 233 00:19:17,210 --> 00:19:22,590 circuit current between drain and ground, bringing the drain of transistor M1 to low 234 00:19:22,590 --> 00:19:29,871 potential, which also acts on the gates of second inverter, temporarily changing its 235 00:19:29,871 --> 00:19:38,734 state from low to high, which reinforces the wrong state in the first inverter. And 236 00:19:38,734 --> 00:19:45,340 at this time the error is locked in your memory cell and you basically lost your 237 00:19:45,340 --> 00:19:49,652 information. So you may be asking yourself: 'How much charge is needed 238 00:19:49,652 --> 00:19:54,281 really to flip a state of a memory cell?'. And you can get this number from either 239 00:19:54,281 --> 00:19:59,952 simulations or from measurements. So let's assume that what we could do, we could try 240 00:19:59,952 --> 00:20:04,605 to inject some current into the sensitive node, for example, drain of transistor M1. 241 00:20:04,605 --> 00:20:08,790 And here what I will show is that on the top plot you will have current as a function 242 00:20:08,790 --> 00:20:13,484 of time. On the second plot you will have output voltage. So voltage at node B as a 243 00:20:13,484 --> 00:20:19,121 function of time and at the lowest plot you will see a probability of having a bit 244 00:20:19,121 --> 00:20:23,097 flip. So if you inject very little current, of course nothing changes at the 245 00:20:23,097 --> 00:20:27,670 output, but once you start increasing the amount of current you are injecting, you 246 00:20:27,670 --> 00:20:33,306 see that something appears at the output and at some point the output will toggle, 247 00:20:33,306 --> 00:20:39,747 so it will switch to the other state. And at this point, if you really calculate 248 00:20:39,747 --> 00:20:46,369 what is the area under the current curve you can find what is the critical charge 249 00:20:46,369 --> 00:20:53,499 needed to flip the memory cell. And if you go further, if you start injecting even 250 00:20:53,499 --> 00:21:00,701 more current, you will not see that much difference in the output voltage waveform. 251 00:21:00,701 --> 00:21:05,112 It could become only slightly faster. And at this point, you also can notice that 252 00:21:05,112 --> 00:21:09,528 the probability now jumped to one, which means that any time you inject so much 253 00:21:09,528 --> 00:21:17,431 current there is a fault in your circuit. So for now, we just found what is the 254 00:21:17,431 --> 00:21:23,414 probability of having a bit-flip from 0 to 1 in node B. Of course we should also 255 00:21:23,414 --> 00:21:27,904 calculate the same for the other direction, so from 1 to zero. And usually 256 00:21:27,904 --> 00:21:32,377 it is slightly different. And then of course we should inject in all the other 257 00:21:32,377 --> 00:21:37,817 nodes, for example node B and also should study all possible transitions. And then 258 00:21:37,817 --> 00:21:43,492 at the end, if you calculate the superposition of these effects and you 259 00:21:43,492 --> 00:21:48,655 multiply them by the active area of each node, you will end up with what we call 260 00:21:48,655 --> 00:21:52,420 the cross section, which has a dimension of centimeters squared, which will tell 261 00:21:52,420 --> 00:21:57,357 you how sensitive your circuit is to this type of effects. And then knowing the 262 00:21:57,357 --> 00:22:03,761 radiation profile of your environment, you can calculate the expected upset rate in 263 00:22:03,761 --> 00:22:10,105 the final application. So now, having covered the basic of the single event 264 00:22:10,105 --> 00:22:16,517 effects, let's try to check how we can mitigate them. And here also technology 265 00:22:16,517 --> 00:22:20,875 plays a significant role. So of course, newer technologies offer us much smaller 266 00:22:20,875 --> 00:22:26,692 devices. And together with that, what follows is that usually supply voltages 267 00:22:26,692 --> 00:22:31,047 are getting smaller and smaller as well as the node capacitance, which means that for 268 00:22:31,047 --> 00:22:35,565 our Single Event Upsets it is very bad because the critical charge which is 269 00:22:35,565 --> 00:22:40,207 required to flip our bit is getting less and less. But at the end, at the same 270 00:22:40,207 --> 00:22:44,135 time, physical dimensions of our transistors are getting smaller, which 271 00:22:44,135 --> 00:22:48,097 means that the cross section for them being hit is also getting smaller. So 272 00:22:48,097 --> 00:22:52,495 overall, the effects really depend on the circuit topology and the radiation 273 00:22:52,495 --> 00:22:59,181 environment. So another protection method could be introduced on the cell level. And 274 00:22:59,181 --> 00:23:04,914 here we could imagine increasing the critical charge. And that could be done in 275 00:23:04,914 --> 00:23:10,819 the easiest way by just increasing the node capacitance by, for example, putting 276 00:23:10,819 --> 00:23:16,096 larger transistors. But of course, this also increases the collection electrode, 277 00:23:16,096 --> 00:23:22,657 which is not nice. And another way could be just increase the capacitance by adding 278 00:23:22,657 --> 00:23:28,336 some extra metal capacitance, but it, of course, slows down the circuit. Another 279 00:23:28,336 --> 00:23:33,615 approach could be to try to store the information on more than two nodes. So I 280 00:23:33,615 --> 00:23:38,377 showed you that on a simple SRAM cell we store information only on two nodes, so 281 00:23:38,377 --> 00:23:43,102 you could try to come up with some other cells, for example, like that one in which 282 00:23:43,102 --> 00:23:47,406 the information you stored on four nodes. So you can see that the architecture is 283 00:23:47,406 --> 00:23:53,800 very similar to the basic SRAM cell. But you should be careful always to very 284 00:23:53,800 --> 00:23:59,000 carefully simulate your design, because if we analyze this circuit, you will quickly 285 00:23:59,000 --> 00:24:02,936 realize that this circuit, even though the information is stored in four different 286 00:24:02,936 --> 00:24:09,867 nodes, the same type of loop exists as in the basic circuit. Meaning that at the end 287 00:24:09,867 --> 00:24:15,227 the circuit offers basically no hardening with respect to the previous cell. So 288 00:24:15,227 --> 00:24:21,074 actually we can do it better. So here you can see a typical dual interlocked cell. 289 00:24:21,074 --> 00:24:26,445 So the amount of transistors is exactly the same as in the previous example, but 290 00:24:26,445 --> 00:24:30,819 now they are interconnected slightly differently. And here you can see that 291 00:24:30,819 --> 00:24:36,262 this cell has also two stable configurations. But this time data can propagate, the low 292 00:24:36,262 --> 00:24:40,587 level from a given node can propagate only to the left hand side, while high 293 00:24:40,587 --> 00:24:47,872 level can propagate to the right hand side. And each stage being inverting means 294 00:24:47,872 --> 00:24:54,918 that the fault can not propagate for more than one node. Of course, this cell has 295 00:24:54,918 --> 00:25:00,379 some drawbacks: It consumes more area than a simple SRAM cell and also write access 296 00:25:00,379 --> 00:25:04,240 requires accessing at least two nodes at the same time to really change the state 297 00:25:04,240 --> 00:25:09,801 of the cell. And so you may ask yourself, how effective is this cell? So here I will 298 00:25:09,801 --> 00:25:13,709 show you a cross section plot. So it is the probability of having an error as a 299 00:25:13,709 --> 00:25:18,883 function of injected energy. And as a reference, you can see a pink curve on the 300 00:25:18,883 --> 00:25:25,650 top, which is for a normal, not protected cell. And on the green you can see the 301 00:25:25,650 --> 00:25:31,399 cross section for the error in the DICE cell. So as you can see, it is one order 302 00:25:31,399 --> 00:25:36,934 of magnitude better than the normal cell. But still, the cross section is far from 303 00:25:36,934 --> 00:25:41,426 being negligible, So, the problem was identified: So it was identified that the 304 00:25:41,426 --> 00:25:45,679 problem was caused by the fact that some sensitive nodes were very close together 305 00:25:45,679 --> 00:25:50,807 on the layout and therefore they could be upset by the same particle. Because as we 306 00:25:50,807 --> 00:25:54,721 mentioned, that single devices, they are very small. We are talking about dimensions 307 00:25:54,721 --> 00:25:59,675 below a micron. So after realizing that, we designed another cell in which we 308 00:25:59,675 --> 00:26:04,799 separated more sensitive nodes and we ended up with the blue curve, and as you 309 00:26:04,799 --> 00:26:08,907 can see the cross section was reduced by two more orders of magnitude and the 310 00:26:08,907 --> 00:26:14,205 threshold was increased significantly. So if you don't want to redesign your 311 00:26:14,205 --> 00:26:18,771 standard cells, you could also apply some mitigation techniques on block level. So 312 00:26:18,771 --> 00:26:24,717 here we can use some encoding to encode our state better. And as an example, I 313 00:26:24,717 --> 00:26:31,540 will show you a typical Hamming code. So to protect four bits, we have to add three 314 00:26:31,540 --> 00:26:38,052 additional party bits which are calculated according to this formula. And then once 315 00:26:38,052 --> 00:26:44,133 you calculate the parity bits, you can use those to check the state integrity of your 316 00:26:44,133 --> 00:26:50,360 internal state. And if any of their parity bits is not equal to zero, then the bits 317 00:26:50,360 --> 00:26:55,375 instantaneously become syndromes, indicating where the error happened. And 318 00:26:55,375 --> 00:26:59,916 you can use these information to correct the error. Of course, in this case, the 319 00:26:59,916 --> 00:27:06,533 efficiency is not really nice because we need three additional bits to protect only 320 00:27:06,533 --> 00:27:11,828 four bits of information. But as the state length increases the protection also is 321 00:27:11,828 --> 00:27:18,855 more efficient. Another approach would be to do even less. Meaning that instead of 322 00:27:18,855 --> 00:27:23,970 changing anything you need in your design, you can just triplicate your design or 323 00:27:23,970 --> 00:27:30,190 multiply it many times and just vote, which state is correct? So this concept is 324 00:27:30,190 --> 00:27:35,046 called tripple modular redudancy and it is based around a voter cell. So it is a 325 00:27:35,046 --> 00:27:40,210 cell which has odd number of inputs and output is always equal to 326 00:27:40,210 --> 00:27:45,040 majority of its input. And as I mentioned that the idea is that you have, for 327 00:27:45,040 --> 00:27:49,292 example, three circuits: A, B and C, and during normal operation, when they are 328 00:27:49,292 --> 00:27:54,471 identical, the output is also the same. However, when there is a problem, for 329 00:27:54,471 --> 00:28:00,957 example, in logic, part B, the output is affected. So this problem is 330 00:28:00,957 --> 00:28:05,509 effectively masked by the voter cell and it is not visible from outside of the 331 00:28:05,509 --> 00:28:10,383 circuit. But you have to be careful not to take this picture as a as a design 332 00:28:10,383 --> 00:28:15,501 template. So let's try to analyze what would happen with a state machine 333 00:28:15,501 --> 00:28:20,329 similar to what Stephan introduced. If you were to just use this concept. So here you 334 00:28:20,329 --> 00:28:24,859 can see three state machines and a voter on the output. And as we can see, 335 00:28:24,859 --> 00:28:29,484 if you have an upside in, for example, the state register A, then the state is 336 00:28:29,484 --> 00:28:36,676 broken. But still the output of the circuit, which is indicated by letter s is 337 00:28:36,676 --> 00:28:42,355 correct because the B and C registers are still fine. But what happens if some time 338 00:28:42,355 --> 00:28:49,283 later we have an upset in memory element B or C? Then of course the state 339 00:28:49,283 --> 00:28:56,028 of our system is broken and we can not recover it. So you can ask yourself what 340 00:28:56,028 --> 00:29:02,204 can we do better in order to avoid this situation? So that just to be sure. Please 341 00:29:02,204 --> 00:29:06,654 do not use this technique to protect your circuits. So the easiest mitigation could 342 00:29:06,654 --> 00:29:13,201 be to use as an input to your logic to use the output of the voter cell itself. 343 00:29:13,201 --> 00:29:18,491 What it offers us is that now whenever you have an upset in one of the memory 344 00:29:18,491 --> 00:29:22,933 elements for the next computation, for the next stage, we always use the voter 345 00:29:22,933 --> 00:29:27,631 output, which ensures that the signal will be removed one clock cycle later. So 346 00:29:27,631 --> 00:29:32,726 you will have another hit sometime later, basically, it will not affect our state. 347 00:29:32,726 --> 00:29:39,765 Until now we consider only upsets in our registers but what happens if we have 348 00:29:39,765 --> 00:29:45,885 charge in our voter? So you see that if there is no state change, basically the 349 00:29:45,885 --> 00:29:50,981 transient in the voter doesn't impact our system. But if you are really unlucky 350 00:29:50,981 --> 00:29:55,777 and the transient happens when the clock transition happens, so when whenever we 351 00:29:55,777 --> 00:30:01,182 enlarge the data, we can corrupt the state in three registers at the same time, which 352 00:30:01,182 --> 00:30:05,605 is less than ideal. So to overcome this limitation, you can consider skewing our 353 00:30:05,605 --> 00:30:11,101 clocks by some time, which is larger than the maximum charge in time. And now, 354 00:30:11,101 --> 00:30:18,050 because with each register samples the output of the voter a slightly different 355 00:30:18,050 --> 00:30:23,449 time, we can corrupt only one flip-flop at the time. So of course, if you are 356 00:30:23,449 --> 00:30:28,780 unlucky, we can have problematic situations in which one register is 357 00:30:28,780 --> 00:30:33,646 already in your state. The other register is still in the old state. And then it 358 00:30:33,646 --> 00:30:39,728 can lead to undetermenistic result. So it is better, but still not ideal. So as a 359 00:30:39,728 --> 00:30:46,578 general theme, you have seen that we were adding and adding more resources so you 360 00:30:46,578 --> 00:30:50,418 can ask yourself what would happen if we tripplicate everything. So in this case, 361 00:30:50,418 --> 00:30:54,262 we tripplicated registers, we tripplicate our logic and our voters. And 362 00:30:54,262 --> 00:30:59,138 now you can see that whenever we have an upset in our register, it can only affect 363 00:30:59,138 --> 00:31:04,513 one register at the time and the error will be removed from the system one clock 364 00:31:04,513 --> 00:31:08,912 cycle later. Also, if we have an upset in the voter or in their logic it can be 365 00:31:08,912 --> 00:31:13,372 larged only to one register, which means that in principle we create that system 366 00:31:13,372 --> 00:31:17,885 which is really robust. Unfortunately, nothing is for free. So here I compare a 367 00:31:17,885 --> 00:31:22,823 different tripplication environments and as you can see that the more protection 368 00:31:22,823 --> 00:31:26,326 you want to have, the more you have to pay in terms of resources being power in the 369 00:31:26,326 --> 00:31:31,373 area. And also usual, you pay small penalty in terms of maximum operational 370 00:31:31,373 --> 00:31:37,597 speed. So which flavor of protection you use depends really on 371 00:31:37,597 --> 00:31:42,420 application. So for most sensitive circuits, you probably you want to use 372 00:31:42,420 --> 00:31:48,493 full TMR and you may leave some other bits of logic unprotected. So another, if 373 00:31:48,493 --> 00:31:54,749 your system is not mission critical and you can tolerate some downtime, you can 374 00:31:54,749 --> 00:32:00,294 consider scrubbing, which means periodically checking the state of your system and refreshing it 375 00:32:00,294 --> 00:32:05,120 if necessary if an error is detected using some parity bits or copy of the data in 376 00:32:05,120 --> 00:32:10,394 a safe space. Or you can have a watchdog which will find out that 377 00:32:10,394 --> 00:32:13,951 something went wrong and it will just reinitialize the whole system. So now, 378 00:32:13,951 --> 00:32:20,011 having covered the basics of all the effects we will have to face, we would like 379 00:32:20,011 --> 00:32:24,293 to show you the basic flow which we follow during designing our radiation hardened 380 00:32:24,293 --> 00:32:29,746 circuits. So of course we always start with specifications. So we try to 381 00:32:29,746 --> 00:32:34,228 understand our radiation environment in which the circuit is meant to operate. So 382 00:32:34,228 --> 00:32:38,750 we come up with some specifications for total dose which could be accumulated and 383 00:32:38,750 --> 00:32:45,348 for the rate of single event upsets. And at this moment, it is also not very rare 384 00:32:45,348 --> 00:32:49,705 that we have to decide to move some functionality out of our detector volume, 385 00:32:49,705 --> 00:32:56,133 outside, where we can use of the sort of commercial equipment to do number 386 00:32:56,133 --> 00:33:04,820 crunching. But let's assume that we would go with our ASIC. So having the 387 00:33:04,820 --> 00:33:09,220 specifications, of course we proceed with functional implementation. This we 388 00:33:09,220 --> 00:33:14,260 typically do with hardware describtion languages, so verilog or VHDL which you may 389 00:33:14,260 --> 00:33:18,900 know from typical FPGA flow. And of course we write a lot of simulations to 390 00:33:18,900 --> 00:33:24,205 understand whether we are meeting our functional goals or whether our circuit 391 00:33:24,205 --> 00:33:30,665 behaves as expected. And then we selectively select some parts of the 392 00:33:30,665 --> 00:33:36,318 circuits which we want to protect from radiation effects. So, for example, we can 393 00:33:36,318 --> 00:33:42,290 decide to use triplication or some other methods. So these days we typically use 394 00:33:42,290 --> 00:33:46,645 triplication as the most straightforward and very effective method. So you can ask 395 00:33:46,645 --> 00:33:50,750 yourself how do we triplicate the logic? So the simplest could be: Just copy 396 00:33:50,750 --> 00:33:55,099 and paste the code three times at some postfixes like A, B and C and you are 397 00:33:55,099 --> 00:34:01,653 done. But of course this solution has some drawbacks. So it is time consuming and it 398 00:34:01,653 --> 00:34:05,964 is very error prone. So maybe you have noticed that I had a typo there. So of 399 00:34:05,964 --> 00:34:10,220 course we don't want to do that. So we developed our own tool, which we called 400 00:34:10,220 --> 00:34:16,924 TMRG, which automatizes the process of triplication and eliminates the two main 401 00:34:16,924 --> 00:34:22,494 drawbacks, which I just described. So after we have our code triplicated and of 402 00:34:22,494 --> 00:34:27,075 course, not before rerunning all the simulations to make sure that everything 403 00:34:27,075 --> 00:34:34,230 went as expected. We then proceed to the synthesis process in which we convert our 404 00:34:34,230 --> 00:34:41,091 high level hardware description languages to gate level netlists, in which all the functions 405 00:34:41,091 --> 00:34:46,189 are mapped to gates, which were introduced by Stefan, so both combinatorial and 406 00:34:46,189 --> 00:34:53,631 sequential. And here we also have to be careful because modern CAD tools have a 407 00:34:53,631 --> 00:34:59,020 tendency, of course, to optimise the logic as much as possible. And our logic in most 408 00:34:59,020 --> 00:35:03,810 of the cases is really redundant. So it is very easy; So, it should be removed. So we 409 00:35:03,810 --> 00:35:08,632 really have to make sure that it is not removed. That's why our tool also provides 410 00:35:08,632 --> 00:35:13,633 some constraints for the synthesizer to make sure that our design intent is 411 00:35:13,633 --> 00:35:20,900 clearly and well understood by the tool. And once we have the output netlist, we 412 00:35:20,900 --> 00:35:26,980 proceed to place and route process where this kind of netlist representation is 413 00:35:26,980 --> 00:35:32,580 mapped to a layout of what will become soon our digital chip where we placed all 414 00:35:32,580 --> 00:35:36,624 the cells and we route connections between them and here there is 415 00:35:36,624 --> 00:35:40,907 another danger which I mentioned already, it's that in modern technologies the cells 416 00:35:40,907 --> 00:35:45,597 are so small that they could be easily affected by a single particle at the same 417 00:35:45,597 --> 00:35:51,892 time. So we have to really space out the big cells which are responsible for 418 00:35:51,892 --> 00:35:56,982 keeping the information about the state to make sure that a single particle cannot 419 00:35:56,982 --> 00:36:04,980 upset A and B, for example, registered from the same register. And then in the 420 00:36:04,980 --> 00:36:09,540 last step, of course, we'll have to verify that everything, what we have done, is 421 00:36:09,540 --> 00:36:13,926 correct. And at this level, we also try to introduce some single event effects in our 422 00:36:13,926 --> 00:36:19,971 simulations. So we could randomly flip bits in our system. We can also inject 423 00:36:19,971 --> 00:36:26,094 transients. And typically we used to do that on the netlist level, which works 424 00:36:26,094 --> 00:36:31,424 very fine. And it is very nice. But the problem with this approach is that we can 425 00:36:31,424 --> 00:36:37,640 perform these actions very late in the design cycle, which is less than ideal. 426 00:36:37,640 --> 00:36:43,084 And also that if we find that there is problem in our simulation, typical netlist 427 00:36:43,084 --> 00:36:48,437 at this level has probably few orders of magnitude more lines than our initial RTL 428 00:36:48,437 --> 00:36:52,990 code. So to trace back what is the problematic line of code is not so 429 00:36:52,990 --> 00:36:57,533 straightforward. At this time. So you can ask yourself why not to try to inject 430 00:36:57,533 --> 00:37:05,458 errors in the RTL design? And the answer was, the answer is that it is not so 431 00:37:05,458 --> 00:37:10,670 trivially to map the hardware description language's high level constructs to 432 00:37:10,670 --> 00:37:15,585 what will become combinatorial or sequential logic. So in order to eliminate 433 00:37:15,585 --> 00:37:20,980 this problem, we also develop another open source tool, which allows us to... 434 00:37:20,980 --> 00:37:27,860 So we decided to use Yosys open source synthesis tool from clifford, which 435 00:37:27,860 --> 00:37:31,530 was presented in the Congress several years ago. So we use this tool to make a 436 00:37:31,530 --> 00:37:35,680 first pass through our RTL code to understand which elements will be mapped 437 00:37:35,680 --> 00:37:40,678 to sequential and combinatorial. And then having this information, we will use 438 00:37:40,678 --> 00:37:45,951 cocotb, another python verification framework, which allows us programmatic 439 00:37:45,951 --> 00:37:51,838 access to these nodes and we can effectively inject the errors in our 440 00:37:51,838 --> 00:37:56,660 simulations. And I forgot to mention that the TMRG tool is also open source. So if 441 00:37:56,660 --> 00:38:03,841 you are interested in one of the tools, please feel free to contact us. And of 442 00:38:03,841 --> 00:38:10,505 course, after our simulation is done, then in the next step we would really tape out. And 443 00:38:10,505 --> 00:38:14,637 so we submit our chip to manufacturing and hopefully a few months later we receive 444 00:38:14,637 --> 00:38:18,105 our chip back. Stefan: All right. So after patiently 445 00:38:18,105 --> 00:38:23,546 waiting then for a couple of months while your chip is in manufacturing and you're 446 00:38:23,546 --> 00:38:28,245 spending time on preparing a test set up and preparing yourself to actually test if 447 00:38:28,245 --> 00:38:33,772 your chip works as you expected to. Now, it's probably also a good time to think 448 00:38:33,772 --> 00:38:38,307 about how to actually validate or test if all the measures that you've taken to 449 00:38:38,307 --> 00:38:41,389 protect your circuit from radiation effects actually are effective or if they 450 00:38:41,389 --> 00:38:46,196 are not. And so again, we will split this in two parts. So you will probably want to 451 00:38:46,196 --> 00:38:50,024 start with testing for the total ionizing dose effects. So for the cumulative effect 452 00:38:50,024 --> 00:38:54,554 and for that, you typically use x ray radiation relatively similar to the one 453 00:38:54,554 --> 00:38:59,005 used in medical treatment. So this radiation is relatively low, energetic, 454 00:38:59,005 --> 00:39:03,344 which has the upside of not producing any single event effects, but you can really 455 00:39:03,344 --> 00:39:07,462 only accumulate radiation dose and focus on the accumulating effects. And typically 456 00:39:07,462 --> 00:39:11,600 you would use a machine that looks somewhat like this, a relatively compact 457 00:39:11,600 --> 00:39:16,840 thing. You can have in your laboratory and you can use that to really accumulate 458 00:39:16,840 --> 00:39:21,520 large amounts of radiation dose on your circuit. And then you need some sort of 459 00:39:21,520 --> 00:39:26,641 mechanism to verify or to quantify how much your circuit slows down due to this 460 00:39:26,641 --> 00:39:31,285 radiation dose. And if you do that, you typically end up with a graphic such as 461 00:39:31,285 --> 00:39:36,567 this one, where in the x axis you have the radiation dose your circuit was exposed 462 00:39:36,567 --> 00:39:40,639 to. And on the y axis, you see that the frequency has gone down over time and you 463 00:39:40,639 --> 00:39:44,536 can use this information to say: "OK, my final application, I expect this 464 00:39:44,536 --> 00:39:49,324 level of radiation dose. I mean, I can still see that my circuit will work fine 465 00:39:49,324 --> 00:39:53,565 under some given environmental condition or some operation condition." So this is 466 00:39:53,565 --> 00:39:58,285 the test for the first class of effects. And the test for the second class of 467 00:39:58,285 --> 00:40:02,318 effects for the single event effect is a bit more involved. So there what you would 468 00:40:02,318 --> 00:40:07,157 typically start to do is go for a heavy ion test campaign. So you would go to a 469 00:40:07,157 --> 00:40:12,760 specialized, relatively rare facility. We have a couple of those in Europe and would 470 00:40:12,760 --> 00:40:16,532 look perhaps somewhat like this. So it's a small particle accelerator somewhere. 471 00:40:16,532 --> 00:40:20,794 They typically have different types of heavy ions at their 472 00:40:20,794 --> 00:40:26,311 disposal that they can accelerate and then shoot at your chip that you can place in a 473 00:40:26,311 --> 00:40:32,390 vacuum chamber and these ions can deposit very well known amounts of energy in your 474 00:40:32,390 --> 00:40:36,818 circuit and you can use that information to characterize your circuit. The downside 475 00:40:36,818 --> 00:40:41,207 is a bit that these facilities tend to be relatively expensive to access and also a 476 00:40:41,207 --> 00:40:45,161 bit hard to access. So typically you need to book them a lot of time in advance and 477 00:40:45,161 --> 00:40:50,351 that's sometimes not very easy. But what it offers you, you can use different types 478 00:40:50,351 --> 00:40:55,244 of ions with different energies. You can really make a very well-defined 479 00:40:55,244 --> 00:41:00,190 sensitivity curve similar to the one that Szymon has described. You can get from 480 00:41:00,190 --> 00:41:04,052 simulations and really characterize your circuit for how often, any single event 481 00:41:04,052 --> 00:41:09,026 effects will appear in the final application if there is any remaining 482 00:41:09,026 --> 00:41:12,827 effects left. If you have left something unprotected. The problem here is that 483 00:41:12,827 --> 00:41:18,190 these particle accelerators typically just bombard your circuit with like thousands 484 00:41:18,190 --> 00:41:23,310 of particles per second and they hit basically the whole area in a random 485 00:41:23,310 --> 00:41:26,940 fashion. So you don't really have a way of steering those or measuring the position 486 00:41:26,940 --> 00:41:30,964 of these particles. So typically you are a bit in the dark and really have to really 487 00:41:30,964 --> 00:41:34,884 carefully know the behavior of your circuit and all the quirks it has even 488 00:41:34,884 --> 00:41:39,481 without the radiation to instantly notice when something has gone wrong. And 489 00:41:39,481 --> 00:41:44,088 this is typically not very easy and you can kind of compare it with having 490 00:41:44,088 --> 00:41:47,372 some weird crash somewhere in your software stack and then having to have 491 00:41:47,372 --> 00:41:51,800 first take a look and see what actually has happened. Typically 492 00:41:51,800 --> 00:41:57,058 you find something that has not been properly protected and you see some weird 493 00:41:57,058 --> 00:42:01,847 effect on your circuit and then you try to get a better idea of where that problem 494 00:42:01,847 --> 00:42:06,256 actually is located. And the answer for these types of problems involving position 495 00:42:06,256 --> 00:42:11,381 is, of course, always lasers. So we have two types of laser experiments available 496 00:42:11,381 --> 00:42:15,796 that can be used to more selectively probe your circuit for these problems. The first 497 00:42:15,796 --> 00:42:19,691 one being the single photon absorption laser. And it sounds this relatively 498 00:42:19,691 --> 00:42:24,709 simple in terms of setup. You just use a single laser beam that shoots straight up 499 00:42:24,709 --> 00:42:29,884 at your circuit from the back. And while it does that, it deposits energy all along 500 00:42:29,884 --> 00:42:34,180 the silicon and also in the diffusions of your transistors and is therefore also 501 00:42:34,180 --> 00:42:38,388 able to inject energy there, potentially upsetting a bit of memory or exposing 502 00:42:38,388 --> 00:42:43,053 whatever other single event effects you have. And of course, you can steer this 503 00:42:43,053 --> 00:42:46,880 beam across the surface of your chip or whatever circuit you are testing and then 504 00:42:46,880 --> 00:42:51,330 find the sensitive location. The problem here is that the amount of energy that is 505 00:42:51,330 --> 00:42:55,238 deposited is really large due to the fact that it has to go through the whole 506 00:42:55,238 --> 00:42:59,053 silicon until it reaches the transistor. And therefore it's mostly used to find 507 00:42:59,053 --> 00:43:02,582 these destructive effects that really break something in your circuit. The more 508 00:43:02,582 --> 00:43:07,972 clever and somehow beautiful experiment is the two photon absorption laser experiment 509 00:43:07,972 --> 00:43:12,624 in which you use two laser beams of a different wavelength. And these actually 510 00:43:12,624 --> 00:43:18,366 do not have enough energy to cause any effect in your silicon. If only one of the 511 00:43:18,366 --> 00:43:22,174 laser beams is present, but only in the small location where the two beams 512 00:43:22,174 --> 00:43:26,874 intersect, the energy is actually large enough to produce the effect. And this 513 00:43:26,874 --> 00:43:30,664 allows you to very selectively and only on a very small volume induce charge and 514 00:43:30,664 --> 00:43:37,818 cause an effect in your circuit. And when you do that now, you can systematically 515 00:43:37,818 --> 00:43:41,964 scan both the X and Y directions across your chip and also the Z direction and can 516 00:43:41,964 --> 00:43:46,366 really measure the volume of sensitive area. And this is what you would typically 517 00:43:46,366 --> 00:43:50,804 get of such an experiment. So in black and white in the back, you'll see an infrared 518 00:43:50,804 --> 00:43:54,621 image of your chip where you can really make out the individual, say structural 519 00:43:54,621 --> 00:43:59,406 components. And then overlaid in blue, you can basically highlight all the sensitive 520 00:43:59,406 --> 00:44:03,897 points that made you measure something you didn't expect, some weird bit flip in a 521 00:44:03,897 --> 00:44:08,338 register or something. And you can really then go to your layout software and find 522 00:44:08,338 --> 00:44:13,644 what is the the register or the gate in your netlist that is responsible for 523 00:44:13,644 --> 00:44:17,465 this. And then it's more like operating a debugger in a software environment. 524 00:44:17,465 --> 00:44:22,889 Tracing back from there what the line of code responsible for this bug is. And 525 00:44:22,889 --> 00:44:31,260 to close out, it is always best to learn from mistakes. And we offer our mistakes 526 00:44:31,260 --> 00:44:35,901 as a guideline for if you ever feel yourself the need to design radiation 527 00:44:35,901 --> 00:44:40,695 tolerant circuits. So we want to present two or three small issues we had and 528 00:44:40,695 --> 00:44:45,300 circuits where we were convinced it should have been working fine. So the first one 529 00:44:45,300 --> 00:44:50,018 this you will probably recognize is this full triple modular redundancy scheme that 530 00:44:50,018 --> 00:44:55,279 Szymon has presented. So we made sure to triplicate everything and we were relatively 531 00:44:55,279 --> 00:44:59,102 sure that everything should be fine. The only modification we did is that to all 532 00:44:59,102 --> 00:45:03,506 those registers in our design, we've added a reset, because we wanted to initialize 533 00:45:03,506 --> 00:45:07,710 the system to some known state when we started up, which is a very obvious thing 534 00:45:07,710 --> 00:45:12,327 to do. Every CPU has a reset. But of course, what we didn't think about here 535 00:45:12,327 --> 00:45:16,577 was that at some point there's a buffer driving this reset line somewhere. And if 536 00:45:16,577 --> 00:45:20,355 there's only a single buffer. What happens if this buffer experiences a small 537 00:45:20,355 --> 00:45:24,501 transient event? Of course, the obvious thing that happened is that as soon as 538 00:45:24,501 --> 00:45:28,247 that happened, all the registers were upset at the same time and were basically 539 00:45:28,247 --> 00:45:32,205 cleared and all our fancy protection was invalidated. So next time we decided, 540 00:45:32,205 --> 00:45:37,679 let's be smarter this time. And of course, we triplicate all the logic and all the 541 00:45:37,679 --> 00:45:40,633 voters and all the registers. So let's also triplicate the reset lines. And while 542 00:45:40,633 --> 00:45:44,955 the designer of that block probably had very good intentions, it turned out 543 00:45:44,955 --> 00:45:49,268 that later than when we manufactured the chip, it still sometimes showed a complete 544 00:45:49,268 --> 00:45:54,570 reset without any good explanation for that. And what was left out of the the 545 00:45:54,570 --> 00:45:59,981 scope of thinking here was that this reset actually was connected to the system reset 546 00:45:59,981 --> 00:46:05,033 of the chip that we had. And typically pins are on the chip or something that is 547 00:46:05,033 --> 00:46:09,005 not available in huge quantities. So you typically don't want to spend three pins 548 00:46:09,005 --> 00:46:13,128 of your chip just for a stupid reset that you don't use ninety nine percent of the 549 00:46:13,128 --> 00:46:17,895 time. So what we did at some point we just connected again the reset lines to a 550 00:46:17,895 --> 00:46:21,972 single input buffer. That was then connected to a pin of the chip. And of 551 00:46:21,972 --> 00:46:25,590 course, this also represented a small sensitive area in the chip. And again, 552 00:46:25,590 --> 00:46:30,216 a single upset here was able to destroy all three of our flip flops. All right. 553 00:46:30,216 --> 00:46:35,132 And the last lesson I'm bringing or the last thing that goes back to the 554 00:46:35,132 --> 00:46:38,930 implementation details that Szymon has mentioned. So this time, really simple 555 00:46:38,930 --> 00:46:42,532 circuit. We were absolutely convinced it must work because it was basically the 556 00:46:42,532 --> 00:46:46,072 textbook example that Szymon was presenting. And the code was so 557 00:46:46,072 --> 00:46:49,817 small we were able to inspect everything and were very much sure that nothing 558 00:46:49,817 --> 00:46:54,690 should have happened. And what we saw when we went for this laser testing experiment, 559 00:46:54,690 --> 00:46:59,769 in simplified form is basically that only this first voter. And when this was 560 00:46:59,769 --> 00:47:04,414 hit, always all our register was upset while the other ones were 561 00:47:04,414 --> 00:47:09,161 never manifested to show anything strange. And it took us quite a while to actually 562 00:47:09,161 --> 00:47:13,563 look at the layout later on and figure out that what was in the chip was rather this. 563 00:47:13,563 --> 00:47:17,250 So two of the voters were actually not there. And Szymon mentioned the reason for 564 00:47:17,250 --> 00:47:21,208 that. So synthesis tool these days are really clever at identifying redundant 565 00:47:21,208 --> 00:47:26,102 logic and because we forgot to tell it to not optimize these redundant pieces of 566 00:47:26,102 --> 00:47:30,248 logic, which the voters really are. It just merged them into one. And that 567 00:47:30,248 --> 00:47:34,393 explains why we only saw this one voter being the sensitive one. And of course, if 568 00:47:34,393 --> 00:47:38,255 you have a transient event there, then you suddenly upset all your registers and that 569 00:47:38,255 --> 00:47:41,871 without even knowing it and with being sure, having looked at every single line 570 00:47:41,871 --> 00:47:45,652 of verilog code and being very sure, everything should have been fine. But that 571 00:47:45,652 --> 00:47:51,805 seems to be how this business goes. So we hope we had been we had the chance and you 572 00:47:51,805 --> 00:47:56,648 were able to get some insight in in what we do to make sure the experiments at the 573 00:47:56,648 --> 00:48:01,966 LHC work fine. What you can do to make sure the satellite you are working on 574 00:48:01,966 --> 00:48:06,393 might be working OK. Even before launching it into space, if you're interested into 575 00:48:06,393 --> 00:48:10,715 some more information on this topic, feel free to pass by at the assembly I 576 00:48:10,715 --> 00:48:15,014 mentioned at the beginning or just meet us after the talk and otherwise thank you 577 00:48:15,014 --> 00:48:22,286 very much. Applause 578 00:48:22,286 --> 00:48:27,041 Herald: Thank you very much indeed. There's about 10 minutes left for Q and A, 579 00:48:27,041 --> 00:48:31,872 so if you have any questions go to a microphone. And as a cautious reminder, 580 00:48:31,872 --> 00:48:38,297 questions are short sentences with. That starts with a question. Well, ends with a 581 00:48:38,297 --> 00:48:42,548 question mark and the first question goes to the Internet. 582 00:48:42,548 --> 00:48:46,433 Internet: Well, hello. Um, do you also incorporate radiation as the source for 583 00:48:46,433 --> 00:48:50,596 randomness when that's needed? Stefan: So we personally don't. So in our 584 00:48:50,596 --> 00:48:56,880 designs we don't. But it is done indeed for a random number generator. This is 585 00:48:56,880 --> 00:49:01,081 sometimes done that they use radioactive decay as a source for randomness. So this 586 00:49:01,081 --> 00:49:03,989 is done, but we don't do it in our experiments. 587 00:49:03,989 --> 00:49:06,802 We rather want deterministic data out of the things we built. 588 00:49:06,802 --> 00:49:10,929 Herald: Okay. Next question goes to microphone number four. 589 00:49:10,929 --> 00:49:16,714 Mic 4: Do you do your tripplication before or after elaboration? 590 00:49:16,714 --> 00:49:21,003 Szymon: So currently we do it before elaboration. So we decided that our tool 591 00:49:21,003 --> 00:49:25,764 works on verilog input and it produces verilog output because it offers much more 592 00:49:25,764 --> 00:49:30,496 flexibility in the way how you can incorporate different tripplication 593 00:49:30,496 --> 00:49:34,423 schemes. If you were to apply to only after elaboration, then of course doing a 594 00:49:34,423 --> 00:49:38,453 full tripplication might be easy. But then you - to having a really precise control 595 00:49:38,453 --> 00:49:43,438 or on types of tripplication on different levels is much more difficult. 596 00:49:43,438 --> 00:49:47,296 Herald: Next question from microphone number two. 597 00:49:47,296 --> 00:49:50,840 Mic 2: Is it possible to use DCDC converters or switch mode power supplies 598 00:49:50,840 --> 00:49:54,630 within the radiation environment to power your logic? Or you use only linear power? 599 00:49:54,630 --> 00:49:59,866 Szymon: Yes, alternatively we also have a dedicated program which develops radiation 600 00:49:59,866 --> 00:50:05,366 hardened DCDC converters who operate in our environments. So they are available 601 00:50:05,366 --> 00:50:10,988 also for space applications, as far as I'm aware. And they are hardened against total 602 00:50:10,988 --> 00:50:16,027 ionizing dose as well as single event upsets. 603 00:50:16,027 --> 00:50:19,667 Herald: Okay next question goes to microphone number one. 604 00:50:19,667 --> 00:50:22,614 Mic 1: Thank you very much for the great talk. I'm just wondering, would it be 605 00:50:22,614 --> 00:50:27,435 possible to hook up every logic gate in every water in a way of mesh network? And 606 00:50:27,435 --> 00:50:31,873 what are the pitfalls and limitations for that? 607 00:50:31,873 --> 00:50:36,734 Stefan: So that is not something I'm aware of, of being done. So typically: No. I 608 00:50:36,734 --> 00:50:41,473 wouldn't say that that's something we would do. 609 00:50:41,473 --> 00:50:43,431 Szymon: I'm not really sure if I understood the question. 610 00:50:43,431 --> 00:50:46,401 Stefan: So maybe you can rephrase what your idea is? 611 00:50:46,401 --> 00:50:52,613 Mic 1: On the last slide, there were a lesson learned. 612 00:50:52,613 --> 00:50:56,253 Stefan: Yes. One of those? Mic 1: In here. Yeah. Would you be able to 613 00:50:56,253 --> 00:51:00,309 connect everything interchangeably in a mesh network? 614 00:51:00,309 --> 00:51:04,030 Szymon: So what you are probably asking about is whether we can build our own 615 00:51:04,030 --> 00:51:08,166 FPGA, like programable logic device. Mic 1: Probably. 616 00:51:08,166 --> 00:51:11,074 Szymon: Yeah. And so this we typically don't do, because in our experiments, our 617 00:51:11,074 --> 00:51:15,857 power budget is also very limited, so we cannot really afford this level of 618 00:51:15,857 --> 00:51:20,903 complexity. So of course you can make your FPGA design radiation hard, but this is 619 00:51:20,903 --> 00:51:24,890 not what we will typically do in our experiments. 620 00:51:24,890 --> 00:51:28,630 Herald: Next question goes to microphone number two. 621 00:51:28,630 --> 00:51:32,059 Mic 2: Hi, I would like to ask if the orientation of your transistors and your 622 00:51:32,059 --> 00:51:38,029 chip is part of your design. So mostly you have something like a bounding box around 623 00:51:38,029 --> 00:51:42,921 your design and with an attack surface in different sizes. So do you use this 624 00:51:42,921 --> 00:51:48,350 orientation to minimize the attack surface of the radiation on chips, if you know 625 00:51:48,350 --> 00:51:52,616 the source of the radiation? Szymon: No. So I don't think we'd do that. 626 00:51:52,616 --> 00:51:58,515 So, of course, we control our orientation of transistors during the design phase. 627 00:51:58,515 --> 00:52:02,651 But usually in our experiment, the radiation is really perpendicular to the 628 00:52:02,651 --> 00:52:07,981 chip area, which means that if you rotate it by 90 degrees, you don't really gain 629 00:52:07,981 --> 00:52:12,082 that much. And moreover, our chips, usually they are mounted in a bigger 630 00:52:12,082 --> 00:52:16,625 system where we don't control how they are oriented. 631 00:52:16,625 --> 00:52:24,420 Herald: Again, microphone number two. Mic 2: Do you take meta stability into 632 00:52:24,420 --> 00:52:33,140 account when designing voters? Szymon: The voter itself is combinatorial. 633 00:52:33,140 --> 00:52:38,820 So ... - Mic 2: Yeah, but if the state of the rest 634 00:52:38,820 --> 00:52:45,300 can change in any time that then the voters can have like glitches, yeah? 635 00:52:45,300 --> 00:52:51,140 Szymon: Correct. So that's why - so to avoid this, we don't take it into account 636 00:52:51,140 --> 00:52:55,060 during the design phase. But if we use that scheme which is just displayed here, 637 00:52:55,060 --> 00:52:58,980 we avoid this problem altogether, right? Because even if you have meta stability in 638 00:52:58,980 --> 00:53:05,300 one of the blocks like A, B or C, then it will be fixed in the next clock cycle. 639 00:53:05,300 --> 00:53:09,940 Because usually our systems operate on clocks with low frequencies, hundreds of 640 00:53:09,940 --> 00:53:13,236 megahertz, which means that any meta stability should be resolved by the next 641 00:53:13,236 --> 00:53:15,065 clock cycle. Mic 2: Thank you. 642 00:53:15,065 --> 00:53:19,145 Herald: Next question microphone number one. 643 00:53:19,145 --> 00:53:23,014 Mic 1: How do you handle the register duplication that can be performed by a 644 00:53:23,014 --> 00:53:27,947 synthesis and pleasant route? So the tools will try to optimize timing sometimes by 645 00:53:27,947 --> 00:53:32,375 adding registers. And these registers are not trippled. 646 00:53:32,375 --> 00:53:35,784 Stefan: Yes. So what we do is that I mean, in a typical, let's say, standard ASIC 647 00:53:35,784 --> 00:53:40,405 design flaw, this is not what happens. So you have to actually instruct a tool to do 648 00:53:40,405 --> 00:53:44,585 that, to do re timing and add additional registers. But for what we are doing, we 649 00:53:44,585 --> 00:53:48,174 have to - let's say not do this optimization and instruct a tool to keep 650 00:53:48,174 --> 00:53:52,823 all the registers we described in our RTL code to keep them until the very end. And 651 00:53:52,823 --> 00:53:56,908 we realy also constrain them to always keep their associated logic tripplicated. 652 00:53:56,908 --> 00:54:01,759 Herald: The next question is from the internet. 653 00:54:01,759 --> 00:54:07,887 Internet: Do you have some simple tips for improving radiation tolerance? 654 00:54:07,887 --> 00:54:12,020 Stefan: Simple tips? Ahhhm... Szymon: Put your electronics inside a 655 00:54:12,020 --> 00:54:12,820 box. Stefan: Yes. 656 00:54:12,820 --> 00:54:17,380 some laughter There's there's just no 657 00:54:17,380 --> 00:54:22,980 single one size fits all textbook recipe for this as it really always comes down to 658 00:54:22,980 --> 00:54:28,020 analyzing your environment, really getting an awareness first of what rate and what 659 00:54:28,020 --> 00:54:31,940 number of events you are looking at, what type of particles cause them, and then 660 00:54:31,940 --> 00:54:36,420 take the appropriate measures to mitigate them. So there is no one size fits all 661 00:54:36,420 --> 00:54:38,095 thing I say. Herald: Next question goes from mycrophone 662 00:54:38,095 --> 00:54:41,620 number two. Mic 2: Hi. Thanks for the talk. How much 663 00:54:41,620 --> 00:54:47,611 of your software used to design is actually open source? I only know a super 664 00:54:47,611 --> 00:54:54,495 expensive chip design software. Stefan: You write the core of all the 665 00:54:54,495 --> 00:55:00,604 implementation tools like the synthesis and place and route stage for the ASICS, 666 00:55:00,604 --> 00:55:04,987 that we design is actually a commercial closed source tools. And if 667 00:55:04,987 --> 00:55:10,443 you're asking for the fraction, that's a bit hard to answer. I cannot give a 668 00:55:10,443 --> 00:55:14,518 statement about the size of the commercial closed tools. But we tried to do 669 00:55:14,518 --> 00:55:18,638 everything we develop, tried to make it available to the widest possible audience 670 00:55:18,638 --> 00:55:22,353 and therefore decided to make the extensions to this design flaw available 671 00:55:22,353 --> 00:55:26,237 in public form. And that's why these tools that we develop and share among the 672 00:55:26,237 --> 00:55:30,541 community of ASIC designers and this environment are open source. 673 00:55:30,541 --> 00:55:35,196 Herald: Microphone number four. Mic 4: Have you ever tried using steered 674 00:55:35,196 --> 00:55:41,098 iron beams for more localized, radiation ingress testing? 675 00:55:41,098 --> 00:55:44,495 Stefan: Yes, indeed! And the picture I showed actually, uh, didn't disclaimer 676 00:55:44,495 --> 00:55:49,311 that, but the facility you saw here is actually a facility in Darmstadt in 677 00:55:49,311 --> 00:55:53,366 Germany and is actually a micro beam facility. So it's a facility that allows 678 00:55:53,366 --> 00:55:58,400 steering a heavy ion beam really on a single position with less than a 679 00:55:58,400 --> 00:56:01,808 micrometer accuracy. So it provides probably exactly what you were asking for. 680 00:56:01,808 --> 00:56:05,854 But that's not the typical case. That is really a special thing. And it's probably 681 00:56:05,854 --> 00:56:09,405 also the only facility in Europe that can do that. 682 00:56:09,405 --> 00:56:13,316 Herald: Microphone number one. Mic 1: Was very good very good talk. Thank 683 00:56:13,316 --> 00:56:19,282 you very much. My question is, did you compare what you did to what is done for 684 00:56:19,282 --> 00:56:25,380 securing secret chips? You know, when you have credit card chips, you can make fault 685 00:56:25,380 --> 00:56:29,949 attacks into them so you can make them malfunction and extract the cryptographic 686 00:56:29,949 --> 00:56:33,830 key for example from the banking card. There are techniques here to harden these 687 00:56:33,830 --> 00:56:38,207 chips against fault attacks. So which are like voluntary faults while you have like 688 00:56:38,207 --> 00:56:43,121 random less faults due to like involatility attacks. You know what? Can you explain if 689 00:56:43,121 --> 00:56:47,294 you compared in a way what you did to this? 690 00:56:47,294 --> 00:56:50,861 Stefan: Um, so no, we didn't explicitly compared it, but it is right that the 691 00:56:50,861 --> 00:56:54,427 techniques we present can also be used in a variety of different contexts. So one 692 00:56:54,427 --> 00:56:59,134 thing that's not exactly what you are referring to, but relatively on a similar 693 00:56:59,134 --> 00:57:03,513 scale is that currently in very small technologies you get two problems with the 694 00:57:03,513 --> 00:57:07,855 reliability and yield of the manufacturing process itself, meaning that sometimes 695 00:57:07,855 --> 00:57:11,721 just the metal interconnection between two gates and your circuit might be broken 696 00:57:11,721 --> 00:57:16,297 after manufacturing and then adding the sort of redundancy with the same kinds of 697 00:57:16,297 --> 00:57:20,576 techniques can be used to make, to produce more working chips out of a 698 00:57:20,576 --> 00:57:24,715 manufacturing run. So in this sort of context, these sorts of techniques are 699 00:57:24,715 --> 00:57:30,674 used very often these days. But, um, I'm and I'm pretty sure they can be applied to 700 00:57:30,674 --> 00:57:34,953 these sorts of, uh, security fault attack scenarios as well. 701 00:57:34,953 --> 00:57:39,703 Herald: Next question from microphone number two. 702 00:57:39,703 --> 00:57:44,126 Mic 2: Hi, you briefly also mentioned the mitigation techniques on the cell level 703 00:57:44,126 --> 00:57:52,426 and yesterday there was a very nice talk from the Libre Silicon people and they 704 00:57:52,426 --> 00:57:55,914 are trying to build a standard cell library, uh, open source standard cell 705 00:57:55,914 --> 00:58:00,015 library. So are you in contact with them or maybe you could help them to improve 706 00:58:00,015 --> 00:58:03,980 their design and then the radiation hardness? 707 00:58:03,980 --> 00:58:07,430 Stefan: No. We also saw the talk yesterday, but we are not yet in 708 00:58:07,430 --> 00:58:14,180 contact with them. No. Herald: Does the Internet have questions? 709 00:58:14,180 --> 00:58:21,380 Internet: Yes, I do. Um, two in fact. First one would be would TTL or other BJT 710 00:58:21,380 --> 00:58:26,740 based logic be more resistant? Szymon: Uh, yeah. So depending on which 711 00:58:26,740 --> 00:58:31,126 type of errors we are considering. So BJT transistors, they have ... 712 00:58:31,126 --> 00:58:35,917 Stefan in his part mentioned that displacement damage is not a problem for 713 00:58:35,917 --> 00:58:40,305 seamless devices, but it is not the case for BJT devices. So when they are exposed 714 00:58:40,305 --> 00:58:47,074 to high energy hadrons or protons, they degrade a lot. So that's why we don't 715 00:58:47,074 --> 00:58:52,393 use them in really our environment. They could be probably much more robust to 716 00:58:52,393 --> 00:58:57,369 single event effects because their resistance everywhere is much lower. But 717 00:58:57,369 --> 00:59:01,633 they would have other problems. And also another problem which is worth 718 00:59:01,633 --> 00:59:06,204 mentioning is that for those devices, they consume much, much, much more power, which 719 00:59:06,204 --> 00:59:13,041 we cannot afford in our applications. Internet: And the last one would be how do 720 00:59:13,041 --> 00:59:19,396 I use the output of the full TMR setup? Is it still three signals? How do I know 721 00:59:19,396 --> 00:59:26,260 which one to use and to trust? Stefan: Um, yes. So with this, um, 722 00:59:26,260 --> 00:59:30,047 architecture, what you could either do is really do the full triplication scheme 723 00:59:30,047 --> 00:59:34,804 to your whole logic tree basically and really triplicate everything or, and 724 00:59:34,804 --> 00:59:38,903 that's going in the direction of one of the lessons learned I had, at some point 725 00:59:38,903 --> 00:59:43,261 of course you have an interface to your chip, so you have pins left and right that 726 00:59:43,261 --> 00:59:46,630 are inputs and outputs. And then you have to decide either you want to spend the 727 00:59:46,630 --> 00:59:51,025 effort and also have three dedicated input pins for each of the signals, or you at 728 00:59:51,025 --> 00:59:54,260 some point have the voter and say, okay. At this point, all these signals are 729 00:59:54,260 --> 00:59:58,202 combined. But I was able to reduce the amount of sensitive area in my chip 730 00:59:58,202 --> 01:00:03,780 significantly and can live with the very small remaining sensitive area that just 731 01:00:03,780 --> 01:00:07,460 the input and output pins provide. Szymon: So maybe I will add one more thing 732 01:00:07,460 --> 01:00:11,780 is that typically in our systems, of course we triplicate our logic internally, 733 01:00:11,780 --> 01:00:15,300 but when we interface with external world, we can apply another protection 734 01:00:15,300 --> 01:00:20,340 mechanism. So for example, for our high speed serialisers, we will use different types 735 01:00:20,340 --> 01:00:23,733 of encoding to add protect..., to add like forward error correction 736 01:00:23,733 --> 01:00:30,340 codes which would allow us to recover these type of faults in the backend later on. 737 01:00:30,340 --> 01:00:36,522 Herald: Okay. If ...if we keep this very, very short. Last question goes to 738 01:00:36,522 --> 01:00:41,401 microphone number two. Mic 2: I don't know much about physics. So 739 01:00:41,401 --> 01:00:47,370 just the question, how important is the physical testing after the chip is 740 01:00:47,370 --> 01:00:51,895 manufactured? Isn't the simulation, the computer simulation enough if you just 741 01:00:51,895 --> 01:00:56,332 shoot particles at it? Stefan: Yes and no. So in principle, of 742 01:00:56,332 --> 01:01:01,267 course, you are right that you should be able to simulate all the effects we look 743 01:01:01,267 --> 01:01:06,531 at. The problem is that as the designs grow big and they do grow bigger as the 744 01:01:06,531 --> 01:01:10,892 technologies shrink, so this final net list that you end up with 745 01:01:10,892 --> 01:01:15,175 can have millions or billions of nodes and it just is not feasible anymore to 746 01:01:15,175 --> 01:01:19,558 simulate it exhaustively because you have to have so many dimensions. You have to 747 01:01:19,558 --> 01:01:25,852 change when you inject. For example, bit flips or transients in your design in any 748 01:01:25,852 --> 01:01:30,745 of those nodes for varying time offsets. And it's just the state space the circuit 749 01:01:30,745 --> 01:01:34,553 can be in is just too huge to capture in a in a full simulation. So it's not possible 750 01:01:34,553 --> 01:01:38,803 to exhaustively test it in simulation. And so typically you end up with having missed 751 01:01:38,803 --> 01:01:43,048 something that you discover only in the physical testing afterwards, which you 752 01:01:43,048 --> 01:01:47,311 always want to do before you put your, uh, your chip into final experiment or on your 753 01:01:47,311 --> 01:01:50,934 satellite and then realise it's it's not working as intended. So it has a big 754 01:01:50,934 --> 01:01:55,540 importance as well. Herald: Okay. Thank you. Time is up. All 755 01:01:55,540 --> 01:01:58,584 right. Thank you all very much. 756 01:01:58,584 --> 01:02:04,602 applause 757 01:02:04,602 --> 01:02:09,599 36c3 postroll music 758 01:02:09,599 --> 01:02:32,100 Subtitles created by c3subtitles.de in the year 2021. Join, and help us!