36C3 Intro musik Herald: The next talk will be titled 'How to Design Highly Reliable Digital Electronics', and it will be delivered to you by Szymon and Stefan. Warm Applause for them. applause Stefan: All right. Good morning, Congress. So perhaps every one of you in the room here has at one point or another in their lives witnessed their computer behaving weirdly and doing things that it was not supposed to do or what you didn't anticipate it to do. And well, typically that would have probably been the result of a software bug of some sort somewhere inside the huge software stack your PC is running on. Have you ever considered what the probability of this weird behavior being caused by a bit flipped somewhere in your memory of your computer might have been? So what you can see in this video on the screen now is a physics experiment called a cloud chamber. It's a very simple experiment that is actually able to visualize and make apparent all the constant stream of background radiation we all are constantly exposed to. So what's happening here is that highly energetic particles, for example, from space they trace through gaseous alcohol and they collide with alcohol molecules and they form in this process a trail of condensation while they do that. And if you think about your computer, a typical cell of RAM, of which you might have, I don't know, 4, 8, 10 gigabytes in your machine is as big as only 80 nanometers wide. So it's very, very tiny. And you probably are able to appreciate the small amount of energy that is needed or that is used to store the information inside each of those bits. And the sheer amount of of those bits you have in your RAM and your computer. So a couple of years ago, there was a study that concluded that in a computer with about four gigabytes of RAM, a bit flip, um, caused by such an event by cosmic background radiation can occur about once every 33 hours. So a bit less than than one per day. In an incident in 2008, a Quantas Airlines flight actually nearly crashed, and the reason for this crash was traced back to be very likely caused by a bit flipped somewhere in one of the CPUs of the avionics system and nearly caused the death of a lot of passengers on this plane. In 2003, in Belgium, a small municipal vote actually had a weird hiccup in which one of the candidates in this election actually got 4096 more votes added in a single instance. And that was traced back to be very likely caused by cosmic background radiation, flipping a memory cell somewhere that stored the vote count. And it was only discovered that this happened because this number of votes for this particular candidate was considered unreasonable, but otherwise would have gotten away probably without being detected. So a few words about us: Szymon and I, we both work at CERN in the microelectronics section and we both develop electronics that need to be tolerant to these sorts of effects. So we develop radiation tolerant electronics for the experiments at CERN, at the LHC. Among a lot of other applications, you can meet the two of us at the Lötlabor Jena assembly if you are interested in what we are talking about today. And we will also give a small talk or a small workshop about radiation detection tomorrow, in one of the seminar rooms. So feel free to pass by there, it will be a quick introduction. To give you a small idea of what kind of environment we are working for: So if you would use one of your default intel i7 CPUs from your notebook and would put it anywhere where we operate our electronics, it would very shortly die in a matter of probably one or two minutes and it would die for more than just one reason, which is rather interesting and compelling. So the idea for today's talk is to give you all an insight into all the things that need to be taken into account when you design electronics for radiation environments. What kinds of different challenges come when you try to do that. We classify and explain the different types of radiation effects that exist. And then we also present what you can do to mitigate these effects and also validate that what you did to care for them or protect your circuits actually worked. And of course, as we do that, we'll try to give our view on how we develop radiation tolerant electronics at CERN and how our workflow looks like to make sure this works. So let's first maybe take a step back and have a look at what we mean when we say radiation environments. The first one that you probably have in mind right now when you think about radiation is space. So, this interstellar space is basically filled with, very high speed, highly energetic electrons and protons and all sorts of high energy particles. And while they, for example, traverse close to planets as our Earth - these planets sometimes do have a magnetic field and the highly energetic particles are actually deflected by these magnetic fields and they can protect the planets as our planet, for example, from this highly energetic radiation. But in the process, there around these planets sometimes they form these radiation belts - known as the Van Allen belts after the guy who discovered this effect a long time ago. And a satellite in space as it orbits around the Earth might, depending on what orbit is chosen, sometimes go through these belts of highly intense radiation. That, of course, then needs to be taken into account when designing electronics for such a satellite. And if Earth itself is not able to give you enough radiation, you may think of the very famous Juno Jupiter mission that has become famous about a year ago. They actually in the environment of Jupiter they anticipated so much radiation that they actually decided to put all the electronics of the satellite inside a one centimeter thick cube of titanium, which is famously known as the Juno Radiation Vault. But not only space offers radiation environments. Another form of radiation you probably all recognize this when I show you this picture, which is an X-ray image of a hand. And X-ray is also considered a form of radiation. And while, of course, the doses or amounts of radiation any patient is exposed to while doing diagnosis or treatment of some disease, that might not be the full story when it comes to medical applications. So this is a medical particle accelerator which is used for cancer treatment. And in these sorts of accelerators, typically carbon ions or protons are accelerated and then focused and used to treat and selectively destroy cancer cells in the body. And this comes already relatively close to the environment we are working in and working for. So Szymon and I are working, for example, on electronics, for the CMS detector inside the LHC or which we build dedicated, radiation tolerant, integrated circuits which have to withstand very, very large amounts and doses of short lived radiation in order to function correctly. And if we didn't specifically design electronics for that, basically the whole system would never be able to work. To illustrate a bit how you can imagine the scale of this environment: This is a single plot of a collision event that was recorded in the ATLAS experiment. And each of those tiny little traces you can make out in this diagram is actually either one or multiple secondary particles that were created in the initial collision of two proton bunches inside the experiment. And in each of those, of course, races around the detector electronics, which make these traces visible. Itself, then decaying into multiple other secondary particles which all go through our electronics. And if that doesn't sound, let's say, bad enough for digital electronics, these collisions happen about 40 million times a second. Of course, multiplying the number of events or problems they can cause in our circuits. So we now want to introduce all the things that can happen, the different radiation effects. But first, probably we take a step back and look at what we mean when we say digital electronics or digital logic, which we want to focus on today. So from your university lectures or your reading, you probably know the first class of digital logic, which is the combinatorial logic. So this is typically logic that just does a simple linear relation of the inputs of a circuit and produces an output as exemplified with these AND and OR, NAND, XOR gates that you see here. But if you want to build - I mean even though we use those everywhere in our circuits - you probably also want to store state in a more complex circuit, for example, in the registers of your CPU they store some sort of internal information. And for that we use the other class of logic, which is called the sequential logic. So this is typically clocked with some system clock frequency and it changes its output with relation to the inputs whenever this clock signal changes. And now if we look at how all these different logic functionalities are implemented. So typically nowadays for that you may know that we use CMOS technologies and basically represent all this logic functionality as digital gates using small P-MOS and N-MOS MOSFET transistors in CMOS technologies. And if we kind of try to build a model for more complex digital circuits, we typically use something we call the finite state machine model, in which we use a model that consists of a combinatorial and a sequential part. And you can see that the output of the circuit depends both on the internal state inside the register as well as also the input to the combinatorial logic. And accordingly, also the state that is internal is always changed by the inputs as well as the current state. So this is kind of the simple model for more complex systems that can be used to model different effects. Um, now let's try to actually look at what the radiation can do to transistors. And for that we are going to have a quick recap at what the transistor actually is and how it looks like. As you may perhaps know is that in CMOS technologies, transistors are built on wafers of high purity silicon. So this is a crystalline, very regularly organized lattice of silicon atoms. And what we do to form a transistor on such a wafer is that we add dopants. So in order to form diffusion regions, which later will become the source and drain of our transistors. And then on top of that we grow a layer of insulating oxide. And on top of that we put polysilicon, which forms the gate terminal of the transistor. And in the end we end up with an equivalent circuit a bit like that. And now to put things back into perspective - you may also note that the dimension of these structures are very tiny. So we talk about tens of nanometers for some of the dimensions I've outlined here. And as the technologies shrink, these become smaller and smaller and therefore you'll probably also realize or are able to appreciate the small amount of energy that are used to store information inside these digital circuits, which makes them perhaps more sensitive to radiation. So let's take a look. What different types of radiation effects exist? We typically in this case, differentiate them into two main classes of events. The first one would be the cumulative effects, which are effects that, as the name implies, accumulate over time. So as the circuit is placed inside some radiation environment, over time it accumulates more and more dose and therefore worsens its performance or changes how it operates. And on the other side, we have the Single Event Effects, which are always events that happen at some instantaneous point in time, and then suddenly, without being predictable, change how the circuit operates or how it functions or if it works in the first place or not. So I'm going to first go into the class of cumulative effects and then later on, Szymon will go into the other class of the Single Event Effects. So in terms of these accumulating effects, we basically have two main subclasses: The first one being ionization or TID effects, for Total Ionizing Dose - and the second one being displacement damages. So displacement damages do exactly what they sound like. It is all the effects that happen when an atom in the silicon lattice is actually displaced, so removed from its lattice position and actually changes the structure of the semiconductor. But luckily, these effects don't have a big impact in the CMOS digital circuits that we are looking at today. So we will disregard them for the moment and we'll be looking more at the ionization damage, or TID. So ionization - as a quick recap - is whenever electrons are removed or added to an atom, effectively transforming it into an ion. And these effects are especially critical for the circuits we are building because of what they do is that they change the behavior of the transistors. And without looking too much into the semiconductor details, I just want to show their typical effect that we are concerned about in this very simple circuit here. So this is just an inverter circuit consisting of two transistors here and there. And what the circuit does in normal operation is it just takes an input signal and inverts and basically gives the inverted signal at the output. And as the transistors are irradiated and accumulate dose, you can see that the edges of the output signal get slower. So the transistor takes longer to turn on and off. And what that does in turn is that it limits the maximum operation frequency of your circuit. And of course, that is not something you want to do. You want your circuit to operate at some frequency in your final system. And if the maximum frequency it can work at degrades over time, at some point it will fail as the maximum frequency is just too low. So let's have a look at what we can do to mitigate these effects. The first one and I already mentioned it when talking about the Juno mission, is shielding. So if you can actually put a box around your electronics and shield any radiation from actually hitting your transistors, it is obvious that they will last longer and will suffer less from the radiation damage that it would otherwise do. So this approach is very often used in space applications like on satellites, but it's not very useful if you are actually trying to measure the radiation with your circuits as we do, for example, in the particle accelerators we build integrated circuits for. So there first of all, we want to measure the radiation so we cannot shield our detectors from the radiation. And also, we don't want to influence the tracks of these secondary collision products with any shielding material that would be in the way. So this is not very useful in a particle accelerator environment, let's say. So we have to resort to different methods. So as I said, we do have to design our own integrated circuits in the first place. So we have some freedom in what we call transistor level design. So we can actually alter the dimensions of the transistors. We can make them larger to withstand larger doses of radiation and we can use special techniques in terms of layout that we can experimentally verifiy to be more resistant to radiation effects. And as a third measure, which is probably the most important one for us, is what we call modeling. So we actually are able to characterize all the effects that radiation will have on a transistor. And if we can do that, if we will know: 'If I put it into a radiation environment for a year, how much slower will it become?' Then it is of course easy to say: 'OK, I can just over-design my circuit and make it a bit more simple, maybe have less functionality, but be able to operate at a higher frequency and therefore withstand the radiation effects for a longer time while still working sufficiently well at the end of its expected lifetime.' So that's more or less what we can do about these effects. And I'll hand over to Szymon for the second class. Szymon: Contrary to the cumulative effects presented by Stefan, the other group are Single Event Effects which are caused by high energy deposits, which are caused by a single particle or shower of particles. And they can happen at any time, even seconds after irradiation is started. It means that if your circuit is vulnerable to this class of effects, it can fail immediately after radiation is present. And here we also classify these effects into several groups. The first are hard, or permanent, errors, which as the name indicates can permanently destroy your circuit. And this type of errors are typically critical for power devices where you have large power densities and they are not so much of a problem for digital circuits. In the other class of effects are soft errors. And here we distinguish transient, or Single Event Transient errors, which are spurious signals propagating in your circuit as a result of a gate being hit by a particle and they are especially problematic for analog circuits or asynchronous digital circuits, but under some circumstances they can be also problematic for synchronous systems. And the other class of problems are static, or Single Event Upset problems, which basically means that your memory element like a register gets flipped. And then of course, if your system is not designed to handle this type of errors properly, it can lead to a failure. So in the following part of the presentation we'll focus mostly on soft errors. So let's try to understand what is the origin of this type of problem. So as Stefan mentioned, the typical transistor is built out of diffusions, gate and channel. So here you can see one diffusion. Let's assume that it is a drain diffusion. And then when a particle goes through and deposits charge, it creates free electron and hole pairs, which then in the presence of electric fields, they get collected by means of drift, which results in a large current spike, which is very short. And then the rest of the charge could be collected by diffusion which is a much slower process and therefore also the amplitude of the event is much, much smaller. So let's try to understand what could happen in a typical memory cell. So on this schematic, you can see the simplest memory cell, which is composed of two back-to-back inverters. And let's assume that node A is at high and node B is at low potential initially. And then we have a particle hitting the drain of transistor M1 which creates a short circuit current between drain and ground, bringing the drain of transistor M1 to low potential, which also acts on the gates of second inverter, temporarily changing its state from low to high, which reinforces the wrong state in the first inverter. And at this time the error is locked in your memory cell and you basically lost your information. So you may be asking yourself: 'How much charge is needed really to flip a state of a memory cell?'. And you can get this number from either simulations or from measurements. So let's assume that what we could do, we could try to inject some current into the sensitive node, for example, drain of transistor M1. And here what I will show is that on the top plot you will have current as a function of time. On the second plot you will have output voltage. So voltage at node B as a function of time and at the lowest plot you will see a probability of having a bit flip. So if you inject very little current, of course nothing changes at the output, but once you start increasing the amount of current you are injecting, you see that something appears at the output and at some point the output will toggle, so it will switch to the other state. And at this point, if you really calculate what is the area under the current curve you can find what is the critical charge needed to flip the memory cell. And if you go further, if you start injecting even more current, you will not see that much difference in the output voltage waveform. It could become only slightly faster. And at this point, you also can notice that the probability now jumped to one, which means that any time you inject so much current there is a fault in your circuit. So for now, we just found what is the probability of having a bit-flip from 0 to 1 in node B. Of course we should also calculate the same for the other direction, so from 1 to zero. And usually it is slightly different. And then of course we should inject in all the other nodes, for example node B and also should study all possible transitions. And then at the end, if you calculate the superposition of these effects and you multiply them by the active area of each node, you will end up with what we call the cross section, which has a dimension of centimeters squared, which will tell you how sensitive your circuit is to this type of effects. And then knowing the radiation profile of your environment, you can calculate the expected upset rate in the final application. So now, having covered the basic of the single event effects, let's try to check how we can mitigate them. And here also technology plays a significant role. So of course, newer technologies offer us much smaller devices. And together with that, what follows is that usually supply voltages are getting smaller and smaller as well as the node capacitance, which means that for our Single Event Upsets it is very bad because the critical charge which is required to flip our bit is getting less and less. But at the end, at the same time, physical dimensions of our transistors are getting smaller, which means that the cross section for them being hit is also getting smaller. So overall, the effects really depend on the circuit topology and the radiation environment. So another protection method could be introduced on the cell level. And here we could imagine increasing the critical charge. And that could be done in the easiest way by just increasing the node capacitance by, for example, putting larger transistors. But of course, this also increases the collection electrode, which is not nice. And another way could be just increase the capacitance by adding some extra metal capacitance, but it, of course, slows down the circuit. Another approach could be to try to store the information on more than two nodes. So I showed you that on a simple SRAM cell we store information only on two nodes, so you could try to come up with some other cells, for example, like that one in which the information you stored on four nodes. So you can see that the architecture is very similar to the basic SRAM cell. But you should be careful always to very carefully simulate your design, because if we analyze this circuit, you will quickly realize that this circuit, even though the information is stored in four different nodes, the same type of loop exists as in the basic circuit. Meaning that at the end the circuit offers basically no hardening with respect to the previous cell. So actually we can do it better. So here you can see a typical dual interlocked cell. So the amount of transistors is exactly the same as in the previous example, but now they are interconnected slightly differently. And here you can see that this cell has also two stable configurations. But this time data can propagate, the low level from a given node can propagate only to the left hand side, while high level can propagate to the right hand side. And each stage being inverting means that the fault can not propagate for more than one node. Of course, this cell has some drawbacks: It consumes more area than a simple SRAM cell and also write access requires accessing at least two nodes at the same time to really change the state of the cell. And so you may ask yourself, how effective is this cell? So here I will show you a cross section plot. So it is the probability of having an error as a function of injected energy. And as a reference, you can see a pink curve on the top, which is for a normal, not protected cell. And on the green you can see the cross section for the error in the DICE cell. So as you can see, it is one order of magnitude better than the normal cell. But still, the cross section is far from being negligible, So, the problem was identified: So it was identified that the problem was caused by the fact that some sensitive nodes were very close together on the layout and therefore they could be upset by the same particle. Because as we mentioned, that single devices, they are very small. We are talking about dimensions below a micron. So after realizing that, we designed another cell in which we separated more sensitive nodes and we ended up with the blue curve, and as you can see the cross section was reduced by two more orders of magnitude and the threshold was increased significantly. So if you don't want to redesign your standard cells, you could also apply some mitigation techniques on block level. So here we can use some encoding to encode our state better. And as an example, I will show you a typical Hamming code. So to protect four bits, we have to add three additional party bits which are calculated according to this formula. And then once you calculate the parity bits, you can use those to check the state integrity of your internal state. And if any of their parity bits is not equal to zero, then the bits instantaneously become syndromes, indicating where the error happened. And you can use these information to correct the error. Of course, in this case, the efficiency is not really nice because we need three additional bits to protect only four bits of information. But as the state length increases the protection also is more efficient. Another approach would be to do even less. Meaning that instead of changing anything you need in your design, you can just triplicate your design or multiply it many times and just vote, which state is correct? So this concept is called tripple modular redudancy and it is based around a voter cell. So it is a cell which has odd number of inputs and output is always equal to majority of its input. And as I mentioned that the idea is that you have, for example, three circuits: A, B and C, and during normal operation, when they are identical, the output is also the same. However, when there is a problem, for example, in logic, part B, the output is affected. So this problem is effectively masked by the voter cell and it is not visible from outside of the circuit. But you have to be careful not to take this picture as a as a design template. So let's try to analyze what would happen with a state machine similar to what Stephan introduced. If you were to just use this concept. So here you can see three state machines and a voter on the output. And as we can see, if you have an upside in, for example, the state register A, then the state is broken. But still the output of the circuit, which is indicated by letter s is correct because the B and C registers are still fine. But what happens if some time later we have an upset in memory element B or C? Then of course the state of our system is broken and we can not recover it. So you can ask yourself what can we do better in order to avoid this situation? So that just to be sure. Please do not use this technique to protect your circuits. So the easiest mitigation could be to use as an input to your logic to use the output of the voter cell itself. What it offers us is that now whenever you have an upset in one of the memory elements for the next computation, for the next stage, we always use the voter output, which ensures that the signal will be removed one clock cycle later. So you will have another hit sometime later, basically, it will not affect our state. Until now we consider only upsets in our registers but what happens if we have charge in our voter? So you see that if there is no state change, basically the transient in the voter doesn't impact our system. But if you are really unlucky and the transient happens when the clock transition happens, so when whenever we enlarge the data, we can corrupt the state in three registers at the same time, which is less than ideal. So to overcome this limitation, you can consider skewing our clocks by some time, which is larger than the maximum charge in time. And now, because with each register samples the output of the voter a slightly different time, we can corrupt only one flip-flop at the time. So of course, if you are unlucky, we can have problematic situations in which one register is already in your state. The other register is still in the old state. And then it can lead to undetermenistic result. So it is better, but still not ideal. So as a general theme, you have seen that we were adding and adding more resources so you can ask yourself what would happen if we tripplicate everything. So in this case, we tripplicated registers, we tripplicate our logic and our voters. And now you can see that whenever we have an upset in our register, it can only affect one register at the time and the error will be removed from the system one clock cycle later. Also, if we have an upset in the voter or in their logic it can be larged only to one register, which means that in principle we create that system which is really robust. Unfortunately, nothing is for free. So here I compare a different tripplication environments and as you can see that the more protection you want to have, the more you have to pay in terms of resources being power in the area. And also usual, you pay small penalty in terms of maximum operational speed. So which flavor of protection you use depends really on application. So for most sensitive circuits, you probably you want to use full TMR and you may leave some other bits of logic unprotected. So another, if your system is not mission critical and you can tolerate some downtime, you can consider scrubbing, which means periodically checking the state of your system and refreshing it if necessary if an error is detected using some parity bits or copy of the data in a safe space. Or you can have a watchdog which will find out that something went wrong and it will just reinitialize the whole system. So now, having covered the basics of all the effects we will have to face, we would like to show you the basic flow which we follow during designing our radiation hardened circuits. So of course we always start with specifications. So we try to understand our radiation environment in which the circuit is meant to operate. So we come up with some specifications for total dose which could be accumulated and for the rate of single event upsets. And at this moment, it is also not very rare that we have to decide to move some functionality out of our detector volume, outside, where we can use of the sort of commercial equipment to do number crunching. But let's assume that we would go with our ASIC. So having the specifications, of course we proceed with functional implementation. This we typically do with hardware describtion languages, so verilog or VHDL which you may know from typical FPGA flow. And of course we write a lot of simulations to understand whether we are meeting our functional goals or whether our circuit behaves as expected. And then we selectively select some parts of the circuits which we want to protect from radiation effects. So, for example, we can decide to use triplication or some other methods. So these days we typically use triplication as the most straightforward and very effective method. So you can ask yourself how do we triplicate the logic? So the simplest could be: Just copy and paste the code three times at some postfixes like A, B and C and you are done. But of course this solution has some drawbacks. So it is time consuming and it is very error prone. So maybe you have noticed that I had a typo there. So of course we don't want to do that. So we developed our own tool, which we called TMRG, which automatizes the process of triplication and eliminates the two main drawbacks, which I just described. So after we have our code triplicated and of course, not before rerunning all the simulations to make sure that everything went as expected. We then proceed to the synthesis process in which we convert our high level hardware description languages to gate level netlists, in which all the functions are mapped to gates, which were introduced by Stefan, so both combinatorial and sequential. And here we also have to be careful because modern CAD tools have a tendency, of course, to optimise the logic as much as possible. And our logic in most of the cases is really redundant. So it is very easy; So, it should be removed. So we really have to make sure that it is not removed. That's why our tool also provides some constraints for the synthesizer to make sure that our design intent is clearly and well understood by the tool. And once we have the output netlist, we proceed to place and route process where this kind of netlist representation is mapped to a layout of what will become soon our digital chip where we placed all the cells and we route connections between them and here there is another danger which I mentioned already, it's that in modern technologies the cells are so small that they could be easily affected by a single particle at the same time. So we have to really space out the big cells which are responsible for keeping the information about the state to make sure that a single particle cannot upset A and B, for example, registered from the same register. And then in the last step, of course, we'll have to verify that everything, what we have done, is correct. And at this level, we also try to introduce some single event effects in our simulations. So we could randomly flip bits in our system. We can also inject transients. And typically we used to do that on the netlist level, which works very fine. And it is very nice. But the problem with this approach is that we can perform these actions very late in the design cycle, which is less than ideal. And also that if we find that there is problem in our simulation, typical netlist at this level has probably few orders of magnitude more lines than our initial RTL code. So to trace back what is the problematic line of code is not so straightforward. At this time. So you can ask yourself why not to try to inject errors in the RTL design? And the answer was, the answer is that it is not so trivially to map the hardware description language's high level constructs to what will become combinatorial or sequential logic. So in order to eliminate this problem, we also develop another open source tool, which allows us to... So we decided to use Yosys open source synthesis tool from clifford, which was presented in the Congress several years ago. So we use this tool to make a first pass through our RTL code to understand which elements will be mapped to sequential and combinatorial. And then having this information, we will use cocotb, another python verification framework, which allows us programmatic access to these nodes and we can effectively inject the errors in our simulations. And I forgot to mention that the TMRG tool is also open source. So if you are interested in one of the tools, please feel free to contact us. And of course, after our simulation is done, then in the next step we would really tape out. And so we submit our chip to manufacturing and hopefully a few months later we receive our chip back. Stefan: All right. So after patiently waiting then for a couple of months while your chip is in manufacturing and you're spending time on preparing a test set up and preparing yourself to actually test if your chip works as you expected to. Now, it's probably also a good time to think about how to actually validate or test if all the measures that you've taken to protect your circuit from radiation effects actually are effective or if they are not. And so again, we will split this in two parts. So you will probably want to start with testing for the total ionizing dose effects. So for the cumulative effect and for that, you typically use x ray radiation relatively similar to the one used in medical treatment. So this radiation is relatively low, energetic, which has the upside of not producing any single event effects, but you can really only accumulate radiation dose and focus on the accumulating effects. And typically you would use a machine that looks somewhat like this, a relatively compact thing. You can have in your laboratory and you can use that to really accumulate large amounts of radiation dose on your circuit. And then you need some sort of mechanism to verify or to quantify how much your circuit slows down due to this radiation dose. And if you do that, you typically end up with a graphic such as this one, where in the x axis you have the radiation dose your circuit was exposed to. And on the y axis, you see that the frequency has gone down over time and you can use this information to say: "OK, my final application, I expect this level of radiation dose. I mean, I can still see that my circuit will work fine under some given environmental condition or some operation condition." So this is the test for the first class of effects. And the test for the second class of effects for the single event effect is a bit more involved. So there what you would typically start to do is go for a heavy ion test campaign. So you would go to a specialized, relatively rare facility. We have a couple of those in Europe and would look perhaps somewhat like this. So it's a small particle accelerator somewhere. They typically have different types of heavy ions at their disposal that they can accelerate and then shoot at your chip that you can place in a vacuum chamber and these ions can deposit very well known amounts of energy in your circuit and you can use that information to characterize your circuit. The downside is a bit that these facilities tend to be relatively expensive to access and also a bit hard to access. So typically you need to book them a lot of time in advance and that's sometimes not very easy. But what it offers you, you can use different types of ions with different energies. You can really make a very well-defined sensitivity curve similar to the one that Szymon has described. You can get from simulations and really characterize your circuit for how often, any single event effects will appear in the final application if there is any remaining effects left. If you have left something unprotected. The problem here is that these particle accelerators typically just bombard your circuit with like thousands of particles per second and they hit basically the whole area in a random fashion. So you don't really have a way of steering those or measuring the position of these particles. So typically you are a bit in the dark and really have to really carefully know the behavior of your circuit and all the quirks it has even without the radiation to instantly notice when something has gone wrong. And this is typically not very easy and you can kind of compare it with having some weird crash somewhere in your software stack and then having to have first take a look and see what actually has happened. Typically you find something that has not been properly protected and you see some weird effect on your circuit and then you try to get a better idea of where that problem actually is located. And the answer for these types of problems involving position is, of course, always lasers. So we have two types of laser experiments available that can be used to more selectively probe your circuit for these problems. The first one being the single photon absorption laser. And it sounds this relatively simple in terms of setup. You just use a single laser beam that shoots straight up at your circuit from the back. And while it does that, it deposits energy all along the silicon and also in the diffusions of your transistors and is therefore also able to inject energy there, potentially upsetting a bit of memory or exposing whatever other single event effects you have. And of course, you can steer this beam across the surface of your chip or whatever circuit you are testing and then find the sensitive location. The problem here is that the amount of energy that is deposited is really large due to the fact that it has to go through the whole silicon until it reaches the transistor. And therefore it's mostly used to find these destructive effects that really break something in your circuit. The more clever and somehow beautiful experiment is the two photon absorption laser experiment in which you use two laser beams of a different wavelength. And these actually do not have enough energy to cause any effect in your silicon. If only one of the laser beams is present, but only in the small location where the two beams intersect, the energy is actually large enough to produce the effect. And this allows you to very selectively and only on a very small volume induce charge and cause an effect in your circuit. And when you do that now, you can systematically scan both the X and Y directions across your chip and also the Z direction and can really measure the volume of sensitive area. And this is what you would typically get of such an experiment. So in black and white in the back, you'll see an infrared image of your chip where you can really make out the individual, say structural components. And then overlaid in blue, you can basically highlight all the sensitive points that made you measure something you didn't expect, some weird bit flip in a register or something. And you can really then go to your layout software and find what is the the register or the gate in your netlist that is responsible for this. And then it's more like operating a debugger in a software environment. Tracing back from there what the line of code responsible for this bug is. And to close out, it is always best to learn from mistakes. And we offer our mistakes as a guideline for if you ever feel yourself the need to design radiation tolerant circuits. So we want to present two or three small issues we had and circuits where we were convinced it should have been working fine. So the first one this you will probably recognize is this full triple modular redundancy scheme that Szymon has presented. So we made sure to triplicate everything and we were relatively sure that everything should be fine. The only modification we did is that to all those registers in our design, we've added a reset, because we wanted to initialize the system to some known state when we started up, which is a very obvious thing to do. Every CPU has a reset. But of course, what we didn't think about here was that at some point there's a buffer driving this reset line somewhere. And if there's only a single buffer. What happens if this buffer experiences a small transient event? Of course, the obvious thing that happened is that as soon as that happened, all the registers were upset at the same time and were basically cleared and all our fancy protection was invalidated. So next time we decided, let's be smarter this time. And of course, we triplicate all the logic and all the voters and all the registers. So let's also triplicate the reset lines. And while the designer of that block probably had very good intentions, it turned out that later than when we manufactured the chip, it still sometimes showed a complete reset without any good explanation for that. And what was left out of the the scope of thinking here was that this reset actually was connected to the system reset of the chip that we had. And typically pins are on the chip or something that is not available in huge quantities. So you typically don't want to spend three pins of your chip just for a stupid reset that you don't use ninety nine percent of the time. So what we did at some point we just connected again the reset lines to a single input buffer. That was then connected to a pin of the chip. And of course, this also represented a small sensitive area in the chip. And again, a single upset here was able to destroy all three of our flip flops. All right. And the last lesson I'm bringing or the last thing that goes back to the implementation details that Szymon has mentioned. So this time, really simple circuit. We were absolutely convinced it must work because it was basically the textbook example that Szymon was presenting. And the code was so small we were able to inspect everything and were very much sure that nothing should have happened. And what we saw when we went for this laser testing experiment, in simplified form is basically that only this first voter. And when this was hit, always all our register was upset while the other ones were never manifested to show anything strange. And it took us quite a while to actually look at the layout later on and figure out that what was in the chip was rather this. So two of the voters were actually not there. And Szymon mentioned the reason for that. So synthesis tool these days are really clever at identifying redundant logic and because we forgot to tell it to not optimize these redundant pieces of logic, which the voters really are. It just merged them into one. And that explains why we only saw this one voter being the sensitive one. And of course, if you have a transient event there, then you suddenly upset all your registers and that without even knowing it and with being sure, having looked at every single line of verilog code and being very sure, everything should have been fine. But that seems to be how this business goes. So we hope we had been we had the chance and you were able to get some insight in in what we do to make sure the experiments at the LHC work fine. What you can do to make sure the satellite you are working on might be working OK. Even before launching it into space, if you're interested into some more information on this topic, feel free to pass by at the assembly I mentioned at the beginning or just meet us after the talk and otherwise thank you very much. Applause Herald: Thank you very much indeed. There's about 10 minutes left for Q and A, so if you have any questions go to a microphone. And as a cautious reminder, questions are short sentences with. That starts with a question. Well, ends with a question mark and the first question goes to the Internet. Internet: Well, hello. Um, do you also incorporate radiation as the source for randomness when that's needed? Stefan: So we personally don't. So in our designs we don't. But it is done indeed for a random number generator. This is sometimes done that they use radioactive decay as a source for randomness. So this is done, but we don't do it in our experiments. We rather want deterministic data out of the things we built. Herald: Okay. Next question goes to microphone number four. Mic 4: Do you do your tripplication before or after elaboration? Szymon: So currently we do it before elaboration. So we decided that our tool works on verilog input and it produces verilog output because it offers much more flexibility in the way how you can incorporate different tripplication schemes. If you were to apply to only after elaboration, then of course doing a full tripplication might be easy. But then you - to having a really precise control or on types of tripplication on different levels is much more difficult. Herald: Next question from microphone number two. Mic 2: Is it possible to use DCDC converters or switch mode power supplies within the radiation environment to power your logic? Or you use only linear power? Szymon: Yes, alternatively we also have a dedicated program which develops radiation hardened DCDC converters who operate in our environments. So they are available also for space applications, as far as I'm aware. And they are hardened against total ionizing dose as well as single event upsets. Herald: Okay next question goes to microphone number one. Mic 1: Thank you very much for the great talk. I'm just wondering, would it be possible to hook up every logic gate in every water in a way of mesh network? And what are the pitfalls and limitations for that? Stefan: So that is not something I'm aware of, of being done. So typically: No. I wouldn't say that that's something we would do. Szymon: I'm not really sure if I understood the question. Stefan: So maybe you can rephrase what your idea is? Mic 1: On the last slide, there were a lesson learned. Stefan: Yes. One of those? Mic 1: In here. Yeah. Would you be able to connect everything interchangeably in a mesh network? Szymon: So what you are probably asking about is whether we can build our own FPGA, like programable logic device. Mic 1: Probably. Szymon: Yeah. And so this we typically don't do, because in our experiments, our power budget is also very limited, so we cannot really afford this level of complexity. So of course you can make your FPGA design radiation hard, but this is not what we will typically do in our experiments. Herald: Next question goes to microphone number two. Mic 2: Hi, I would like to ask if the orientation of your transistors and your chip is part of your design. So mostly you have something like a bounding box around your design and with an attack surface in different sizes. So do you use this orientation to minimize the attack surface of the radiation on chips, if you know the source of the radiation? Szymon: No. So I don't think we'd do that. So, of course, we control our orientation of transistors during the design phase. But usually in our experiment, the radiation is really perpendicular to the chip area, which means that if you rotate it by 90 degrees, you don't really gain that much. And moreover, our chips, usually they are mounted in a bigger system where we don't control how they are oriented. Herald: Again, microphone number two. Mic 2: Do you take meta stability into account when designing voters? Szymon: The voter itself is combinatorial. So ... - Mic 2: Yeah, but if the state of the rest can change in any time that then the voters can have like glitches, yeah? Szymon: Correct. So that's why - so to avoid this, we don't take it into account during the design phase. But if we use that scheme which is just displayed here, we avoid this problem altogether, right? Because even if you have meta stability in one of the blocks like A, B or C, then it will be fixed in the next clock cycle. Because usually our systems operate on clocks with low frequencies, hundreds of megahertz, which means that any meta stability should be resolved by the next clock cycle. Mic 2: Thank you. Herald: Next question microphone number one. Mic 1: How do you handle the register duplication that can be performed by a synthesis and pleasant route? So the tools will try to optimize timing sometimes by adding registers. And these registers are not trippled. Stefan: Yes. So what we do is that I mean, in a typical, let's say, standard ASIC design flaw, this is not what happens. So you have to actually instruct a tool to do that, to do re timing and add additional registers. But for what we are doing, we have to - let's say not do this optimization and instruct a tool to keep all the registers we described in our RTL code to keep them until the very end. And we realy also constrain them to always keep their associated logic tripplicated. Herald: The next question is from the internet. Internet: Do you have some simple tips for improving radiation tolerance? Stefan: Simple tips? Ahhhm... Szymon: Put your electronics inside a box. Stefan: Yes. some laughter There's there's just no single one size fits all textbook recipe for this as it really always comes down to analyzing your environment, really getting an awareness first of what rate and what number of events you are looking at, what type of particles cause them, and then take the appropriate measures to mitigate them. So there is no one size fits all thing I say. Herald: Next question goes from mycrophone number two. Mic 2: Hi. Thanks for the talk. How much of your software used to design is actually open source? I only know a super expensive chip design software. Stefan: You write the core of all the implementation tools like the synthesis and place and route stage for the ASICS, that we design is actually a commercial closed source tools. And if you're asking for the fraction, that's a bit hard to answer. I cannot give a statement about the size of the commercial closed tools. But we tried to do everything we develop, tried to make it available to the widest possible audience and therefore decided to make the extensions to this design flaw available in public form. And that's why these tools that we develop and share among the community of ASIC designers and this environment are open source. Herald: Microphone number four. Mic 4: Have you ever tried using steered iron beams for more localized, radiation ingress testing? Stefan: Yes, indeed! And the picture I showed actually, uh, didn't disclaimer that, but the facility you saw here is actually a facility in Darmstadt in Germany and is actually a micro beam facility. So it's a facility that allows steering a heavy ion beam really on a single position with less than a micrometer accuracy. So it provides probably exactly what you were asking for. But that's not the typical case. That is really a special thing. And it's probably also the only facility in Europe that can do that. Herald: Microphone number one. Mic 1: Was very good very good talk. Thank you very much. My question is, did you compare what you did to what is done for securing secret chips? You know, when you have credit card chips, you can make fault attacks into them so you can make them malfunction and extract the cryptographic key for example from the banking card. There are techniques here to harden these chips against fault attacks. So which are like voluntary faults while you have like random less faults due to like involatility attacks. You know what? Can you explain if you compared in a way what you did to this? Stefan: Um, so no, we didn't explicitly compared it, but it is right that the techniques we present can also be used in a variety of different contexts. So one thing that's not exactly what you are referring to, but relatively on a similar scale is that currently in very small technologies you get two problems with the reliability and yield of the manufacturing process itself, meaning that sometimes just the metal interconnection between two gates and your circuit might be broken after manufacturing and then adding the sort of redundancy with the same kinds of techniques can be used to make, to produce more working chips out of a manufacturing run. So in this sort of context, these sorts of techniques are used very often these days. But, um, I'm and I'm pretty sure they can be applied to these sorts of, uh, security fault attack scenarios as well. Herald: Next question from microphone number two. Mic 2: Hi, you briefly also mentioned the mitigation techniques on the cell level and yesterday there was a very nice talk from the Libre Silicon people and they are trying to build a standard cell library, uh, open source standard cell library. So are you in contact with them or maybe you could help them to improve their design and then the radiation hardness? Stefan: No. We also saw the talk yesterday, but we are not yet in contact with them. No. Herald: Does the Internet have questions? Internet: Yes, I do. Um, two in fact. First one would be would TTL or other BJT based logic be more resistant? Szymon: Uh, yeah. So depending on which type of errors we are considering. So BJT transistors, they have ... Stefan in his part mentioned that displacement damage is not a problem for seamless devices, but it is not the case for BJT devices. So when they are exposed to high energy hadrons or protons, they degrade a lot. So that's why we don't use them in really our environment. They could be probably much more robust to single event effects because their resistance everywhere is much lower. But they would have other problems. And also another problem which is worth mentioning is that for those devices, they consume much, much, much more power, which we cannot afford in our applications. Internet: And the last one would be how do I use the output of the full TMR setup? Is it still three signals? How do I know which one to use and to trust? Stefan: Um, yes. So with this, um, architecture, what you could either do is really do the full triplication scheme to your whole logic tree basically and really triplicate everything or, and that's going in the direction of one of the lessons learned I had, at some point of course you have an interface to your chip, so you have pins left and right that are inputs and outputs. And then you have to decide either you want to spend the effort and also have three dedicated input pins for each of the signals, or you at some point have the voter and say, okay. At this point, all these signals are combined. But I was able to reduce the amount of sensitive area in my chip significantly and can live with the very small remaining sensitive area that just the input and output pins provide. Szymon: So maybe I will add one more thing is that typically in our systems, of course we triplicate our logic internally, but when we interface with external world, we can apply another protection mechanism. So for example, for our high speed serialisers, we will use different types of encoding to add protect..., to add like forward error correction codes which would allow us to recover these type of faults in the backend later on. Herald: Okay. If ...if we keep this very, very short. Last question goes to microphone number two. Mic 2: I don't know much about physics. So just the question, how important is the physical testing after the chip is manufactured? Isn't the simulation, the computer simulation enough if you just shoot particles at it? Stefan: Yes and no. So in principle, of course, you are right that you should be able to simulate all the effects we look at. The problem is that as the designs grow big and they do grow bigger as the technologies shrink, so this final net list that you end up with can have millions or billions of nodes and it just is not feasible anymore to simulate it exhaustively because you have to have so many dimensions. You have to change when you inject. For example, bit flips or transients in your design in any of those nodes for varying time offsets. And it's just the state space the circuit can be in is just too huge to capture in a in a full simulation. So it's not possible to exhaustively test it in simulation. And so typically you end up with having missed something that you discover only in the physical testing afterwards, which you always want to do before you put your, uh, your chip into final experiment or on your satellite and then realise it's it's not working as intended. So it has a big importance as well. Herald: Okay. Thank you. Time is up. All right. Thank you all very much. applause 36c3 postroll music Subtitles created by c3subtitles.de in the year 2021. Join, and help us!