music Herald: Good morning and welcome back to stage one. It's kind of going to be the second talk about physics on this day already and it's about big data and science and big data became something like Uber in science. It's everywhere every discipline has it. Axel Naumann's working for CERN, the accelerator in Switzerland and he talks about how physics and computing bridge in this area and he works a lot with ROOT, a program that helps transform data into knowledge. A warm welcome. Axel Naumann: Thank you. applause AN: Thanks a lot. So, well you know, when, when I was discussing this abstract with the science track people they tell me: "Well, you know about three hundred people might be in the audience." But well, hey, you are huge that's much more than three hundred people. So thank you so much for inviting me over it's a real honor. And of course originally when talking to 300 people are all science interested I thought you know I pick something fairly narrow focuswise but then I learned I'm going to be in Saal one and that's different, so I decided to make the scope a little bit wider and that's what I ended up with. I'll talk a little bit about CERN in society as well if you so choose, you'll see what that means in a minute. So the things I'll cover here is obviously CERN just a little bit of an introduction how we do physics, how we do computing, what data means to us and I can tell you it means everything, you heard about that already, right? How we do data analysis in high energy physics and just because we've been doing it for a while and because I've been doing it for more than ten years, I'm one of the guys who's providing the software to do data analysis in high energy physics, so, you know, because we know what we are doing and we have some experience, I thought maybe you might be interested in hearing what my forecast is for data analysis in general, in the future. So let's start with CERN. And so if you wonder what CERN is, you've all heard about CERN, about the fantastic funds we love to use, then you've probably also heard that we are doing science. We were founded right after the Second World War or soon after the Second World War, basically as a way to entertain those freaky scientists. You know that was the idea: peace europewide. And damn, that's working out really well and so well there's not just Europe anymore these days. We are located near Geneva, we are doing only fundamental research, so we don't do any weapons, nuclear stuff you know, these kind of things. The WWW was invented at CERN but that was just a, you know, side effect happens sometimes, that we invent things. But usually we just do science. So what we do is, we take money, lots off, and brains who like to discuss and think and come up with ideas and from that we generate knowledge. It's really all about curiosity. The things we try to answer is what is mass? Which is funny question right? Like we all know what mass is but actually we don't. We know what mass is in the universe. We understand that masses attract one another: gravity. Which is beautifully correct. And in the small scale, our particles, we know that mass is energy and we can't convert them. But we don't understand how these two things go together. Like there is no bridge, they contradict one another. So we are trying to understand what that bridge might be. Part of that mass thing is of course also what's out there in the universe? That's a big question. We only understand a few percent of that. 90 and some percent are completely unknown to us, and that's scary right? I mean we know gravity really well, we can deal with freaky things like black holes and yet we don't understand what's out there. Now to do all these things we are probing nature at the smallest scale as we call it, so that's particles, we are dealing with things like the Higgs particle and supersymmetry. Here's a little bit of a fact sheet. We have about 12,000 physicists who are working with CERN. We are basically the workbench that you saw in Andre's talk before. We are the table that physicists use, okay? And, so they come to CERN and once a while about 10,000 physicists a year, or they work remotely most of the time from about 120 nations. So you're seeing it's not European anymore, this is a global thing. CERN in itself has about 2,500 employees, you know those scrubbing the table, setting things up and so on. And our table is right here. In the far end we have the Alps, it's in Switzerland as I said, so the Alps are always close, with Mont Blanc, we have the Lake Geneva we have the Jura, the French Mountains on the lower end here, it's just beautiful. It's really nice, but we needed to stick a 30-kilometer ring in there somewhere and people would have hated us had we put it like this. But luckily people were smart back then in the 70s, and built a tunnel much better. So now we have this huge tunnel, and we send particles through in both directions near the speed of light and the tunnel is filled with magnets simply because if you don't use a magnet the particles will fly straight but we need them to turn around. Here you see what it's looking like, you also see these big halls there that have access shafts from the top and that's where the experiments are. That's sort of a sketch of one of the experiments. So the the LHC is one of the, no, is the biggest particle accelerator at the moment, it's a ring with 27 kilometers circumference, 100 meters below Switzerland and France, it has four big experiments and several small ones and we are expected to run until 2030. So you see that all of that is large-scale simply because we're trying to make good use of the money we have. Here, you see one of these caverns that are used by the experiments while it was empty. The experiment was then lowered through this hole by the roof, piece by piece, and these things are humongous. To give you an impression of how big it is, I put Waldo in there, so your job for the next three slides is to find Waldo. You know, that gives you the scale. He's friendlily waving at you, so it should be easy to find him. So then we put a detector in there. Here it's pulled apart a little bit, so it looks nicer, you can actually see something. You can for example see the beam pipe, so that's where the particles are flying through, and then they're coming from both directions and colliding in the center of the detector and then things happen we try to understand what is happening. That's yet another view, frontal view on one of the detectors and now you have to imagine that, you know, you can't just open up Amazon and order an LHC experiment, right, that's not how it works. We do this stuff ourselves, like PhD students, postdocs, engineers. You know, that's all done by hand, just like the microscope you saw before. Of course you order the parts, but you know the design, the whole conception and actually screwing these things together, making sure that all fits, is all done by hand. And I find that just beautiful, I mean that's close to a miracle, right? That nations, like people no matter what nation, people across the globe work together to build such a huge thing and then you turn it on and it works. More or less, but you get it to work. That's not my applause, that's your applause, because you make this possible. Really, but it's, it's huge this is for me one of the things I love most about CERN: That is this international thing that just works smoothly. Now the detectors are like a massive camera. We have lots of pixels and we take many, many pictures a second. We do this to identify particles and then sort of estimate what has happened during the collision. Now, life at CERN is of course an important ingredient for scientists as well, and if you live at CERN then actually it's just work at CERN and that's what it's about. But it's not that bad, so we hang out together in our control rooms, make sure that the experiments work correctly. We also, you know, study the forces. laughter We have scientific discourse, in the sun, view on the Mont Blanc, with a good coffee. We have lectures and we are lectured and of course, as you, we have more laptops than people. And, then we do stuff and so this presentation is going to introduce you to some of the things we are doing, and more on the computing and the society side as I said. But because I have so much to talk to about I decided that you just build your own talk, you tell me what you want to hear. So let's do this, you can choose between A, physics, and B, model simulation and data. You remember these books like from the old days when we were all young? It's that kind of thing, ok? You decide/design your own talk here. So, by applause, do you want to hear about physics? applause Okay. Or the model simulation data part? louder applause Okay, there we go. So, this is what we skip. Model simulation data it is. You're a strange crowd, first time I meet people who don't want to hear about physics... no I'm kidding. laughter Audience: inaudible interjection laughter So model simulation data it is. So our theory is actually incredibly precise. It's so precise that our basic job is really really boring, because we already understand everything. Whenever there is a collision, we know what's going to happen. Except for these very rare things. So we are trying to find these very rare things out of this haystack of fairly boring things that we really understand well. And the weird things are, for example, monopoles, supersymmetry, or black holes. Now the theorists job is to tell us what we should be seeing in the detector, given some fancy physics. Then we use simulation to see how our detector would respond to that. Now, of course the question is: We are just counting, basically, when we do experiments and the question is: How often do we need to see something to say: "Well, that's not just the ordinary. That is something new, that's something that could be explained by a weird theory. We use the detector simulation as I said to basically predict how much we expect to see things. We use reconstruction software which tells us what has happened, or might have happened in the detector to count how often we saw something. And then we use statistics to compare these two and to say whether something is expected or not. Now, that's fairly abstract but it's fairly common, a fairly common approach. For example, if you look at climate versus weather, right, I mean we always have temperature fluctuations because of weather, and the question is: Is that rise in temperature because of a weather effect or because of a climate effect? Is that large-scale or just a short-term fluctuation. So there, we have a very similar problem and here what you do is you measure temperatures, and you want to detect abnormal variations, and you can improve that by measuring longer, like, for 300 years instead of 20 years. That gives you a better prediction what you would expect in the future. Also, larger deviations help, right?. If you look for something that is just 0.1 degree, then you might not be able to find it. If there is a deviation of 5 degrees, you will definitely find it. And for us it's very similar. So here we have a plot, one of the first Higgs discovery plots, and you can see that we have many ingredients there. So, the black dots are what we measure and they have certain uncertainty, because when we measure, we count and we might have, you know, not seen something, or we might have seen more than we we should have seen, so there's always an uncertainty. And then we also have theory, which tells us you should have seen so many and so for the red part that's something that we know exists, it's nothing spectacular. It's simply what theory is telling us what we should be seeing. And you can see the data follows the red part fairly well. But then there is this other bump in our dots on the right-hand side or in the center and that does not make sense, unless you take the Higgs into account, right, which is the light blue part and so here you can see how this interplay between different sources of physics and statistics works for us. Now just as for the climate, more data helps. And there are two versions of more data more data: Either by having more collisions, which is why we are running 24/7, or more data by combining different analyses which is what's happening here. So here you see all these different analyses. If you combine them, of course you get a much stronger prediction of, in this case, the Higgs mass, then if you just take any single one of them. You see how similar what we are doing is to, you know, any of the big data analyses out there. Okay, so that was that part. Now comes the obligatory part again, computering. When we were designing the LHC,not me, when people were designing the LHC, they needed to project computing power from 1990 to 2000 2010 and so on. And then they said: "Well, we need massive amount of computers" and for you there's now "Ughhh - everybody has it, we have it as well, we have our racks of computers". This is something that the big companies usually don't show: You you know there is actually a ramp where the trucks arrive and they offload the things and then someone needs to screw them together and then looks shiny. This is how we are spending our CPU time: We have about 60,000 cores that are spinning all the time for us, and they are distributed around the world. You can see that CERN, for example, is the red part there near the bottom. Yeah, so we make good use of that. We also monitor the efficiency, and because 100 percent efficient is for beginners we are actually about 700 percent efficient. Don't ask why. They decided if you are multi-threading, then we, you know, we multiply your efficiency by the number of threads you have. Makes no sense to me. We also have storage, currently we use about 0.7 exabytes. We also have available at one point seven exabytes, so that's good, we make use of the storage we have. Where it's, you know, tera- peta- exa-, so it's a lot, and here you can see on the right hand side you see, for example, the tape usage on the bottom and you see this dip that was before we were starting the accelerator again, we needed to make some space so we monitor our hard disk usage all the time. Hey, here comes the next decision point: So, do you want to hear about, 1, distributed computing or 2, measure effects of bugs. So, 1, distributed computing applause and 2, measure the effects of bugs similar amount of applause Okay, so that's my call, and I would say we do we do... Measure the effects of bugs, because it's shorter. laughter So this is one of the views you can, you know, electronic views you can get from a detector and you see how we trace the particles that fly through the detector. Now, that software right, that's the result of software, and you might not believe it, if you have bugs in there, in that software. And you know, these bugs are sometimes wrong coordinate transformations, so things don't go this way but that way, it's kind of weird if you look at it, and the result is that our particles don't go through the path that they should have been going, but we are attributing them a different path. Now, the the nice thing is that we are doing this a million times, right? So all of that is smeared. We are not systematically doing this wrong it's just, we are always doing it a little bit wrong. And so the net result is that if we measure our particles, we will not measure the right thing but always a little bit wobbly left wobbly right you know? Things are not as precise. That's simply an uncertainty. So for us just like counting has an uncertainty and predictions have an uncertainty, software bugs introduced another source of uncertainties. And here you can see how we are tracking uncertainties for for all of our analyses. We are trying to understand the different forces of uncertainties. And again, bugs are only one of the sources here, so if we find the bug then we reduce our uncertainty and we can find new physics earlier, instead of having to wait and collect more data. So for us finding bugs is really key, we really love finding bugs because it brings physics closer. I thought that was interesting. It's kind of rare that you're in environment where you're able to measure the effect of bugs. Okay, so now we are talking, we'll be talking about data. I talked, told you that we are trying to find particle traces in our data and the way we do this is by using reconstruction programs and there are multiple gigabytes of binaries in shared libraries and stuff. They're huge, they're experiment specific and they are curated by the experiments, open-source for some of them, and we want them to be correct and efficient. The data format we use is not comma separated values, it's binary and for some strange reason it's our own custom binary format. The reason is that it's really targeted and the kind of data we are having. We have collisions that are independent, so we only need one in memory at any time and we have nested collections which makes the regular table layout a non-starter. We actually generate them from C++ objects so from classes, class definitions, C++ class definitions and we can read them back into C++ but also into JavaScript or Scala. Database just didn't do it for us. They have the wrong model of data axis, they don't scale, it's just not the kind of system that works for us. Also using a file system as a storage back-end might sound really very traditional and boring but it works amazingly well and seems to be future proof as well, so that's just the way to go for us. There are many other structured data formats out there, many of those did not exist when we started root our own data format. But they also miss many things. For example, we wanted to make sure that we have schema evolution support. We can change the class layout and still read back all data. We don't want to throw away all data just because we're changing the class. Also we do not trust people. That is a, you know, as a computer scientist or whatever you probably know what I'm talking about right? If people have to write their own streaming algorithm, there will be bugs and we will lose data. We really don't want to do this, so we were trying to automate this, based on the class definition. So, last decision point for the story. Do you want to hear about cling, our C++ interpreter or about Open Data and Applied Science? Let's start with option 1, the C++ interpreter applause Okay and and Open Data and Applied Science? more applause than before Yeah. I'm heading there. You miss a fish. You can look at the slides later. Okay, so there we go. Really? No. The slide number is wrong. Oh a bug! So, Open Data and Applied Science. Okay, you really wanted to know about our budget, I understand that. So we get from you about 1 billion year and the currency doesn't really matter anymore at this, at this point of time. laughter And that is a lot of money. And you know? We try to do really wonderful things, I mean we really enjoy our job, we love it. It's fantastic to work in such an environment. And thank you very much for making that possible. Really, I mean it. But it also means that you decided as society to enable something like CERN. Which I think really deserves my applause and yours probably as well. I think it's a great decision to do something like this. applause So we realize this, right? We realized that we are basically, that we can do what we do because of you, and we are trying to react to that by giving back what we do. Software, research results, hardware and data. So the way we share research results is through open access. We have it, finally. It took us a long time to fight with publishers and, you know, the establishment, but now we have it. We also, yes thank you. applause We also put a lot of effort in communicating our results and what we are doing. And if you're in the region, it's definitely worth a visit. I mean the URL is really easy to remember, it's visit.cern, and you know, works. And you should go there by April, actually, if you can because then you can ask people how to get on the ground, because the accelerator is off at the moment. We also do applied research, for example we have this super cool experiment where we try to study how clouds form, based on cosmic rays. So the the influence of cosmic rays and cloud formation. Which is a key element in the uncertainty of climate models. We are trying to, to think about, you know, how to make energy from nuclear waste. So getting rid of nuclear waste while making energy from it. And we are trying to repurpose detectors that we have and you know develop. We have something called open hardware, for example White Rabbit: deterministic ethernet, we have Open Data, and we have the LHC@home and some other programs, where either you can donate compute power or your brain and help us get better results. We explicitly try to use open source as much as possible, and also feed back, whenever we see issues. But we also create open source. For example, we create Geant, which is a program that allows you to simulate how particles fly through a matter, for example used by the NASA. We have Indico, which allows us to schedule meetings, upload slides, you know, these kind of things. Across the globe, lots of people, with access protection, all these kind of things. And it's open source. We have DaviX, the dimension we love HTTP. That's the next machine of Tim Berners-Lee. And that's his futile effort in trying to prevent the cleaning personnel from switching it off. They don't speak English, they did not back then at least. So we use we used DaviX to transfer files over HTTP, with a high bandwidth. Or we have CVM-FS, which allows us to distribute our binaries across the globe, and not rely on admins downloading stuff and making sure it actually runs, and these kind of things. That is a lifesaver, it's really fantastic, it's a great tool. But nobody knows it. And we have ROOT, but that's coming up. So now, the last official part of this, of this presentation, how do we do data analysis? Not like that. laughter applause We use, we use C++ and actually physicists need to write their own analysis in C++. We have very few people who have an actual education in programming. so that's sort of a clash. As I said, we need to keep one collision in memory. And for what, you know, what matters to us is throughput. We want to have, we want to analyze as many collisions as possible per second. What we can do, is specialize our data format to match the analysis, because we don't want to waste I/O cycles, if we can, you know, if we can make use of the CPU better. ROOT allows us to do this since twenty years. It's really the workhorse for the analysis in high energy physics. And it's also an interface to complex software. We have serialization facilities, we have the statistical tools, that people need, and we have graphics, because once you have done your analysis you need to communicate that to your peers and convince people, and publish, and so on, so that's part of the game. All of that is open source, and, of course, all of that is not just used by high energy physics. So, to conclude: We are here, because you make it possible. Thank you very much. It's fantastic to have you. applause We want to share and we have great people for science outreach, but we have nobody for software outreach, basically. So maybe it's worth a look to see what what CERN is producing software-wise. Scientific computing is nothing new, it existed since a long time, but we had to start fairly early on a large scale. So when we were building it up, we had to take... we were trying to take pieces that existed and did not found find much. So now we ended up with C++ data serialization, efficient computing for non computer scientists even... In the part that I skipped and, you know, one of the alternate tracks, you would have seen that we have a Python binding as well for the whole software stack in C++. And for us, what matters most is scale. Now we are seeing that we are not the only ones. There are many more natural sciences arriving at a similar challenge of having to analyze large amounts of data. Now I promised to you that I'll be bold and I'll try to make a few statements of what will happen with data analysis, not just in science. Because what we see is that we actually educate the people who will do data analysis, not just in science. What we see is that in the past, data volume mattered most. So more data meant more power. Now that's not the complete truth anymore. It's a lot about finding correlations. So even with the amount of data not growing anymore, because it's already humongous, we try to squeeze more knowledge out of it. And for that, I/O becomes important and CPU limitations is the crucial factor. We see that multivariate techniques are still rising and they will just be part of the toolchain of the statistical tools; except for generative parts, which, I believe, will change the way we model. Now, based on what I just described, this is not a big surprise anymore. As we need throughput, we need to have a language for the core analysis part, that is close to metal, so something like C++. On the other hand writing analyses is still complex, so you need a higher-level language and for that people could, for example, use Python. So, now language binding becomes relevant all of a sudden. It's much more important in the future. And we need to tailor I/O to the actual analysis to not waste CPU cycles. So throughput is the king and, in my point of view, also in the future we will see much more effort in increasing the throughput. Okay, so that was it. In case you want to discuss anything with me, like "That's just wrong!", that's fine. I'm probably have several bugs in there. I'm still here until tomorrow. I don't know where yet, so I'll wander around and you can contact me by email or Twitter. Thank you very much for your attention. Thank you. applause music subtitles created by c3subtitles.de in the year 2017. Join, and help us!