WEBVTT 00:00:00.000 --> 00:00:13.119 music 00:00:13.119 --> 00:00:17.190 Herald: Good morning and welcome back to stage one. It's kind of going to be the 00:00:17.190 --> 00:00:21.490 second talk about physics on this day already and it's about big data and 00:00:21.490 --> 00:00:27.150 science and big data became something like Uber in science. It's everywhere every 00:00:27.150 --> 00:00:33.370 discipline has it. Axel Naumann's working for CERN, the accelerator in Switzerland 00:00:33.370 --> 00:00:39.160 and he talks about how physics and computing bridge in this area and he works 00:00:39.160 --> 00:00:43.183 a lot with ROOT, a program that helps transform data into knowledge. A warm 00:00:43.183 --> 00:00:44.650 welcome. 00:00:44.650 --> 00:00:45.262 Axel Naumann: Thank you. 00:00:45.262 --> 00:00:51.260 applause 00:00:51.260 --> 00:00:57.850 AN: Thanks a lot. So, well you know, when, when I was discussing this abstract with 00:00:57.850 --> 00:01:00.950 the science track people they tell me: "Well, you know about three hundred people 00:01:00.950 --> 00:01:06.000 might be in the audience." But well, hey, you are huge that's much more than three 00:01:06.000 --> 00:01:10.940 hundred people. So thank you so much for inviting me over it's a real honor. And of 00:01:10.940 --> 00:01:15.310 course originally when talking to 300 people are all science interested I 00:01:15.310 --> 00:01:20.590 thought you know I pick something fairly narrow focuswise but then I learned I'm 00:01:20.590 --> 00:01:24.690 going to be in Saal one and that's different, so I decided to make the scope 00:01:24.690 --> 00:01:30.670 a little bit wider and that's what I ended up with. I'll talk a little bit about 00:01:30.670 --> 00:01:37.540 CERN in society as well if you so choose, you'll see what that means in a minute. So 00:01:37.540 --> 00:01:41.680 the things I'll cover here is obviously CERN just a little bit of an introduction 00:01:41.680 --> 00:01:46.100 how we do physics, how we do computing, what data means to us and I can tell you 00:01:46.100 --> 00:01:51.810 it means everything, you heard about that already, right? How we do data analysis in 00:01:51.810 --> 00:01:56.159 high energy physics and just because we've been doing it for a while and 00:01:56.159 --> 00:02:00.530 because I've been doing it for more than ten years, I'm one of the guys who's 00:02:00.530 --> 00:02:07.250 providing the software to do data analysis in high energy physics, so, you 00:02:07.250 --> 00:02:11.360 know, because we know what we are doing and we have some experience, I thought 00:02:11.360 --> 00:02:18.110 maybe you might be interested in hearing what my forecast is for data analysis in 00:02:18.110 --> 00:02:25.430 general, in the future. So let's start with CERN. And so if you wonder what CERN 00:02:25.430 --> 00:02:31.510 is, you've all heard about CERN, about the fantastic funds we love to use, then 00:02:31.510 --> 00:02:36.960 you've probably also heard that we are doing science. We were founded right after 00:02:36.960 --> 00:02:41.450 the Second World War or soon after the Second World War, basically as a way to 00:02:41.450 --> 00:02:47.458 entertain those freaky scientists. You know that was the idea: peace europewide. 00:02:47.458 --> 00:02:52.349 And damn, that's working out really well and so well there's not just Europe 00:02:52.349 --> 00:02:57.530 anymore these days. We are located near Geneva, we are doing only fundamental 00:02:57.530 --> 00:03:02.269 research, so we don't do any weapons, nuclear stuff you 00:03:02.269 --> 00:03:10.230 know, these kind of things. The WWW was invented at CERN but that was just a, you 00:03:10.230 --> 00:03:14.586 know, side effect happens sometimes, that we invent things. But usually we just do 00:03:14.586 --> 00:03:22.500 science. So what we do is, we take money, lots off, and brains who like to discuss 00:03:22.500 --> 00:03:27.210 and think and come up with ideas and from that we generate knowledge. It's really 00:03:27.210 --> 00:03:33.000 all about curiosity. The things we try to answer is what is mass? Which is funny 00:03:33.000 --> 00:03:37.371 question right? Like we all know what mass is but actually we don't. We know what 00:03:37.371 --> 00:03:42.360 mass is in the universe. We understand that masses attract one another: gravity. 00:03:42.360 --> 00:03:48.730 Which is beautifully correct. And in the small scale, our particles, we know that 00:03:48.730 --> 00:03:52.940 mass is energy and we can't convert them. But we don't understand how these two 00:03:52.940 --> 00:03:58.319 things go together. Like there is no bridge, they contradict one another. So we 00:03:58.319 --> 00:04:04.930 are trying to understand what that bridge might be. Part of that mass thing is of 00:04:04.930 --> 00:04:08.650 course also what's out there in the universe? That's a big question. We only 00:04:08.650 --> 00:04:14.230 understand a few percent of that. 90 and some percent are completely unknown to 00:04:14.230 --> 00:04:20.349 us, and that's scary right? I mean we know gravity really well, we can deal with 00:04:20.349 --> 00:04:27.560 freaky things like black holes and yet we don't understand what's out there. Now to 00:04:27.560 --> 00:04:31.850 do all these things we are probing nature at the smallest scale as we call it, so 00:04:31.850 --> 00:04:36.190 that's particles, we are dealing with things like the Higgs particle and 00:04:36.190 --> 00:04:43.900 supersymmetry. Here's a little bit of a fact sheet. We have about 12,000 00:04:43.900 --> 00:04:47.500 physicists who are working with CERN. We are basically the workbench that you saw 00:04:47.500 --> 00:04:54.661 in Andre's talk before. We are the table that physicists use, okay? And, so they 00:04:54.661 --> 00:04:59.050 come to CERN and once a while about 10,000 physicists a year, or they work 00:04:59.050 --> 00:05:02.810 remotely most of the time from about 120 nations. So you're seeing it's not 00:05:02.810 --> 00:05:10.650 European anymore, this is a global thing. CERN in itself has about 2,500 employees, 00:05:10.650 --> 00:05:15.490 you know those scrubbing the table, setting things up and so on. And our 00:05:15.490 --> 00:05:21.190 table is right here. In the far end we have the Alps, it's in Switzerland 00:05:21.190 --> 00:05:25.990 as I said, so the Alps are always close, with Mont Blanc, we have the 00:05:25.990 --> 00:05:31.639 Lake Geneva we have the Jura, the French Mountains on the lower end here, it's just 00:05:31.639 --> 00:05:37.410 beautiful. It's really nice, but we needed to stick a 30-kilometer ring in 00:05:37.410 --> 00:05:43.861 there somewhere and people would have hated us had we put it like this. But 00:05:43.861 --> 00:05:49.671 luckily people were smart back then in the 70s, and built a tunnel much better. So 00:05:49.671 --> 00:05:55.229 now we have this huge tunnel, and we send particles through in both directions near 00:05:55.229 --> 00:06:00.351 the speed of light and the tunnel is filled with magnets simply because if you 00:06:00.351 --> 00:06:08.110 don't use a magnet the particles will fly straight but we need them to turn around. 00:06:08.110 --> 00:06:13.560 Here you see what it's looking like, you also see these big halls there that have 00:06:13.560 --> 00:06:21.880 access shafts from the top and that's where the experiments are. That's sort of 00:06:21.880 --> 00:06:29.210 a sketch of one of the experiments. So the the LHC is one of the, no, is the biggest 00:06:29.210 --> 00:06:35.889 particle accelerator at the moment, it's a ring with 27 kilometers circumference, 100 00:06:35.889 --> 00:06:40.300 meters below Switzerland and France, it has four big experiments and several 00:06:40.300 --> 00:06:45.270 small ones and we are expected to run until 2030. So you see that all of that 00:06:45.270 --> 00:06:50.150 is large-scale simply because we're trying to make good use of the money we have. 00:06:50.150 --> 00:06:56.020 Here, you see one of these caverns that are used by the experiments while it was 00:06:56.020 --> 00:07:01.490 empty. The experiment was then lowered through this hole by the roof, piece by 00:07:01.490 --> 00:07:07.190 piece, and these things are humongous. To give you an impression of how big it is, I 00:07:07.190 --> 00:07:12.520 put Waldo in there, so your job for the next three slides is to find Waldo. You 00:07:12.520 --> 00:07:15.800 know, that gives you the scale. He's friendlily waving at you, so it should be 00:07:15.800 --> 00:07:21.990 easy to find him. So then we put a detector in there. Here it's pulled apart 00:07:21.990 --> 00:07:26.160 a little bit, so it looks nicer, you can actually see something. You can for 00:07:26.160 --> 00:07:31.039 example see the beam pipe, so that's where the particles are flying through, and then 00:07:31.039 --> 00:07:34.880 they're coming from both directions and colliding in the center of the detector 00:07:34.880 --> 00:07:38.490 and then things happen we try to understand what 00:07:38.490 --> 00:07:44.790 is happening. That's yet another view, frontal view on one of the detectors and 00:07:44.790 --> 00:07:51.060 now you have to imagine that, you know, you can't just open up Amazon and order an 00:07:51.060 --> 00:07:56.210 LHC experiment, right, that's not how it works. We do this stuff ourselves, like 00:07:56.210 --> 00:08:02.669 PhD students, postdocs, engineers. You know, that's all done by hand, just like 00:08:02.669 --> 00:08:06.940 the microscope you saw before. Of course you order the parts, but you know the 00:08:06.940 --> 00:08:11.060 design, the whole conception and actually screwing these things together, making 00:08:11.060 --> 00:08:16.970 sure that all fits, is all done by hand. And I find that just beautiful, I mean 00:08:16.970 --> 00:08:21.760 that's close to a miracle, right? That nations, like people no matter what 00:08:21.760 --> 00:08:26.819 nation, people across the globe work together to build such a huge thing and 00:08:26.819 --> 00:08:39.490 then you turn it on and it works. More or less, but you get it to work. That's not 00:08:39.490 --> 00:08:44.310 my applause, that's your applause, because you make this possible. Really, but it's, 00:08:44.310 --> 00:08:49.690 it's huge this is for me one of the things I love most about CERN: That is this 00:08:49.690 --> 00:08:55.279 international thing that just works smoothly. Now the detectors are like a 00:08:55.279 --> 00:09:01.310 massive camera. We have lots of pixels and we take many, many pictures a second. We 00:09:01.310 --> 00:09:06.680 do this to identify particles and then sort of estimate what has happened during 00:09:06.680 --> 00:09:15.470 the collision. Now, life at CERN is of course an important ingredient for 00:09:15.470 --> 00:09:19.529 scientists as well, and if you live at CERN then actually it's just work at CERN 00:09:19.529 --> 00:09:23.980 and that's what it's about. But it's not that bad, so we hang out together in our 00:09:23.980 --> 00:09:30.040 control rooms, make sure that the experiments work correctly. We also, you 00:09:30.040 --> 00:09:33.720 know, study the forces. laughter 00:09:33.720 --> 00:09:38.740 We have scientific discourse, in the sun, view on the Mont Blanc, with a good 00:09:38.740 --> 00:09:45.430 coffee. We have lectures and we are lectured and of course, as you, we have 00:09:45.430 --> 00:09:54.570 more laptops than people. And, then we do stuff and so this presentation is going to 00:09:54.570 --> 00:09:58.580 introduce you to some of the things we are doing, and more on the computing and the 00:09:58.580 --> 00:10:04.100 society side as I said. But because I have so much to talk to about I decided that 00:10:04.100 --> 00:10:08.810 you just build your own talk, you tell me what you want to hear. So let's do this, 00:10:08.810 --> 00:10:14.410 you can choose between A, physics, and B, model simulation and data. You remember 00:10:14.410 --> 00:10:18.620 these books like from the old days when we were all young? It's that kind of thing, 00:10:18.620 --> 00:10:24.450 ok? You decide/design your own talk here. So, by applause, do you want to hear about 00:10:24.450 --> 00:10:27.720 physics? applause 00:10:27.720 --> 00:10:35.730 Okay. Or the model simulation data part? louder applause 00:10:35.730 --> 00:10:45.101 Okay, there we go. So, this is what we skip. Model simulation data it is. You're 00:10:45.101 --> 00:10:49.700 a strange crowd, first time I meet people who don't want to hear about physics... no 00:10:49.700 --> 00:10:51.450 I'm kidding. laughter 00:10:51.450 --> 00:10:53.800 Audience: inaudible interjection laughter 00:10:53.800 --> 00:11:00.079 So model simulation data it is. So our theory is actually incredibly precise. 00:11:00.079 --> 00:11:04.450 It's so precise that our basic job is really really boring, because we already 00:11:04.450 --> 00:11:10.514 understand everything. Whenever there is a collision, we know what's going to happen. 00:11:10.514 --> 00:11:15.430 Except for these very rare things. So we are trying to find these very rare things 00:11:15.430 --> 00:11:19.580 out of this haystack of fairly boring things that we really understand well. And 00:11:19.580 --> 00:11:25.589 the weird things are, for example, monopoles, supersymmetry, or black holes. 00:11:25.589 --> 00:11:32.060 Now the theorists job is to tell us what we should be seeing in the detector, given 00:11:32.060 --> 00:11:42.347 some fancy physics. Then we use simulation to see how our detector would respond to 00:11:42.347 --> 00:11:53.476 that. Now, of course the question is: We are just counting, basically, when we do 00:11:53.476 --> 00:11:58.102 experiments and the question is: How often do we need to see something to say: "Well, 00:11:58.102 --> 00:12:03.310 that's not just the ordinary. That is something new, that's something that could 00:12:03.310 --> 00:12:09.870 be explained by a weird theory. We use the detector simulation as I said to basically 00:12:09.870 --> 00:12:15.029 predict how much we expect to see things. We use reconstruction software which 00:12:15.029 --> 00:12:20.680 tells us what has happened, or might have happened in the detector to count how 00:12:20.680 --> 00:12:25.400 often we saw something. And then we use statistics to compare these two and to say 00:12:25.400 --> 00:12:31.610 whether something is expected or not. Now, that's fairly abstract but it's fairly 00:12:31.610 --> 00:12:36.905 common, a fairly common approach. For example, if you look at climate versus 00:12:36.905 --> 00:12:40.331 weather, right, I mean we always have temperature fluctuations because of 00:12:40.331 --> 00:12:46.480 weather, and the question is: Is that rise in temperature because of a weather effect 00:12:46.480 --> 00:12:50.375 or because of a climate effect? Is that large-scale or just a short-term 00:12:50.375 --> 00:12:55.610 fluctuation. So there, we have a very similar problem and here what you do is 00:12:55.610 --> 00:13:00.880 you measure temperatures, and you want to detect abnormal variations, and you can 00:13:00.880 --> 00:13:06.420 improve that by measuring longer, like, for 300 years instead of 20 years. That 00:13:06.420 --> 00:13:11.930 gives you a better prediction what you would expect in the future. Also, larger 00:13:11.930 --> 00:13:14.170 deviations help, right?. If you look for something that 00:13:14.170 --> 00:13:19.700 is just 0.1 degree, then you might not be able to find it. If there is a deviation 00:13:19.700 --> 00:13:25.230 of 5 degrees, you will definitely find it. And for us it's very similar. So here we 00:13:25.230 --> 00:13:31.610 have a plot, one of the first Higgs discovery plots, and you can see that we 00:13:31.610 --> 00:13:38.800 have many ingredients there. So, the black dots are what we measure and they have 00:13:38.800 --> 00:13:43.829 certain uncertainty, because when we measure, we count and we might have, you 00:13:43.829 --> 00:13:48.977 know, not seen something, or we might have seen more than we we should have seen, so 00:13:48.977 --> 00:13:54.970 there's always an uncertainty. And then we also have theory, which tells us you 00:13:54.970 --> 00:14:00.079 should have seen so many and so for the red part that's something that we know 00:14:00.079 --> 00:14:04.889 exists, it's nothing spectacular. It's simply what theory is telling us what we 00:14:04.889 --> 00:14:10.660 should be seeing. And you can see the data follows the red part fairly well. But then 00:14:10.660 --> 00:14:15.980 there is this other bump in our dots on the right-hand side or in the center and 00:14:15.980 --> 00:14:21.230 that does not make sense, unless you take the Higgs into account, right, which is 00:14:21.230 --> 00:14:26.889 the light blue part and so here you can see how this interplay between different 00:14:26.889 --> 00:14:38.280 sources of physics and statistics works for us. Now just as for the climate, more 00:14:38.280 --> 00:14:43.690 data helps. And there are two versions of more data more data: Either by having more 00:14:43.690 --> 00:14:48.079 collisions, which is why we are running 24/7, or more data by combining different 00:14:48.079 --> 00:14:52.060 analyses which is what's happening here. So here you see all these different 00:14:52.060 --> 00:14:56.990 analyses. If you combine them, of course you get a much stronger prediction of, in 00:14:56.990 --> 00:15:03.300 this case, the Higgs mass, then if you just take any single one of them. You see 00:15:03.300 --> 00:15:08.540 how similar what we are doing is to, you know, any of the big data analyses out 00:15:08.540 --> 00:15:16.414 there. Okay, so that was that part. Now comes the obligatory part again, 00:15:16.414 --> 00:15:22.930 computering. When we were designing the LHC,not me, when people were designing the 00:15:22.930 --> 00:15:31.120 LHC, they needed to project computing power from 1990 to 2000 2010 and so on. 00:15:31.120 --> 00:15:34.140 And then they said: "Well, we need massive amount of computers" and for you 00:15:34.140 --> 00:15:38.420 there's now "Ughhh - everybody has it, we have it as well, we have our racks of 00:15:38.420 --> 00:15:44.240 computers". This is something that the big companies usually don't show: You you know 00:15:44.240 --> 00:15:48.509 there is actually a ramp where the trucks arrive and they offload the things and 00:15:48.509 --> 00:15:53.820 then someone needs to screw them together and then looks shiny. This is how we are 00:15:53.820 --> 00:16:00.870 spending our CPU time: We have about 60,000 cores that are spinning all the 00:16:00.870 --> 00:16:06.680 time for us, and they are distributed around the world. You can see that CERN, 00:16:06.680 --> 00:16:14.529 for example, is the red part there near the bottom. Yeah, so we make good use of 00:16:14.529 --> 00:16:20.829 that. We also monitor the efficiency, and because 100 percent efficient is for 00:16:20.829 --> 00:16:29.300 beginners we are actually about 700 percent efficient. Don't ask why. They 00:16:29.300 --> 00:16:33.920 decided if you are multi-threading, then we, you know, we multiply your efficiency 00:16:33.920 --> 00:16:39.950 by the number of threads you have. Makes no sense to me. We also have storage, 00:16:39.950 --> 00:16:44.930 currently we use about 0.7 exabytes. We also have available at one point seven 00:16:44.930 --> 00:16:49.130 exabytes, so that's good, we make use of the storage we have. Where it's, you know, 00:16:49.130 --> 00:16:55.529 tera- peta- exa-, so it's a lot, and here you can see on the right hand side you 00:16:55.529 --> 00:16:59.610 see, for example, the tape usage on the bottom and you see this dip that was 00:16:59.610 --> 00:17:04.270 before we were starting the accelerator again, we needed to make some space so we 00:17:04.270 --> 00:17:09.089 monitor our hard disk usage all the time. Hey, here comes the next decision point: 00:17:09.089 --> 00:17:13.630 So, do you want to hear about, 1, distributed computing or 2, measure 00:17:13.630 --> 00:17:17.839 effects of bugs. So, 1, distributed computing 00:17:17.839 --> 00:17:26.470 applause and 2, measure the effects of bugs 00:17:26.470 --> 00:17:35.560 similar amount of applause Okay, so that's my call, and I would say 00:17:35.560 --> 00:17:41.455 we do we do... Measure the effects of bugs, because it's shorter. 00:17:41.455 --> 00:17:47.130 laughter So this is one of the views you can, you 00:17:47.130 --> 00:17:50.740 know, electronic views you can get from a detector and you see how we trace the 00:17:50.740 --> 00:17:55.380 particles that fly through the detector. Now, that software right, that's the 00:17:55.380 --> 00:17:59.927 result of software, and you might not believe it, if you have bugs in there, in 00:17:59.927 --> 00:18:00.808 that software. 00:18:02.849 --> 00:18:07.260 And you know, these bugs are sometimes wrong coordinate transformations, so 00:18:07.260 --> 00:18:12.590 things don't go this way but that way, it's kind of weird if you look at it, and 00:18:12.590 --> 00:18:17.470 the result is that our particles don't go through the path that they should have 00:18:17.470 --> 00:18:25.190 been going, but we are attributing them a different path. Now, the the nice thing 00:18:25.190 --> 00:18:30.960 is that we are doing this a million times, right? So all of that is smeared. We are 00:18:30.960 --> 00:18:35.730 not systematically doing this wrong it's just, we are always doing it a little bit 00:18:35.730 --> 00:18:41.669 wrong. And so the net result is that if we measure our particles, we will not measure 00:18:41.669 --> 00:18:46.861 the right thing but always a little bit wobbly left wobbly right you know? Things 00:18:46.861 --> 00:18:53.809 are not as precise. That's simply an uncertainty. So for us just like counting 00:18:53.809 --> 00:18:59.059 has an uncertainty and predictions have an uncertainty, software bugs introduced 00:18:59.059 --> 00:19:05.559 another source of uncertainties. And here you can see how we are tracking 00:19:05.559 --> 00:19:09.370 uncertainties for for all of our analyses. We are trying to understand the 00:19:09.370 --> 00:19:16.220 different forces of uncertainties. And again, bugs are only one of the sources 00:19:16.220 --> 00:19:22.880 here, so if we find the bug then we reduce our uncertainty and we can find new 00:19:22.880 --> 00:19:27.760 physics earlier, instead of having to wait and collect more data. So for us 00:19:27.760 --> 00:19:32.210 finding bugs is really key, we really love finding bugs because it brings 00:19:32.210 --> 00:19:36.710 physics closer. I thought that was interesting. It's kind of rare that you're 00:19:36.710 --> 00:19:42.140 in environment where you're able to measure the effect of bugs. Okay, so now 00:19:42.140 --> 00:19:47.870 we are talking, we'll be talking about data. I talked, told you that we are 00:19:47.870 --> 00:19:52.690 trying to find particle traces in our data and the way we do this is by using 00:19:52.690 --> 00:19:56.700 reconstruction programs and there are multiple gigabytes of binaries in shared 00:19:56.700 --> 00:20:01.799 libraries and stuff. They're huge, they're experiment specific and they are curated 00:20:01.799 --> 00:20:06.270 by the experiments, open-source for some of them, and we want them to be correct 00:20:06.270 --> 00:20:14.140 and efficient. The data format we use is not comma separated values, it's binary 00:20:14.140 --> 00:20:21.080 and for some strange reason it's our own custom binary format. The reason is that 00:20:21.080 --> 00:20:26.990 it's really targeted and the kind of data we are having. We have collisions 00:20:26.990 --> 00:20:32.230 that are independent, so we only need one in memory at any time and we have nested 00:20:32.230 --> 00:20:38.590 collections which makes the regular table layout a non-starter. We actually generate 00:20:38.590 --> 00:20:44.430 them from C++ objects so from classes, class definitions, C++ class definitions 00:20:44.430 --> 00:20:51.320 and we can read them back into C++ but also into JavaScript or Scala. Database 00:20:51.320 --> 00:20:56.840 just didn't do it for us. They have the wrong model of data axis, they don't 00:20:56.840 --> 00:21:02.940 scale, it's just not the kind of system that works for us. Also using a file 00:21:02.940 --> 00:21:09.390 system as a storage back-end might sound really very traditional and boring but it 00:21:09.390 --> 00:21:13.890 works amazingly well and seems to be future proof as well, so that's just the 00:21:13.890 --> 00:21:20.360 way to go for us. There are many other structured data formats out there, many of 00:21:20.360 --> 00:21:26.000 those did not exist when we started root our own data format. But they also miss 00:21:26.000 --> 00:21:30.250 many things. For example, we wanted to make sure that we have schema evolution 00:21:30.250 --> 00:21:33.970 support. We can change the class layout and still read back all data. We don't 00:21:33.970 --> 00:21:38.750 want to throw away all data just because we're changing the class. Also we do not 00:21:38.750 --> 00:21:43.370 trust people. That is a, you know, as a computer scientist or whatever you 00:21:43.370 --> 00:21:46.750 probably know what I'm talking about right? If people have to write their own 00:21:46.750 --> 00:21:50.630 streaming algorithm, there will be bugs and we will lose data. 00:21:50.630 --> 00:21:54.610 We really don't want to do this, so we were trying to automate this, based on the 00:21:54.610 --> 00:22:03.070 class definition. So, last decision point for the story. Do you want to hear about 00:22:03.070 --> 00:22:10.409 cling, our C++ interpreter or about Open Data and Applied Science? Let's start with 00:22:10.409 --> 00:22:14.860 option 1, the C++ interpreter applause 00:22:14.860 --> 00:22:21.106 Okay and and Open Data and Applied Science? 00:22:21.106 --> 00:22:29.679 more applause than before Yeah. I'm heading there. You miss a fish. 00:22:29.679 --> 00:22:35.299 You can look at the slides later. Okay, so there we go. Really? No. The slide number 00:22:35.299 --> 00:22:41.140 is wrong. Oh a bug! So, Open Data and Applied Science. Okay, you really wanted 00:22:41.140 --> 00:22:47.700 to know about our budget, I understand that. So we get from you about 1 billion 00:22:47.700 --> 00:22:50.719 year and the currency doesn't really matter anymore at this, at this point of 00:22:50.719 --> 00:22:54.200 time. laughter 00:22:54.200 --> 00:23:01.230 And that is a lot of money. And you know? We try to do really wonderful things, I 00:23:01.230 --> 00:23:04.943 mean we really enjoy our job, we love it. It's fantastic to work in such an 00:23:04.943 --> 00:23:09.248 environment. And thank you very much for making that possible. Really, I mean it. 00:23:11.110 --> 00:23:16.691 But it also means that you decided as society to enable something like CERN. 00:23:17.473 --> 00:23:22.140 Which I think really deserves my applause and yours probably as well. I think it's a 00:23:22.140 --> 00:23:24.425 great decision to do something like this. 00:23:24.425 --> 00:23:30.211 applause 00:23:31.325 --> 00:23:35.690 So we realize this, right? We realized that we are basically, that we can do what 00:23:35.690 --> 00:23:40.210 we do because of you, and we are trying to react to that by giving back what we do. 00:23:40.210 --> 00:23:47.460 Software, research results, hardware and data. So the way we share research results 00:23:47.460 --> 00:23:52.600 is through open access. We have it, finally. It took us a long time to fight 00:23:52.600 --> 00:23:57.570 with publishers and, you know, the establishment, but now we have it. We 00:23:57.570 --> 00:23:59.220 also, yes thank you. 00:23:59.220 --> 00:24:03.395 applause 00:24:03.395 --> 00:24:07.520 We also put a lot of effort in communicating our results and what we are 00:24:07.520 --> 00:24:12.680 doing. And if you're in the region, it's definitely worth a visit. I mean the URL 00:24:12.680 --> 00:24:17.590 is really easy to remember, it's visit.cern, and you know, works. And you 00:24:17.590 --> 00:24:22.270 should go there by April, actually, if you can because then you can ask people how to 00:24:22.270 --> 00:24:27.580 get on the ground, because the accelerator is off at the moment. We also do applied 00:24:27.580 --> 00:24:32.320 research, for example we have this super cool experiment where we try to study how 00:24:32.320 --> 00:24:39.630 clouds form, based on cosmic rays. So the the influence of cosmic rays and cloud 00:24:39.630 --> 00:24:45.770 formation. Which is a key element in the uncertainty of climate models. We are 00:24:45.770 --> 00:24:50.440 trying to, to think about, you know, how to make energy from nuclear waste. So 00:24:50.440 --> 00:24:54.830 getting rid of nuclear waste while making energy from it. And we are trying to 00:24:54.830 --> 00:25:02.070 repurpose detectors that we have and you know develop. We have something called 00:25:02.070 --> 00:25:08.330 open hardware, for example White Rabbit: deterministic ethernet, we have Open Data, 00:25:08.330 --> 00:25:12.789 and we have the LHC@home and some other programs, where either you can donate 00:25:12.789 --> 00:25:21.250 compute power or your brain and help us get better results. We explicitly try to 00:25:21.250 --> 00:25:25.747 use open source as much as possible, and also feed back, whenever we see issues. 00:25:27.700 --> 00:25:33.620 But we also create open source. For example, we create Geant, which is a 00:25:33.620 --> 00:25:37.831 program that allows you to simulate how particles fly through a matter, for 00:25:37.831 --> 00:25:44.610 example used by the NASA. We have Indico, which allows us to schedule meetings, 00:25:44.610 --> 00:25:48.940 upload slides, you know, these kind of things. Across the globe, lots of people, 00:25:48.940 --> 00:25:52.970 with access protection, all these kind of things. And it's open source. We have 00:25:52.970 --> 00:25:58.919 DaviX, the dimension we love HTTP. That's the next machine of Tim Berners-Lee. And 00:25:58.919 --> 00:26:03.140 that's his futile effort in trying to prevent the cleaning personnel from 00:26:03.140 --> 00:26:07.530 switching it off. They don't speak English, they did not back then at least. 00:26:09.337 --> 00:26:15.500 So we use we used DaviX to transfer files over HTTP, with a high bandwidth. Or we 00:26:15.500 --> 00:26:21.241 have CVM-FS, which allows us to distribute our binaries across the globe, and not 00:26:21.241 --> 00:26:26.570 rely on admins downloading stuff and making sure it actually runs, and these 00:26:26.570 --> 00:26:31.581 kind of things. That is a lifesaver, it's really fantastic, it's a great tool. But 00:26:31.581 --> 00:26:37.730 nobody knows it. And we have ROOT, but that's coming up. So now, the last 00:26:37.730 --> 00:26:42.534 official part of this, of this presentation, how do we do data analysis? 00:26:42.534 --> 00:26:44.950 Not like that. laughter 00:26:44.950 --> 00:26:52.210 applause We use, we use C++ and actually physicists 00:26:52.210 --> 00:26:58.140 need to write their own analysis in C++. We have very few people who have an actual 00:26:58.140 --> 00:27:03.876 education in programming. so that's sort of a clash. As I said, we need to keep one 00:27:04.607 --> 00:27:08.460 collision in memory. And for what, you know, what matters to us is throughput. We 00:27:08.460 --> 00:27:13.340 want to have, we want to analyze as many collisions as possible per second. What we 00:27:13.340 --> 00:27:17.390 can do, is specialize our data format to match the analysis, because we don't want 00:27:17.390 --> 00:27:23.419 to waste I/O cycles, if we can, you know, if we can make use of the CPU better. ROOT 00:27:23.419 --> 00:27:29.110 allows us to do this since twenty years. It's really the workhorse for the analysis 00:27:29.110 --> 00:27:35.200 in high energy physics. And it's also an interface to complex software. We have 00:27:35.200 --> 00:27:40.950 serialization facilities, we have the statistical tools, that people need, and 00:27:40.950 --> 00:27:44.480 we have graphics, because once you have done your analysis you need to communicate 00:27:44.480 --> 00:27:48.500 that to your peers and convince people, and publish, and so on, so that's part of 00:27:48.500 --> 00:27:54.169 the game. All of that is open source, and, of course, all of that is not just used by 00:27:54.169 --> 00:28:03.370 high energy physics. So, to conclude: We are here, because you make it possible. 00:28:03.370 --> 00:28:05.223 Thank you very much. It's fantastic to have you. 00:28:05.223 --> 00:28:10.860 applause We want to share and we have great people 00:28:10.860 --> 00:28:17.080 for science outreach, but we have nobody for software outreach, basically. So maybe 00:28:17.080 --> 00:28:24.570 it's worth a look to see what what CERN is producing software-wise. Scientific 00:28:24.570 --> 00:28:29.940 computing is nothing new, it existed since a long time, but we had to start fairly 00:28:29.940 --> 00:28:35.490 early on a large scale. So when we were building it up, we had to take... we were 00:28:35.490 --> 00:28:39.960 trying to take pieces that existed and did not found find much. So now we ended up 00:28:39.960 --> 00:28:45.179 with C++ data serialization, efficient computing for non computer scientists 00:28:45.179 --> 00:28:49.660 even... In the part that I skipped and, you know, one of the alternate tracks, you 00:28:49.660 --> 00:28:54.289 would have seen that we have a Python binding as well for the whole software 00:28:54.289 --> 00:28:59.970 stack in C++. And for us, what matters most is scale. Now we are seeing that we 00:28:59.970 --> 00:29:04.309 are not the only ones. There are many more natural sciences arriving at a similar 00:29:04.309 --> 00:29:09.120 challenge of having to analyze large amounts of data. Now I promised to you 00:29:09.120 --> 00:29:12.480 that I'll be bold and I'll try to make a few statements of what will happen with 00:29:12.480 --> 00:29:16.750 data analysis, not just in science. Because what we see is that we actually 00:29:16.750 --> 00:29:22.610 educate the people who will do data analysis, not just in science. What we see 00:29:22.610 --> 00:29:30.990 is that in the past, data volume mattered most. So more data meant more power. Now 00:29:30.990 --> 00:29:35.929 that's not the complete truth anymore. It's a lot about finding correlations. So 00:29:35.929 --> 00:29:40.880 even with the amount of data not growing anymore, because it's already humongous, 00:29:40.880 --> 00:29:46.320 we try to squeeze more knowledge out of it. And for that, I/O becomes important 00:29:46.320 --> 00:29:53.900 and CPU limitations is the crucial factor. We see that multivariate techniques are 00:29:53.900 --> 00:29:59.029 still rising and they will just be part of the toolchain of the statistical tools; 00:29:59.852 --> 00:30:06.681 except for generative parts, which, I believe, will change the way we model. 00:30:10.232 --> 00:30:16.361 Now, based on what I just described, this is not a big surprise anymore. As we need 00:30:16.361 --> 00:30:21.210 throughput, we need to have a language for the core analysis part, that is close to 00:30:21.210 --> 00:30:26.970 metal, so something like C++. On the other hand writing analyses is 00:30:26.970 --> 00:30:31.791 still complex, so you need a higher-level language and for that people could, for 00:30:31.791 --> 00:30:35.929 example, use Python. So, now language binding becomes relevant all of a sudden. 00:30:35.929 --> 00:30:42.010 It's much more important in the future. And we need to tailor I/O to the actual 00:30:42.010 --> 00:30:48.910 analysis to not waste CPU cycles. So throughput is the king and, in my point of 00:30:48.910 --> 00:30:54.331 view, also in the future we will see much more effort in increasing the throughput. 00:30:55.600 --> 00:31:03.115 Okay, so that was it. In case you want to discuss anything with me, like "That's 00:31:03.115 --> 00:31:07.970 just wrong!", that's fine. I'm probably have several bugs in there. I'm still here 00:31:07.970 --> 00:31:12.909 until tomorrow. I don't know where yet, so I'll wander around and you can contact 00:31:12.909 --> 00:31:16.818 me by email or Twitter. Thank you very much for your attention. Thank you. 00:31:16.818 --> 00:31:20.525 applause 00:31:20.525 --> 00:31:27.990 music 00:31:27.990 --> 00:31:45.000 subtitles created by c3subtitles.de in the year 2017. Join, and help us!