0:00:00.000,0:00:13.119 music 0:00:13.119,0:00:17.190 Herald: Good morning and welcome back to[br]stage one. It's kind of going to be the 0:00:17.190,0:00:21.490 second talk about physics on this day[br]already and it's about big data and 0:00:21.490,0:00:27.150 science and big data became something like[br]Uber in science. It's everywhere every 0:00:27.150,0:00:33.370 discipline has it. Axel Naumann's working[br]for CERN, the accelerator in Switzerland 0:00:33.370,0:00:39.160 and he talks about how physics and[br]computing bridge in this area and he works 0:00:39.160,0:00:43.183 a lot with ROOT, a program that helps[br]transform data into knowledge. A warm 0:00:43.183,0:00:44.650 welcome. 0:00:44.650,0:00:45.262 Axel Naumann: Thank you. 0:00:45.262,0:00:51.260 applause 0:00:51.260,0:00:57.850 AN: Thanks a lot. So, well you know, when,[br]when I was discussing this abstract with 0:00:57.850,0:01:00.950 the science track people they tell me:[br]"Well, you know about three hundred people 0:01:00.950,0:01:06.000 might be in the audience." But well, hey,[br]you are huge that's much more than three 0:01:06.000,0:01:10.940 hundred people. So thank you so much for[br]inviting me over it's a real honor. And of 0:01:10.940,0:01:15.310 course originally when talking to 300[br]people are all science interested I 0:01:15.310,0:01:20.590 thought you know I pick something fairly[br]narrow focuswise but then I learned I'm 0:01:20.590,0:01:24.690 going to be in Saal one and that's[br]different, so I decided to make the scope 0:01:24.690,0:01:30.670 a little bit wider and that's what I ended[br]up with. I'll talk a little bit about 0:01:30.670,0:01:37.540 CERN in society as well if you so choose,[br]you'll see what that means in a minute. So 0:01:37.540,0:01:41.680 the things I'll cover here is obviously[br]CERN just a little bit of an introduction 0:01:41.680,0:01:46.100 how we do physics, how we do computing,[br]what data means to us and I can tell you 0:01:46.100,0:01:51.810 it means everything, you heard about that[br]already, right? How we do data analysis in 0:01:51.810,0:01:56.159 high energy physics and just because[br]we've been doing it for a while and 0:01:56.159,0:02:00.530 because I've been doing it for more than[br]ten years, I'm one of the guys who's 0:02:00.530,0:02:07.250 providing the software to do data[br]analysis in high energy physics, so, you 0:02:07.250,0:02:11.360 know, because we know what we are doing[br]and we have some experience, I thought 0:02:11.360,0:02:18.110 maybe you might be interested in hearing[br]what my forecast is for data analysis in 0:02:18.110,0:02:25.430 general, in the future. So let's start[br]with CERN. And so if you wonder what CERN 0:02:25.430,0:02:31.510 is, you've all heard about CERN, about[br]the fantastic funds we love to use, then 0:02:31.510,0:02:36.960 you've probably also heard that we are[br]doing science. We were founded right after 0:02:36.960,0:02:41.450 the Second World War or soon after the[br]Second World War, basically as a way to 0:02:41.450,0:02:47.458 entertain those freaky scientists. You[br]know that was the idea: peace europewide. 0:02:47.458,0:02:52.349 And damn, that's working out really well[br]and so well there's not just Europe 0:02:52.349,0:02:57.530 anymore these days. We are located near[br]Geneva, we are doing only fundamental 0:02:57.530,0:03:02.269 research, so we don't do any weapons,[br]nuclear stuff you 0:03:02.269,0:03:10.230 know, these kind of things. The WWW was[br]invented at CERN but that was just a, you 0:03:10.230,0:03:14.586 know, side effect happens sometimes, that[br]we invent things. But usually we just do 0:03:14.586,0:03:22.500 science. So what we do is, we take money,[br]lots off, and brains who like to discuss 0:03:22.500,0:03:27.210 and think and come up with ideas and from[br]that we generate knowledge. It's really 0:03:27.210,0:03:33.000 all about curiosity. The things we try to[br]answer is what is mass? Which is funny 0:03:33.000,0:03:37.371 question right? Like we all know what mass[br]is but actually we don't. We know what 0:03:37.371,0:03:42.360 mass is in the universe. We understand[br]that masses attract one another: gravity. 0:03:42.360,0:03:48.730 Which is beautifully correct. And in the[br]small scale, our particles, we know that 0:03:48.730,0:03:52.940 mass is energy and we can't convert them.[br]But we don't understand how these two 0:03:52.940,0:03:58.319 things go together. Like there is no[br]bridge, they contradict one another. So we 0:03:58.319,0:04:04.930 are trying to understand what that bridge[br]might be. Part of that mass thing is of 0:04:04.930,0:04:08.650 course also what's out there in the[br]universe? That's a big question. We only 0:04:08.650,0:04:14.230 understand a few percent of that. 90 and[br]some percent are completely unknown to 0:04:14.230,0:04:20.349 us, and that's scary right? I mean we know[br]gravity really well, we can deal with 0:04:20.349,0:04:27.560 freaky things like black holes and yet we[br]don't understand what's out there. Now to 0:04:27.560,0:04:31.850 do all these things we are probing nature[br]at the smallest scale as we call it, so 0:04:31.850,0:04:36.190 that's particles, we are dealing with[br]things like the Higgs particle and 0:04:36.190,0:04:43.900 supersymmetry. Here's a little bit of a[br]fact sheet. We have about 12,000 0:04:43.900,0:04:47.500 physicists who are working with CERN. We[br]are basically the workbench that you saw 0:04:47.500,0:04:54.661 in Andre's talk before. We are the table[br]that physicists use, okay? And, so they 0:04:54.661,0:04:59.050 come to CERN and once a while about[br]10,000 physicists a year, or they work 0:04:59.050,0:05:02.810 remotely most of the time from about 120[br]nations. So you're seeing it's not 0:05:02.810,0:05:10.650 European anymore, this is a global thing.[br]CERN in itself has about 2,500 employees, 0:05:10.650,0:05:15.490 you know those scrubbing the table,[br]setting things up and so on. And our 0:05:15.490,0:05:21.190 table is right here. In the far end we[br]have the Alps, it's in Switzerland 0:05:21.190,0:05:25.990 as I said, so the Alps are[br]always close, with Mont Blanc, we have the 0:05:25.990,0:05:31.639 Lake Geneva we have the Jura, the French[br]Mountains on the lower end here, it's just 0:05:31.639,0:05:37.410 beautiful. It's really nice, but we[br]needed to stick a 30-kilometer ring in 0:05:37.410,0:05:43.861 there somewhere and people would have[br]hated us had we put it like this. But 0:05:43.861,0:05:49.671 luckily people were smart back then in the[br]70s, and built a tunnel much better. So 0:05:49.671,0:05:55.229 now we have this huge tunnel, and we send[br]particles through in both directions near 0:05:55.229,0:06:00.351 the speed of light and the tunnel is[br]filled with magnets simply because if you 0:06:00.351,0:06:08.110 don't use a magnet the particles will fly[br]straight but we need them to turn around. 0:06:08.110,0:06:13.560 Here you see what it's looking like, you[br]also see these big halls there that have 0:06:13.560,0:06:21.880 access shafts from the top and that's[br]where the experiments are. That's sort of 0:06:21.880,0:06:29.210 a sketch of one of the experiments. So the[br]the LHC is one of the, no, is the biggest 0:06:29.210,0:06:35.889 particle accelerator at the moment, it's a[br]ring with 27 kilometers circumference, 100 0:06:35.889,0:06:40.300 meters below Switzerland and France, it[br]has four big experiments and several 0:06:40.300,0:06:45.270 small ones and we are expected to run[br]until 2030. So you see that all of that 0:06:45.270,0:06:50.150 is large-scale simply because we're trying[br]to make good use of the money we have. 0:06:50.150,0:06:56.020 Here, you see one of these caverns that[br]are used by the experiments while it was 0:06:56.020,0:07:01.490 empty. The experiment was then lowered[br]through this hole by the roof, piece by 0:07:01.490,0:07:07.190 piece, and these things are humongous. To[br]give you an impression of how big it is, I 0:07:07.190,0:07:12.520 put Waldo in there, so your job for the[br]next three slides is to find Waldo. You 0:07:12.520,0:07:15.800 know, that gives you the scale. He's[br]friendlily waving at you, so it should be 0:07:15.800,0:07:21.990 easy to find him. So then we put a[br]detector in there. Here it's pulled apart 0:07:21.990,0:07:26.160 a little bit, so it looks nicer, you can[br]actually see something. You can for 0:07:26.160,0:07:31.039 example see the beam pipe, so that's where[br]the particles are flying through, and then 0:07:31.039,0:07:34.880 they're coming from both directions and[br]colliding in the center of the detector 0:07:34.880,0:07:38.490 and then things happen we try to[br]understand what 0:07:38.490,0:07:44.790 is happening. That's yet another view,[br]frontal view on one of the detectors and 0:07:44.790,0:07:51.060 now you have to imagine that, you know,[br]you can't just open up Amazon and order an 0:07:51.060,0:07:56.210 LHC experiment, right, that's not how it[br]works. We do this stuff ourselves, like 0:07:56.210,0:08:02.669 PhD students, postdocs, engineers. You[br]know, that's all done by hand, just like 0:08:02.669,0:08:06.940 the microscope you saw before. Of course[br]you order the parts, but you know the 0:08:06.940,0:08:11.060 design, the whole conception and actually[br]screwing these things together, making 0:08:11.060,0:08:16.970 sure that all fits, is all done by hand.[br]And I find that just beautiful, I mean 0:08:16.970,0:08:21.760 that's close to a miracle, right? That[br]nations, like people no matter what 0:08:21.760,0:08:26.819 nation, people across the globe work[br]together to build such a huge thing and 0:08:26.819,0:08:39.490 then you turn it on and it works. More or[br]less, but you get it to work. That's not 0:08:39.490,0:08:44.310 my applause, that's your applause, because[br]you make this possible. Really, but it's, 0:08:44.310,0:08:49.690 it's huge this is for me one of the things[br]I love most about CERN: That is this 0:08:49.690,0:08:55.279 international thing that just works[br]smoothly. Now the detectors are like a 0:08:55.279,0:09:01.310 massive camera. We have lots of pixels and[br]we take many, many pictures a second. We 0:09:01.310,0:09:06.680 do this to identify particles and then[br]sort of estimate what has happened during 0:09:06.680,0:09:15.470 the collision. Now, life at CERN is of[br]course an important ingredient for 0:09:15.470,0:09:19.529 scientists as well, and if you live at[br]CERN then actually it's just work at CERN 0:09:19.529,0:09:23.980 and that's what it's about. But it's not[br]that bad, so we hang out together in our 0:09:23.980,0:09:30.040 control rooms, make sure that the[br]experiments work correctly. We also, you 0:09:30.040,0:09:33.720 know, study the forces.[br]laughter 0:09:33.720,0:09:38.740 We have scientific discourse, in the sun,[br]view on the Mont Blanc, with a good 0:09:38.740,0:09:45.430 coffee. We have lectures and we are[br]lectured and of course, as you, we have 0:09:45.430,0:09:54.570 more laptops than people. And, then we do[br]stuff and so this presentation is going to 0:09:54.570,0:09:58.580 introduce you to some of the things we are[br]doing, and more on the computing and the 0:09:58.580,0:10:04.100 society side as I said. But because I have[br]so much to talk to about I decided that 0:10:04.100,0:10:08.810 you just build your own talk, you tell me[br]what you want to hear. So let's do this, 0:10:08.810,0:10:14.410 you can choose between A, physics, and B,[br]model simulation and data. You remember 0:10:14.410,0:10:18.620 these books like from the old days when we[br]were all young? It's that kind of thing, 0:10:18.620,0:10:24.450 ok? You decide/design your own talk here.[br]So, by applause, do you want to hear about 0:10:24.450,0:10:27.720 physics?[br]applause 0:10:27.720,0:10:35.730 Okay. Or the model simulation data part?[br]louder applause 0:10:35.730,0:10:45.101 Okay, there we go. So, this is what we[br]skip. Model simulation data it is. You're 0:10:45.101,0:10:49.700 a strange crowd, first time I meet people[br]who don't want to hear about physics... no 0:10:49.700,0:10:51.450 I'm kidding.[br]laughter 0:10:51.450,0:10:53.800 Audience: inaudible interjection[br]laughter 0:10:53.800,0:11:00.079 So model simulation data it is. So our[br]theory is actually incredibly precise. 0:11:00.079,0:11:04.450 It's so precise that our basic job is[br]really really boring, because we already 0:11:04.450,0:11:10.514 understand everything. Whenever there is a[br]collision, we know what's going to happen. 0:11:10.514,0:11:15.430 Except for these very rare things. So we[br]are trying to find these very rare things 0:11:15.430,0:11:19.580 out of this haystack of fairly boring[br]things that we really understand well. And 0:11:19.580,0:11:25.589 the weird things are, for example,[br]monopoles, supersymmetry, or black holes. 0:11:25.589,0:11:32.060 Now the theorists job is to tell us what[br]we should be seeing in the detector, given 0:11:32.060,0:11:42.347 some fancy physics. Then we use simulation[br]to see how our detector would respond to 0:11:42.347,0:11:53.476 that. Now, of course the question is: We[br]are just counting, basically, when we do 0:11:53.476,0:11:58.102 experiments and the question is: How often[br]do we need to see something to say: "Well, 0:11:58.102,0:12:03.310 that's not just the ordinary. That is[br]something new, that's something that could 0:12:03.310,0:12:09.870 be explained by a weird theory. We use the[br]detector simulation as I said to basically 0:12:09.870,0:12:15.029 predict how much we expect to see things.[br]We use reconstruction software which 0:12:15.029,0:12:20.680 tells us what has happened, or might have[br]happened in the detector to count how 0:12:20.680,0:12:25.400 often we saw something. And then we use[br]statistics to compare these two and to say 0:12:25.400,0:12:31.610 whether something is expected or not. Now,[br]that's fairly abstract but it's fairly 0:12:31.610,0:12:36.905 common, a fairly common approach. For[br]example, if you look at climate versus 0:12:36.905,0:12:40.331 weather, right, I mean we always have[br]temperature fluctuations because of 0:12:40.331,0:12:46.480 weather, and the question is: Is that rise[br]in temperature because of a weather effect 0:12:46.480,0:12:50.375 or because of a climate effect? Is that[br]large-scale or just a short-term 0:12:50.375,0:12:55.610 fluctuation. So there, we have a very[br]similar problem and here what you do is 0:12:55.610,0:13:00.880 you measure temperatures, and you want to[br]detect abnormal variations, and you can 0:13:00.880,0:13:06.420 improve that by measuring longer, like,[br]for 300 years instead of 20 years. That 0:13:06.420,0:13:11.930 gives you a better prediction what you[br]would expect in the future. Also, larger 0:13:11.930,0:13:14.170 deviations help, right?. If you look for[br]something that 0:13:14.170,0:13:19.700 is just 0.1 degree, then you might not be[br]able to find it. If there is a deviation 0:13:19.700,0:13:25.230 of 5 degrees, you will definitely find it.[br]And for us it's very similar. So here we 0:13:25.230,0:13:31.610 have a plot, one of the first Higgs[br]discovery plots, and you can see that we 0:13:31.610,0:13:38.800 have many ingredients there. So, the black[br]dots are what we measure and they have 0:13:38.800,0:13:43.829 certain uncertainty, because when we[br]measure, we count and we might have, you 0:13:43.829,0:13:48.977 know, not seen something, or we might have[br]seen more than we we should have seen, so 0:13:48.977,0:13:54.970 there's always an uncertainty. And then we[br]also have theory, which tells us you 0:13:54.970,0:14:00.079 should have seen so many and so for the[br]red part that's something that we know 0:14:00.079,0:14:04.889 exists, it's nothing spectacular. It's[br]simply what theory is telling us what we 0:14:04.889,0:14:10.660 should be seeing. And you can see the data[br]follows the red part fairly well. But then 0:14:10.660,0:14:15.980 there is this other bump in our dots on[br]the right-hand side or in the center and 0:14:15.980,0:14:21.230 that does not make sense, unless you take[br]the Higgs into account, right, which is 0:14:21.230,0:14:26.889 the light blue part and so here you can[br]see how this interplay between different 0:14:26.889,0:14:38.280 sources of physics and statistics works[br]for us. Now just as for the climate, more 0:14:38.280,0:14:43.690 data helps. And there are two versions of[br]more data more data: Either by having more 0:14:43.690,0:14:48.079 collisions, which is why we are running[br]24/7, or more data by combining different 0:14:48.079,0:14:52.060 analyses which is what's happening here.[br]So here you see all these different 0:14:52.060,0:14:56.990 analyses. If you combine them, of course[br]you get a much stronger prediction of, in 0:14:56.990,0:15:03.300 this case, the Higgs mass, then if you[br]just take any single one of them. You see 0:15:03.300,0:15:08.540 how similar what we are doing is to, you[br]know, any of the big data analyses out 0:15:08.540,0:15:16.414 there. Okay, so that was that part. Now[br]comes the obligatory part again, 0:15:16.414,0:15:22.930 computering. When we were designing the[br]LHC,not me, when people were designing the 0:15:22.930,0:15:31.120 LHC, they needed to project computing[br]power from 1990 to 2000 2010 and so on. 0:15:31.120,0:15:34.140 And then they said: "Well, we need[br]massive amount of computers" and for you 0:15:34.140,0:15:38.420 there's now "Ughhh - everybody has it, we[br]have it as well, we have our racks of 0:15:38.420,0:15:44.240 computers". This is something that the big[br]companies usually don't show: You you know 0:15:44.240,0:15:48.509 there is actually a ramp where the trucks[br]arrive and they offload the things and 0:15:48.509,0:15:53.820 then someone needs to screw them together[br]and then looks shiny. This is how we are 0:15:53.820,0:16:00.870 spending our CPU time: We have about[br]60,000 cores that are spinning all the 0:16:00.870,0:16:06.680 time for us, and they are distributed[br]around the world. You can see that CERN, 0:16:06.680,0:16:14.529 for example, is the red part there near[br]the bottom. Yeah, so we make good use of 0:16:14.529,0:16:20.829 that. We also monitor the efficiency, and[br]because 100 percent efficient is for 0:16:20.829,0:16:29.300 beginners we are actually about 700[br]percent efficient. Don't ask why. They 0:16:29.300,0:16:33.920 decided if you are multi-threading, then[br]we, you know, we multiply your efficiency 0:16:33.920,0:16:39.950 by the number of threads you have. Makes[br]no sense to me. We also have storage, 0:16:39.950,0:16:44.930 currently we use about 0.7 exabytes. We[br]also have available at one point seven 0:16:44.930,0:16:49.130 exabytes, so that's good, we make use of[br]the storage we have. Where it's, you know, 0:16:49.130,0:16:55.529 tera- peta- exa-, so it's a lot, and here[br]you can see on the right hand side you 0:16:55.529,0:16:59.610 see, for example, the tape usage on the[br]bottom and you see this dip that was 0:16:59.610,0:17:04.270 before we were starting the accelerator[br]again, we needed to make some space so we 0:17:04.270,0:17:09.089 monitor our hard disk usage all the time.[br]Hey, here comes the next decision point: 0:17:09.089,0:17:13.630 So, do you want to hear about, 1,[br]distributed computing or 2, measure 0:17:13.630,0:17:17.839 effects of bugs. So, 1, distributed[br]computing 0:17:17.839,0:17:26.470 applause[br]and 2, measure the effects of bugs 0:17:26.470,0:17:35.560 similar amount of applause[br]Okay, so that's my call, and I would say 0:17:35.560,0:17:41.455 we do we do... Measure the effects of[br]bugs, because it's shorter. 0:17:41.455,0:17:47.130 laughter[br]So this is one of the views you can, you 0:17:47.130,0:17:50.740 know, electronic views you can get from a[br]detector and you see how we trace the 0:17:50.740,0:17:55.380 particles that fly through the detector.[br]Now, that software right, that's the 0:17:55.380,0:17:59.927 result of software, and you might not[br]believe it, if you have bugs in there, in 0:17:59.927,0:18:00.808 that software. 0:18:02.849,0:18:07.260 And you know, these bugs are sometimes[br]wrong coordinate transformations, so 0:18:07.260,0:18:12.590 things don't go this way but that way,[br]it's kind of weird if you look at it, and 0:18:12.590,0:18:17.470 the result is that our particles don't go[br]through the path that they should have 0:18:17.470,0:18:25.190 been going, but we are attributing them a[br]different path. Now, the the nice thing 0:18:25.190,0:18:30.960 is that we are doing this a million times,[br]right? So all of that is smeared. We are 0:18:30.960,0:18:35.730 not systematically doing this wrong it's[br]just, we are always doing it a little bit 0:18:35.730,0:18:41.669 wrong. And so the net result is that if we[br]measure our particles, we will not measure 0:18:41.669,0:18:46.861 the right thing but always a little bit[br]wobbly left wobbly right you know? Things 0:18:46.861,0:18:53.809 are not as precise. That's simply an[br]uncertainty. So for us just like counting 0:18:53.809,0:18:59.059 has an uncertainty and predictions have[br]an uncertainty, software bugs introduced 0:18:59.059,0:19:05.559 another source of uncertainties. And here[br]you can see how we are tracking 0:19:05.559,0:19:09.370 uncertainties for for all of our[br]analyses. We are trying to understand the 0:19:09.370,0:19:16.220 different forces of uncertainties. And[br]again, bugs are only one of the sources 0:19:16.220,0:19:22.880 here, so if we find the bug then we[br]reduce our uncertainty and we can find new 0:19:22.880,0:19:27.760 physics earlier, instead of having to[br]wait and collect more data. So for us 0:19:27.760,0:19:32.210 finding bugs is really key, we really[br]love finding bugs because it brings 0:19:32.210,0:19:36.710 physics closer. I thought that was[br]interesting. It's kind of rare that you're 0:19:36.710,0:19:42.140 in environment where you're able to[br]measure the effect of bugs. Okay, so now 0:19:42.140,0:19:47.870 we are talking, we'll be talking about[br]data. I talked, told you that we are 0:19:47.870,0:19:52.690 trying to find particle traces in our[br]data and the way we do this is by using 0:19:52.690,0:19:56.700 reconstruction programs and there are[br]multiple gigabytes of binaries in shared 0:19:56.700,0:20:01.799 libraries and stuff. They're huge, they're[br]experiment specific and they are curated 0:20:01.799,0:20:06.270 by the experiments, open-source for some[br]of them, and we want them to be correct 0:20:06.270,0:20:14.140 and efficient. The data format we use is[br]not comma separated values, it's binary 0:20:14.140,0:20:21.080 and for some strange reason it's our own[br]custom binary format. The reason is that 0:20:21.080,0:20:26.990 it's really targeted and the kind of[br]data we are having. We have collisions 0:20:26.990,0:20:32.230 that are independent, so we only need one[br]in memory at any time and we have nested 0:20:32.230,0:20:38.590 collections which makes the regular table[br]layout a non-starter. We actually generate 0:20:38.590,0:20:44.430 them from C++ objects so from classes,[br]class definitions, C++ class definitions 0:20:44.430,0:20:51.320 and we can read them back into C++ but[br]also into JavaScript or Scala. Database 0:20:51.320,0:20:56.840 just didn't do it for us. They have the[br]wrong model of data axis, they don't 0:20:56.840,0:21:02.940 scale, it's just not the kind of system[br]that works for us. Also using a file 0:21:02.940,0:21:09.390 system as a storage back-end might sound[br]really very traditional and boring but it 0:21:09.390,0:21:13.890 works amazingly well and seems to be[br]future proof as well, so that's just the 0:21:13.890,0:21:20.360 way to go for us. There are many other[br]structured data formats out there, many of 0:21:20.360,0:21:26.000 those did not exist when we started root[br]our own data format. But they also miss 0:21:26.000,0:21:30.250 many things. For example, we wanted to[br]make sure that we have schema evolution 0:21:30.250,0:21:33.970 support. We can change the class layout[br]and still read back all data. We don't 0:21:33.970,0:21:38.750 want to throw away all data just because[br]we're changing the class. Also we do not 0:21:38.750,0:21:43.370 trust people. That is a, you know, as a[br]computer scientist or whatever you 0:21:43.370,0:21:46.750 probably know what I'm talking about[br]right? If people have to write their own 0:21:46.750,0:21:50.630 streaming algorithm, there will be bugs[br]and we will lose data. 0:21:50.630,0:21:54.610 We really don't want to do this, so we[br]were trying to automate this, based on the 0:21:54.610,0:22:03.070 class definition. So, last decision point[br]for the story. Do you want to hear about 0:22:03.070,0:22:10.409 cling, our C++ interpreter or about Open[br]Data and Applied Science? Let's start with 0:22:10.409,0:22:14.860 option 1, the C++ interpreter[br]applause 0:22:14.860,0:22:21.106 Okay and and Open Data and Applied[br]Science? 0:22:21.106,0:22:29.679 more applause than before[br]Yeah. I'm heading there. You miss a fish. 0:22:29.679,0:22:35.299 You can look at the slides later. Okay, so[br]there we go. Really? No. The slide number 0:22:35.299,0:22:41.140 is wrong. Oh a bug! So, Open Data and[br]Applied Science. Okay, you really wanted 0:22:41.140,0:22:47.700 to know about our budget, I understand[br]that. So we get from you about 1 billion 0:22:47.700,0:22:50.719 year and the currency doesn't really[br]matter anymore at this, at this point of 0:22:50.719,0:22:54.200 time.[br]laughter 0:22:54.200,0:23:01.230 And that is a lot of money. And you know?[br]We try to do really wonderful things, I 0:23:01.230,0:23:04.943 mean we really enjoy our job, we love it.[br]It's fantastic to work in such an 0:23:04.943,0:23:09.248 environment. And thank you very much for[br]making that possible. Really, I mean it. 0:23:11.110,0:23:16.691 But it also means that you decided as[br]society to enable something like CERN. 0:23:17.473,0:23:22.140 Which I think really deserves my applause[br]and yours probably as well. I think it's a 0:23:22.140,0:23:24.425 great decision to do something like this. 0:23:24.425,0:23:30.211 applause 0:23:31.325,0:23:35.690 So we realize this, right? We realized[br]that we are basically, that we can do what 0:23:35.690,0:23:40.210 we do because of you, and we are trying to[br]react to that by giving back what we do. 0:23:40.210,0:23:47.460 Software, research results, hardware and[br]data. So the way we share research results 0:23:47.460,0:23:52.600 is through open access. We have it,[br]finally. It took us a long time to fight 0:23:52.600,0:23:57.570 with publishers and, you know, the[br]establishment, but now we have it. We 0:23:57.570,0:23:59.220 also, yes thank you. 0:23:59.220,0:24:03.395 applause 0:24:03.395,0:24:07.520 We also put a lot of effort in[br]communicating our results and what we are 0:24:07.520,0:24:12.680 doing. And if you're in the region, it's[br]definitely worth a visit. I mean the URL 0:24:12.680,0:24:17.590 is really easy to remember, it's[br]visit.cern, and you know, works. And you 0:24:17.590,0:24:22.270 should go there by April, actually, if you[br]can because then you can ask people how to 0:24:22.270,0:24:27.580 get on the ground, because the accelerator[br]is off at the moment. We also do applied 0:24:27.580,0:24:32.320 research, for example we have this super[br]cool experiment where we try to study how 0:24:32.320,0:24:39.630 clouds form, based on cosmic rays. So the[br]the influence of cosmic rays and cloud 0:24:39.630,0:24:45.770 formation. Which is a key element in the[br]uncertainty of climate models. We are 0:24:45.770,0:24:50.440 trying to, to think about, you know, how[br]to make energy from nuclear waste. So 0:24:50.440,0:24:54.830 getting rid of nuclear waste while making[br]energy from it. And we are trying to 0:24:54.830,0:25:02.070 repurpose detectors that we have and you[br]know develop. We have something called 0:25:02.070,0:25:08.330 open hardware, for example White Rabbit:[br]deterministic ethernet, we have Open Data, 0:25:08.330,0:25:12.789 and we have the LHC@home and some other[br]programs, where either you can donate 0:25:12.789,0:25:21.250 compute power or your brain and help us[br]get better results. We explicitly try to 0:25:21.250,0:25:25.747 use open source as much as possible, and[br]also feed back, whenever we see issues. 0:25:27.700,0:25:33.620 But we also create open source. For[br]example, we create Geant, which is a 0:25:33.620,0:25:37.831 program that allows you to simulate how[br]particles fly through a matter, for 0:25:37.831,0:25:44.610 example used by the NASA. We have Indico,[br]which allows us to schedule meetings, 0:25:44.610,0:25:48.940 upload slides, you know, these kind of[br]things. Across the globe, lots of people, 0:25:48.940,0:25:52.970 with access protection, all these kind of[br]things. And it's open source. We have 0:25:52.970,0:25:58.919 DaviX, the dimension we love HTTP. That's[br]the next machine of Tim Berners-Lee. And 0:25:58.919,0:26:03.140 that's his futile effort in trying to[br]prevent the cleaning personnel from 0:26:03.140,0:26:07.530 switching it off. They don't speak[br]English, they did not back then at least. 0:26:09.337,0:26:15.500 So we use we used DaviX to transfer files[br]over HTTP, with a high bandwidth. Or we 0:26:15.500,0:26:21.241 have CVM-FS, which allows us to distribute[br]our binaries across the globe, and not 0:26:21.241,0:26:26.570 rely on admins downloading stuff and[br]making sure it actually runs, and these 0:26:26.570,0:26:31.581 kind of things. That is a lifesaver, it's[br]really fantastic, it's a great tool. But 0:26:31.581,0:26:37.730 nobody knows it. And we have ROOT, but[br]that's coming up. So now, the last 0:26:37.730,0:26:42.534 official part of this, of this[br]presentation, how do we do data analysis? 0:26:42.534,0:26:44.950 Not like that.[br]laughter 0:26:44.950,0:26:52.210 applause[br]We use, we use C++ and actually physicists 0:26:52.210,0:26:58.140 need to write their own analysis in C++.[br]We have very few people who have an actual 0:26:58.140,0:27:03.876 education in programming. so that's sort[br]of a clash. As I said, we need to keep one 0:27:04.607,0:27:08.460 collision in memory. And for what, you[br]know, what matters to us is throughput. We 0:27:08.460,0:27:13.340 want to have, we want to analyze as many[br]collisions as possible per second. What we 0:27:13.340,0:27:17.390 can do, is specialize our data format to[br]match the analysis, because we don't want 0:27:17.390,0:27:23.419 to waste I/O cycles, if we can, you know,[br]if we can make use of the CPU better. ROOT 0:27:23.419,0:27:29.110 allows us to do this since twenty years.[br]It's really the workhorse for the analysis 0:27:29.110,0:27:35.200 in high energy physics. And it's also an[br]interface to complex software. We have 0:27:35.200,0:27:40.950 serialization facilities, we have the[br]statistical tools, that people need, and 0:27:40.950,0:27:44.480 we have graphics, because once you have[br]done your analysis you need to communicate 0:27:44.480,0:27:48.500 that to your peers and convince people,[br]and publish, and so on, so that's part of 0:27:48.500,0:27:54.169 the game. All of that is open source, and,[br]of course, all of that is not just used by 0:27:54.169,0:28:03.370 high energy physics. So, to conclude: We[br]are here, because you make it possible. 0:28:03.370,0:28:05.223 Thank you very much. It's fantastic to[br]have you. 0:28:05.223,0:28:10.860 applause[br]We want to share and we have great people 0:28:10.860,0:28:17.080 for science outreach, but we have nobody[br]for software outreach, basically. So maybe 0:28:17.080,0:28:24.570 it's worth a look to see what what CERN is[br]producing software-wise. Scientific 0:28:24.570,0:28:29.940 computing is nothing new, it existed since[br]a long time, but we had to start fairly 0:28:29.940,0:28:35.490 early on a large scale. So when we were[br]building it up, we had to take... we were 0:28:35.490,0:28:39.960 trying to take pieces that existed and did[br]not found find much. So now we ended up 0:28:39.960,0:28:45.179 with C++ data serialization, efficient[br]computing for non computer scientists 0:28:45.179,0:28:49.660 even... In the part that I skipped and,[br]you know, one of the alternate tracks, you 0:28:49.660,0:28:54.289 would have seen that we have a Python[br]binding as well for the whole software 0:28:54.289,0:28:59.970 stack in C++. And for us, what matters[br]most is scale. Now we are seeing that we 0:28:59.970,0:29:04.309 are not the only ones. There are many more[br]natural sciences arriving at a similar 0:29:04.309,0:29:09.120 challenge of having to analyze large[br]amounts of data. Now I promised to you 0:29:09.120,0:29:12.480 that I'll be bold and I'll try to make a[br]few statements of what will happen with 0:29:12.480,0:29:16.750 data analysis, not just in science.[br]Because what we see is that we actually 0:29:16.750,0:29:22.610 educate the people who will do data[br]analysis, not just in science. What we see 0:29:22.610,0:29:30.990 is that in the past, data volume mattered[br]most. So more data meant more power. Now 0:29:30.990,0:29:35.929 that's not the complete truth anymore.[br]It's a lot about finding correlations. So 0:29:35.929,0:29:40.880 even with the amount of data not growing[br]anymore, because it's already humongous, 0:29:40.880,0:29:46.320 we try to squeeze more knowledge out of[br]it. And for that, I/O becomes important 0:29:46.320,0:29:53.900 and CPU limitations is the crucial factor.[br]We see that multivariate techniques are 0:29:53.900,0:29:59.029 still rising and they will just be part of[br]the toolchain of the statistical tools; 0:29:59.852,0:30:06.681 except for generative parts, which, I[br]believe, will change the way we model. 0:30:10.232,0:30:16.361 Now, based on what I just described, this[br]is not a big surprise anymore. As we need 0:30:16.361,0:30:21.210 throughput, we need to have a language for[br]the core analysis part, that is close to 0:30:21.210,0:30:26.970 metal, so something like C++.[br]On the other hand writing analyses is 0:30:26.970,0:30:31.791 still complex, so you need a higher-level[br]language and for that people could, for 0:30:31.791,0:30:35.929 example, use Python. So, now language[br]binding becomes relevant all of a sudden. 0:30:35.929,0:30:42.010 It's much more important in the future.[br]And we need to tailor I/O to the actual 0:30:42.010,0:30:48.910 analysis to not waste CPU cycles. So[br]throughput is the king and, in my point of 0:30:48.910,0:30:54.331 view, also in the future we will see much[br]more effort in increasing the throughput. 0:30:55.600,0:31:03.115 Okay, so that was it. In case you want to[br]discuss anything with me, like "That's 0:31:03.115,0:31:07.970 just wrong!", that's fine. I'm probably[br]have several bugs in there. I'm still here 0:31:07.970,0:31:12.909 until tomorrow. I don't know where yet,[br]so I'll wander around and you can contact 0:31:12.909,0:31:16.818 me by email or Twitter. Thank you very[br]much for your attention. Thank you. 0:31:16.818,0:31:20.525 applause 0:31:20.525,0:31:27.990 music 0:31:27.990,0:31:45.000 subtitles created by c3subtitles.de[br]in the year 2017. Join, and help us!