0:00:00.000,0:00:13.119
music
0:00:13.119,0:00:17.190
Herald: Good morning and welcome back to[br]stage one. It's kind of going to be the
0:00:17.190,0:00:21.490
second talk about physics on this day[br]already and it's about big data and
0:00:21.490,0:00:27.150
science and big data became something like[br]Uber in science. It's everywhere every
0:00:27.150,0:00:33.370
discipline has it. Axel Naumann's working[br]for CERN, the accelerator in Switzerland
0:00:33.370,0:00:39.160
and he talks about how physics and[br]computing bridge in this area and he works
0:00:39.160,0:00:43.183
a lot with ROOT, a program that helps[br]transform data into knowledge. A warm
0:00:43.183,0:00:44.650
welcome.
0:00:44.650,0:00:45.262
Axel Naumann: Thank you.
0:00:45.262,0:00:51.260
applause
0:00:51.260,0:00:57.850
AN: Thanks a lot. So, well you know, when,[br]when I was discussing this abstract with
0:00:57.850,0:01:00.950
the science track people they tell me:[br]"Well, you know about three hundred people
0:01:00.950,0:01:06.000
might be in the audience." But well, hey,[br]you are huge that's much more than three
0:01:06.000,0:01:10.940
hundred people. So thank you so much for[br]inviting me over it's a real honor. And of
0:01:10.940,0:01:15.310
course originally when talking to 300[br]people are all science interested I
0:01:15.310,0:01:20.590
thought you know I pick something fairly[br]narrow focuswise but then I learned I'm
0:01:20.590,0:01:24.690
going to be in Saal one and that's[br]different, so I decided to make the scope
0:01:24.690,0:01:30.670
a little bit wider and that's what I ended[br]up with. I'll talk a little bit about
0:01:30.670,0:01:37.540
CERN in society as well if you so choose,[br]you'll see what that means in a minute. So
0:01:37.540,0:01:41.680
the things I'll cover here is obviously[br]CERN just a little bit of an introduction
0:01:41.680,0:01:46.100
how we do physics, how we do computing,[br]what data means to us and I can tell you
0:01:46.100,0:01:51.810
it means everything, you heard about that[br]already, right? How we do data analysis in
0:01:51.810,0:01:56.159
high energy physics and just because[br]we've been doing it for a while and
0:01:56.159,0:02:00.530
because I've been doing it for more than[br]ten years, I'm one of the guys who's
0:02:00.530,0:02:07.250
providing the software to do data[br]analysis in high energy physics, so, you
0:02:07.250,0:02:11.360
know, because we know what we are doing[br]and we have some experience, I thought
0:02:11.360,0:02:18.110
maybe you might be interested in hearing[br]what my forecast is for data analysis in
0:02:18.110,0:02:25.430
general, in the future. So let's start[br]with CERN. And so if you wonder what CERN
0:02:25.430,0:02:31.510
is, you've all heard about CERN, about[br]the fantastic funds we love to use, then
0:02:31.510,0:02:36.960
you've probably also heard that we are[br]doing science. We were founded right after
0:02:36.960,0:02:41.450
the Second World War or soon after the[br]Second World War, basically as a way to
0:02:41.450,0:02:47.458
entertain those freaky scientists. You[br]know that was the idea: peace europewide.
0:02:47.458,0:02:52.349
And damn, that's working out really well[br]and so well there's not just Europe
0:02:52.349,0:02:57.530
anymore these days. We are located near[br]Geneva, we are doing only fundamental
0:02:57.530,0:03:02.269
research, so we don't do any weapons,[br]nuclear stuff you
0:03:02.269,0:03:10.230
know, these kind of things. The WWW was[br]invented at CERN but that was just a, you
0:03:10.230,0:03:14.586
know, side effect happens sometimes, that[br]we invent things. But usually we just do
0:03:14.586,0:03:22.500
science. So what we do is, we take money,[br]lots off, and brains who like to discuss
0:03:22.500,0:03:27.210
and think and come up with ideas and from[br]that we generate knowledge. It's really
0:03:27.210,0:03:33.000
all about curiosity. The things we try to[br]answer is what is mass? Which is funny
0:03:33.000,0:03:37.371
question right? Like we all know what mass[br]is but actually we don't. We know what
0:03:37.371,0:03:42.360
mass is in the universe. We understand[br]that masses attract one another: gravity.
0:03:42.360,0:03:48.730
Which is beautifully correct. And in the[br]small scale, our particles, we know that
0:03:48.730,0:03:52.940
mass is energy and we can't convert them.[br]But we don't understand how these two
0:03:52.940,0:03:58.319
things go together. Like there is no[br]bridge, they contradict one another. So we
0:03:58.319,0:04:04.930
are trying to understand what that bridge[br]might be. Part of that mass thing is of
0:04:04.930,0:04:08.650
course also what's out there in the[br]universe? That's a big question. We only
0:04:08.650,0:04:14.230
understand a few percent of that. 90 and[br]some percent are completely unknown to
0:04:14.230,0:04:20.349
us, and that's scary right? I mean we know[br]gravity really well, we can deal with
0:04:20.349,0:04:27.560
freaky things like black holes and yet we[br]don't understand what's out there. Now to
0:04:27.560,0:04:31.850
do all these things we are probing nature[br]at the smallest scale as we call it, so
0:04:31.850,0:04:36.190
that's particles, we are dealing with[br]things like the Higgs particle and
0:04:36.190,0:04:43.900
supersymmetry. Here's a little bit of a[br]fact sheet. We have about 12,000
0:04:43.900,0:04:47.500
physicists who are working with CERN. We[br]are basically the workbench that you saw
0:04:47.500,0:04:54.661
in Andre's talk before. We are the table[br]that physicists use, okay? And, so they
0:04:54.661,0:04:59.050
come to CERN and once a while about[br]10,000 physicists a year, or they work
0:04:59.050,0:05:02.810
remotely most of the time from about 120[br]nations. So you're seeing it's not
0:05:02.810,0:05:10.650
European anymore, this is a global thing.[br]CERN in itself has about 2,500 employees,
0:05:10.650,0:05:15.490
you know those scrubbing the table,[br]setting things up and so on. And our
0:05:15.490,0:05:21.190
table is right here. In the far end we[br]have the Alps, it's in Switzerland
0:05:21.190,0:05:25.990
as I said, so the Alps are[br]always close, with Mont Blanc, we have the
0:05:25.990,0:05:31.639
Lake Geneva we have the Jura, the French[br]Mountains on the lower end here, it's just
0:05:31.639,0:05:37.410
beautiful. It's really nice, but we[br]needed to stick a 30-kilometer ring in
0:05:37.410,0:05:43.861
there somewhere and people would have[br]hated us had we put it like this. But
0:05:43.861,0:05:49.671
luckily people were smart back then in the[br]70s, and built a tunnel much better. So
0:05:49.671,0:05:55.229
now we have this huge tunnel, and we send[br]particles through in both directions near
0:05:55.229,0:06:00.351
the speed of light and the tunnel is[br]filled with magnets simply because if you
0:06:00.351,0:06:08.110
don't use a magnet the particles will fly[br]straight but we need them to turn around.
0:06:08.110,0:06:13.560
Here you see what it's looking like, you[br]also see these big halls there that have
0:06:13.560,0:06:21.880
access shafts from the top and that's[br]where the experiments are. That's sort of
0:06:21.880,0:06:29.210
a sketch of one of the experiments. So the[br]the LHC is one of the, no, is the biggest
0:06:29.210,0:06:35.889
particle accelerator at the moment, it's a[br]ring with 27 kilometers circumference, 100
0:06:35.889,0:06:40.300
meters below Switzerland and France, it[br]has four big experiments and several
0:06:40.300,0:06:45.270
small ones and we are expected to run[br]until 2030. So you see that all of that
0:06:45.270,0:06:50.150
is large-scale simply because we're trying[br]to make good use of the money we have.
0:06:50.150,0:06:56.020
Here, you see one of these caverns that[br]are used by the experiments while it was
0:06:56.020,0:07:01.490
empty. The experiment was then lowered[br]through this hole by the roof, piece by
0:07:01.490,0:07:07.190
piece, and these things are humongous. To[br]give you an impression of how big it is, I
0:07:07.190,0:07:12.520
put Waldo in there, so your job for the[br]next three slides is to find Waldo. You
0:07:12.520,0:07:15.800
know, that gives you the scale. He's[br]friendlily waving at you, so it should be
0:07:15.800,0:07:21.990
easy to find him. So then we put a[br]detector in there. Here it's pulled apart
0:07:21.990,0:07:26.160
a little bit, so it looks nicer, you can[br]actually see something. You can for
0:07:26.160,0:07:31.039
example see the beam pipe, so that's where[br]the particles are flying through, and then
0:07:31.039,0:07:34.880
they're coming from both directions and[br]colliding in the center of the detector
0:07:34.880,0:07:38.490
and then things happen we try to[br]understand what
0:07:38.490,0:07:44.790
is happening. That's yet another view,[br]frontal view on one of the detectors and
0:07:44.790,0:07:51.060
now you have to imagine that, you know,[br]you can't just open up Amazon and order an
0:07:51.060,0:07:56.210
LHC experiment, right, that's not how it[br]works. We do this stuff ourselves, like
0:07:56.210,0:08:02.669
PhD students, postdocs, engineers. You[br]know, that's all done by hand, just like
0:08:02.669,0:08:06.940
the microscope you saw before. Of course[br]you order the parts, but you know the
0:08:06.940,0:08:11.060
design, the whole conception and actually[br]screwing these things together, making
0:08:11.060,0:08:16.970
sure that all fits, is all done by hand.[br]And I find that just beautiful, I mean
0:08:16.970,0:08:21.760
that's close to a miracle, right? That[br]nations, like people no matter what
0:08:21.760,0:08:26.819
nation, people across the globe work[br]together to build such a huge thing and
0:08:26.819,0:08:39.490
then you turn it on and it works. More or[br]less, but you get it to work. That's not
0:08:39.490,0:08:44.310
my applause, that's your applause, because[br]you make this possible. Really, but it's,
0:08:44.310,0:08:49.690
it's huge this is for me one of the things[br]I love most about CERN: That is this
0:08:49.690,0:08:55.279
international thing that just works[br]smoothly. Now the detectors are like a
0:08:55.279,0:09:01.310
massive camera. We have lots of pixels and[br]we take many, many pictures a second. We
0:09:01.310,0:09:06.680
do this to identify particles and then[br]sort of estimate what has happened during
0:09:06.680,0:09:15.470
the collision. Now, life at CERN is of[br]course an important ingredient for
0:09:15.470,0:09:19.529
scientists as well, and if you live at[br]CERN then actually it's just work at CERN
0:09:19.529,0:09:23.980
and that's what it's about. But it's not[br]that bad, so we hang out together in our
0:09:23.980,0:09:30.040
control rooms, make sure that the[br]experiments work correctly. We also, you
0:09:30.040,0:09:33.720
know, study the forces.[br]laughter
0:09:33.720,0:09:38.740
We have scientific discourse, in the sun,[br]view on the Mont Blanc, with a good
0:09:38.740,0:09:45.430
coffee. We have lectures and we are[br]lectured and of course, as you, we have
0:09:45.430,0:09:54.570
more laptops than people. And, then we do[br]stuff and so this presentation is going to
0:09:54.570,0:09:58.580
introduce you to some of the things we are[br]doing, and more on the computing and the
0:09:58.580,0:10:04.100
society side as I said. But because I have[br]so much to talk to about I decided that
0:10:04.100,0:10:08.810
you just build your own talk, you tell me[br]what you want to hear. So let's do this,
0:10:08.810,0:10:14.410
you can choose between A, physics, and B,[br]model simulation and data. You remember
0:10:14.410,0:10:18.620
these books like from the old days when we[br]were all young? It's that kind of thing,
0:10:18.620,0:10:24.450
ok? You decide/design your own talk here.[br]So, by applause, do you want to hear about
0:10:24.450,0:10:27.720
physics?[br]applause
0:10:27.720,0:10:35.730
Okay. Or the model simulation data part?[br]louder applause
0:10:35.730,0:10:45.101
Okay, there we go. So, this is what we[br]skip. Model simulation data it is. You're
0:10:45.101,0:10:49.700
a strange crowd, first time I meet people[br]who don't want to hear about physics... no
0:10:49.700,0:10:51.450
I'm kidding.[br]laughter
0:10:51.450,0:10:53.800
Audience: inaudible interjection[br]laughter
0:10:53.800,0:11:00.079
So model simulation data it is. So our[br]theory is actually incredibly precise.
0:11:00.079,0:11:04.450
It's so precise that our basic job is[br]really really boring, because we already
0:11:04.450,0:11:10.514
understand everything. Whenever there is a[br]collision, we know what's going to happen.
0:11:10.514,0:11:15.430
Except for these very rare things. So we[br]are trying to find these very rare things
0:11:15.430,0:11:19.580
out of this haystack of fairly boring[br]things that we really understand well. And
0:11:19.580,0:11:25.589
the weird things are, for example,[br]monopoles, supersymmetry, or black holes.
0:11:25.589,0:11:32.060
Now the theorists job is to tell us what[br]we should be seeing in the detector, given
0:11:32.060,0:11:42.347
some fancy physics. Then we use simulation[br]to see how our detector would respond to
0:11:42.347,0:11:53.476
that. Now, of course the question is: We[br]are just counting, basically, when we do
0:11:53.476,0:11:58.102
experiments and the question is: How often[br]do we need to see something to say: "Well,
0:11:58.102,0:12:03.310
that's not just the ordinary. That is[br]something new, that's something that could
0:12:03.310,0:12:09.870
be explained by a weird theory. We use the[br]detector simulation as I said to basically
0:12:09.870,0:12:15.029
predict how much we expect to see things.[br]We use reconstruction software which
0:12:15.029,0:12:20.680
tells us what has happened, or might have[br]happened in the detector to count how
0:12:20.680,0:12:25.400
often we saw something. And then we use[br]statistics to compare these two and to say
0:12:25.400,0:12:31.610
whether something is expected or not. Now,[br]that's fairly abstract but it's fairly
0:12:31.610,0:12:36.905
common, a fairly common approach. For[br]example, if you look at climate versus
0:12:36.905,0:12:40.331
weather, right, I mean we always have[br]temperature fluctuations because of
0:12:40.331,0:12:46.480
weather, and the question is: Is that rise[br]in temperature because of a weather effect
0:12:46.480,0:12:50.375
or because of a climate effect? Is that[br]large-scale or just a short-term
0:12:50.375,0:12:55.610
fluctuation. So there, we have a very[br]similar problem and here what you do is
0:12:55.610,0:13:00.880
you measure temperatures, and you want to[br]detect abnormal variations, and you can
0:13:00.880,0:13:06.420
improve that by measuring longer, like,[br]for 300 years instead of 20 years. That
0:13:06.420,0:13:11.930
gives you a better prediction what you[br]would expect in the future. Also, larger
0:13:11.930,0:13:14.170
deviations help, right?. If you look for[br]something that
0:13:14.170,0:13:19.700
is just 0.1 degree, then you might not be[br]able to find it. If there is a deviation
0:13:19.700,0:13:25.230
of 5 degrees, you will definitely find it.[br]And for us it's very similar. So here we
0:13:25.230,0:13:31.610
have a plot, one of the first Higgs[br]discovery plots, and you can see that we
0:13:31.610,0:13:38.800
have many ingredients there. So, the black[br]dots are what we measure and they have
0:13:38.800,0:13:43.829
certain uncertainty, because when we[br]measure, we count and we might have, you
0:13:43.829,0:13:48.977
know, not seen something, or we might have[br]seen more than we we should have seen, so
0:13:48.977,0:13:54.970
there's always an uncertainty. And then we[br]also have theory, which tells us you
0:13:54.970,0:14:00.079
should have seen so many and so for the[br]red part that's something that we know
0:14:00.079,0:14:04.889
exists, it's nothing spectacular. It's[br]simply what theory is telling us what we
0:14:04.889,0:14:10.660
should be seeing. And you can see the data[br]follows the red part fairly well. But then
0:14:10.660,0:14:15.980
there is this other bump in our dots on[br]the right-hand side or in the center and
0:14:15.980,0:14:21.230
that does not make sense, unless you take[br]the Higgs into account, right, which is
0:14:21.230,0:14:26.889
the light blue part and so here you can[br]see how this interplay between different
0:14:26.889,0:14:38.280
sources of physics and statistics works[br]for us. Now just as for the climate, more
0:14:38.280,0:14:43.690
data helps. And there are two versions of[br]more data more data: Either by having more
0:14:43.690,0:14:48.079
collisions, which is why we are running[br]24/7, or more data by combining different
0:14:48.079,0:14:52.060
analyses which is what's happening here.[br]So here you see all these different
0:14:52.060,0:14:56.990
analyses. If you combine them, of course[br]you get a much stronger prediction of, in
0:14:56.990,0:15:03.300
this case, the Higgs mass, then if you[br]just take any single one of them. You see
0:15:03.300,0:15:08.540
how similar what we are doing is to, you[br]know, any of the big data analyses out
0:15:08.540,0:15:16.414
there. Okay, so that was that part. Now[br]comes the obligatory part again,
0:15:16.414,0:15:22.930
computering. When we were designing the[br]LHC,not me, when people were designing the
0:15:22.930,0:15:31.120
LHC, they needed to project computing[br]power from 1990 to 2000 2010 and so on.
0:15:31.120,0:15:34.140
And then they said: "Well, we need[br]massive amount of computers" and for you
0:15:34.140,0:15:38.420
there's now "Ughhh - everybody has it, we[br]have it as well, we have our racks of
0:15:38.420,0:15:44.240
computers". This is something that the big[br]companies usually don't show: You you know
0:15:44.240,0:15:48.509
there is actually a ramp where the trucks[br]arrive and they offload the things and
0:15:48.509,0:15:53.820
then someone needs to screw them together[br]and then looks shiny. This is how we are
0:15:53.820,0:16:00.870
spending our CPU time: We have about[br]60,000 cores that are spinning all the
0:16:00.870,0:16:06.680
time for us, and they are distributed[br]around the world. You can see that CERN,
0:16:06.680,0:16:14.529
for example, is the red part there near[br]the bottom. Yeah, so we make good use of
0:16:14.529,0:16:20.829
that. We also monitor the efficiency, and[br]because 100 percent efficient is for
0:16:20.829,0:16:29.300
beginners we are actually about 700[br]percent efficient. Don't ask why. They
0:16:29.300,0:16:33.920
decided if you are multi-threading, then[br]we, you know, we multiply your efficiency
0:16:33.920,0:16:39.950
by the number of threads you have. Makes[br]no sense to me. We also have storage,
0:16:39.950,0:16:44.930
currently we use about 0.7 exabytes. We[br]also have available at one point seven
0:16:44.930,0:16:49.130
exabytes, so that's good, we make use of[br]the storage we have. Where it's, you know,
0:16:49.130,0:16:55.529
tera- peta- exa-, so it's a lot, and here[br]you can see on the right hand side you
0:16:55.529,0:16:59.610
see, for example, the tape usage on the[br]bottom and you see this dip that was
0:16:59.610,0:17:04.270
before we were starting the accelerator[br]again, we needed to make some space so we
0:17:04.270,0:17:09.089
monitor our hard disk usage all the time.[br]Hey, here comes the next decision point:
0:17:09.089,0:17:13.630
So, do you want to hear about, 1,[br]distributed computing or 2, measure
0:17:13.630,0:17:17.839
effects of bugs. So, 1, distributed[br]computing
0:17:17.839,0:17:26.470
applause[br]and 2, measure the effects of bugs
0:17:26.470,0:17:35.560
similar amount of applause[br]Okay, so that's my call, and I would say
0:17:35.560,0:17:41.455
we do we do... Measure the effects of[br]bugs, because it's shorter.
0:17:41.455,0:17:47.130
laughter[br]So this is one of the views you can, you
0:17:47.130,0:17:50.740
know, electronic views you can get from a[br]detector and you see how we trace the
0:17:50.740,0:17:55.380
particles that fly through the detector.[br]Now, that software right, that's the
0:17:55.380,0:17:59.927
result of software, and you might not[br]believe it, if you have bugs in there, in
0:17:59.927,0:18:00.808
that software.
0:18:02.849,0:18:07.260
And you know, these bugs are sometimes[br]wrong coordinate transformations, so
0:18:07.260,0:18:12.590
things don't go this way but that way,[br]it's kind of weird if you look at it, and
0:18:12.590,0:18:17.470
the result is that our particles don't go[br]through the path that they should have
0:18:17.470,0:18:25.190
been going, but we are attributing them a[br]different path. Now, the the nice thing
0:18:25.190,0:18:30.960
is that we are doing this a million times,[br]right? So all of that is smeared. We are
0:18:30.960,0:18:35.730
not systematically doing this wrong it's[br]just, we are always doing it a little bit
0:18:35.730,0:18:41.669
wrong. And so the net result is that if we[br]measure our particles, we will not measure
0:18:41.669,0:18:46.861
the right thing but always a little bit[br]wobbly left wobbly right you know? Things
0:18:46.861,0:18:53.809
are not as precise. That's simply an[br]uncertainty. So for us just like counting
0:18:53.809,0:18:59.059
has an uncertainty and predictions have[br]an uncertainty, software bugs introduced
0:18:59.059,0:19:05.559
another source of uncertainties. And here[br]you can see how we are tracking
0:19:05.559,0:19:09.370
uncertainties for for all of our[br]analyses. We are trying to understand the
0:19:09.370,0:19:16.220
different forces of uncertainties. And[br]again, bugs are only one of the sources
0:19:16.220,0:19:22.880
here, so if we find the bug then we[br]reduce our uncertainty and we can find new
0:19:22.880,0:19:27.760
physics earlier, instead of having to[br]wait and collect more data. So for us
0:19:27.760,0:19:32.210
finding bugs is really key, we really[br]love finding bugs because it brings
0:19:32.210,0:19:36.710
physics closer. I thought that was[br]interesting. It's kind of rare that you're
0:19:36.710,0:19:42.140
in environment where you're able to[br]measure the effect of bugs. Okay, so now
0:19:42.140,0:19:47.870
we are talking, we'll be talking about[br]data. I talked, told you that we are
0:19:47.870,0:19:52.690
trying to find particle traces in our[br]data and the way we do this is by using
0:19:52.690,0:19:56.700
reconstruction programs and there are[br]multiple gigabytes of binaries in shared
0:19:56.700,0:20:01.799
libraries and stuff. They're huge, they're[br]experiment specific and they are curated
0:20:01.799,0:20:06.270
by the experiments, open-source for some[br]of them, and we want them to be correct
0:20:06.270,0:20:14.140
and efficient. The data format we use is[br]not comma separated values, it's binary
0:20:14.140,0:20:21.080
and for some strange reason it's our own[br]custom binary format. The reason is that
0:20:21.080,0:20:26.990
it's really targeted and the kind of[br]data we are having. We have collisions
0:20:26.990,0:20:32.230
that are independent, so we only need one[br]in memory at any time and we have nested
0:20:32.230,0:20:38.590
collections which makes the regular table[br]layout a non-starter. We actually generate
0:20:38.590,0:20:44.430
them from C++ objects so from classes,[br]class definitions, C++ class definitions
0:20:44.430,0:20:51.320
and we can read them back into C++ but[br]also into JavaScript or Scala. Database
0:20:51.320,0:20:56.840
just didn't do it for us. They have the[br]wrong model of data axis, they don't
0:20:56.840,0:21:02.940
scale, it's just not the kind of system[br]that works for us. Also using a file
0:21:02.940,0:21:09.390
system as a storage back-end might sound[br]really very traditional and boring but it
0:21:09.390,0:21:13.890
works amazingly well and seems to be[br]future proof as well, so that's just the
0:21:13.890,0:21:20.360
way to go for us. There are many other[br]structured data formats out there, many of
0:21:20.360,0:21:26.000
those did not exist when we started root[br]our own data format. But they also miss
0:21:26.000,0:21:30.250
many things. For example, we wanted to[br]make sure that we have schema evolution
0:21:30.250,0:21:33.970
support. We can change the class layout[br]and still read back all data. We don't
0:21:33.970,0:21:38.750
want to throw away all data just because[br]we're changing the class. Also we do not
0:21:38.750,0:21:43.370
trust people. That is a, you know, as a[br]computer scientist or whatever you
0:21:43.370,0:21:46.750
probably know what I'm talking about[br]right? If people have to write their own
0:21:46.750,0:21:50.630
streaming algorithm, there will be bugs[br]and we will lose data.
0:21:50.630,0:21:54.610
We really don't want to do this, so we[br]were trying to automate this, based on the
0:21:54.610,0:22:03.070
class definition. So, last decision point[br]for the story. Do you want to hear about
0:22:03.070,0:22:10.409
cling, our C++ interpreter or about Open[br]Data and Applied Science? Let's start with
0:22:10.409,0:22:14.860
option 1, the C++ interpreter[br]applause
0:22:14.860,0:22:21.106
Okay and and Open Data and Applied[br]Science?
0:22:21.106,0:22:29.679
more applause than before[br]Yeah. I'm heading there. You miss a fish.
0:22:29.679,0:22:35.299
You can look at the slides later. Okay, so[br]there we go. Really? No. The slide number
0:22:35.299,0:22:41.140
is wrong. Oh a bug! So, Open Data and[br]Applied Science. Okay, you really wanted
0:22:41.140,0:22:47.700
to know about our budget, I understand[br]that. So we get from you about 1 billion
0:22:47.700,0:22:50.719
year and the currency doesn't really[br]matter anymore at this, at this point of
0:22:50.719,0:22:54.200
time.[br]laughter
0:22:54.200,0:23:01.230
And that is a lot of money. And you know?[br]We try to do really wonderful things, I
0:23:01.230,0:23:04.943
mean we really enjoy our job, we love it.[br]It's fantastic to work in such an
0:23:04.943,0:23:09.248
environment. And thank you very much for[br]making that possible. Really, I mean it.
0:23:11.110,0:23:16.691
But it also means that you decided as[br]society to enable something like CERN.
0:23:17.473,0:23:22.140
Which I think really deserves my applause[br]and yours probably as well. I think it's a
0:23:22.140,0:23:24.425
great decision to do something like this.
0:23:24.425,0:23:30.211
applause
0:23:31.325,0:23:35.690
So we realize this, right? We realized[br]that we are basically, that we can do what
0:23:35.690,0:23:40.210
we do because of you, and we are trying to[br]react to that by giving back what we do.
0:23:40.210,0:23:47.460
Software, research results, hardware and[br]data. So the way we share research results
0:23:47.460,0:23:52.600
is through open access. We have it,[br]finally. It took us a long time to fight
0:23:52.600,0:23:57.570
with publishers and, you know, the[br]establishment, but now we have it. We
0:23:57.570,0:23:59.220
also, yes thank you.
0:23:59.220,0:24:03.395
applause
0:24:03.395,0:24:07.520
We also put a lot of effort in[br]communicating our results and what we are
0:24:07.520,0:24:12.680
doing. And if you're in the region, it's[br]definitely worth a visit. I mean the URL
0:24:12.680,0:24:17.590
is really easy to remember, it's[br]visit.cern, and you know, works. And you
0:24:17.590,0:24:22.270
should go there by April, actually, if you[br]can because then you can ask people how to
0:24:22.270,0:24:27.580
get on the ground, because the accelerator[br]is off at the moment. We also do applied
0:24:27.580,0:24:32.320
research, for example we have this super[br]cool experiment where we try to study how
0:24:32.320,0:24:39.630
clouds form, based on cosmic rays. So the[br]the influence of cosmic rays and cloud
0:24:39.630,0:24:45.770
formation. Which is a key element in the[br]uncertainty of climate models. We are
0:24:45.770,0:24:50.440
trying to, to think about, you know, how[br]to make energy from nuclear waste. So
0:24:50.440,0:24:54.830
getting rid of nuclear waste while making[br]energy from it. And we are trying to
0:24:54.830,0:25:02.070
repurpose detectors that we have and you[br]know develop. We have something called
0:25:02.070,0:25:08.330
open hardware, for example White Rabbit:[br]deterministic ethernet, we have Open Data,
0:25:08.330,0:25:12.789
and we have the LHC@home and some other[br]programs, where either you can donate
0:25:12.789,0:25:21.250
compute power or your brain and help us[br]get better results. We explicitly try to
0:25:21.250,0:25:25.747
use open source as much as possible, and[br]also feed back, whenever we see issues.
0:25:27.700,0:25:33.620
But we also create open source. For[br]example, we create Geant, which is a
0:25:33.620,0:25:37.831
program that allows you to simulate how[br]particles fly through a matter, for
0:25:37.831,0:25:44.610
example used by the NASA. We have Indico,[br]which allows us to schedule meetings,
0:25:44.610,0:25:48.940
upload slides, you know, these kind of[br]things. Across the globe, lots of people,
0:25:48.940,0:25:52.970
with access protection, all these kind of[br]things. And it's open source. We have
0:25:52.970,0:25:58.919
DaviX, the dimension we love HTTP. That's[br]the next machine of Tim Berners-Lee. And
0:25:58.919,0:26:03.140
that's his futile effort in trying to[br]prevent the cleaning personnel from
0:26:03.140,0:26:07.530
switching it off. They don't speak[br]English, they did not back then at least.
0:26:09.337,0:26:15.500
So we use we used DaviX to transfer files[br]over HTTP, with a high bandwidth. Or we
0:26:15.500,0:26:21.241
have CVM-FS, which allows us to distribute[br]our binaries across the globe, and not
0:26:21.241,0:26:26.570
rely on admins downloading stuff and[br]making sure it actually runs, and these
0:26:26.570,0:26:31.581
kind of things. That is a lifesaver, it's[br]really fantastic, it's a great tool. But
0:26:31.581,0:26:37.730
nobody knows it. And we have ROOT, but[br]that's coming up. So now, the last
0:26:37.730,0:26:42.534
official part of this, of this[br]presentation, how do we do data analysis?
0:26:42.534,0:26:44.950
Not like that.[br]laughter
0:26:44.950,0:26:52.210
applause[br]We use, we use C++ and actually physicists
0:26:52.210,0:26:58.140
need to write their own analysis in C++.[br]We have very few people who have an actual
0:26:58.140,0:27:03.876
education in programming. so that's sort[br]of a clash. As I said, we need to keep one
0:27:04.607,0:27:08.460
collision in memory. And for what, you[br]know, what matters to us is throughput. We
0:27:08.460,0:27:13.340
want to have, we want to analyze as many[br]collisions as possible per second. What we
0:27:13.340,0:27:17.390
can do, is specialize our data format to[br]match the analysis, because we don't want
0:27:17.390,0:27:23.419
to waste I/O cycles, if we can, you know,[br]if we can make use of the CPU better. ROOT
0:27:23.419,0:27:29.110
allows us to do this since twenty years.[br]It's really the workhorse for the analysis
0:27:29.110,0:27:35.200
in high energy physics. And it's also an[br]interface to complex software. We have
0:27:35.200,0:27:40.950
serialization facilities, we have the[br]statistical tools, that people need, and
0:27:40.950,0:27:44.480
we have graphics, because once you have[br]done your analysis you need to communicate
0:27:44.480,0:27:48.500
that to your peers and convince people,[br]and publish, and so on, so that's part of
0:27:48.500,0:27:54.169
the game. All of that is open source, and,[br]of course, all of that is not just used by
0:27:54.169,0:28:03.370
high energy physics. So, to conclude: We[br]are here, because you make it possible.
0:28:03.370,0:28:05.223
Thank you very much. It's fantastic to[br]have you.
0:28:05.223,0:28:10.860
applause[br]We want to share and we have great people
0:28:10.860,0:28:17.080
for science outreach, but we have nobody[br]for software outreach, basically. So maybe
0:28:17.080,0:28:24.570
it's worth a look to see what what CERN is[br]producing software-wise. Scientific
0:28:24.570,0:28:29.940
computing is nothing new, it existed since[br]a long time, but we had to start fairly
0:28:29.940,0:28:35.490
early on a large scale. So when we were[br]building it up, we had to take... we were
0:28:35.490,0:28:39.960
trying to take pieces that existed and did[br]not found find much. So now we ended up
0:28:39.960,0:28:45.179
with C++ data serialization, efficient[br]computing for non computer scientists
0:28:45.179,0:28:49.660
even... In the part that I skipped and,[br]you know, one of the alternate tracks, you
0:28:49.660,0:28:54.289
would have seen that we have a Python[br]binding as well for the whole software
0:28:54.289,0:28:59.970
stack in C++. And for us, what matters[br]most is scale. Now we are seeing that we
0:28:59.970,0:29:04.309
are not the only ones. There are many more[br]natural sciences arriving at a similar
0:29:04.309,0:29:09.120
challenge of having to analyze large[br]amounts of data. Now I promised to you
0:29:09.120,0:29:12.480
that I'll be bold and I'll try to make a[br]few statements of what will happen with
0:29:12.480,0:29:16.750
data analysis, not just in science.[br]Because what we see is that we actually
0:29:16.750,0:29:22.610
educate the people who will do data[br]analysis, not just in science. What we see
0:29:22.610,0:29:30.990
is that in the past, data volume mattered[br]most. So more data meant more power. Now
0:29:30.990,0:29:35.929
that's not the complete truth anymore.[br]It's a lot about finding correlations. So
0:29:35.929,0:29:40.880
even with the amount of data not growing[br]anymore, because it's already humongous,
0:29:40.880,0:29:46.320
we try to squeeze more knowledge out of[br]it. And for that, I/O becomes important
0:29:46.320,0:29:53.900
and CPU limitations is the crucial factor.[br]We see that multivariate techniques are
0:29:53.900,0:29:59.029
still rising and they will just be part of[br]the toolchain of the statistical tools;
0:29:59.852,0:30:06.681
except for generative parts, which, I[br]believe, will change the way we model.
0:30:10.232,0:30:16.361
Now, based on what I just described, this[br]is not a big surprise anymore. As we need
0:30:16.361,0:30:21.210
throughput, we need to have a language for[br]the core analysis part, that is close to
0:30:21.210,0:30:26.970
metal, so something like C++.[br]On the other hand writing analyses is
0:30:26.970,0:30:31.791
still complex, so you need a higher-level[br]language and for that people could, for
0:30:31.791,0:30:35.929
example, use Python. So, now language[br]binding becomes relevant all of a sudden.
0:30:35.929,0:30:42.010
It's much more important in the future.[br]And we need to tailor I/O to the actual
0:30:42.010,0:30:48.910
analysis to not waste CPU cycles. So[br]throughput is the king and, in my point of
0:30:48.910,0:30:54.331
view, also in the future we will see much[br]more effort in increasing the throughput.
0:30:55.600,0:31:03.115
Okay, so that was it. In case you want to[br]discuss anything with me, like "That's
0:31:03.115,0:31:07.970
just wrong!", that's fine. I'm probably[br]have several bugs in there. I'm still here
0:31:07.970,0:31:12.909
until tomorrow. I don't know where yet,[br]so I'll wander around and you can contact
0:31:12.909,0:31:16.818
me by email or Twitter. Thank you very[br]much for your attention. Thank you.
0:31:16.818,0:31:20.525
applause
0:31:20.525,0:31:27.990
music
0:31:27.990,0:31:45.000
subtitles created by c3subtitles.de[br]in the year 2017. Join, and help us!