WEBVTT
00:00:00.000 --> 00:00:13.119
music
00:00:13.119 --> 00:00:17.190
Herald: Good morning and welcome back to
stage one. It's kind of going to be the
00:00:17.190 --> 00:00:21.490
second talk about physics on this day
already and it's about big data and
00:00:21.490 --> 00:00:27.150
science and big data became something like
Uber in science. It's everywhere every
00:00:27.150 --> 00:00:33.370
discipline has it. Axel Naumann's working
for CERN, the accelerator in Switzerland
00:00:33.370 --> 00:00:39.160
and he talks about how physics and
computing bridge in this area and he works
00:00:39.160 --> 00:00:43.183
a lot with ROOT, a program that helps
transform data into knowledge. A warm
00:00:43.183 --> 00:00:44.650
welcome.
00:00:44.650 --> 00:00:45.262
Axel Naumann: Thank you.
00:00:45.262 --> 00:00:51.260
applause
00:00:51.260 --> 00:00:57.850
AN: Thanks a lot. So, well you know, when,
when I was discussing this abstract with
00:00:57.850 --> 00:01:00.950
the science track people they tell me:
"Well, you know about three hundred people
00:01:00.950 --> 00:01:06.000
might be in the audience." But well, hey,
you are huge that's much more than three
00:01:06.000 --> 00:01:10.940
hundred people. So thank you so much for
inviting me over it's a real honor. And of
00:01:10.940 --> 00:01:15.310
course originally when talking to 300
people are all science interested I
00:01:15.310 --> 00:01:20.590
thought you know I pick something fairly
narrow focuswise but then I learned I'm
00:01:20.590 --> 00:01:24.690
going to be in Saal one and that's
different, so I decided to make the scope
00:01:24.690 --> 00:01:30.670
a little bit wider and that's what I ended
up with. I'll talk a little bit about
00:01:30.670 --> 00:01:37.540
CERN in society as well if you so choose,
you'll see what that means in a minute. So
00:01:37.540 --> 00:01:41.680
the things I'll cover here is obviously
CERN just a little bit of an introduction
00:01:41.680 --> 00:01:46.100
how we do physics, how we do computing,
what data means to us and I can tell you
00:01:46.100 --> 00:01:51.810
it means everything, you heard about that
already, right? How we do data analysis in
00:01:51.810 --> 00:01:56.159
high energy physics and just because
we've been doing it for a while and
00:01:56.159 --> 00:02:00.530
because I've been doing it for more than
ten years, I'm one of the guys who's
00:02:00.530 --> 00:02:07.250
providing the software to do data
analysis in high energy physics, so, you
00:02:07.250 --> 00:02:11.360
know, because we know what we are doing
and we have some experience, I thought
00:02:11.360 --> 00:02:18.110
maybe you might be interested in hearing
what my forecast is for data analysis in
00:02:18.110 --> 00:02:25.430
general, in the future. So let's start
with CERN. And so if you wonder what CERN
00:02:25.430 --> 00:02:31.510
is, you've all heard about CERN, about
the fantastic funds we love to use, then
00:02:31.510 --> 00:02:36.960
you've probably also heard that we are
doing science. We were founded right after
00:02:36.960 --> 00:02:41.450
the Second World War or soon after the
Second World War, basically as a way to
00:02:41.450 --> 00:02:47.458
entertain those freaky scientists. You
know that was the idea: peace europewide.
00:02:47.458 --> 00:02:52.349
And damn, that's working out really well
and so well there's not just Europe
00:02:52.349 --> 00:02:57.530
anymore these days. We are located near
Geneva, we are doing only fundamental
00:02:57.530 --> 00:03:02.269
research, so we don't do any weapons,
nuclear stuff you
00:03:02.269 --> 00:03:10.230
know, these kind of things. The WWW was
invented at CERN but that was just a, you
00:03:10.230 --> 00:03:14.586
know, side effect happens sometimes, that
we invent things. But usually we just do
00:03:14.586 --> 00:03:22.500
science. So what we do is, we take money,
lots off, and brains who like to discuss
00:03:22.500 --> 00:03:27.210
and think and come up with ideas and from
that we generate knowledge. It's really
00:03:27.210 --> 00:03:33.000
all about curiosity. The things we try to
answer is what is mass? Which is funny
00:03:33.000 --> 00:03:37.371
question right? Like we all know what mass
is but actually we don't. We know what
00:03:37.371 --> 00:03:42.360
mass is in the universe. We understand
that masses attract one another: gravity.
00:03:42.360 --> 00:03:48.730
Which is beautifully correct. And in the
small scale, our particles, we know that
00:03:48.730 --> 00:03:52.940
mass is energy and we can't convert them.
But we don't understand how these two
00:03:52.940 --> 00:03:58.319
things go together. Like there is no
bridge, they contradict one another. So we
00:03:58.319 --> 00:04:04.930
are trying to understand what that bridge
might be. Part of that mass thing is of
00:04:04.930 --> 00:04:08.650
course also what's out there in the
universe? That's a big question. We only
00:04:08.650 --> 00:04:14.230
understand a few percent of that. 90 and
some percent are completely unknown to
00:04:14.230 --> 00:04:20.349
us, and that's scary right? I mean we know
gravity really well, we can deal with
00:04:20.349 --> 00:04:27.560
freaky things like black holes and yet we
don't understand what's out there. Now to
00:04:27.560 --> 00:04:31.850
do all these things we are probing nature
at the smallest scale as we call it, so
00:04:31.850 --> 00:04:36.190
that's particles, we are dealing with
things like the Higgs particle and
00:04:36.190 --> 00:04:43.900
supersymmetry. Here's a little bit of a
fact sheet. We have about 12,000
00:04:43.900 --> 00:04:47.500
physicists who are working with CERN. We
are basically the workbench that you saw
00:04:47.500 --> 00:04:54.661
in Andre's talk before. We are the table
that physicists use, okay? And, so they
00:04:54.661 --> 00:04:59.050
come to CERN and once a while about
10,000 physicists a year, or they work
00:04:59.050 --> 00:05:02.810
remotely most of the time from about 120
nations. So you're seeing it's not
00:05:02.810 --> 00:05:10.650
European anymore, this is a global thing.
CERN in itself has about 2,500 employees,
00:05:10.650 --> 00:05:15.490
you know those scrubbing the table,
setting things up and so on. And our
00:05:15.490 --> 00:05:21.190
table is right here. In the far end we
have the Alps, it's in Switzerland
00:05:21.190 --> 00:05:25.990
as I said, so the Alps are
always close, with Mont Blanc, we have the
00:05:25.990 --> 00:05:31.639
Lake Geneva we have the Jura, the French
Mountains on the lower end here, it's just
00:05:31.639 --> 00:05:37.410
beautiful. It's really nice, but we
needed to stick a 30-kilometer ring in
00:05:37.410 --> 00:05:43.861
there somewhere and people would have
hated us had we put it like this. But
00:05:43.861 --> 00:05:49.671
luckily people were smart back then in the
70s, and built a tunnel much better. So
00:05:49.671 --> 00:05:55.229
now we have this huge tunnel, and we send
particles through in both directions near
00:05:55.229 --> 00:06:00.351
the speed of light and the tunnel is
filled with magnets simply because if you
00:06:00.351 --> 00:06:08.110
don't use a magnet the particles will fly
straight but we need them to turn around.
00:06:08.110 --> 00:06:13.560
Here you see what it's looking like, you
also see these big halls there that have
00:06:13.560 --> 00:06:21.880
access shafts from the top and that's
where the experiments are. That's sort of
00:06:21.880 --> 00:06:29.210
a sketch of one of the experiments. So the
the LHC is one of the, no, is the biggest
00:06:29.210 --> 00:06:35.889
particle accelerator at the moment, it's a
ring with 27 kilometers circumference, 100
00:06:35.889 --> 00:06:40.300
meters below Switzerland and France, it
has four big experiments and several
00:06:40.300 --> 00:06:45.270
small ones and we are expected to run
until 2030. So you see that all of that
00:06:45.270 --> 00:06:50.150
is large-scale simply because we're trying
to make good use of the money we have.
00:06:50.150 --> 00:06:56.020
Here, you see one of these caverns that
are used by the experiments while it was
00:06:56.020 --> 00:07:01.490
empty. The experiment was then lowered
through this hole by the roof, piece by
00:07:01.490 --> 00:07:07.190
piece, and these things are humongous. To
give you an impression of how big it is, I
00:07:07.190 --> 00:07:12.520
put Waldo in there, so your job for the
next three slides is to find Waldo. You
00:07:12.520 --> 00:07:15.800
know, that gives you the scale. He's
friendlily waving at you, so it should be
00:07:15.800 --> 00:07:21.990
easy to find him. So then we put a
detector in there. Here it's pulled apart
00:07:21.990 --> 00:07:26.160
a little bit, so it looks nicer, you can
actually see something. You can for
00:07:26.160 --> 00:07:31.039
example see the beam pipe, so that's where
the particles are flying through, and then
00:07:31.039 --> 00:07:34.880
they're coming from both directions and
colliding in the center of the detector
00:07:34.880 --> 00:07:38.490
and then things happen we try to
understand what
00:07:38.490 --> 00:07:44.790
is happening. That's yet another view,
frontal view on one of the detectors and
00:07:44.790 --> 00:07:51.060
now you have to imagine that, you know,
you can't just open up Amazon and order an
00:07:51.060 --> 00:07:56.210
LHC experiment, right, that's not how it
works. We do this stuff ourselves, like
00:07:56.210 --> 00:08:02.669
PhD students, postdocs, engineers. You
know, that's all done by hand, just like
00:08:02.669 --> 00:08:06.940
the microscope you saw before. Of course
you order the parts, but you know the
00:08:06.940 --> 00:08:11.060
design, the whole conception and actually
screwing these things together, making
00:08:11.060 --> 00:08:16.970
sure that all fits, is all done by hand.
And I find that just beautiful, I mean
00:08:16.970 --> 00:08:21.760
that's close to a miracle, right? That
nations, like people no matter what
00:08:21.760 --> 00:08:26.819
nation, people across the globe work
together to build such a huge thing and
00:08:26.819 --> 00:08:39.490
then you turn it on and it works. More or
less, but you get it to work. That's not
00:08:39.490 --> 00:08:44.310
my applause, that's your applause, because
you make this possible. Really, but it's,
00:08:44.310 --> 00:08:49.690
it's huge this is for me one of the things
I love most about CERN: That is this
00:08:49.690 --> 00:08:55.279
international thing that just works
smoothly. Now the detectors are like a
00:08:55.279 --> 00:09:01.310
massive camera. We have lots of pixels and
we take many, many pictures a second. We
00:09:01.310 --> 00:09:06.680
do this to identify particles and then
sort of estimate what has happened during
00:09:06.680 --> 00:09:15.470
the collision. Now, life at CERN is of
course an important ingredient for
00:09:15.470 --> 00:09:19.529
scientists as well, and if you live at
CERN then actually it's just work at CERN
00:09:19.529 --> 00:09:23.980
and that's what it's about. But it's not
that bad, so we hang out together in our
00:09:23.980 --> 00:09:30.040
control rooms, make sure that the
experiments work correctly. We also, you
00:09:30.040 --> 00:09:33.720
know, study the forces.
laughter
00:09:33.720 --> 00:09:38.740
We have scientific discourse, in the sun,
view on the Mont Blanc, with a good
00:09:38.740 --> 00:09:45.430
coffee. We have lectures and we are
lectured and of course, as you, we have
00:09:45.430 --> 00:09:54.570
more laptops than people. And, then we do
stuff and so this presentation is going to
00:09:54.570 --> 00:09:58.580
introduce you to some of the things we are
doing, and more on the computing and the
00:09:58.580 --> 00:10:04.100
society side as I said. But because I have
so much to talk to about I decided that
00:10:04.100 --> 00:10:08.810
you just build your own talk, you tell me
what you want to hear. So let's do this,
00:10:08.810 --> 00:10:14.410
you can choose between A, physics, and B,
model simulation and data. You remember
00:10:14.410 --> 00:10:18.620
these books like from the old days when we
were all young? It's that kind of thing,
00:10:18.620 --> 00:10:24.450
ok? You decide/design your own talk here.
So, by applause, do you want to hear about
00:10:24.450 --> 00:10:27.720
physics?
applause
00:10:27.720 --> 00:10:35.730
Okay. Or the model simulation data part?
louder applause
00:10:35.730 --> 00:10:45.101
Okay, there we go. So, this is what we
skip. Model simulation data it is. You're
00:10:45.101 --> 00:10:49.700
a strange crowd, first time I meet people
who don't want to hear about physics... no
00:10:49.700 --> 00:10:51.450
I'm kidding.
laughter
00:10:51.450 --> 00:10:53.800
Audience: inaudible interjection
laughter
00:10:53.800 --> 00:11:00.079
So model simulation data it is. So our
theory is actually incredibly precise.
00:11:00.079 --> 00:11:04.450
It's so precise that our basic job is
really really boring, because we already
00:11:04.450 --> 00:11:10.514
understand everything. Whenever there is a
collision, we know what's going to happen.
00:11:10.514 --> 00:11:15.430
Except for these very rare things. So we
are trying to find these very rare things
00:11:15.430 --> 00:11:19.580
out of this haystack of fairly boring
things that we really understand well. And
00:11:19.580 --> 00:11:25.589
the weird things are, for example,
monopoles, supersymmetry, or black holes.
00:11:25.589 --> 00:11:32.060
Now the theorists job is to tell us what
we should be seeing in the detector, given
00:11:32.060 --> 00:11:42.347
some fancy physics. Then we use simulation
to see how our detector would respond to
00:11:42.347 --> 00:11:53.476
that. Now, of course the question is: We
are just counting, basically, when we do
00:11:53.476 --> 00:11:58.102
experiments and the question is: How often
do we need to see something to say: "Well,
00:11:58.102 --> 00:12:03.310
that's not just the ordinary. That is
something new, that's something that could
00:12:03.310 --> 00:12:09.870
be explained by a weird theory. We use the
detector simulation as I said to basically
00:12:09.870 --> 00:12:15.029
predict how much we expect to see things.
We use reconstruction software which
00:12:15.029 --> 00:12:20.680
tells us what has happened, or might have
happened in the detector to count how
00:12:20.680 --> 00:12:25.400
often we saw something. And then we use
statistics to compare these two and to say
00:12:25.400 --> 00:12:31.610
whether something is expected or not. Now,
that's fairly abstract but it's fairly
00:12:31.610 --> 00:12:36.905
common, a fairly common approach. For
example, if you look at climate versus
00:12:36.905 --> 00:12:40.331
weather, right, I mean we always have
temperature fluctuations because of
00:12:40.331 --> 00:12:46.480
weather, and the question is: Is that rise
in temperature because of a weather effect
00:12:46.480 --> 00:12:50.375
or because of a climate effect? Is that
large-scale or just a short-term
00:12:50.375 --> 00:12:55.610
fluctuation. So there, we have a very
similar problem and here what you do is
00:12:55.610 --> 00:13:00.880
you measure temperatures, and you want to
detect abnormal variations, and you can
00:13:00.880 --> 00:13:06.420
improve that by measuring longer, like,
for 300 years instead of 20 years. That
00:13:06.420 --> 00:13:11.930
gives you a better prediction what you
would expect in the future. Also, larger
00:13:11.930 --> 00:13:14.170
deviations help, right?. If you look for
something that
00:13:14.170 --> 00:13:19.700
is just 0.1 degree, then you might not be
able to find it. If there is a deviation
00:13:19.700 --> 00:13:25.230
of 5 degrees, you will definitely find it.
And for us it's very similar. So here we
00:13:25.230 --> 00:13:31.610
have a plot, one of the first Higgs
discovery plots, and you can see that we
00:13:31.610 --> 00:13:38.800
have many ingredients there. So, the black
dots are what we measure and they have
00:13:38.800 --> 00:13:43.829
certain uncertainty, because when we
measure, we count and we might have, you
00:13:43.829 --> 00:13:48.977
know, not seen something, or we might have
seen more than we we should have seen, so
00:13:48.977 --> 00:13:54.970
there's always an uncertainty. And then we
also have theory, which tells us you
00:13:54.970 --> 00:14:00.079
should have seen so many and so for the
red part that's something that we know
00:14:00.079 --> 00:14:04.889
exists, it's nothing spectacular. It's
simply what theory is telling us what we
00:14:04.889 --> 00:14:10.660
should be seeing. And you can see the data
follows the red part fairly well. But then
00:14:10.660 --> 00:14:15.980
there is this other bump in our dots on
the right-hand side or in the center and
00:14:15.980 --> 00:14:21.230
that does not make sense, unless you take
the Higgs into account, right, which is
00:14:21.230 --> 00:14:26.889
the light blue part and so here you can
see how this interplay between different
00:14:26.889 --> 00:14:38.280
sources of physics and statistics works
for us. Now just as for the climate, more
00:14:38.280 --> 00:14:43.690
data helps. And there are two versions of
more data more data: Either by having more
00:14:43.690 --> 00:14:48.079
collisions, which is why we are running
24/7, or more data by combining different
00:14:48.079 --> 00:14:52.060
analyses which is what's happening here.
So here you see all these different
00:14:52.060 --> 00:14:56.990
analyses. If you combine them, of course
you get a much stronger prediction of, in
00:14:56.990 --> 00:15:03.300
this case, the Higgs mass, then if you
just take any single one of them. You see
00:15:03.300 --> 00:15:08.540
how similar what we are doing is to, you
know, any of the big data analyses out
00:15:08.540 --> 00:15:16.414
there. Okay, so that was that part. Now
comes the obligatory part again,
00:15:16.414 --> 00:15:22.930
computering. When we were designing the
LHC,not me, when people were designing the
00:15:22.930 --> 00:15:31.120
LHC, they needed to project computing
power from 1990 to 2000 2010 and so on.
00:15:31.120 --> 00:15:34.140
And then they said: "Well, we need
massive amount of computers" and for you
00:15:34.140 --> 00:15:38.420
there's now "Ughhh - everybody has it, we
have it as well, we have our racks of
00:15:38.420 --> 00:15:44.240
computers". This is something that the big
companies usually don't show: You you know
00:15:44.240 --> 00:15:48.509
there is actually a ramp where the trucks
arrive and they offload the things and
00:15:48.509 --> 00:15:53.820
then someone needs to screw them together
and then looks shiny. This is how we are
00:15:53.820 --> 00:16:00.870
spending our CPU time: We have about
60,000 cores that are spinning all the
00:16:00.870 --> 00:16:06.680
time for us, and they are distributed
around the world. You can see that CERN,
00:16:06.680 --> 00:16:14.529
for example, is the red part there near
the bottom. Yeah, so we make good use of
00:16:14.529 --> 00:16:20.829
that. We also monitor the efficiency, and
because 100 percent efficient is for
00:16:20.829 --> 00:16:29.300
beginners we are actually about 700
percent efficient. Don't ask why. They
00:16:29.300 --> 00:16:33.920
decided if you are multi-threading, then
we, you know, we multiply your efficiency
00:16:33.920 --> 00:16:39.950
by the number of threads you have. Makes
no sense to me. We also have storage,
00:16:39.950 --> 00:16:44.930
currently we use about 0.7 exabytes. We
also have available at one point seven
00:16:44.930 --> 00:16:49.130
exabytes, so that's good, we make use of
the storage we have. Where it's, you know,
00:16:49.130 --> 00:16:55.529
tera- peta- exa-, so it's a lot, and here
you can see on the right hand side you
00:16:55.529 --> 00:16:59.610
see, for example, the tape usage on the
bottom and you see this dip that was
00:16:59.610 --> 00:17:04.270
before we were starting the accelerator
again, we needed to make some space so we
00:17:04.270 --> 00:17:09.089
monitor our hard disk usage all the time.
Hey, here comes the next decision point:
00:17:09.089 --> 00:17:13.630
So, do you want to hear about, 1,
distributed computing or 2, measure
00:17:13.630 --> 00:17:17.839
effects of bugs. So, 1, distributed
computing
00:17:17.839 --> 00:17:26.470
applause
and 2, measure the effects of bugs
00:17:26.470 --> 00:17:35.560
similar amount of applause
Okay, so that's my call, and I would say
00:17:35.560 --> 00:17:41.455
we do we do... Measure the effects of
bugs, because it's shorter.
00:17:41.455 --> 00:17:47.130
laughter
So this is one of the views you can, you
00:17:47.130 --> 00:17:50.740
know, electronic views you can get from a
detector and you see how we trace the
00:17:50.740 --> 00:17:55.380
particles that fly through the detector.
Now, that software right, that's the
00:17:55.380 --> 00:17:59.927
result of software, and you might not
believe it, if you have bugs in there, in
00:17:59.927 --> 00:18:00.808
that software.
00:18:02.849 --> 00:18:07.260
And you know, these bugs are sometimes
wrong coordinate transformations, so
00:18:07.260 --> 00:18:12.590
things don't go this way but that way,
it's kind of weird if you look at it, and
00:18:12.590 --> 00:18:17.470
the result is that our particles don't go
through the path that they should have
00:18:17.470 --> 00:18:25.190
been going, but we are attributing them a
different path. Now, the the nice thing
00:18:25.190 --> 00:18:30.960
is that we are doing this a million times,
right? So all of that is smeared. We are
00:18:30.960 --> 00:18:35.730
not systematically doing this wrong it's
just, we are always doing it a little bit
00:18:35.730 --> 00:18:41.669
wrong. And so the net result is that if we
measure our particles, we will not measure
00:18:41.669 --> 00:18:46.861
the right thing but always a little bit
wobbly left wobbly right you know? Things
00:18:46.861 --> 00:18:53.809
are not as precise. That's simply an
uncertainty. So for us just like counting
00:18:53.809 --> 00:18:59.059
has an uncertainty and predictions have
an uncertainty, software bugs introduced
00:18:59.059 --> 00:19:05.559
another source of uncertainties. And here
you can see how we are tracking
00:19:05.559 --> 00:19:09.370
uncertainties for for all of our
analyses. We are trying to understand the
00:19:09.370 --> 00:19:16.220
different forces of uncertainties. And
again, bugs are only one of the sources
00:19:16.220 --> 00:19:22.880
here, so if we find the bug then we
reduce our uncertainty and we can find new
00:19:22.880 --> 00:19:27.760
physics earlier, instead of having to
wait and collect more data. So for us
00:19:27.760 --> 00:19:32.210
finding bugs is really key, we really
love finding bugs because it brings
00:19:32.210 --> 00:19:36.710
physics closer. I thought that was
interesting. It's kind of rare that you're
00:19:36.710 --> 00:19:42.140
in environment where you're able to
measure the effect of bugs. Okay, so now
00:19:42.140 --> 00:19:47.870
we are talking, we'll be talking about
data. I talked, told you that we are
00:19:47.870 --> 00:19:52.690
trying to find particle traces in our
data and the way we do this is by using
00:19:52.690 --> 00:19:56.700
reconstruction programs and there are
multiple gigabytes of binaries in shared
00:19:56.700 --> 00:20:01.799
libraries and stuff. They're huge, they're
experiment specific and they are curated
00:20:01.799 --> 00:20:06.270
by the experiments, open-source for some
of them, and we want them to be correct
00:20:06.270 --> 00:20:14.140
and efficient. The data format we use is
not comma separated values, it's binary
00:20:14.140 --> 00:20:21.080
and for some strange reason it's our own
custom binary format. The reason is that
00:20:21.080 --> 00:20:26.990
it's really targeted and the kind of
data we are having. We have collisions
00:20:26.990 --> 00:20:32.230
that are independent, so we only need one
in memory at any time and we have nested
00:20:32.230 --> 00:20:38.590
collections which makes the regular table
layout a non-starter. We actually generate
00:20:38.590 --> 00:20:44.430
them from C++ objects so from classes,
class definitions, C++ class definitions
00:20:44.430 --> 00:20:51.320
and we can read them back into C++ but
also into JavaScript or Scala. Database
00:20:51.320 --> 00:20:56.840
just didn't do it for us. They have the
wrong model of data axis, they don't
00:20:56.840 --> 00:21:02.940
scale, it's just not the kind of system
that works for us. Also using a file
00:21:02.940 --> 00:21:09.390
system as a storage back-end might sound
really very traditional and boring but it
00:21:09.390 --> 00:21:13.890
works amazingly well and seems to be
future proof as well, so that's just the
00:21:13.890 --> 00:21:20.360
way to go for us. There are many other
structured data formats out there, many of
00:21:20.360 --> 00:21:26.000
those did not exist when we started root
our own data format. But they also miss
00:21:26.000 --> 00:21:30.250
many things. For example, we wanted to
make sure that we have schema evolution
00:21:30.250 --> 00:21:33.970
support. We can change the class layout
and still read back all data. We don't
00:21:33.970 --> 00:21:38.750
want to throw away all data just because
we're changing the class. Also we do not
00:21:38.750 --> 00:21:43.370
trust people. That is a, you know, as a
computer scientist or whatever you
00:21:43.370 --> 00:21:46.750
probably know what I'm talking about
right? If people have to write their own
00:21:46.750 --> 00:21:50.630
streaming algorithm, there will be bugs
and we will lose data.
00:21:50.630 --> 00:21:54.610
We really don't want to do this, so we
were trying to automate this, based on the
00:21:54.610 --> 00:22:03.070
class definition. So, last decision point
for the story. Do you want to hear about
00:22:03.070 --> 00:22:10.409
cling, our C++ interpreter or about Open
Data and Applied Science? Let's start with
00:22:10.409 --> 00:22:14.860
option 1, the C++ interpreter
applause
00:22:14.860 --> 00:22:21.106
Okay and and Open Data and Applied
Science?
00:22:21.106 --> 00:22:29.679
more applause than before
Yeah. I'm heading there. You miss a fish.
00:22:29.679 --> 00:22:35.299
You can look at the slides later. Okay, so
there we go. Really? No. The slide number
00:22:35.299 --> 00:22:41.140
is wrong. Oh a bug! So, Open Data and
Applied Science. Okay, you really wanted
00:22:41.140 --> 00:22:47.700
to know about our budget, I understand
that. So we get from you about 1 billion
00:22:47.700 --> 00:22:50.719
year and the currency doesn't really
matter anymore at this, at this point of
00:22:50.719 --> 00:22:54.200
time.
laughter
00:22:54.200 --> 00:23:01.230
And that is a lot of money. And you know?
We try to do really wonderful things, I
00:23:01.230 --> 00:23:04.943
mean we really enjoy our job, we love it.
It's fantastic to work in such an
00:23:04.943 --> 00:23:09.248
environment. And thank you very much for
making that possible. Really, I mean it.
00:23:11.110 --> 00:23:16.691
But it also means that you decided as
society to enable something like CERN.
00:23:17.473 --> 00:23:22.140
Which I think really deserves my applause
and yours probably as well. I think it's a
00:23:22.140 --> 00:23:24.425
great decision to do something like this.
00:23:24.425 --> 00:23:30.211
applause
00:23:31.325 --> 00:23:35.690
So we realize this, right? We realized
that we are basically, that we can do what
00:23:35.690 --> 00:23:40.210
we do because of you, and we are trying to
react to that by giving back what we do.
00:23:40.210 --> 00:23:47.460
Software, research results, hardware and
data. So the way we share research results
00:23:47.460 --> 00:23:52.600
is through open access. We have it,
finally. It took us a long time to fight
00:23:52.600 --> 00:23:57.570
with publishers and, you know, the
establishment, but now we have it. We
00:23:57.570 --> 00:23:59.220
also, yes thank you.
00:23:59.220 --> 00:24:03.395
applause
00:24:03.395 --> 00:24:07.520
We also put a lot of effort in
communicating our results and what we are
00:24:07.520 --> 00:24:12.680
doing. And if you're in the region, it's
definitely worth a visit. I mean the URL
00:24:12.680 --> 00:24:17.590
is really easy to remember, it's
visit.cern, and you know, works. And you
00:24:17.590 --> 00:24:22.270
should go there by April, actually, if you
can because then you can ask people how to
00:24:22.270 --> 00:24:27.580
get on the ground, because the accelerator
is off at the moment. We also do applied
00:24:27.580 --> 00:24:32.320
research, for example we have this super
cool experiment where we try to study how
00:24:32.320 --> 00:24:39.630
clouds form, based on cosmic rays. So the
the influence of cosmic rays and cloud
00:24:39.630 --> 00:24:45.770
formation. Which is a key element in the
uncertainty of climate models. We are
00:24:45.770 --> 00:24:50.440
trying to, to think about, you know, how
to make energy from nuclear waste. So
00:24:50.440 --> 00:24:54.830
getting rid of nuclear waste while making
energy from it. And we are trying to
00:24:54.830 --> 00:25:02.070
repurpose detectors that we have and you
know develop. We have something called
00:25:02.070 --> 00:25:08.330
open hardware, for example White Rabbit:
deterministic ethernet, we have Open Data,
00:25:08.330 --> 00:25:12.789
and we have the LHC@home and some other
programs, where either you can donate
00:25:12.789 --> 00:25:21.250
compute power or your brain and help us
get better results. We explicitly try to
00:25:21.250 --> 00:25:25.747
use open source as much as possible, and
also feed back, whenever we see issues.
00:25:27.700 --> 00:25:33.620
But we also create open source. For
example, we create Geant, which is a
00:25:33.620 --> 00:25:37.831
program that allows you to simulate how
particles fly through a matter, for
00:25:37.831 --> 00:25:44.610
example used by the NASA. We have Indico,
which allows us to schedule meetings,
00:25:44.610 --> 00:25:48.940
upload slides, you know, these kind of
things. Across the globe, lots of people,
00:25:48.940 --> 00:25:52.970
with access protection, all these kind of
things. And it's open source. We have
00:25:52.970 --> 00:25:58.919
DaviX, the dimension we love HTTP. That's
the next machine of Tim Berners-Lee. And
00:25:58.919 --> 00:26:03.140
that's his futile effort in trying to
prevent the cleaning personnel from
00:26:03.140 --> 00:26:07.530
switching it off. They don't speak
English, they did not back then at least.
00:26:09.337 --> 00:26:15.500
So we use we used DaviX to transfer files
over HTTP, with a high bandwidth. Or we
00:26:15.500 --> 00:26:21.241
have CVM-FS, which allows us to distribute
our binaries across the globe, and not
00:26:21.241 --> 00:26:26.570
rely on admins downloading stuff and
making sure it actually runs, and these
00:26:26.570 --> 00:26:31.581
kind of things. That is a lifesaver, it's
really fantastic, it's a great tool. But
00:26:31.581 --> 00:26:37.730
nobody knows it. And we have ROOT, but
that's coming up. So now, the last
00:26:37.730 --> 00:26:42.534
official part of this, of this
presentation, how do we do data analysis?
00:26:42.534 --> 00:26:44.950
Not like that.
laughter
00:26:44.950 --> 00:26:52.210
applause
We use, we use C++ and actually physicists
00:26:52.210 --> 00:26:58.140
need to write their own analysis in C++.
We have very few people who have an actual
00:26:58.140 --> 00:27:03.876
education in programming. so that's sort
of a clash. As I said, we need to keep one
00:27:04.607 --> 00:27:08.460
collision in memory. And for what, you
know, what matters to us is throughput. We
00:27:08.460 --> 00:27:13.340
want to have, we want to analyze as many
collisions as possible per second. What we
00:27:13.340 --> 00:27:17.390
can do, is specialize our data format to
match the analysis, because we don't want
00:27:17.390 --> 00:27:23.419
to waste I/O cycles, if we can, you know,
if we can make use of the CPU better. ROOT
00:27:23.419 --> 00:27:29.110
allows us to do this since twenty years.
It's really the workhorse for the analysis
00:27:29.110 --> 00:27:35.200
in high energy physics. And it's also an
interface to complex software. We have
00:27:35.200 --> 00:27:40.950
serialization facilities, we have the
statistical tools, that people need, and
00:27:40.950 --> 00:27:44.480
we have graphics, because once you have
done your analysis you need to communicate
00:27:44.480 --> 00:27:48.500
that to your peers and convince people,
and publish, and so on, so that's part of
00:27:48.500 --> 00:27:54.169
the game. All of that is open source, and,
of course, all of that is not just used by
00:27:54.169 --> 00:28:03.370
high energy physics. So, to conclude: We
are here, because you make it possible.
00:28:03.370 --> 00:28:05.223
Thank you very much. It's fantastic to
have you.
00:28:05.223 --> 00:28:10.860
applause
We want to share and we have great people
00:28:10.860 --> 00:28:17.080
for science outreach, but we have nobody
for software outreach, basically. So maybe
00:28:17.080 --> 00:28:24.570
it's worth a look to see what what CERN is
producing software-wise. Scientific
00:28:24.570 --> 00:28:29.940
computing is nothing new, it existed since
a long time, but we had to start fairly
00:28:29.940 --> 00:28:35.490
early on a large scale. So when we were
building it up, we had to take... we were
00:28:35.490 --> 00:28:39.960
trying to take pieces that existed and did
not found find much. So now we ended up
00:28:39.960 --> 00:28:45.179
with C++ data serialization, efficient
computing for non computer scientists
00:28:45.179 --> 00:28:49.660
even... In the part that I skipped and,
you know, one of the alternate tracks, you
00:28:49.660 --> 00:28:54.289
would have seen that we have a Python
binding as well for the whole software
00:28:54.289 --> 00:28:59.970
stack in C++. And for us, what matters
most is scale. Now we are seeing that we
00:28:59.970 --> 00:29:04.309
are not the only ones. There are many more
natural sciences arriving at a similar
00:29:04.309 --> 00:29:09.120
challenge of having to analyze large
amounts of data. Now I promised to you
00:29:09.120 --> 00:29:12.480
that I'll be bold and I'll try to make a
few statements of what will happen with
00:29:12.480 --> 00:29:16.750
data analysis, not just in science.
Because what we see is that we actually
00:29:16.750 --> 00:29:22.610
educate the people who will do data
analysis, not just in science. What we see
00:29:22.610 --> 00:29:30.990
is that in the past, data volume mattered
most. So more data meant more power. Now
00:29:30.990 --> 00:29:35.929
that's not the complete truth anymore.
It's a lot about finding correlations. So
00:29:35.929 --> 00:29:40.880
even with the amount of data not growing
anymore, because it's already humongous,
00:29:40.880 --> 00:29:46.320
we try to squeeze more knowledge out of
it. And for that, I/O becomes important
00:29:46.320 --> 00:29:53.900
and CPU limitations is the crucial factor.
We see that multivariate techniques are
00:29:53.900 --> 00:29:59.029
still rising and they will just be part of
the toolchain of the statistical tools;
00:29:59.852 --> 00:30:06.681
except for generative parts, which, I
believe, will change the way we model.
00:30:10.232 --> 00:30:16.361
Now, based on what I just described, this
is not a big surprise anymore. As we need
00:30:16.361 --> 00:30:21.210
throughput, we need to have a language for
the core analysis part, that is close to
00:30:21.210 --> 00:30:26.970
metal, so something like C++.
On the other hand writing analyses is
00:30:26.970 --> 00:30:31.791
still complex, so you need a higher-level
language and for that people could, for
00:30:31.791 --> 00:30:35.929
example, use Python. So, now language
binding becomes relevant all of a sudden.
00:30:35.929 --> 00:30:42.010
It's much more important in the future.
And we need to tailor I/O to the actual
00:30:42.010 --> 00:30:48.910
analysis to not waste CPU cycles. So
throughput is the king and, in my point of
00:30:48.910 --> 00:30:54.331
view, also in the future we will see much
more effort in increasing the throughput.
00:30:55.600 --> 00:31:03.115
Okay, so that was it. In case you want to
discuss anything with me, like "That's
00:31:03.115 --> 00:31:07.970
just wrong!", that's fine. I'm probably
have several bugs in there. I'm still here
00:31:07.970 --> 00:31:12.909
until tomorrow. I don't know where yet,
so I'll wander around and you can contact
00:31:12.909 --> 00:31:16.818
me by email or Twitter. Thank you very
much for your attention. Thank you.
00:31:16.818 --> 00:31:20.525
applause
00:31:20.525 --> 00:31:27.990
music
00:31:27.990 --> 00:31:45.000
subtitles created by c3subtitles.de
in the year 2017. Join, and help us!