1
00:00:00,000 --> 00:00:13,119
music
2
00:00:13,119 --> 00:00:17,190
Herald: Good morning and welcome back to
stage one. It's kind of going to be the
3
00:00:17,190 --> 00:00:21,490
second talk about physics on this day
already and it's about big data and
4
00:00:21,490 --> 00:00:27,150
science and big data became something like
Uber in science. It's everywhere every
5
00:00:27,150 --> 00:00:33,370
discipline has it. Axel Naumann's working
for CERN, the accelerator in Switzerland
6
00:00:33,370 --> 00:00:39,160
and he talks about how physics and
computing bridge in this area and he works
7
00:00:39,160 --> 00:00:43,183
a lot with ROOT, a program that helps
transform data into knowledge. A warm
8
00:00:43,183 --> 00:00:44,650
welcome.
9
00:00:44,650 --> 00:00:45,262
Axel Naumann: Thank you.
10
00:00:45,262 --> 00:00:51,260
applause
11
00:00:51,260 --> 00:00:57,850
AN: Thanks a lot. So, well you know, when,
when I was discussing this abstract with
12
00:00:57,850 --> 00:01:00,950
the science track people they tell me:
"Well, you know about three hundred people
13
00:01:00,950 --> 00:01:06,000
might be in the audience." But well, hey,
you are huge that's much more than three
14
00:01:06,000 --> 00:01:10,940
hundred people. So thank you so much for
inviting me over it's a real honor. And of
15
00:01:10,940 --> 00:01:15,310
course originally when talking to 300
people are all science interested I
16
00:01:15,310 --> 00:01:20,590
thought you know I pick something fairly
narrow focuswise but then I learned I'm
17
00:01:20,590 --> 00:01:24,690
going to be in Saal one and that's
different, so I decided to make the scope
18
00:01:24,690 --> 00:01:30,670
a little bit wider and that's what I ended
up with. I'll talk a little bit about
19
00:01:30,670 --> 00:01:37,540
CERN in society as well if you so choose,
you'll see what that means in a minute. So
20
00:01:37,540 --> 00:01:41,680
the things I'll cover here is obviously
CERN just a little bit of an introduction
21
00:01:41,680 --> 00:01:46,100
how we do physics, how we do computing,
what data means to us and I can tell you
22
00:01:46,100 --> 00:01:51,810
it means everything, you heard about that
already, right? How we do data analysis in
23
00:01:51,810 --> 00:01:56,159
high energy physics and just because
we've been doing it for a while and
24
00:01:56,159 --> 00:02:00,530
because I've been doing it for more than
ten years, I'm one of the guys who's
25
00:02:00,530 --> 00:02:07,250
providing the software to do data
analysis in high energy physics, so, you
26
00:02:07,250 --> 00:02:11,360
know, because we know what we are doing
and we have some experience, I thought
27
00:02:11,360 --> 00:02:18,110
maybe you might be interested in hearing
what my forecast is for data analysis in
28
00:02:18,110 --> 00:02:25,430
general, in the future. So let's start
with CERN. And so if you wonder what CERN
29
00:02:25,430 --> 00:02:31,510
is, you've all heard about CERN, about
the fantastic funds we love to use, then
30
00:02:31,510 --> 00:02:36,960
you've probably also heard that we are
doing science. We were founded right after
31
00:02:36,960 --> 00:02:41,450
the Second World War or soon after the
Second World War, basically as a way to
32
00:02:41,450 --> 00:02:47,458
entertain those freaky scientists. You
know that was the idea: peace europewide.
33
00:02:47,458 --> 00:02:52,349
And damn, that's working out really well
and so well there's not just Europe
34
00:02:52,349 --> 00:02:57,530
anymore these days. We are located near
Geneva, we are doing only fundamental
35
00:02:57,530 --> 00:03:02,269
research, so we don't do any weapons,
nuclear stuff you
36
00:03:02,269 --> 00:03:10,230
know, these kind of things. The WWW was
invented at CERN but that was just a, you
37
00:03:10,230 --> 00:03:14,586
know, side effect happens sometimes, that
we invent things. But usually we just do
38
00:03:14,586 --> 00:03:22,500
science. So what we do is, we take money,
lots off, and brains who like to discuss
39
00:03:22,500 --> 00:03:27,210
and think and come up with ideas and from
that we generate knowledge. It's really
40
00:03:27,210 --> 00:03:33,000
all about curiosity. The things we try to
answer is what is mass? Which is funny
41
00:03:33,000 --> 00:03:37,371
question right? Like we all know what mass
is but actually we don't. We know what
42
00:03:37,371 --> 00:03:42,360
mass is in the universe. We understand
that masses attract one another: gravity.
43
00:03:42,360 --> 00:03:48,730
Which is beautifully correct. And in the
small scale, our particles, we know that
44
00:03:48,730 --> 00:03:52,940
mass is energy and we can't convert them.
But we don't understand how these two
45
00:03:52,940 --> 00:03:58,319
things go together. Like there is no
bridge, they contradict one another. So we
46
00:03:58,319 --> 00:04:04,930
are trying to understand what that bridge
might be. Part of that mass thing is of
47
00:04:04,930 --> 00:04:08,650
course also what's out there in the
universe? That's a big question. We only
48
00:04:08,650 --> 00:04:14,230
understand a few percent of that. 90 and
some percent are completely unknown to
49
00:04:14,230 --> 00:04:20,349
us, and that's scary right? I mean we know
gravity really well, we can deal with
50
00:04:20,349 --> 00:04:27,560
freaky things like black holes and yet we
don't understand what's out there. Now to
51
00:04:27,560 --> 00:04:31,850
do all these things we are probing nature
at the smallest scale as we call it, so
52
00:04:31,850 --> 00:04:36,190
that's particles, we are dealing with
things like the Higgs particle and
53
00:04:36,190 --> 00:04:43,900
supersymmetry. Here's a little bit of a
fact sheet. We have about 12,000
54
00:04:43,900 --> 00:04:47,500
physicists who are working with CERN. We
are basically the workbench that you saw
55
00:04:47,500 --> 00:04:54,661
in Andre's talk before. We are the table
that physicists use, okay? And, so they
56
00:04:54,661 --> 00:04:59,050
come to CERN and once a while about
10,000 physicists a year, or they work
57
00:04:59,050 --> 00:05:02,810
remotely most of the time from about 120
nations. So you're seeing it's not
58
00:05:02,810 --> 00:05:10,650
European anymore, this is a global thing.
CERN in itself has about 2,500 employees,
59
00:05:10,650 --> 00:05:15,490
you know those scrubbing the table,
setting things up and so on. And our
60
00:05:15,490 --> 00:05:21,190
table is right here. In the far end we
have the Alps, it's in Switzerland
61
00:05:21,190 --> 00:05:25,990
as I said, so the Alps are
always close, with Mont Blanc, we have the
62
00:05:25,990 --> 00:05:31,639
Lake Geneva we have the Jura, the French
Mountains on the lower end here, it's just
63
00:05:31,639 --> 00:05:37,410
beautiful. It's really nice, but we
needed to stick a 30-kilometer ring in
64
00:05:37,410 --> 00:05:43,861
there somewhere and people would have
hated us had we put it like this. But
65
00:05:43,861 --> 00:05:49,671
luckily people were smart back then in the
70s, and built a tunnel much better. So
66
00:05:49,671 --> 00:05:55,229
now we have this huge tunnel, and we send
particles through in both directions near
67
00:05:55,229 --> 00:06:00,351
the speed of light and the tunnel is
filled with magnets simply because if you
68
00:06:00,351 --> 00:06:08,110
don't use a magnet the particles will fly
straight but we need them to turn around.
69
00:06:08,110 --> 00:06:13,560
Here you see what it's looking like, you
also see these big halls there that have
70
00:06:13,560 --> 00:06:21,880
access shafts from the top and that's
where the experiments are. That's sort of
71
00:06:21,880 --> 00:06:29,210
a sketch of one of the experiments. So the
the LHC is one of the, no, is the biggest
72
00:06:29,210 --> 00:06:35,889
particle accelerator at the moment, it's a
ring with 27 kilometers circumference, 100
73
00:06:35,889 --> 00:06:40,300
meters below Switzerland and France, it
has four big experiments and several
74
00:06:40,300 --> 00:06:45,270
small ones and we are expected to run
until 2030. So you see that all of that
75
00:06:45,270 --> 00:06:50,150
is large-scale simply because we're trying
to make good use of the money we have.
76
00:06:50,150 --> 00:06:56,020
Here, you see one of these caverns that
are used by the experiments while it was
77
00:06:56,020 --> 00:07:01,490
empty. The experiment was then lowered
through this hole by the roof, piece by
78
00:07:01,490 --> 00:07:07,190
piece, and these things are humongous. To
give you an impression of how big it is, I
79
00:07:07,190 --> 00:07:12,520
put Waldo in there, so your job for the
next three slides is to find Waldo. You
80
00:07:12,520 --> 00:07:15,800
know, that gives you the scale. He's
friendlily waving at you, so it should be
81
00:07:15,800 --> 00:07:21,990
easy to find him. So then we put a
detector in there. Here it's pulled apart
82
00:07:21,990 --> 00:07:26,160
a little bit, so it looks nicer, you can
actually see something. You can for
83
00:07:26,160 --> 00:07:31,039
example see the beam pipe, so that's where
the particles are flying through, and then
84
00:07:31,039 --> 00:07:34,880
they're coming from both directions and
colliding in the center of the detector
85
00:07:34,880 --> 00:07:38,490
and then things happen we try to
understand what
86
00:07:38,490 --> 00:07:44,790
is happening. That's yet another view,
frontal view on one of the detectors and
87
00:07:44,790 --> 00:07:51,060
now you have to imagine that, you know,
you can't just open up Amazon and order an
88
00:07:51,060 --> 00:07:56,210
LHC experiment, right, that's not how it
works. We do this stuff ourselves, like
89
00:07:56,210 --> 00:08:02,669
PhD students, postdocs, engineers. You
know, that's all done by hand, just like
90
00:08:02,669 --> 00:08:06,940
the microscope you saw before. Of course
you order the parts, but you know the
91
00:08:06,940 --> 00:08:11,060
design, the whole conception and actually
screwing these things together, making
92
00:08:11,060 --> 00:08:16,970
sure that all fits, is all done by hand.
And I find that just beautiful, I mean
93
00:08:16,970 --> 00:08:21,760
that's close to a miracle, right? That
nations, like people no matter what
94
00:08:21,760 --> 00:08:26,819
nation, people across the globe work
together to build such a huge thing and
95
00:08:26,819 --> 00:08:39,490
then you turn it on and it works. More or
less, but you get it to work. That's not
96
00:08:39,490 --> 00:08:44,310
my applause, that's your applause, because
you make this possible. Really, but it's,
97
00:08:44,310 --> 00:08:49,690
it's huge this is for me one of the things
I love most about CERN: That is this
98
00:08:49,690 --> 00:08:55,279
international thing that just works
smoothly. Now the detectors are like a
99
00:08:55,279 --> 00:09:01,310
massive camera. We have lots of pixels and
we take many, many pictures a second. We
100
00:09:01,310 --> 00:09:06,680
do this to identify particles and then
sort of estimate what has happened during
101
00:09:06,680 --> 00:09:15,470
the collision. Now, life at CERN is of
course an important ingredient for
102
00:09:15,470 --> 00:09:19,529
scientists as well, and if you live at
CERN then actually it's just work at CERN
103
00:09:19,529 --> 00:09:23,980
and that's what it's about. But it's not
that bad, so we hang out together in our
104
00:09:23,980 --> 00:09:30,040
control rooms, make sure that the
experiments work correctly. We also, you
105
00:09:30,040 --> 00:09:33,720
know, study the forces.
laughter
106
00:09:33,720 --> 00:09:38,740
We have scientific discourse, in the sun,
view on the Mont Blanc, with a good
107
00:09:38,740 --> 00:09:45,430
coffee. We have lectures and we are
lectured and of course, as you, we have
108
00:09:45,430 --> 00:09:54,570
more laptops than people. And, then we do
stuff and so this presentation is going to
109
00:09:54,570 --> 00:09:58,580
introduce you to some of the things we are
doing, and more on the computing and the
110
00:09:58,580 --> 00:10:04,100
society side as I said. But because I have
so much to talk to about I decided that
111
00:10:04,100 --> 00:10:08,810
you just build your own talk, you tell me
what you want to hear. So let's do this,
112
00:10:08,810 --> 00:10:14,410
you can choose between A, physics, and B,
model simulation and data. You remember
113
00:10:14,410 --> 00:10:18,620
these books like from the old days when we
were all young? It's that kind of thing,
114
00:10:18,620 --> 00:10:24,450
ok? You decide/design your own talk here.
So, by applause, do you want to hear about
115
00:10:24,450 --> 00:10:27,720
physics?
applause
116
00:10:27,720 --> 00:10:35,730
Okay. Or the model simulation data part?
louder applause
117
00:10:35,730 --> 00:10:45,101
Okay, there we go. So, this is what we
skip. Model simulation data it is. You're
118
00:10:45,101 --> 00:10:49,700
a strange crowd, first time I meet people
who don't want to hear about physics... no
119
00:10:49,700 --> 00:10:51,450
I'm kidding.
laughter
120
00:10:51,450 --> 00:10:53,800
Audience: inaudible interjection
laughter
121
00:10:53,800 --> 00:11:00,079
So model simulation data it is. So our
theory is actually incredibly precise.
122
00:11:00,079 --> 00:11:04,450
It's so precise that our basic job is
really really boring, because we already
123
00:11:04,450 --> 00:11:10,514
understand everything. Whenever there is a
collision, we know what's going to happen.
124
00:11:10,514 --> 00:11:15,430
Except for these very rare things. So we
are trying to find these very rare things
125
00:11:15,430 --> 00:11:19,580
out of this haystack of fairly boring
things that we really understand well. And
126
00:11:19,580 --> 00:11:25,589
the weird things are, for example,
monopoles, supersymmetry, or black holes.
127
00:11:25,589 --> 00:11:32,060
Now the theorists job is to tell us what
we should be seeing in the detector, given
128
00:11:32,060 --> 00:11:42,347
some fancy physics. Then we use simulation
to see how our detector would respond to
129
00:11:42,347 --> 00:11:53,476
that. Now, of course the question is: We
are just counting, basically, when we do
130
00:11:53,476 --> 00:11:58,102
experiments and the question is: How often
do we need to see something to say: "Well,
131
00:11:58,102 --> 00:12:03,310
that's not just the ordinary. That is
something new, that's something that could
132
00:12:03,310 --> 00:12:09,870
be explained by a weird theory. We use the
detector simulation as I said to basically
133
00:12:09,870 --> 00:12:15,029
predict how much we expect to see things.
We use reconstruction software which
134
00:12:15,029 --> 00:12:20,680
tells us what has happened, or might have
happened in the detector to count how
135
00:12:20,680 --> 00:12:25,400
often we saw something. And then we use
statistics to compare these two and to say
136
00:12:25,400 --> 00:12:31,610
whether something is expected or not. Now,
that's fairly abstract but it's fairly
137
00:12:31,610 --> 00:12:36,905
common, a fairly common approach. For
example, if you look at climate versus
138
00:12:36,905 --> 00:12:40,331
weather, right, I mean we always have
temperature fluctuations because of
139
00:12:40,331 --> 00:12:46,480
weather, and the question is: Is that rise
in temperature because of a weather effect
140
00:12:46,480 --> 00:12:50,375
or because of a climate effect? Is that
large-scale or just a short-term
141
00:12:50,375 --> 00:12:55,610
fluctuation. So there, we have a very
similar problem and here what you do is
142
00:12:55,610 --> 00:13:00,880
you measure temperatures, and you want to
detect abnormal variations, and you can
143
00:13:00,880 --> 00:13:06,420
improve that by measuring longer, like,
for 300 years instead of 20 years. That
144
00:13:06,420 --> 00:13:11,930
gives you a better prediction what you
would expect in the future. Also, larger
145
00:13:11,930 --> 00:13:14,170
deviations help, right?. If you look for
something that
146
00:13:14,170 --> 00:13:19,700
is just 0.1 degree, then you might not be
able to find it. If there is a deviation
147
00:13:19,700 --> 00:13:25,230
of 5 degrees, you will definitely find it.
And for us it's very similar. So here we
148
00:13:25,230 --> 00:13:31,610
have a plot, one of the first Higgs
discovery plots, and you can see that we
149
00:13:31,610 --> 00:13:38,800
have many ingredients there. So, the black
dots are what we measure and they have
150
00:13:38,800 --> 00:13:43,829
certain uncertainty, because when we
measure, we count and we might have, you
151
00:13:43,829 --> 00:13:48,977
know, not seen something, or we might have
seen more than we we should have seen, so
152
00:13:48,977 --> 00:13:54,970
there's always an uncertainty. And then we
also have theory, which tells us you
153
00:13:54,970 --> 00:14:00,079
should have seen so many and so for the
red part that's something that we know
154
00:14:00,079 --> 00:14:04,889
exists, it's nothing spectacular. It's
simply what theory is telling us what we
155
00:14:04,889 --> 00:14:10,660
should be seeing. And you can see the data
follows the red part fairly well. But then
156
00:14:10,660 --> 00:14:15,980
there is this other bump in our dots on
the right-hand side or in the center and
157
00:14:15,980 --> 00:14:21,230
that does not make sense, unless you take
the Higgs into account, right, which is
158
00:14:21,230 --> 00:14:26,889
the light blue part and so here you can
see how this interplay between different
159
00:14:26,889 --> 00:14:38,280
sources of physics and statistics works
for us. Now just as for the climate, more
160
00:14:38,280 --> 00:14:43,690
data helps. And there are two versions of
more data more data: Either by having more
161
00:14:43,690 --> 00:14:48,079
collisions, which is why we are running
24/7, or more data by combining different
162
00:14:48,079 --> 00:14:52,060
analyses which is what's happening here.
So here you see all these different
163
00:14:52,060 --> 00:14:56,990
analyses. If you combine them, of course
you get a much stronger prediction of, in
164
00:14:56,990 --> 00:15:03,300
this case, the Higgs mass, then if you
just take any single one of them. You see
165
00:15:03,300 --> 00:15:08,540
how similar what we are doing is to, you
know, any of the big data analyses out
166
00:15:08,540 --> 00:15:16,414
there. Okay, so that was that part. Now
comes the obligatory part again,
167
00:15:16,414 --> 00:15:22,930
computering. When we were designing the
LHC,not me, when people were designing the
168
00:15:22,930 --> 00:15:31,120
LHC, they needed to project computing
power from 1990 to 2000 2010 and so on.
169
00:15:31,120 --> 00:15:34,140
And then they said: "Well, we need
massive amount of computers" and for you
170
00:15:34,140 --> 00:15:38,420
there's now "Ughhh - everybody has it, we
have it as well, we have our racks of
171
00:15:38,420 --> 00:15:44,240
computers". This is something that the big
companies usually don't show: You you know
172
00:15:44,240 --> 00:15:48,509
there is actually a ramp where the trucks
arrive and they offload the things and
173
00:15:48,509 --> 00:15:53,820
then someone needs to screw them together
and then looks shiny. This is how we are
174
00:15:53,820 --> 00:16:00,870
spending our CPU time: We have about
60,000 cores that are spinning all the
175
00:16:00,870 --> 00:16:06,680
time for us, and they are distributed
around the world. You can see that CERN,
176
00:16:06,680 --> 00:16:14,529
for example, is the red part there near
the bottom. Yeah, so we make good use of
177
00:16:14,529 --> 00:16:20,829
that. We also monitor the efficiency, and
because 100 percent efficient is for
178
00:16:20,829 --> 00:16:29,300
beginners we are actually about 700
percent efficient. Don't ask why. They
179
00:16:29,300 --> 00:16:33,920
decided if you are multi-threading, then
we, you know, we multiply your efficiency
180
00:16:33,920 --> 00:16:39,950
by the number of threads you have. Makes
no sense to me. We also have storage,
181
00:16:39,950 --> 00:16:44,930
currently we use about 0.7 exabytes. We
also have available at one point seven
182
00:16:44,930 --> 00:16:49,130
exabytes, so that's good, we make use of
the storage we have. Where it's, you know,
183
00:16:49,130 --> 00:16:55,529
tera- peta- exa-, so it's a lot, and here
you can see on the right hand side you
184
00:16:55,529 --> 00:16:59,610
see, for example, the tape usage on the
bottom and you see this dip that was
185
00:16:59,610 --> 00:17:04,270
before we were starting the accelerator
again, we needed to make some space so we
186
00:17:04,270 --> 00:17:09,089
monitor our hard disk usage all the time.
Hey, here comes the next decision point:
187
00:17:09,089 --> 00:17:13,630
So, do you want to hear about, 1,
distributed computing or 2, measure
188
00:17:13,630 --> 00:17:17,839
effects of bugs. So, 1, distributed
computing
189
00:17:17,839 --> 00:17:26,470
applause
and 2, measure the effects of bugs
190
00:17:26,470 --> 00:17:35,560
similar amount of applause
Okay, so that's my call, and I would say
191
00:17:35,560 --> 00:17:41,455
we do we do... Measure the effects of
bugs, because it's shorter.
192
00:17:41,455 --> 00:17:47,130
laughter
So this is one of the views you can, you
193
00:17:47,130 --> 00:17:50,740
know, electronic views you can get from a
detector and you see how we trace the
194
00:17:50,740 --> 00:17:55,380
particles that fly through the detector.
Now, that software right, that's the
195
00:17:55,380 --> 00:17:59,927
result of software, and you might not
believe it, if you have bugs in there, in
196
00:17:59,927 --> 00:18:00,808
that software.
197
00:18:02,849 --> 00:18:07,260
And you know, these bugs are sometimes
wrong coordinate transformations, so
198
00:18:07,260 --> 00:18:12,590
things don't go this way but that way,
it's kind of weird if you look at it, and
199
00:18:12,590 --> 00:18:17,470
the result is that our particles don't go
through the path that they should have
200
00:18:17,470 --> 00:18:25,190
been going, but we are attributing them a
different path. Now, the the nice thing
201
00:18:25,190 --> 00:18:30,960
is that we are doing this a million times,
right? So all of that is smeared. We are
202
00:18:30,960 --> 00:18:35,730
not systematically doing this wrong it's
just, we are always doing it a little bit
203
00:18:35,730 --> 00:18:41,669
wrong. And so the net result is that if we
measure our particles, we will not measure
204
00:18:41,669 --> 00:18:46,861
the right thing but always a little bit
wobbly left wobbly right you know? Things
205
00:18:46,861 --> 00:18:53,809
are not as precise. That's simply an
uncertainty. So for us just like counting
206
00:18:53,809 --> 00:18:59,059
has an uncertainty and predictions have
an uncertainty, software bugs introduced
207
00:18:59,059 --> 00:19:05,559
another source of uncertainties. And here
you can see how we are tracking
208
00:19:05,559 --> 00:19:09,370
uncertainties for for all of our
analyses. We are trying to understand the
209
00:19:09,370 --> 00:19:16,220
different forces of uncertainties. And
again, bugs are only one of the sources
210
00:19:16,220 --> 00:19:22,880
here, so if we find the bug then we
reduce our uncertainty and we can find new
211
00:19:22,880 --> 00:19:27,760
physics earlier, instead of having to
wait and collect more data. So for us
212
00:19:27,760 --> 00:19:32,210
finding bugs is really key, we really
love finding bugs because it brings
213
00:19:32,210 --> 00:19:36,710
physics closer. I thought that was
interesting. It's kind of rare that you're
214
00:19:36,710 --> 00:19:42,140
in environment where you're able to
measure the effect of bugs. Okay, so now
215
00:19:42,140 --> 00:19:47,870
we are talking, we'll be talking about
data. I talked, told you that we are
216
00:19:47,870 --> 00:19:52,690
trying to find particle traces in our
data and the way we do this is by using
217
00:19:52,690 --> 00:19:56,700
reconstruction programs and there are
multiple gigabytes of binaries in shared
218
00:19:56,700 --> 00:20:01,799
libraries and stuff. They're huge, they're
experiment specific and they are curated
219
00:20:01,799 --> 00:20:06,270
by the experiments, open-source for some
of them, and we want them to be correct
220
00:20:06,270 --> 00:20:14,140
and efficient. The data format we use is
not comma separated values, it's binary
221
00:20:14,140 --> 00:20:21,080
and for some strange reason it's our own
custom binary format. The reason is that
222
00:20:21,080 --> 00:20:26,990
it's really targeted and the kind of
data we are having. We have collisions
223
00:20:26,990 --> 00:20:32,230
that are independent, so we only need one
in memory at any time and we have nested
224
00:20:32,230 --> 00:20:38,590
collections which makes the regular table
layout a non-starter. We actually generate
225
00:20:38,590 --> 00:20:44,430
them from C++ objects so from classes,
class definitions, C++ class definitions
226
00:20:44,430 --> 00:20:51,320
and we can read them back into C++ but
also into JavaScript or Scala. Database
227
00:20:51,320 --> 00:20:56,840
just didn't do it for us. They have the
wrong model of data axis, they don't
228
00:20:56,840 --> 00:21:02,940
scale, it's just not the kind of system
that works for us. Also using a file
229
00:21:02,940 --> 00:21:09,390
system as a storage back-end might sound
really very traditional and boring but it
230
00:21:09,390 --> 00:21:13,890
works amazingly well and seems to be
future proof as well, so that's just the
231
00:21:13,890 --> 00:21:20,360
way to go for us. There are many other
structured data formats out there, many of
232
00:21:20,360 --> 00:21:26,000
those did not exist when we started root
our own data format. But they also miss
233
00:21:26,000 --> 00:21:30,250
many things. For example, we wanted to
make sure that we have schema evolution
234
00:21:30,250 --> 00:21:33,970
support. We can change the class layout
and still read back all data. We don't
235
00:21:33,970 --> 00:21:38,750
want to throw away all data just because
we're changing the class. Also we do not
236
00:21:38,750 --> 00:21:43,370
trust people. That is a, you know, as a
computer scientist or whatever you
237
00:21:43,370 --> 00:21:46,750
probably know what I'm talking about
right? If people have to write their own
238
00:21:46,750 --> 00:21:50,630
streaming algorithm, there will be bugs
and we will lose data.
239
00:21:50,630 --> 00:21:54,610
We really don't want to do this, so we
were trying to automate this, based on the
240
00:21:54,610 --> 00:22:03,070
class definition. So, last decision point
for the story. Do you want to hear about
241
00:22:03,070 --> 00:22:10,409
cling, our C++ interpreter or about Open
Data and Applied Science? Let's start with
242
00:22:10,409 --> 00:22:14,860
option 1, the C++ interpreter
applause
243
00:22:14,860 --> 00:22:21,106
Okay and and Open Data and Applied
Science?
244
00:22:21,106 --> 00:22:29,679
more applause than before
Yeah. I'm heading there. You miss a fish.
245
00:22:29,679 --> 00:22:35,299
You can look at the slides later. Okay, so
there we go. Really? No. The slide number
246
00:22:35,299 --> 00:22:41,140
is wrong. Oh a bug! So, Open Data and
Applied Science. Okay, you really wanted
247
00:22:41,140 --> 00:22:47,700
to know about our budget, I understand
that. So we get from you about 1 billion
248
00:22:47,700 --> 00:22:50,719
year and the currency doesn't really
matter anymore at this, at this point of
249
00:22:50,719 --> 00:22:54,200
time.
laughter
250
00:22:54,200 --> 00:23:01,230
And that is a lot of money. And you know?
We try to do really wonderful things, I
251
00:23:01,230 --> 00:23:04,943
mean we really enjoy our job, we love it.
It's fantastic to work in such an
252
00:23:04,943 --> 00:23:09,248
environment. And thank you very much for
making that possible. Really, I mean it.
253
00:23:11,110 --> 00:23:16,691
But it also means that you decided as
society to enable something like CERN.
254
00:23:17,473 --> 00:23:22,140
Which I think really deserves my applause
and yours probably as well. I think it's a
255
00:23:22,140 --> 00:23:24,425
great decision to do something like this.
256
00:23:24,425 --> 00:23:30,211
applause
257
00:23:31,325 --> 00:23:35,690
So we realize this, right? We realized
that we are basically, that we can do what
258
00:23:35,690 --> 00:23:40,210
we do because of you, and we are trying to
react to that by giving back what we do.
259
00:23:40,210 --> 00:23:47,460
Software, research results, hardware and
data. So the way we share research results
260
00:23:47,460 --> 00:23:52,600
is through open access. We have it,
finally. It took us a long time to fight
261
00:23:52,600 --> 00:23:57,570
with publishers and, you know, the
establishment, but now we have it. We
262
00:23:57,570 --> 00:23:59,220
also, yes thank you.
263
00:23:59,220 --> 00:24:03,395
applause
264
00:24:03,395 --> 00:24:07,520
We also put a lot of effort in
communicating our results and what we are
265
00:24:07,520 --> 00:24:12,680
doing. And if you're in the region, it's
definitely worth a visit. I mean the URL
266
00:24:12,680 --> 00:24:17,590
is really easy to remember, it's
visit.cern, and you know, works. And you
267
00:24:17,590 --> 00:24:22,270
should go there by April, actually, if you
can because then you can ask people how to
268
00:24:22,270 --> 00:24:27,580
get on the ground, because the accelerator
is off at the moment. We also do applied
269
00:24:27,580 --> 00:24:32,320
research, for example we have this super
cool experiment where we try to study how
270
00:24:32,320 --> 00:24:39,630
clouds form, based on cosmic rays. So the
the influence of cosmic rays and cloud
271
00:24:39,630 --> 00:24:45,770
formation. Which is a key element in the
uncertainty of climate models. We are
272
00:24:45,770 --> 00:24:50,440
trying to, to think about, you know, how
to make energy from nuclear waste. So
273
00:24:50,440 --> 00:24:54,830
getting rid of nuclear waste while making
energy from it. And we are trying to
274
00:24:54,830 --> 00:25:02,070
repurpose detectors that we have and you
know develop. We have something called
275
00:25:02,070 --> 00:25:08,330
open hardware, for example White Rabbit:
deterministic ethernet, we have Open Data,
276
00:25:08,330 --> 00:25:12,789
and we have the LHC@home and some other
programs, where either you can donate
277
00:25:12,789 --> 00:25:21,250
compute power or your brain and help us
get better results. We explicitly try to
278
00:25:21,250 --> 00:25:25,747
use open source as much as possible, and
also feed back, whenever we see issues.
279
00:25:27,700 --> 00:25:33,620
But we also create open source. For
example, we create Geant, which is a
280
00:25:33,620 --> 00:25:37,831
program that allows you to simulate how
particles fly through a matter, for
281
00:25:37,831 --> 00:25:44,610
example used by the NASA. We have Indico,
which allows us to schedule meetings,
282
00:25:44,610 --> 00:25:48,940
upload slides, you know, these kind of
things. Across the globe, lots of people,
283
00:25:48,940 --> 00:25:52,970
with access protection, all these kind of
things. And it's open source. We have
284
00:25:52,970 --> 00:25:58,919
DaviX, the dimension we love HTTP. That's
the next machine of Tim Berners-Lee. And
285
00:25:58,919 --> 00:26:03,140
that's his futile effort in trying to
prevent the cleaning personnel from
286
00:26:03,140 --> 00:26:07,530
switching it off. They don't speak
English, they did not back then at least.
287
00:26:09,337 --> 00:26:15,500
So we use we used DaviX to transfer files
over HTTP, with a high bandwidth. Or we
288
00:26:15,500 --> 00:26:21,241
have CVM-FS, which allows us to distribute
our binaries across the globe, and not
289
00:26:21,241 --> 00:26:26,570
rely on admins downloading stuff and
making sure it actually runs, and these
290
00:26:26,570 --> 00:26:31,581
kind of things. That is a lifesaver, it's
really fantastic, it's a great tool. But
291
00:26:31,581 --> 00:26:37,730
nobody knows it. And we have ROOT, but
that's coming up. So now, the last
292
00:26:37,730 --> 00:26:42,534
official part of this, of this
presentation, how do we do data analysis?
293
00:26:42,534 --> 00:26:44,950
Not like that.
laughter
294
00:26:44,950 --> 00:26:52,210
applause
We use, we use C++ and actually physicists
295
00:26:52,210 --> 00:26:58,140
need to write their own analysis in C++.
We have very few people who have an actual
296
00:26:58,140 --> 00:27:03,876
education in programming. so that's sort
of a clash. As I said, we need to keep one
297
00:27:04,607 --> 00:27:08,460
collision in memory. And for what, you
know, what matters to us is throughput. We
298
00:27:08,460 --> 00:27:13,340
want to have, we want to analyze as many
collisions as possible per second. What we
299
00:27:13,340 --> 00:27:17,390
can do, is specialize our data format to
match the analysis, because we don't want
300
00:27:17,390 --> 00:27:23,419
to waste I/O cycles, if we can, you know,
if we can make use of the CPU better. ROOT
301
00:27:23,419 --> 00:27:29,110
allows us to do this since twenty years.
It's really the workhorse for the analysis
302
00:27:29,110 --> 00:27:35,200
in high energy physics. And it's also an
interface to complex software. We have
303
00:27:35,200 --> 00:27:40,950
serialization facilities, we have the
statistical tools, that people need, and
304
00:27:40,950 --> 00:27:44,480
we have graphics, because once you have
done your analysis you need to communicate
305
00:27:44,480 --> 00:27:48,500
that to your peers and convince people,
and publish, and so on, so that's part of
306
00:27:48,500 --> 00:27:54,169
the game. All of that is open source, and,
of course, all of that is not just used by
307
00:27:54,169 --> 00:28:03,370
high energy physics. So, to conclude: We
are here, because you make it possible.
308
00:28:03,370 --> 00:28:05,223
Thank you very much. It's fantastic to
have you.
309
00:28:05,223 --> 00:28:10,860
applause
We want to share and we have great people
310
00:28:10,860 --> 00:28:17,080
for science outreach, but we have nobody
for software outreach, basically. So maybe
311
00:28:17,080 --> 00:28:24,570
it's worth a look to see what what CERN is
producing software-wise. Scientific
312
00:28:24,570 --> 00:28:29,940
computing is nothing new, it existed since
a long time, but we had to start fairly
313
00:28:29,940 --> 00:28:35,490
early on a large scale. So when we were
building it up, we had to take... we were
314
00:28:35,490 --> 00:28:39,960
trying to take pieces that existed and did
not found find much. So now we ended up
315
00:28:39,960 --> 00:28:45,179
with C++ data serialization, efficient
computing for non computer scientists
316
00:28:45,179 --> 00:28:49,660
even... In the part that I skipped and,
you know, one of the alternate tracks, you
317
00:28:49,660 --> 00:28:54,289
would have seen that we have a Python
binding as well for the whole software
318
00:28:54,289 --> 00:28:59,970
stack in C++. And for us, what matters
most is scale. Now we are seeing that we
319
00:28:59,970 --> 00:29:04,309
are not the only ones. There are many more
natural sciences arriving at a similar
320
00:29:04,309 --> 00:29:09,120
challenge of having to analyze large
amounts of data. Now I promised to you
321
00:29:09,120 --> 00:29:12,480
that I'll be bold and I'll try to make a
few statements of what will happen with
322
00:29:12,480 --> 00:29:16,750
data analysis, not just in science.
Because what we see is that we actually
323
00:29:16,750 --> 00:29:22,610
educate the people who will do data
analysis, not just in science. What we see
324
00:29:22,610 --> 00:29:30,990
is that in the past, data volume mattered
most. So more data meant more power. Now
325
00:29:30,990 --> 00:29:35,929
that's not the complete truth anymore.
It's a lot about finding correlations. So
326
00:29:35,929 --> 00:29:40,880
even with the amount of data not growing
anymore, because it's already humongous,
327
00:29:40,880 --> 00:29:46,320
we try to squeeze more knowledge out of
it. And for that, I/O becomes important
328
00:29:46,320 --> 00:29:53,900
and CPU limitations is the crucial factor.
We see that multivariate techniques are
329
00:29:53,900 --> 00:29:59,029
still rising and they will just be part of
the toolchain of the statistical tools;
330
00:29:59,852 --> 00:30:06,681
except for generative parts, which, I
believe, will change the way we model.
331
00:30:10,232 --> 00:30:16,361
Now, based on what I just described, this
is not a big surprise anymore. As we need
332
00:30:16,361 --> 00:30:21,210
throughput, we need to have a language for
the core analysis part, that is close to
333
00:30:21,210 --> 00:30:26,970
metal, so something like C++.
On the other hand writing analyses is
334
00:30:26,970 --> 00:30:31,791
still complex, so you need a higher-level
language and for that people could, for
335
00:30:31,791 --> 00:30:35,929
example, use Python. So, now language
binding becomes relevant all of a sudden.
336
00:30:35,929 --> 00:30:42,010
It's much more important in the future.
And we need to tailor I/O to the actual
337
00:30:42,010 --> 00:30:48,910
analysis to not waste CPU cycles. So
throughput is the king and, in my point of
338
00:30:48,910 --> 00:30:54,331
view, also in the future we will see much
more effort in increasing the throughput.
339
00:30:55,600 --> 00:31:03,115
Okay, so that was it. In case you want to
discuss anything with me, like "That's
340
00:31:03,115 --> 00:31:07,970
just wrong!", that's fine. I'm probably
have several bugs in there. I'm still here
341
00:31:07,970 --> 00:31:12,909
until tomorrow. I don't know where yet,
so I'll wander around and you can contact
342
00:31:12,909 --> 00:31:16,818
me by email or Twitter. Thank you very
much for your attention. Thank you.
343
00:31:16,818 --> 00:31:20,525
applause
344
00:31:20,525 --> 00:31:27,990
music
345
00:31:27,990 --> 00:31:45,000
subtitles created by c3subtitles.de
in the year 2017. Join, and help us!