music
Herald: Good morning and welcome back to
stage one. It's kind of going to be the
second talk about physics on this day
already and it's about big data and
science and big data became something like
Uber in science. It's everywhere every
discipline has it. Axel Naumann's working
for CERN, the accelerator in Switzerland
and he talks about how physics and
computing bridge in this area and he works
a lot with ROOT, a program that helps
transform data into knowledge. A warm
welcome.
Axel Naumann: Thank you.
applause
AN: Thanks a lot. So, well you know, when,
when I was discussing this abstract with
the science track people they tell me:
"Well, you know about three hundred people
might be in the audience." But well, hey,
you are huge that's much more than three
hundred people. So thank you so much for
inviting me over it's a real honor. And of
course originally when talking to 300
people are all science interested I
thought you know I pick something fairly
narrow focuswise but then I learned I'm
going to be in Saal one and that's
different, so I decided to make the scope
a little bit wider and that's what I ended
up with. I'll talk a little bit about
CERN in society as well if you so choose,
you'll see what that means in a minute. So
the things I'll cover here is obviously
CERN just a little bit of an introduction
how we do physics, how we do computing,
what data means to us and I can tell you
it means everything, you heard about that
already, right? How we do data analysis in
high energy physics and just because
we've been doing it for a while and
because I've been doing it for more than
ten years, I'm one of the guys who's
providing the software to do data
analysis in high energy physics, so, you
know, because we know what we are doing
and we have some experience, I thought
maybe you might be interested in hearing
what my forecast is for data analysis in
general, in the future. So let's start
with CERN. And so if you wonder what CERN
is, you've all heard about CERN, about
the fantastic funds we love to use, then
you've probably also heard that we are
doing science. We were founded right after
the Second World War or soon after the
Second World War, basically as a way to
entertain those freaky scientists. You
know that was the idea: peace europewide.
And damn, that's working out really well
and so well there's not just Europe
anymore these days. We are located near
Geneva, we are doing only fundamental
research, so we don't do any weapons,
nuclear stuff you
know, these kind of things. The WWW was
invented at CERN but that was just a, you
know, side effect happens sometimes, that
we invent things. But usually we just do
science. So what we do is, we take money,
lots off, and brains who like to discuss
and think and come up with ideas and from
that we generate knowledge. It's really
all about curiosity. The things we try to
answer is what is mass? Which is funny
question right? Like we all know what mass
is but actually we don't. We know what
mass is in the universe. We understand
that masses attract one another: gravity.
Which is beautifully correct. And in the
small scale, our particles, we know that
mass is energy and we can't convert them.
But we don't understand how these two
things go together. Like there is no
bridge, they contradict one another. So we
are trying to understand what that bridge
might be. Part of that mass thing is of
course also what's out there in the
universe? That's a big question. We only
understand a few percent of that. 90 and
some percent are completely unknown to
us, and that's scary right? I mean we know
gravity really well, we can deal with
freaky things like black holes and yet we
don't understand what's out there. Now to
do all these things we are probing nature
at the smallest scale as we call it, so
that's particles, we are dealing with
things like the Higgs particle and
supersymmetry. Here's a little bit of a
fact sheet. We have about 12,000
physicists who are working with CERN. We
are basically the workbench that you saw
in Andre's talk before. We are the table
that physicists use, okay? And, so they
come to CERN and once a while about
10,000 physicists a year, or they work
remotely most of the time from about 120
nations. So you're seeing it's not
European anymore, this is a global thing.
CERN in itself has about 2,500 employees,
you know those scrubbing the table,
setting things up and so on. And our
table is right here. In the far end we
have the Alps, it's in Switzerland
as I said, so the Alps are
always close, with Mont Blanc, we have the
Lake Geneva we have the Jura, the French
Mountains on the lower end here, it's just
beautiful. It's really nice, but we
needed to stick a 30-kilometer ring in
there somewhere and people would have
hated us had we put it like this. But
luckily people were smart back then in the
70s, and built a tunnel much better. So
now we have this huge tunnel, and we send
particles through in both directions near
the speed of light and the tunnel is
filled with magnets simply because if you
don't use a magnet the particles will fly
straight but we need them to turn around.
Here you see what it's looking like, you
also see these big halls there that have
access shafts from the top and that's
where the experiments are. That's sort of
a sketch of one of the experiments. So the
the LHC is one of the, no, is the biggest
particle accelerator at the moment, it's a
ring with 27 kilometers circumference, 100
meters below Switzerland and France, it
has four big experiments and several
small ones and we are expected to run
until 2030. So you see that all of that
is large-scale simply because we're trying
to make good use of the money we have.
Here, you see one of these caverns that
are used by the experiments while it was
empty. The experiment was then lowered
through this hole by the roof, piece by
piece, and these things are humongous. To
give you an impression of how big it is, I
put Waldo in there, so your job for the
next three slides is to find Waldo. You
know, that gives you the scale. He's
friendlily waving at you, so it should be
easy to find him. So then we put a
detector in there. Here it's pulled apart
a little bit, so it looks nicer, you can
actually see something. You can for
example see the beam pipe, so that's where
the particles are flying through, and then
they're coming from both directions and
colliding in the center of the detector
and then things happen we try to
understand what
is happening. That's yet another view,
frontal view on one of the detectors and
now you have to imagine that, you know,
you can't just open up Amazon and order an
LHC experiment, right, that's not how it
works. We do this stuff ourselves, like
PhD students, postdocs, engineers. You
know, that's all done by hand, just like
the microscope you saw before. Of course
you order the parts, but you know the
design, the whole conception and actually
screwing these things together, making
sure that all fits, is all done by hand.
And I find that just beautiful, I mean
that's close to a miracle, right? That
nations, like people no matter what
nation, people across the globe work
together to build such a huge thing and
then you turn it on and it works. More or
less, but you get it to work. That's not
my applause, that's your applause, because
you make this possible. Really, but it's,
it's huge this is for me one of the things
I love most about CERN: That is this
international thing that just works
smoothly. Now the detectors are like a
massive camera. We have lots of pixels and
we take many, many pictures a second. We
do this to identify particles and then
sort of estimate what has happened during
the collision. Now, life at CERN is of
course an important ingredient for
scientists as well, and if you live at
CERN then actually it's just work at CERN
and that's what it's about. But it's not
that bad, so we hang out together in our
control rooms, make sure that the
experiments work correctly. We also, you
know, study the forces.
laughter
We have scientific discourse, in the sun,
view on the Mont Blanc, with a good
coffee. We have lectures and we are
lectured and of course, as you, we have
more laptops than people. And, then we do
stuff and so this presentation is going to
introduce you to some of the things we are
doing, and more on the computing and the
society side as I said. But because I have
so much to talk to about I decided that
you just build your own talk, you tell me
what you want to hear. So let's do this,
you can choose between A, physics, and B,
model simulation and data. You remember
these books like from the old days when we
were all young? It's that kind of thing,
ok? You decide/design your own talk here.
So, by applause, do you want to hear about
physics?
applause
Okay. Or the model simulation data part?
louder applause
Okay, there we go. So, this is what we
skip. Model simulation data it is. You're
a strange crowd, first time I meet people
who don't want to hear about physics... no
I'm kidding.
laughter
Audience: inaudible interjection
laughter
So model simulation data it is. So our
theory is actually incredibly precise.
It's so precise that our basic job is
really really boring, because we already
understand everything. Whenever there is a
collision, we know what's going to happen.
Except for these very rare things. So we
are trying to find these very rare things
out of this haystack of fairly boring
things that we really understand well. And
the weird things are, for example,
monopoles, supersymmetry, or black holes.
Now the theorists job is to tell us what
we should be seeing in the detector, given
some fancy physics. Then we use simulation
to see how our detector would respond to
that. Now, of course the question is: We
are just counting, basically, when we do
experiments and the question is: How often
do we need to see something to say: "Well,
that's not just the ordinary. That is
something new, that's something that could
be explained by a weird theory. We use the
detector simulation as I said to basically
predict how much we expect to see things.
We use reconstruction software which
tells us what has happened, or might have
happened in the detector to count how
often we saw something. And then we use
statistics to compare these two and to say
whether something is expected or not. Now,
that's fairly abstract but it's fairly
common, a fairly common approach. For
example, if you look at climate versus
weather, right, I mean we always have
temperature fluctuations because of
weather, and the question is: Is that rise
in temperature because of a weather effect
or because of a climate effect? Is that
large-scale or just a short-term
fluctuation. So there, we have a very
similar problem and here what you do is
you measure temperatures, and you want to
detect abnormal variations, and you can
improve that by measuring longer, like,
for 300 years instead of 20 years. That
gives you a better prediction what you
would expect in the future. Also, larger
deviations help, right?. If you look for
something that
is just 0.1 degree, then you might not be
able to find it. If there is a deviation
of 5 degrees, you will definitely find it.
And for us it's very similar. So here we
have a plot, one of the first Higgs
discovery plots, and you can see that we
have many ingredients there. So, the black
dots are what we measure and they have
certain uncertainty, because when we
measure, we count and we might have, you
know, not seen something, or we might have
seen more than we we should have seen, so
there's always an uncertainty. And then we
also have theory, which tells us you
should have seen so many and so for the
red part that's something that we know
exists, it's nothing spectacular. It's
simply what theory is telling us what we
should be seeing. And you can see the data
follows the red part fairly well. But then
there is this other bump in our dots on
the right-hand side or in the center and
that does not make sense, unless you take
the Higgs into account, right, which is
the light blue part and so here you can
see how this interplay between different
sources of physics and statistics works
for us. Now just as for the climate, more
data helps. And there are two versions of
more data more data: Either by having more
collisions, which is why we are running
24/7, or more data by combining different
analyses which is what's happening here.
So here you see all these different
analyses. If you combine them, of course
you get a much stronger prediction of, in
this case, the Higgs mass, then if you
just take any single one of them. You see
how similar what we are doing is to, you
know, any of the big data analyses out
there. Okay, so that was that part. Now
comes the obligatory part again,
computering. When we were designing the
LHC,not me, when people were designing the
LHC, they needed to project computing
power from 1990 to 2000 2010 and so on.
And then they said: "Well, we need
massive amount of computers" and for you
there's now "Ughhh - everybody has it, we
have it as well, we have our racks of
computers". This is something that the big
companies usually don't show: You you know
there is actually a ramp where the trucks
arrive and they offload the things and
then someone needs to screw them together
and then looks shiny. This is how we are
spending our CPU time: We have about
60,000 cores that are spinning all the
time for us, and they are distributed
around the world. You can see that CERN,
for example, is the red part there near
the bottom. Yeah, so we make good use of
that. We also monitor the efficiency, and
because 100 percent efficient is for
beginners we are actually about 700
percent efficient. Don't ask why. They
decided if you are multi-threading, then
we, you know, we multiply your efficiency
by the number of threads you have. Makes
no sense to me. We also have storage,
currently we use about 0.7 exabytes. We
also have available at one point seven
exabytes, so that's good, we make use of
the storage we have. Where it's, you know,
tera- peta- exa-, so it's a lot, and here
you can see on the right hand side you
see, for example, the tape usage on the
bottom and you see this dip that was
before we were starting the accelerator
again, we needed to make some space so we
monitor our hard disk usage all the time.
Hey, here comes the next decision point:
So, do you want to hear about, 1,
distributed computing or 2, measure
effects of bugs. So, 1, distributed
computing
applause
and 2, measure the effects of bugs
similar amount of applause
Okay, so that's my call, and I would say
we do we do... Measure the effects of
bugs, because it's shorter.
laughter
So this is one of the views you can, you
know, electronic views you can get from a
detector and you see how we trace the
particles that fly through the detector.
Now, that software right, that's the
result of software, and you might not
believe it, if you have bugs in there, in
that software.
And you know, these bugs are sometimes
wrong coordinate transformations, so
things don't go this way but that way,
it's kind of weird if you look at it, and
the result is that our particles don't go
through the path that they should have
been going, but we are attributing them a
different path. Now, the the nice thing
is that we are doing this a million times,
right? So all of that is smeared. We are
not systematically doing this wrong it's
just, we are always doing it a little bit
wrong. And so the net result is that if we
measure our particles, we will not measure
the right thing but always a little bit
wobbly left wobbly right you know? Things
are not as precise. That's simply an
uncertainty. So for us just like counting
has an uncertainty and predictions have
an uncertainty, software bugs introduced
another source of uncertainties. And here
you can see how we are tracking
uncertainties for for all of our
analyses. We are trying to understand the
different forces of uncertainties. And
again, bugs are only one of the sources
here, so if we find the bug then we
reduce our uncertainty and we can find new
physics earlier, instead of having to
wait and collect more data. So for us
finding bugs is really key, we really
love finding bugs because it brings
physics closer. I thought that was
interesting. It's kind of rare that you're
in environment where you're able to
measure the effect of bugs. Okay, so now
we are talking, we'll be talking about
data. I talked, told you that we are
trying to find particle traces in our
data and the way we do this is by using
reconstruction programs and there are
multiple gigabytes of binaries in shared
libraries and stuff. They're huge, they're
experiment specific and they are curated
by the experiments, open-source for some
of them, and we want them to be correct
and efficient. The data format we use is
not comma separated values, it's binary
and for some strange reason it's our own
custom binary format. The reason is that
it's really targeted and the kind of
data we are having. We have collisions
that are independent, so we only need one
in memory at any time and we have nested
collections which makes the regular table
layout a non-starter. We actually generate
them from C++ objects so from classes,
class definitions, C++ class definitions
and we can read them back into C++ but
also into JavaScript or Scala. Database
just didn't do it for us. They have the
wrong model of data axis, they don't
scale, it's just not the kind of system
that works for us. Also using a file
system as a storage back-end might sound
really very traditional and boring but it
works amazingly well and seems to be
future proof as well, so that's just the
way to go for us. There are many other
structured data formats out there, many of
those did not exist when we started root
our own data format. But they also miss
many things. For example, we wanted to
make sure that we have schema evolution
support. We can change the class layout
and still read back all data. We don't
want to throw away all data just because
we're changing the class. Also we do not
trust people. That is a, you know, as a
computer scientist or whatever you
probably know what I'm talking about
right? If people have to write their own
streaming algorithm, there will be bugs
and we will lose data.
We really don't want to do this, so we
were trying to automate this, based on the
class definition. So, last decision point
for the story. Do you want to hear about
cling, our C++ interpreter or about Open
Data and Applied Science? Let's start with
option 1, the C++ interpreter
applause
Okay and and Open Data and Applied
Science?
more applause than before
Yeah. I'm heading there. You miss a fish.
You can look at the slides later. Okay, so
there we go. Really? No. The slide number
is wrong. Oh a bug! So, Open Data and
Applied Science. Okay, you really wanted
to know about our budget, I understand
that. So we get from you about 1 billion
year and the currency doesn't really
matter anymore at this, at this point of
time.
laughter
And that is a lot of money. And you know?
We try to do really wonderful things, I
mean we really enjoy our job, we love it.
It's fantastic to work in such an
environment. And thank you very much for
making that possible. Really, I mean it.
But it also means that you decided as
society to enable something like CERN.
Which I think really deserves my applause
and yours probably as well. I think it's a
great decision to do something like this.
applause
So we realize this, right? We realized
that we are basically, that we can do what
we do because of you, and we are trying to
react to that by giving back what we do.
Software, research results, hardware and
data. So the way we share research results
is through open access. We have it,
finally. It took us a long time to fight
with publishers and, you know, the
establishment, but now we have it. We
also, yes thank you.
applause
We also put a lot of effort in
communicating our results and what we are
doing. And if you're in the region, it's
definitely worth a visit. I mean the URL
is really easy to remember, it's
visit.cern, and you know, works. And you
should go there by April, actually, if you
can because then you can ask people how to
get on the ground, because the accelerator
is off at the moment. We also do applied
research, for example we have this super
cool experiment where we try to study how
clouds form, based on cosmic rays. So the
the influence of cosmic rays and cloud
formation. Which is a key element in the
uncertainty of climate models. We are
trying to, to think about, you know, how
to make energy from nuclear waste. So
getting rid of nuclear waste while making
energy from it. And we are trying to
repurpose detectors that we have and you
know develop. We have something called
open hardware, for example White Rabbit:
deterministic ethernet, we have Open Data,
and we have the LHC@home and some other
programs, where either you can donate
compute power or your brain and help us
get better results. We explicitly try to
use open source as much as possible, and
also feed back, whenever we see issues.
But we also create open source. For
example, we create Geant, which is a
program that allows you to simulate how
particles fly through a matter, for
example used by the NASA. We have Indico,
which allows us to schedule meetings,
upload slides, you know, these kind of
things. Across the globe, lots of people,
with access protection, all these kind of
things. And it's open source. We have
DaviX, the dimension we love HTTP. That's
the next machine of Tim Berners-Lee. And
that's his futile effort in trying to
prevent the cleaning personnel from
switching it off. They don't speak
English, they did not back then at least.
So we use we used DaviX to transfer files
over HTTP, with a high bandwidth. Or we
have CVM-FS, which allows us to distribute
our binaries across the globe, and not
rely on admins downloading stuff and
making sure it actually runs, and these
kind of things. That is a lifesaver, it's
really fantastic, it's a great tool. But
nobody knows it. And we have ROOT, but
that's coming up. So now, the last
official part of this, of this
presentation, how do we do data analysis?
Not like that.
laughter
applause
We use, we use C++ and actually physicists
need to write their own analysis in C++.
We have very few people who have an actual
education in programming. so that's sort
of a clash. As I said, we need to keep one
collision in memory. And for what, you
know, what matters to us is throughput. We
want to have, we want to analyze as many
collisions as possible per second. What we
can do, is specialize our data format to
match the analysis, because we don't want
to waste I/O cycles, if we can, you know,
if we can make use of the CPU better. ROOT
allows us to do this since twenty years.
It's really the workhorse for the analysis
in high energy physics. And it's also an
interface to complex software. We have
serialization facilities, we have the
statistical tools, that people need, and
we have graphics, because once you have
done your analysis you need to communicate
that to your peers and convince people,
and publish, and so on, so that's part of
the game. All of that is open source, and,
of course, all of that is not just used by
high energy physics. So, to conclude: We
are here, because you make it possible.
Thank you very much. It's fantastic to
have you.
applause
We want to share and we have great people
for science outreach, but we have nobody
for software outreach, basically. So maybe
it's worth a look to see what what CERN is
producing software-wise. Scientific
computing is nothing new, it existed since
a long time, but we had to start fairly
early on a large scale. So when we were
building it up, we had to take... we were
trying to take pieces that existed and did
not found find much. So now we ended up
with C++ data serialization, efficient
computing for non computer scientists
even... In the part that I skipped and,
you know, one of the alternate tracks, you
would have seen that we have a Python
binding as well for the whole software
stack in C++. And for us, what matters
most is scale. Now we are seeing that we
are not the only ones. There are many more
natural sciences arriving at a similar
challenge of having to analyze large
amounts of data. Now I promised to you
that I'll be bold and I'll try to make a
few statements of what will happen with
data analysis, not just in science.
Because what we see is that we actually
educate the people who will do data
analysis, not just in science. What we see
is that in the past, data volume mattered
most. So more data meant more power. Now
that's not the complete truth anymore.
It's a lot about finding correlations. So
even with the amount of data not growing
anymore, because it's already humongous,
we try to squeeze more knowledge out of
it. And for that, I/O becomes important
and CPU limitations is the crucial factor.
We see that multivariate techniques are
still rising and they will just be part of
the toolchain of the statistical tools;
except for generative parts, which, I
believe, will change the way we model.
Now, based on what I just described, this
is not a big surprise anymore. As we need
throughput, we need to have a language for
the core analysis part, that is close to
metal, so something like C++.
On the other hand writing analyses is
still complex, so you need a higher-level
language and for that people could, for
example, use Python. So, now language
binding becomes relevant all of a sudden.
It's much more important in the future.
And we need to tailor I/O to the actual
analysis to not waste CPU cycles. So
throughput is the king and, in my point of
view, also in the future we will see much
more effort in increasing the throughput.
Okay, so that was it. In case you want to
discuss anything with me, like "That's
just wrong!", that's fine. I'm probably
have several bugs in there. I'm still here
until tomorrow. I don't know where yet,
so I'll wander around and you can contact
me by email or Twitter. Thank you very
much for your attention. Thank you.
applause
music
subtitles created by c3subtitles.de
in the year 2017. Join, and help us!