<i>music</i>

Herald: Good morning and welcome back to
stage one. It's kind of going to be the

second talk about physics on this day
already and it's about big data and

science and big data became something like
Uber in science. It's everywhere every

discipline has it. Axel Naumann's working
for CERN, the accelerator in Switzerland

and he talks about how physics and
computing bridge in this area and he works

a lot with ROOT, a program that helps
transform data into knowledge. A warm

welcome.

Axel Naumann: Thank you.

<i>applause</i>

AN: Thanks a lot. So, well you know, when,
when I was discussing this abstract with

the science track people they tell me:
"Well, you know about three hundred people

might be in the audience." But well, hey,
you are huge that's much more than three

hundred people. So thank you so much for
inviting me over it's a real honor. And of

course originally when talking to 300
people are all science interested I

thought you know I pick something fairly
narrow focuswise but then I learned I'm

going to be in Saal one and that's
different, so I decided to make the scope

a little bit wider and that's what I ended
up with. I'll talk a little bit about

CERN in society as well if you so choose,
you'll see what that means in a minute. So

the things I'll cover here is obviously
CERN just a little bit of an introduction

how we do physics, how we do computing,
what data means to us and I can tell you

it means everything, you heard about that
already, right? How we do data analysis in

high energy physics and just because
we've been doing it for a while and

because I've been doing it for more than
ten years, I'm one of the guys who's

providing the software to do data
analysis in high energy physics, so, you

know, because we know what we are doing
and we have some experience, I thought

maybe you might be interested in hearing
what my forecast is for data analysis in

general, in the future. So let's start
with CERN. And so if you wonder what CERN

is, you've all heard about CERN, about
the fantastic funds we love to use, then

you've probably also heard that we are
doing science. We were founded right after

the Second World War or soon after the
Second World War, basically as a way to

entertain those freaky scientists. You
know that was the idea: peace europewide.

And damn, that's working out really well
and so well there's not just Europe

anymore these days. We are located near
Geneva, we are doing only fundamental

research, so we don't do any weapons,
nuclear stuff you

know, these kind of things. The WWW was
invented at CERN but that was just a, you

know, side effect happens sometimes, that
we invent things. But usually we just do

science. So what we do is, we take money,
lots off, and brains who like to discuss

and think and come up with ideas and from
that we generate knowledge. It's really

all about curiosity. The things we try to
answer is what is mass? Which is funny

question right? Like we all know what mass
is but actually we don't. We know what

mass is in the universe. We understand
that masses attract one another: gravity.

Which is beautifully correct. And in the
small scale, our particles, we know that

mass is energy and we can't convert them.
But we don't understand how these two

things go together. Like there is no
bridge, they contradict one another. So we

are trying to understand what that bridge
might be. Part of that mass thing is of

course also what's out there in the
universe? That's a big question. We only

understand a few percent of that. 90 and
some percent are completely unknown to

us, and that's scary right? I mean we know
gravity really well, we can deal with

freaky things like black holes and yet we
don't understand what's out there. Now to

do all these things we are probing nature
at the smallest scale as we call it, so

that's particles, we are dealing with
things like the Higgs particle and

supersymmetry. Here's a little bit of a
fact sheet. We have about 12,000

physicists who are working with CERN. We
are basically the workbench that you saw

in Andre's talk before. We are the table
that physicists use, okay? And, so they

come to CERN and once a while about
10,000 physicists a year, or they work

remotely most of the time from about 120
nations. So you're seeing it's not

European anymore, this is a global thing.
CERN in itself has about 2,500 employees,

you know those scrubbing the table,
setting things up and so on. And our

table is right here. In the far end we
have the Alps, it's in Switzerland

as I said, so the Alps are
always close, with Mont Blanc, we have the

Lake Geneva we have the Jura, the French
Mountains on the lower end here, it's just

beautiful. It's really nice, but we
needed to stick a 30-kilometer ring in

there somewhere and people would have
hated us had we put it like this. But

luckily people were smart back then in the
70s, and built a tunnel much better. So

now we have this huge tunnel, and we send
particles through in both directions near

the speed of light and the tunnel is
filled with magnets simply because if you

don't use a magnet the particles will fly
straight but we need them to turn around.

Here you see what it's looking like, you
also see these big halls there that have

access shafts from the top and that's
where the experiments are. That's sort of

a sketch of one of the experiments. So the
the LHC is one of the, no, is the biggest

particle accelerator at the moment, it's a
ring with 27 kilometers circumference, 100

meters below Switzerland and France, it
has four big experiments and several

small ones and we are expected to run
until 2030. So you see that all of that

is large-scale simply because we're trying
to make good use of the money we have.

Here, you see one of these caverns that
are used by the experiments while it was

empty. The experiment was then lowered
through this hole by the roof, piece by

piece, and these things are humongous. To
give you an impression of how big it is, I

put Waldo in there, so your job for the
next three slides is to find Waldo. You

know, that gives you the scale. He's
friendlily waving at you, so it should be

easy to find him. So then we put a
detector in there. Here it's pulled apart

a little bit, so it looks nicer, you can
actually see something. You can for

example see the beam pipe, so that's where
the particles are flying through, and then

they're coming from both directions and
colliding in the center of the detector

and then things happen we try to
understand what

is happening. That's yet another view,
frontal view on one of the detectors and

now you have to imagine that, you know,
you can't just open up Amazon and order an

LHC experiment, right, that's not how it
works. We do this stuff ourselves, like

PhD students, postdocs, engineers. You
know, that's all done by hand, just like

the microscope you saw before. Of course
you order the parts, but you know the

design, the whole conception and actually
screwing these things together, making

sure that all fits, is all done by hand.
And I find that just beautiful, I mean

that's close to a miracle, right? That
nations, like people no matter what

nation, people across the globe work
together to build such a huge thing and

then you turn it on and it works. More or
less, but you get it to work. That's not

my applause, that's your applause, because
you make this possible. Really, but it's,

it's huge this is for me one of the things
I love most about CERN: That is this

international thing that just works
smoothly. Now the detectors are like a

massive camera. We have lots of pixels and
we take many, many pictures a second. We

do this to identify particles and then
sort of estimate what has happened during

the collision. Now, life at CERN is of
course an important ingredient for

scientists as well, and if you live at
CERN then actually it's just work at CERN

and that's what it's about. But it's not
that bad, so we hang out together in our

control rooms, make sure that the
experiments work correctly. We also, you

know, study the forces.
<i>laughter</i>

We have scientific discourse, in the sun,
view on the Mont Blanc, with a good

coffee. We have lectures and we are
lectured and of course, as you, we have

more laptops than people. And, then we do
stuff and so this presentation is going to

introduce you to some of the things we are
doing, and more on the computing and the

society side as I said. But because I have
so much to talk to about I decided that

you just build your own talk, you tell me
what you want to hear. So let's do this,

you can choose between A, physics, and B,
model simulation and data. You remember

these books like from the old days when we
were all young? It's that kind of thing,

ok? You decide/design your own talk here.
So, by applause, do you want to hear about

physics?
<i>applause</i>

Okay. Or the model simulation data part?
<i>louder applause</i>

Okay, there we go. So, this is what we
skip. Model simulation data it is. You're

a strange crowd, first time I meet people
who don't want to hear about physics... no

I'm kidding.
<i>laughter</i>

Audience: <i>inaudible interjection</i>
<i>laughter</i>

So model simulation data it is. So our
theory is actually incredibly precise.

It's so precise that our basic job is
really really boring, because we already

understand everything. Whenever there is a
collision, we know what's going to happen.

Except for these very rare things. So we
are trying to find these very rare things

out of this haystack of fairly boring
things that we really understand well. And

the weird things are, for example,
monopoles, supersymmetry, or black holes.

Now the theorists job is to tell us what
we should be seeing in the detector, given

some fancy physics. Then we use simulation
to see how our detector would respond to

that. Now, of course the question is: We
are just counting, basically, when we do

experiments and the question is: How often
do we need to see something to say: "Well,

that's not just the ordinary. That is
something new, that's something that could

be explained by a weird theory. We use the
detector simulation as I said to basically

predict how much we expect to see things.
We use reconstruction software which

tells us what has happened, or might have
happened in the detector to count how

often we saw something. And then we use
statistics to compare these two and to say

whether something is expected or not. Now,
that's fairly abstract but it's fairly

common, a fairly common approach. For
example, if you look at climate versus

weather, right, I mean we always have
temperature fluctuations because of

weather, and the question is: Is that rise
in temperature because of a weather effect

or because of a climate effect? Is that
large-scale or just a short-term

fluctuation. So there, we have a very
similar problem and here what you do is

you measure temperatures, and you want to
detect abnormal variations, and you can

improve that by measuring longer, like,
for 300 years instead of 20 years. That

gives you a better prediction what you
would expect in the future. Also, larger

deviations help, right?. If you look for
something that

is just 0.1 degree, then you might not be
able to find it. If there is a deviation

of 5 degrees, you will definitely find it.
And for us it's very similar. So here we

have a plot, one of the first Higgs
discovery plots, and you can see that we

have many ingredients there. So, the black
dots are what we measure and they have

certain uncertainty, because when we
measure, we count and we might have, you

know, not seen something, or we might have
seen more than we we should have seen, so

there's always an uncertainty. And then we
also have theory, which tells us you

should have seen so many and so for the
red part that's something that we know

exists, it's nothing spectacular. It's
simply what theory is telling us what we

should be seeing. And you can see the data
follows the red part fairly well. But then

there is this other bump in our dots on
the right-hand side or in the center and

that does not make sense, unless you take
the Higgs into account, right, which is

the light blue part and so here you can
see how this interplay between different

sources of physics and statistics works
for us. Now just as for the climate, more

data helps. And there are two versions of
more data more data: Either by having more

collisions, which is why we are running
24/7, or more data by combining different

analyses which is what's happening here.
So here you see all these different

analyses. If you combine them, of course
you get a much stronger prediction of, in

this case, the Higgs mass, then if you
just take any single one of them. You see

how similar what we are doing is to, you
know, any of the big data analyses out

there. Okay, so that was that part. Now
comes the obligatory part again,

computering. When we were designing the
LHC,not me, when people were designing the

LHC, they needed to project computing
power from 1990 to 2000 2010 and so on.

And then they said: "Well, we need
massive amount of computers" and for you

there's now "Ughhh - everybody has it, we
have it as well, we have our racks of

computers". This is something that the big
companies usually don't show: You you know

there is actually a ramp where the trucks
arrive and they offload the things and

then someone needs to screw them together
and then looks shiny. This is how we are

spending our CPU time: We have about
60,000 cores that are spinning all the

time for us, and they are distributed
around the world. You can see that CERN,

for example, is the red part there near
the bottom. Yeah, so we make good use of

that. We also monitor the efficiency, and
because 100 percent efficient is for

beginners we are actually about 700
percent efficient. Don't ask why. They

decided if you are multi-threading, then
we, you know, we multiply your efficiency

by the number of threads you have. Makes
no sense to me. We also have storage,

currently we use about 0.7 exabytes. We
also have available at one point seven

exabytes, so that's good, we make use of
the storage we have. Where it's, you know,

tera- peta- exa-, so it's a lot, and here
you can see on the right hand side you

see, for example, the tape usage on the
bottom and you see this dip that was

before we were starting the accelerator
again, we needed to make some space so we

monitor our hard disk usage all the time.
Hey, here comes the next decision point:

So, do you want to hear about, 1,
distributed computing or 2, measure

effects of bugs. So, 1, distributed
computing

<i>applause</i>
and 2, measure the effects of bugs

<i>similar amount of applause</i>
Okay, so that's my call, and I would say

we do we do... Measure the effects of
bugs, because it's shorter.

<i>laughter</i>
So this is one of the views you can, you

know, electronic views you can get from a
detector and you see how we trace the

particles that fly through the detector.
Now, that software right, that's the

result of software, and you might not
believe it, if you have bugs in there, in

that software.

And you know, these bugs are sometimes
wrong coordinate transformations, so

things don't go this way but that way,
it's kind of weird if you look at it, and

the result is that our particles don't go
through the path that they should have

been going, but we are attributing them a
different path. Now, the the nice thing

is that we are doing this a million times,
right? So all of that is smeared. We are

not systematically doing this wrong it's
just, we are always doing it a little bit

wrong. And so the net result is that if we
measure our particles, we will not measure

the right thing but always a little bit
wobbly left wobbly right you know? Things

are not as precise. That's simply an
uncertainty. So for us just like counting

has an uncertainty and predictions have
an uncertainty, software bugs introduced

another source of uncertainties. And here
you can see how we are tracking

uncertainties for for all of our
analyses. We are trying to understand the

different forces of uncertainties. And
again, bugs are only one of the sources

here, so if we find the bug then we
reduce our uncertainty and we can find new

physics earlier, instead of having to
wait and collect more data. So for us

finding bugs is really key, we really
love finding bugs because it brings

physics closer. I thought that was
interesting. It's kind of rare that you're

in environment where you're able to
measure the effect of bugs. Okay, so now

we are talking, we'll be talking about
data. I talked, told you that we are

trying to find particle traces in our
data and the way we do this is by using

reconstruction programs and there are
multiple gigabytes of binaries in shared

libraries and stuff. They're huge, they're
experiment specific and they are curated

by the experiments, open-source for some
of them, and we want them to be correct

and efficient. The data format we use is
not comma separated values, it's binary

and for some strange reason it's our own
custom binary format. The reason is that

it's really targeted and the kind of
data we are having. We have collisions

that are independent, so we only need one
in memory at any time and we have nested

collections which makes the regular table
layout a non-starter. We actually generate

them from C++ objects so from classes,
class definitions, C++ class definitions

and we can read them back into C++ but
also into JavaScript or Scala. Database

just didn't do it for us. They have the
wrong model of data axis, they don't

scale, it's just not the kind of system
that works for us. Also using a file

system as a storage back-end might sound
really very traditional and boring but it

works amazingly well and seems to be
future proof as well, so that's just the

way to go for us. There are many other
structured data formats out there, many of

those did not exist when we started root
our own data format. But they also miss

many things. For example, we wanted to
make sure that we have schema evolution

support. We can change the class layout
and still read back all data. We don't

want to throw away all data just because
we're changing the class. Also we do not

trust people. That is a, you know, as a
computer scientist or whatever you

probably know what I'm talking about
right? If people have to write their own

streaming algorithm, there will be bugs
and we will lose data.

We really don't want to do this, so we
were trying to automate this, based on the

class definition. So, last decision point
for the story. Do you want to hear about

cling, our C++ interpreter or about Open
Data and Applied Science? Let's start with

option 1, the C++ interpreter
<i>applause</i>

Okay and and Open Data and Applied
Science?

<i>more applause than before</i>
Yeah. I'm heading there. You miss a fish.

You can look at the slides later. Okay, so
there we go. Really? No. The slide number

is wrong. Oh a bug! So, Open Data and
Applied Science. Okay, you really wanted

to know about our budget, I understand
that. So we get from you about 1 billion

year and the currency doesn't really
matter anymore at this, at this point of

time.
<i>laughter</i>

And that is a lot of money. And you know?
We try to do really wonderful things, I

mean we really enjoy our job, we love it.
It's fantastic to work in such an

environment. And thank you very much for
making that possible. Really, I mean it.

But it also means that you decided as
society to enable something like CERN.

Which I think really deserves my applause
and yours probably as well. I think it's a

great decision to do something like this.

<i>applause</i>

So we realize this, right? We realized
that we are basically, that we can do what

we do because of you, and we are trying to
react to that by giving back what we do.

Software, research results, hardware and
data. So the way we share research results

is through open access. We have it,
finally. It took us a long time to fight

with publishers and, you know, the
establishment, but now we have it. We

also, yes thank you.

<i>applause</i>

We also put a lot of effort in
communicating our results and what we are

doing. And if you're in the region, it's
definitely worth a visit. I mean the URL

is really easy to remember, it's
visit.cern, and you know, works. And you

should go there by April, actually, if you
can because then you can ask people how to

get on the ground, because the accelerator
is off at the moment. We also do applied

research, for example we have this super
cool experiment where we try to study how

clouds form, based on cosmic rays. So the
the influence of cosmic rays and cloud

formation. Which is a key element in the
uncertainty of climate models. We are

trying to, to think about, you know, how
to make energy from nuclear waste. So

getting rid of nuclear waste while making
energy from it. And we are trying to

repurpose detectors that we have and you
know develop. We have something called

open hardware, for example White Rabbit:
deterministic ethernet, we have Open Data,

and we have the LHC@home and some other
programs, where either you can donate

compute power or your brain and help us
get better results. We explicitly try to

use open source as much as possible, and
also feed back, whenever we see issues.

But we also create open source. For
example, we create Geant, which is a

program that allows you to simulate how
particles fly through a matter, for

example used by the NASA. We have Indico,
which allows us to schedule meetings,

upload slides, you know, these kind of
things. Across the globe, lots of people,

with access protection, all these kind of
things. And it's open source. We have

DaviX, the dimension we love HTTP. That's
the next machine of Tim Berners-Lee. And

that's his futile effort in trying to
prevent the cleaning personnel from

switching it off. They don't speak
English, they did not back then at least.

So we use we used DaviX to transfer files
over HTTP, with a high bandwidth. Or we

have CVM-FS, which allows us to distribute
our binaries across the globe, and not

rely on admins downloading stuff and
making sure it actually runs, and these

kind of things. That is a lifesaver, it's
really fantastic, it's a great tool. But

nobody knows it. And we have ROOT, but
that's coming up. So now, the last

official part of this, of this
presentation, how do we do data analysis?

Not like that.
<i>laughter</i>

<i>applause</i>
We use, we use C++ and actually physicists

need to write their own analysis in C++.
We have very few people who have an actual

education in programming. so that's sort
of a clash. As I said, we need to keep one

collision in memory. And for what, you
know, what matters to us is throughput. We

want to have, we want to analyze as many
collisions as possible per second. What we

can do, is specialize our data format to
match the analysis, because we don't want

to waste I/O cycles, if we can, you know,
if we can make use of the CPU better. ROOT

allows us to do this since twenty years.
It's really the workhorse for the analysis

in high energy physics. And it's also an
interface to complex software. We have

serialization facilities, we have the
statistical tools, that people need, and

we have graphics, because once you have
done your analysis you need to communicate

that to your peers and convince people,
and publish, and so on, so that's part of

the game. All of that is open source, and,
of course, all of that is not just used by

high energy physics. So, to conclude: We
are here, because you make it possible.

Thank you very much. It's fantastic to
have you.

<i>applause</i>
We want to share and we have great people

for science outreach, but we have nobody
for software outreach, basically. So maybe

it's worth a look to see what what CERN is
producing software-wise. Scientific

computing is nothing new, it existed since
a long time, but we had to start fairly

early on a large scale. So when we were
building it up, we had to take... we were

trying to take pieces that existed and did
not found find much. So now we ended up

with C++ data serialization, efficient
computing for non computer scientists

even... In the part that I skipped and,
you know, one of the alternate tracks, you

would have seen that we have a Python
binding as well for the whole software

stack in C++. And for us, what matters
most is scale. Now we are seeing that we

are not the only ones. There are many more
natural sciences arriving at a similar

challenge of having to analyze large
amounts of data. Now I promised to you

that I'll be bold and I'll try to make a
few statements of what will happen with

data analysis, not just in science.
Because what we see is that we actually

educate the people who will do data
analysis, not just in science. What we see

is that in the past, data volume mattered
most. So more data meant more power. Now

that's not the complete truth anymore.
It's a lot about finding correlations. So

even with the amount of data not growing
anymore, because it's already humongous,

we try to squeeze more knowledge out of
it. And for that, I/O becomes important

and CPU limitations is the crucial factor.
We see that multivariate techniques are

still rising and they will just be part of
the toolchain of the statistical tools;

except for generative parts, which, I
believe, will change the way we model.

Now, based on what I just described, this
is not a big surprise anymore. As we need

throughput, we need to have a language for
the core analysis part, that is close to

metal, so something like C++.
On the other hand writing analyses is

still complex, so you need a higher-level
language and for that people could, for

example, use Python. So, now language
binding becomes relevant all of a sudden.

It's much more important in the future.
And we need to tailor I/O to the actual

analysis to not waste CPU cycles. So
throughput is the king and, in my point of

view, also in the future we will see much
more effort in increasing the throughput.

Okay, so that was it. In case you want to
discuss anything with me, like "That's

just wrong!", that's fine. I'm probably
have several bugs in there. I'm still here

until tomorrow. I don't know where yet,
so I'll wander around and you can contact

me by email or Twitter. Thank you very
much for your attention. Thank you.

<i>applause</i>

<i>music</i>

subtitles created by c3subtitles.de
in the year 2017. Join, and help us!