< Return to Video

How physicists analyze massive data: LHC + brain + ROOT = Higgs (33c3)

  • 0:00 - 0:13
    music
  • 0:13 - 0:17
    Herald: Good morning and welcome back to
    stage one. It's kind of going to be the
  • 0:17 - 0:21
    second talk about physics on this day
    already and it's about big data and
  • 0:21 - 0:27
    science and big data became something like
    Uber in science. It's everywhere every
  • 0:27 - 0:33
    discipline has it. Axel Naumann's working
    for CERN, the accelerator in Switzerland
  • 0:33 - 0:39
    and he talks about how physics and
    computing bridge in this area and he works
  • 0:39 - 0:43
    a lot with ROOT, a program that helps
    transform data into knowledge. A warm
  • 0:43 - 0:45
    welcome.
  • 0:45 - 0:45
    Axel Naumann: Thank you.
  • 0:45 - 0:51
    applause
  • 0:51 - 0:58
    AN: Thanks a lot. So, well you know, when,
    when I was discussing this abstract with
  • 0:58 - 1:01
    the science track people they tell me:
    "Well, you know about three hundred people
  • 1:01 - 1:06
    might be in the audience." But well, hey,
    you are huge that's much more than three
  • 1:06 - 1:11
    hundred people. So thank you so much for
    inviting me over it's a real honor. And of
  • 1:11 - 1:15
    course originally when talking to 300
    people are all science interested I
  • 1:15 - 1:21
    thought you know I pick something fairly
    narrow focuswise but then I learned I'm
  • 1:21 - 1:25
    going to be in Saal one and that's
    different, so I decided to make the scope
  • 1:25 - 1:31
    a little bit wider and that's what I ended
    up with. I'll talk a little bit about
  • 1:31 - 1:38
    CERN in society as well if you so choose,
    you'll see what that means in a minute. So
  • 1:38 - 1:42
    the things I'll cover here is obviously
    CERN just a little bit of an introduction
  • 1:42 - 1:46
    how we do physics, how we do computing,
    what data means to us and I can tell you
  • 1:46 - 1:52
    it means everything, you heard about that
    already, right? How we do data analysis in
  • 1:52 - 1:56
    high energy physics and just because
    we've been doing it for a while and
  • 1:56 - 2:01
    because I've been doing it for more than
    ten years, I'm one of the guys who's
  • 2:01 - 2:07
    providing the software to do data
    analysis in high energy physics, so, you
  • 2:07 - 2:11
    know, because we know what we are doing
    and we have some experience, I thought
  • 2:11 - 2:18
    maybe you might be interested in hearing
    what my forecast is for data analysis in
  • 2:18 - 2:25
    general, in the future. So let's start
    with CERN. And so if you wonder what CERN
  • 2:25 - 2:32
    is, you've all heard about CERN, about
    the fantastic funds we love to use, then
  • 2:32 - 2:37
    you've probably also heard that we are
    doing science. We were founded right after
  • 2:37 - 2:41
    the Second World War or soon after the
    Second World War, basically as a way to
  • 2:41 - 2:47
    entertain those freaky scientists. You
    know that was the idea: peace europewide.
  • 2:47 - 2:52
    And damn, that's working out really well
    and so well there's not just Europe
  • 2:52 - 2:58
    anymore these days. We are located near
    Geneva, we are doing only fundamental
  • 2:58 - 3:02
    research, so we don't do any weapons,
    nuclear stuff you
  • 3:02 - 3:10
    know, these kind of things. The WWW was
    invented at CERN but that was just a, you
  • 3:10 - 3:15
    know, side effect happens sometimes, that
    we invent things. But usually we just do
  • 3:15 - 3:22
    science. So what we do is, we take money,
    lots off, and brains who like to discuss
  • 3:22 - 3:27
    and think and come up with ideas and from
    that we generate knowledge. It's really
  • 3:27 - 3:33
    all about curiosity. The things we try to
    answer is what is mass? Which is funny
  • 3:33 - 3:37
    question right? Like we all know what mass
    is but actually we don't. We know what
  • 3:37 - 3:42
    mass is in the universe. We understand
    that masses attract one another: gravity.
  • 3:42 - 3:49
    Which is beautifully correct. And in the
    small scale, our particles, we know that
  • 3:49 - 3:53
    mass is energy and we can't convert them.
    But we don't understand how these two
  • 3:53 - 3:58
    things go together. Like there is no
    bridge, they contradict one another. So we
  • 3:58 - 4:05
    are trying to understand what that bridge
    might be. Part of that mass thing is of
  • 4:05 - 4:09
    course also what's out there in the
    universe? That's a big question. We only
  • 4:09 - 4:14
    understand a few percent of that. 90 and
    some percent are completely unknown to
  • 4:14 - 4:20
    us, and that's scary right? I mean we know
    gravity really well, we can deal with
  • 4:20 - 4:28
    freaky things like black holes and yet we
    don't understand what's out there. Now to
  • 4:28 - 4:32
    do all these things we are probing nature
    at the smallest scale as we call it, so
  • 4:32 - 4:36
    that's particles, we are dealing with
    things like the Higgs particle and
  • 4:36 - 4:44
    supersymmetry. Here's a little bit of a
    fact sheet. We have about 12,000
  • 4:44 - 4:48
    physicists who are working with CERN. We
    are basically the workbench that you saw
  • 4:48 - 4:55
    in Andre's talk before. We are the table
    that physicists use, okay? And, so they
  • 4:55 - 4:59
    come to CERN and once a while about
    10,000 physicists a year, or they work
  • 4:59 - 5:03
    remotely most of the time from about 120
    nations. So you're seeing it's not
  • 5:03 - 5:11
    European anymore, this is a global thing.
    CERN in itself has about 2,500 employees,
  • 5:11 - 5:15
    you know those scrubbing the table,
    setting things up and so on. And our
  • 5:15 - 5:21
    table is right here. In the far end we
    have the Alps, it's in Switzerland
  • 5:21 - 5:26
    as I said, so the Alps are
    always close, with Mont Blanc, we have the
  • 5:26 - 5:32
    Lake Geneva we have the Jura, the French
    Mountains on the lower end here, it's just
  • 5:32 - 5:37
    beautiful. It's really nice, but we
    needed to stick a 30-kilometer ring in
  • 5:37 - 5:44
    there somewhere and people would have
    hated us had we put it like this. But
  • 5:44 - 5:50
    luckily people were smart back then in the
    70s, and built a tunnel much better. So
  • 5:50 - 5:55
    now we have this huge tunnel, and we send
    particles through in both directions near
  • 5:55 - 6:00
    the speed of light and the tunnel is
    filled with magnets simply because if you
  • 6:00 - 6:08
    don't use a magnet the particles will fly
    straight but we need them to turn around.
  • 6:08 - 6:14
    Here you see what it's looking like, you
    also see these big halls there that have
  • 6:14 - 6:22
    access shafts from the top and that's
    where the experiments are. That's sort of
  • 6:22 - 6:29
    a sketch of one of the experiments. So the
    the LHC is one of the, no, is the biggest
  • 6:29 - 6:36
    particle accelerator at the moment, it's a
    ring with 27 kilometers circumference, 100
  • 6:36 - 6:40
    meters below Switzerland and France, it
    has four big experiments and several
  • 6:40 - 6:45
    small ones and we are expected to run
    until 2030. So you see that all of that
  • 6:45 - 6:50
    is large-scale simply because we're trying
    to make good use of the money we have.
  • 6:50 - 6:56
    Here, you see one of these caverns that
    are used by the experiments while it was
  • 6:56 - 7:01
    empty. The experiment was then lowered
    through this hole by the roof, piece by
  • 7:01 - 7:07
    piece, and these things are humongous. To
    give you an impression of how big it is, I
  • 7:07 - 7:13
    put Waldo in there, so your job for the
    next three slides is to find Waldo. You
  • 7:13 - 7:16
    know, that gives you the scale. He's
    friendlily waving at you, so it should be
  • 7:16 - 7:22
    easy to find him. So then we put a
    detector in there. Here it's pulled apart
  • 7:22 - 7:26
    a little bit, so it looks nicer, you can
    actually see something. You can for
  • 7:26 - 7:31
    example see the beam pipe, so that's where
    the particles are flying through, and then
  • 7:31 - 7:35
    they're coming from both directions and
    colliding in the center of the detector
  • 7:35 - 7:38
    and then things happen we try to
    understand what
  • 7:38 - 7:45
    is happening. That's yet another view,
    frontal view on one of the detectors and
  • 7:45 - 7:51
    now you have to imagine that, you know,
    you can't just open up Amazon and order an
  • 7:51 - 7:56
    LHC experiment, right, that's not how it
    works. We do this stuff ourselves, like
  • 7:56 - 8:03
    PhD students, postdocs, engineers. You
    know, that's all done by hand, just like
  • 8:03 - 8:07
    the microscope you saw before. Of course
    you order the parts, but you know the
  • 8:07 - 8:11
    design, the whole conception and actually
    screwing these things together, making
  • 8:11 - 8:17
    sure that all fits, is all done by hand.
    And I find that just beautiful, I mean
  • 8:17 - 8:22
    that's close to a miracle, right? That
    nations, like people no matter what
  • 8:22 - 8:27
    nation, people across the globe work
    together to build such a huge thing and
  • 8:27 - 8:39
    then you turn it on and it works. More or
    less, but you get it to work. That's not
  • 8:39 - 8:44
    my applause, that's your applause, because
    you make this possible. Really, but it's,
  • 8:44 - 8:50
    it's huge this is for me one of the things
    I love most about CERN: That is this
  • 8:50 - 8:55
    international thing that just works
    smoothly. Now the detectors are like a
  • 8:55 - 9:01
    massive camera. We have lots of pixels and
    we take many, many pictures a second. We
  • 9:01 - 9:07
    do this to identify particles and then
    sort of estimate what has happened during
  • 9:07 - 9:15
    the collision. Now, life at CERN is of
    course an important ingredient for
  • 9:15 - 9:20
    scientists as well, and if you live at
    CERN then actually it's just work at CERN
  • 9:20 - 9:24
    and that's what it's about. But it's not
    that bad, so we hang out together in our
  • 9:24 - 9:30
    control rooms, make sure that the
    experiments work correctly. We also, you
  • 9:30 - 9:34
    know, study the forces.
    laughter
  • 9:34 - 9:39
    We have scientific discourse, in the sun,
    view on the Mont Blanc, with a good
  • 9:39 - 9:45
    coffee. We have lectures and we are
    lectured and of course, as you, we have
  • 9:45 - 9:55
    more laptops than people. And, then we do
    stuff and so this presentation is going to
  • 9:55 - 9:59
    introduce you to some of the things we are
    doing, and more on the computing and the
  • 9:59 - 10:04
    society side as I said. But because I have
    so much to talk to about I decided that
  • 10:04 - 10:09
    you just build your own talk, you tell me
    what you want to hear. So let's do this,
  • 10:09 - 10:14
    you can choose between A, physics, and B,
    model simulation and data. You remember
  • 10:14 - 10:19
    these books like from the old days when we
    were all young? It's that kind of thing,
  • 10:19 - 10:24
    ok? You decide/design your own talk here.
    So, by applause, do you want to hear about
  • 10:24 - 10:28
    physics?
    applause
  • 10:28 - 10:36
    Okay. Or the model simulation data part?
    louder applause
  • 10:36 - 10:45
    Okay, there we go. So, this is what we
    skip. Model simulation data it is. You're
  • 10:45 - 10:50
    a strange crowd, first time I meet people
    who don't want to hear about physics... no
  • 10:50 - 10:51
    I'm kidding.
    laughter
  • 10:51 - 10:54
    Audience: inaudible interjection
    laughter
  • 10:54 - 11:00
    So model simulation data it is. So our
    theory is actually incredibly precise.
  • 11:00 - 11:04
    It's so precise that our basic job is
    really really boring, because we already
  • 11:04 - 11:11
    understand everything. Whenever there is a
    collision, we know what's going to happen.
  • 11:11 - 11:15
    Except for these very rare things. So we
    are trying to find these very rare things
  • 11:15 - 11:20
    out of this haystack of fairly boring
    things that we really understand well. And
  • 11:20 - 11:26
    the weird things are, for example,
    monopoles, supersymmetry, or black holes.
  • 11:26 - 11:32
    Now the theorists job is to tell us what
    we should be seeing in the detector, given
  • 11:32 - 11:42
    some fancy physics. Then we use simulation
    to see how our detector would respond to
  • 11:42 - 11:53
    that. Now, of course the question is: We
    are just counting, basically, when we do
  • 11:53 - 11:58
    experiments and the question is: How often
    do we need to see something to say: "Well,
  • 11:58 - 12:03
    that's not just the ordinary. That is
    something new, that's something that could
  • 12:03 - 12:10
    be explained by a weird theory. We use the
    detector simulation as I said to basically
  • 12:10 - 12:15
    predict how much we expect to see things.
    We use reconstruction software which
  • 12:15 - 12:21
    tells us what has happened, or might have
    happened in the detector to count how
  • 12:21 - 12:25
    often we saw something. And then we use
    statistics to compare these two and to say
  • 12:25 - 12:32
    whether something is expected or not. Now,
    that's fairly abstract but it's fairly
  • 12:32 - 12:37
    common, a fairly common approach. For
    example, if you look at climate versus
  • 12:37 - 12:40
    weather, right, I mean we always have
    temperature fluctuations because of
  • 12:40 - 12:46
    weather, and the question is: Is that rise
    in temperature because of a weather effect
  • 12:46 - 12:50
    or because of a climate effect? Is that
    large-scale or just a short-term
  • 12:50 - 12:56
    fluctuation. So there, we have a very
    similar problem and here what you do is
  • 12:56 - 13:01
    you measure temperatures, and you want to
    detect abnormal variations, and you can
  • 13:01 - 13:06
    improve that by measuring longer, like,
    for 300 years instead of 20 years. That
  • 13:06 - 13:12
    gives you a better prediction what you
    would expect in the future. Also, larger
  • 13:12 - 13:14
    deviations help, right?. If you look for
    something that
  • 13:14 - 13:20
    is just 0.1 degree, then you might not be
    able to find it. If there is a deviation
  • 13:20 - 13:25
    of 5 degrees, you will definitely find it.
    And for us it's very similar. So here we
  • 13:25 - 13:32
    have a plot, one of the first Higgs
    discovery plots, and you can see that we
  • 13:32 - 13:39
    have many ingredients there. So, the black
    dots are what we measure and they have
  • 13:39 - 13:44
    certain uncertainty, because when we
    measure, we count and we might have, you
  • 13:44 - 13:49
    know, not seen something, or we might have
    seen more than we we should have seen, so
  • 13:49 - 13:55
    there's always an uncertainty. And then we
    also have theory, which tells us you
  • 13:55 - 14:00
    should have seen so many and so for the
    red part that's something that we know
  • 14:00 - 14:05
    exists, it's nothing spectacular. It's
    simply what theory is telling us what we
  • 14:05 - 14:11
    should be seeing. And you can see the data
    follows the red part fairly well. But then
  • 14:11 - 14:16
    there is this other bump in our dots on
    the right-hand side or in the center and
  • 14:16 - 14:21
    that does not make sense, unless you take
    the Higgs into account, right, which is
  • 14:21 - 14:27
    the light blue part and so here you can
    see how this interplay between different
  • 14:27 - 14:38
    sources of physics and statistics works
    for us. Now just as for the climate, more
  • 14:38 - 14:44
    data helps. And there are two versions of
    more data more data: Either by having more
  • 14:44 - 14:48
    collisions, which is why we are running
    24/7, or more data by combining different
  • 14:48 - 14:52
    analyses which is what's happening here.
    So here you see all these different
  • 14:52 - 14:57
    analyses. If you combine them, of course
    you get a much stronger prediction of, in
  • 14:57 - 15:03
    this case, the Higgs mass, then if you
    just take any single one of them. You see
  • 15:03 - 15:09
    how similar what we are doing is to, you
    know, any of the big data analyses out
  • 15:09 - 15:16
    there. Okay, so that was that part. Now
    comes the obligatory part again,
  • 15:16 - 15:23
    computering. When we were designing the
    LHC,not me, when people were designing the
  • 15:23 - 15:31
    LHC, they needed to project computing
    power from 1990 to 2000 2010 and so on.
  • 15:31 - 15:34
    And then they said: "Well, we need
    massive amount of computers" and for you
  • 15:34 - 15:38
    there's now "Ughhh - everybody has it, we
    have it as well, we have our racks of
  • 15:38 - 15:44
    computers". This is something that the big
    companies usually don't show: You you know
  • 15:44 - 15:49
    there is actually a ramp where the trucks
    arrive and they offload the things and
  • 15:49 - 15:54
    then someone needs to screw them together
    and then looks shiny. This is how we are
  • 15:54 - 16:01
    spending our CPU time: We have about
    60,000 cores that are spinning all the
  • 16:01 - 16:07
    time for us, and they are distributed
    around the world. You can see that CERN,
  • 16:07 - 16:15
    for example, is the red part there near
    the bottom. Yeah, so we make good use of
  • 16:15 - 16:21
    that. We also monitor the efficiency, and
    because 100 percent efficient is for
  • 16:21 - 16:29
    beginners we are actually about 700
    percent efficient. Don't ask why. They
  • 16:29 - 16:34
    decided if you are multi-threading, then
    we, you know, we multiply your efficiency
  • 16:34 - 16:40
    by the number of threads you have. Makes
    no sense to me. We also have storage,
  • 16:40 - 16:45
    currently we use about 0.7 exabytes. We
    also have available at one point seven
  • 16:45 - 16:49
    exabytes, so that's good, we make use of
    the storage we have. Where it's, you know,
  • 16:49 - 16:56
    tera- peta- exa-, so it's a lot, and here
    you can see on the right hand side you
  • 16:56 - 17:00
    see, for example, the tape usage on the
    bottom and you see this dip that was
  • 17:00 - 17:04
    before we were starting the accelerator
    again, we needed to make some space so we
  • 17:04 - 17:09
    monitor our hard disk usage all the time.
    Hey, here comes the next decision point:
  • 17:09 - 17:14
    So, do you want to hear about, 1,
    distributed computing or 2, measure
  • 17:14 - 17:18
    effects of bugs. So, 1, distributed
    computing
  • 17:18 - 17:26
    applause
    and 2, measure the effects of bugs
  • 17:26 - 17:36
    similar amount of applause
    Okay, so that's my call, and I would say
  • 17:36 - 17:41
    we do we do... Measure the effects of
    bugs, because it's shorter.
  • 17:41 - 17:47
    laughter
    So this is one of the views you can, you
  • 17:47 - 17:51
    know, electronic views you can get from a
    detector and you see how we trace the
  • 17:51 - 17:55
    particles that fly through the detector.
    Now, that software right, that's the
  • 17:55 - 18:00
    result of software, and you might not
    believe it, if you have bugs in there, in
  • 18:00 - 18:01
    that software.
  • 18:03 - 18:07
    And you know, these bugs are sometimes
    wrong coordinate transformations, so
  • 18:07 - 18:13
    things don't go this way but that way,
    it's kind of weird if you look at it, and
  • 18:13 - 18:17
    the result is that our particles don't go
    through the path that they should have
  • 18:17 - 18:25
    been going, but we are attributing them a
    different path. Now, the the nice thing
  • 18:25 - 18:31
    is that we are doing this a million times,
    right? So all of that is smeared. We are
  • 18:31 - 18:36
    not systematically doing this wrong it's
    just, we are always doing it a little bit
  • 18:36 - 18:42
    wrong. And so the net result is that if we
    measure our particles, we will not measure
  • 18:42 - 18:47
    the right thing but always a little bit
    wobbly left wobbly right you know? Things
  • 18:47 - 18:54
    are not as precise. That's simply an
    uncertainty. So for us just like counting
  • 18:54 - 18:59
    has an uncertainty and predictions have
    an uncertainty, software bugs introduced
  • 18:59 - 19:06
    another source of uncertainties. And here
    you can see how we are tracking
  • 19:06 - 19:09
    uncertainties for for all of our
    analyses. We are trying to understand the
  • 19:09 - 19:16
    different forces of uncertainties. And
    again, bugs are only one of the sources
  • 19:16 - 19:23
    here, so if we find the bug then we
    reduce our uncertainty and we can find new
  • 19:23 - 19:28
    physics earlier, instead of having to
    wait and collect more data. So for us
  • 19:28 - 19:32
    finding bugs is really key, we really
    love finding bugs because it brings
  • 19:32 - 19:37
    physics closer. I thought that was
    interesting. It's kind of rare that you're
  • 19:37 - 19:42
    in environment where you're able to
    measure the effect of bugs. Okay, so now
  • 19:42 - 19:48
    we are talking, we'll be talking about
    data. I talked, told you that we are
  • 19:48 - 19:53
    trying to find particle traces in our
    data and the way we do this is by using
  • 19:53 - 19:57
    reconstruction programs and there are
    multiple gigabytes of binaries in shared
  • 19:57 - 20:02
    libraries and stuff. They're huge, they're
    experiment specific and they are curated
  • 20:02 - 20:06
    by the experiments, open-source for some
    of them, and we want them to be correct
  • 20:06 - 20:14
    and efficient. The data format we use is
    not comma separated values, it's binary
  • 20:14 - 20:21
    and for some strange reason it's our own
    custom binary format. The reason is that
  • 20:21 - 20:27
    it's really targeted and the kind of
    data we are having. We have collisions
  • 20:27 - 20:32
    that are independent, so we only need one
    in memory at any time and we have nested
  • 20:32 - 20:39
    collections which makes the regular table
    layout a non-starter. We actually generate
  • 20:39 - 20:44
    them from C++ objects so from classes,
    class definitions, C++ class definitions
  • 20:44 - 20:51
    and we can read them back into C++ but
    also into JavaScript or Scala. Database
  • 20:51 - 20:57
    just didn't do it for us. They have the
    wrong model of data axis, they don't
  • 20:57 - 21:03
    scale, it's just not the kind of system
    that works for us. Also using a file
  • 21:03 - 21:09
    system as a storage back-end might sound
    really very traditional and boring but it
  • 21:09 - 21:14
    works amazingly well and seems to be
    future proof as well, so that's just the
  • 21:14 - 21:20
    way to go for us. There are many other
    structured data formats out there, many of
  • 21:20 - 21:26
    those did not exist when we started root
    our own data format. But they also miss
  • 21:26 - 21:30
    many things. For example, we wanted to
    make sure that we have schema evolution
  • 21:30 - 21:34
    support. We can change the class layout
    and still read back all data. We don't
  • 21:34 - 21:39
    want to throw away all data just because
    we're changing the class. Also we do not
  • 21:39 - 21:43
    trust people. That is a, you know, as a
    computer scientist or whatever you
  • 21:43 - 21:47
    probably know what I'm talking about
    right? If people have to write their own
  • 21:47 - 21:51
    streaming algorithm, there will be bugs
    and we will lose data.
  • 21:51 - 21:55
    We really don't want to do this, so we
    were trying to automate this, based on the
  • 21:55 - 22:03
    class definition. So, last decision point
    for the story. Do you want to hear about
  • 22:03 - 22:10
    cling, our C++ interpreter or about Open
    Data and Applied Science? Let's start with
  • 22:10 - 22:15
    option 1, the C++ interpreter
    applause
  • 22:15 - 22:21
    Okay and and Open Data and Applied
    Science?
  • 22:21 - 22:30
    more applause than before
    Yeah. I'm heading there. You miss a fish.
  • 22:30 - 22:35
    You can look at the slides later. Okay, so
    there we go. Really? No. The slide number
  • 22:35 - 22:41
    is wrong. Oh a bug! So, Open Data and
    Applied Science. Okay, you really wanted
  • 22:41 - 22:48
    to know about our budget, I understand
    that. So we get from you about 1 billion
  • 22:48 - 22:51
    year and the currency doesn't really
    matter anymore at this, at this point of
  • 22:51 - 22:54
    time.
    laughter
  • 22:54 - 23:01
    And that is a lot of money. And you know?
    We try to do really wonderful things, I
  • 23:01 - 23:05
    mean we really enjoy our job, we love it.
    It's fantastic to work in such an
  • 23:05 - 23:09
    environment. And thank you very much for
    making that possible. Really, I mean it.
  • 23:11 - 23:17
    But it also means that you decided as
    society to enable something like CERN.
  • 23:17 - 23:22
    Which I think really deserves my applause
    and yours probably as well. I think it's a
  • 23:22 - 23:24
    great decision to do something like this.
  • 23:24 - 23:30
    applause
  • 23:31 - 23:36
    So we realize this, right? We realized
    that we are basically, that we can do what
  • 23:36 - 23:40
    we do because of you, and we are trying to
    react to that by giving back what we do.
  • 23:40 - 23:47
    Software, research results, hardware and
    data. So the way we share research results
  • 23:47 - 23:53
    is through open access. We have it,
    finally. It took us a long time to fight
  • 23:53 - 23:58
    with publishers and, you know, the
    establishment, but now we have it. We
  • 23:58 - 23:59
    also, yes thank you.
  • 23:59 - 24:03
    applause
  • 24:03 - 24:08
    We also put a lot of effort in
    communicating our results and what we are
  • 24:08 - 24:13
    doing. And if you're in the region, it's
    definitely worth a visit. I mean the URL
  • 24:13 - 24:18
    is really easy to remember, it's
    visit.cern, and you know, works. And you
  • 24:18 - 24:22
    should go there by April, actually, if you
    can because then you can ask people how to
  • 24:22 - 24:28
    get on the ground, because the accelerator
    is off at the moment. We also do applied
  • 24:28 - 24:32
    research, for example we have this super
    cool experiment where we try to study how
  • 24:32 - 24:40
    clouds form, based on cosmic rays. So the
    the influence of cosmic rays and cloud
  • 24:40 - 24:46
    formation. Which is a key element in the
    uncertainty of climate models. We are
  • 24:46 - 24:50
    trying to, to think about, you know, how
    to make energy from nuclear waste. So
  • 24:50 - 24:55
    getting rid of nuclear waste while making
    energy from it. And we are trying to
  • 24:55 - 25:02
    repurpose detectors that we have and you
    know develop. We have something called
  • 25:02 - 25:08
    open hardware, for example White Rabbit:
    deterministic ethernet, we have Open Data,
  • 25:08 - 25:13
    and we have the LHC@home and some other
    programs, where either you can donate
  • 25:13 - 25:21
    compute power or your brain and help us
    get better results. We explicitly try to
  • 25:21 - 25:26
    use open source as much as possible, and
    also feed back, whenever we see issues.
  • 25:28 - 25:34
    But we also create open source. For
    example, we create Geant, which is a
  • 25:34 - 25:38
    program that allows you to simulate how
    particles fly through a matter, for
  • 25:38 - 25:45
    example used by the NASA. We have Indico,
    which allows us to schedule meetings,
  • 25:45 - 25:49
    upload slides, you know, these kind of
    things. Across the globe, lots of people,
  • 25:49 - 25:53
    with access protection, all these kind of
    things. And it's open source. We have
  • 25:53 - 25:59
    DaviX, the dimension we love HTTP. That's
    the next machine of Tim Berners-Lee. And
  • 25:59 - 26:03
    that's his futile effort in trying to
    prevent the cleaning personnel from
  • 26:03 - 26:08
    switching it off. They don't speak
    English, they did not back then at least.
  • 26:09 - 26:16
    So we use we used DaviX to transfer files
    over HTTP, with a high bandwidth. Or we
  • 26:16 - 26:21
    have CVM-FS, which allows us to distribute
    our binaries across the globe, and not
  • 26:21 - 26:27
    rely on admins downloading stuff and
    making sure it actually runs, and these
  • 26:27 - 26:32
    kind of things. That is a lifesaver, it's
    really fantastic, it's a great tool. But
  • 26:32 - 26:38
    nobody knows it. And we have ROOT, but
    that's coming up. So now, the last
  • 26:38 - 26:43
    official part of this, of this
    presentation, how do we do data analysis?
  • 26:43 - 26:45
    Not like that.
    laughter
  • 26:45 - 26:52
    applause
    We use, we use C++ and actually physicists
  • 26:52 - 26:58
    need to write their own analysis in C++.
    We have very few people who have an actual
  • 26:58 - 27:04
    education in programming. so that's sort
    of a clash. As I said, we need to keep one
  • 27:05 - 27:08
    collision in memory. And for what, you
    know, what matters to us is throughput. We
  • 27:08 - 27:13
    want to have, we want to analyze as many
    collisions as possible per second. What we
  • 27:13 - 27:17
    can do, is specialize our data format to
    match the analysis, because we don't want
  • 27:17 - 27:23
    to waste I/O cycles, if we can, you know,
    if we can make use of the CPU better. ROOT
  • 27:23 - 27:29
    allows us to do this since twenty years.
    It's really the workhorse for the analysis
  • 27:29 - 27:35
    in high energy physics. And it's also an
    interface to complex software. We have
  • 27:35 - 27:41
    serialization facilities, we have the
    statistical tools, that people need, and
  • 27:41 - 27:44
    we have graphics, because once you have
    done your analysis you need to communicate
  • 27:44 - 27:48
    that to your peers and convince people,
    and publish, and so on, so that's part of
  • 27:48 - 27:54
    the game. All of that is open source, and,
    of course, all of that is not just used by
  • 27:54 - 28:03
    high energy physics. So, to conclude: We
    are here, because you make it possible.
  • 28:03 - 28:05
    Thank you very much. It's fantastic to
    have you.
  • 28:05 - 28:11
    applause
    We want to share and we have great people
  • 28:11 - 28:17
    for science outreach, but we have nobody
    for software outreach, basically. So maybe
  • 28:17 - 28:25
    it's worth a look to see what what CERN is
    producing software-wise. Scientific
  • 28:25 - 28:30
    computing is nothing new, it existed since
    a long time, but we had to start fairly
  • 28:30 - 28:35
    early on a large scale. So when we were
    building it up, we had to take... we were
  • 28:35 - 28:40
    trying to take pieces that existed and did
    not found find much. So now we ended up
  • 28:40 - 28:45
    with C++ data serialization, efficient
    computing for non computer scientists
  • 28:45 - 28:50
    even... In the part that I skipped and,
    you know, one of the alternate tracks, you
  • 28:50 - 28:54
    would have seen that we have a Python
    binding as well for the whole software
  • 28:54 - 29:00
    stack in C++. And for us, what matters
    most is scale. Now we are seeing that we
  • 29:00 - 29:04
    are not the only ones. There are many more
    natural sciences arriving at a similar
  • 29:04 - 29:09
    challenge of having to analyze large
    amounts of data. Now I promised to you
  • 29:09 - 29:12
    that I'll be bold and I'll try to make a
    few statements of what will happen with
  • 29:12 - 29:17
    data analysis, not just in science.
    Because what we see is that we actually
  • 29:17 - 29:23
    educate the people who will do data
    analysis, not just in science. What we see
  • 29:23 - 29:31
    is that in the past, data volume mattered
    most. So more data meant more power. Now
  • 29:31 - 29:36
    that's not the complete truth anymore.
    It's a lot about finding correlations. So
  • 29:36 - 29:41
    even with the amount of data not growing
    anymore, because it's already humongous,
  • 29:41 - 29:46
    we try to squeeze more knowledge out of
    it. And for that, I/O becomes important
  • 29:46 - 29:54
    and CPU limitations is the crucial factor.
    We see that multivariate techniques are
  • 29:54 - 29:59
    still rising and they will just be part of
    the toolchain of the statistical tools;
  • 30:00 - 30:07
    except for generative parts, which, I
    believe, will change the way we model.
  • 30:10 - 30:16
    Now, based on what I just described, this
    is not a big surprise anymore. As we need
  • 30:16 - 30:21
    throughput, we need to have a language for
    the core analysis part, that is close to
  • 30:21 - 30:27
    metal, so something like C++.
    On the other hand writing analyses is
  • 30:27 - 30:32
    still complex, so you need a higher-level
    language and for that people could, for
  • 30:32 - 30:36
    example, use Python. So, now language
    binding becomes relevant all of a sudden.
  • 30:36 - 30:42
    It's much more important in the future.
    And we need to tailor I/O to the actual
  • 30:42 - 30:49
    analysis to not waste CPU cycles. So
    throughput is the king and, in my point of
  • 30:49 - 30:54
    view, also in the future we will see much
    more effort in increasing the throughput.
  • 30:56 - 31:03
    Okay, so that was it. In case you want to
    discuss anything with me, like "That's
  • 31:03 - 31:08
    just wrong!", that's fine. I'm probably
    have several bugs in there. I'm still here
  • 31:08 - 31:13
    until tomorrow. I don't know where yet,
    so I'll wander around and you can contact
  • 31:13 - 31:17
    me by email or Twitter. Thank you very
    much for your attention. Thank you.
  • 31:17 - 31:21
    applause
  • 31:21 - 31:28
    music
  • 31:28 - 31:45
    subtitles created by c3subtitles.de
    in the year 2017. Join, and help us!
Title:
How physicists analyze massive data: LHC + brain + ROOT = Higgs (33c3)
Description:

https://media.ccc.de/v/33c3-8083-how_physicists_analyze_massive_data_lhc_brain_root_higgs

Physicists are not computer scientists. But at CERN and worldwide, they need to analyze petabytes of data, efficiently. Since more than 20 years now, ROOT helps them with interactive development of analysis algorithms (in the context of the experiments' multi-gigabyte software libraries), serialization of virtually any C++ object, fast statistical and general math tools, and high quality graphics for publications. I.e. ROOT helps physicists transform data into knowledge.

The presentation will introduce the life of data, the role of computing for physicists and how physicists analyze data with ROOT. It will sketch out how some of us foresee the development of data analysis given that the rest of the world all of a sudden also has big data tools: where they fit, where they don't, and what's missing.

['Axel']

more » « less
Video Language:
English
Duration:
31:45

English subtitles

Revisions