-
Title:
Slac and Big Data - Intro to Computer Science
-
Description:
-
We're here at SLAC National Accelerator Lab,
-
and we're going to see how they use computing to understand the mysteries of the universe.
-
[Spencer Gessner:] We're standing in the klystron gallery, formerly the longest building in the world.
-
[Richard Mount:] You're here at SLAC National Accelerator Laboratory.
-
This is a 50-year-old laboratory, as all the flags on the lampposts around the lab are telling you.
-
It was founded to build a 2-mile-long linear accelerator.
-
SLAC is an accelerator laboratory still.
-
Its main science is based on accelerating particles and creating new states of matter
-
or exploring the nature of matter with the accelerated particles.
-
This always has generated a lot of data, a lot of information.
-
It's very data-intensive experimental science.
-
From the earliest days of SLAC computing
-
to analyze data has been a major part of the activity here.
-
You really can only study the cosmos by studying it in a computer.
-
You get one chance to look at it,
-
but to understand how it evolved into the state it is now,
-
you have to do all this in the computer.
-
There are massive computations going on for that sort of simulation,
-
massive computations in catalysis and material science
-
and massive data analysis going on here as well.
-
The particular particle physics experiment
-
that I am involved in right now has some 300 petabytes of disk space--
-
some 300,000 terabytes, some 300 million gigabytes of disk space
-
around the world to do this analysis.
-
Of course, we are far from understanding everything about the universe,
-
but this is probably one of the most data-intensive activity in science today.
-
The raw data rate coming out of the ATLAS detector that I'm involved in
-
is about a petabyte a second.
-
That's 1 million gigabytes a second.
-
You can't store that with any budget known to man,
-
so most of it is inspected on the fly and reduced to a much smaller, but still large, storable amount of data.
-
Right now we are sifting through these many, many petabytes of data
-
to look for signals of the Higgs boson, as no doubt people have heard in the news.
-
There are tantalizing hints that I'm not holding my breath about at all right now,
-
but this is the way we do it.
-
You need to have those vast amounts of data
-
just to pick out the things that will really revolutionize physics in there,
-
and you need to understand all of it in detail, because what you're looking for
-
is something slightly unusual compared with everything else.
-
If you don't understand everything else perfectly then you don't understand anything.
-
[Max Swiatlowski:] We're looking at one of the racks that contains
-
the ATLAS proof buster at SLAC.
-
ATLAS is an experimental Large Hadron Collider in Geneva, Switzerland,
-
that collides protons, fundamental building blocks of nature,
-
traveling at very, very, very close to the speed of light
-
with trillions of times the energy that they have at room temperature.
-
You get many and many of these collisions happening at once
-
and this enormous machine that reads out trillions of data channels.
-
At the end of the day, you have this enormous amount of data--petabytes of data--
-
that you have to analyse looking for very rare, very particular signatures inside of that.
-
If I want to look for a rare signature--something that had a lot of energy
-
and a lot of really strange particles at once--
-
there are trillions and trillions of these events stored on this machine.
-
To look for them in any reasonable amount of time,
-
I have to do many searches at once.
-
I have to use all the cores on the computers--
-
the hundreds of cores on the machine all running at full-speed at the same time--
-
to have any hope of doing it in any reasonable amount of time.
-
[Richard Mount:] This isn't the sort of thing that search engines currently do.
-
They're looking for text strings and indexing all the text strings that they find
-
in some way like this.
-
What we have is very, very structured.
-
We know the structure of these data.
-
We know exactly how to go to anything that we want to get to in these data,
-
because the way in which everything is linked together is very well understood.
-
Things will go wrong all the time.
-
You cannot assume you won't lose data from the disk.
-
You send it by network from one computer center to another.
-
You cannot assume it arrives undamaged.
-
You cannot assume your computers don't die in the middle of calculations.
-
Everything can go wrong, so the computing we do for the LHC
-
has many layers of error correction and retry.
-
Some of the basic failure rates are quite high,
-
but by the time everything has been fairly automatically retried
-
and errors have been corrected, we get high throughput and a high success rate.