0:00:00.200,0:00:04.030
So let's start by looking at how we might implement this as simply as possible,
0:00:04.030,0:00:09.399
and we're going to begin by thinking about this computation as working on an N by N matrix.
0:00:09.399,0:00:12.662
The N bodies on the X-axis here are the source of each force,
0:00:12.662,0:00:17.078
and the N bodies on the Y-axis here are the destination of each force.
0:00:17.078,0:00:19.555
The source and destination objects are the same objects, of course.
0:00:19.555,0:00:22.058
This is just how we're organizing the computation.
0:00:22.058,0:00:24.740
So the matrix element at the position X,Y
0:00:24.740,0:00:30.213
represents the force that source element X exerts on destination element Y,
0:00:30.213,0:00:33.267
and what we want to do is calculate the total amount of force
0:00:33.267,0:00:36.480
from all source objects on each destination object.
0:00:36.480,0:00:40.150
This formulation of the problem exhibits a lot of parallelism.
0:00:40.150,0:00:43.680
Each of the N-squared forces can be computed in parallel.
0:00:43.680,0:00:48.983
We just load the parameters of the source and destination thread to do one particular computation,
0:00:48.983,0:00:52.360
and compute the force, say, using gravitational attraction.
0:00:52.360,0:00:54.864
This requires specifying parameters for each object.
0:00:54.864,0:00:58.830
For gravity, this would be, say, the mass and the position, X, Y, Z.
0:00:58.830,0:01:03.998
Then each object, say object Y, must add up all N forces that act on it.
0:01:03.998,0:01:07.830
Then, we have to calculate N reductions of N forces also in parallel.
0:01:07.830,0:01:12.835
Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements,
0:01:12.835,0:01:15.721
we can see that we have a lot of parallelism to exploit.
0:01:15.721,0:01:18.226
In this case, the computation would be very straightforward.
0:01:18.226,0:01:23.286
We will now see that if we run it in this way, we would be making poor use of our memory bandwidth,
0:01:23.286,0:01:29.130
so if we assume that all of our N-squared force computations running in parallel must go to global memory,
0:01:29.130,0:01:32.682
how many times must we fetch each element as a function of N?