So let's start by looking at how we might implement this as simply as possible,
and we're going to begin by thinking about this computation as working on an N by N matrix.
The N bodies on the X-axis here are the source of each force,
and the N bodies on the Y-axis here are the destination of each force.
The source and destination objects are the same objects, of course.
This is just how we're organizing the computation.
So the matrix element at the position X,Y
represents the force that source element X exerts on destination element Y,
and what we want to do is calculate the total amount of force
from all source objects on each destination object.
This formulation of the problem exhibits a lot of parallelism.
Each of the N-squared forces can be computed in parallel.
We just load the parameters of the source and destination thread to do one particular computation,
and compute the force, say, using gravitational attraction.
This requires specifying parameters for each object.
For gravity, this would be, say, the mass and the position, X, Y, Z.
Then each object, say object Y, must add up all N forces that act on it.
Then, we have to calculate N reductions of N forces also in parallel.
Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements,
we can see that we have a lot of parallelism to exploit.
In this case, the computation would be very straightforward.
We will now see that if we run it in this way, we would be making poor use of our memory bandwidth,
so if we assume that all of our N-squared force computations running in parallel must go to global memory,
how many times must we fetch each element as a function of N?