So let's start by looking at how we might implement this as simply as possible,
and we're going to begin by thinking about this computation as working on an N by N matrix.
The N bodies on the X-axis here are the source of each force,
and the N bodies on the Y-axis here are the destination of each force.
The source and destination objects are the same objects, of course.
This is just how we're organizing the computation.
So the matrix element at the position X,Y
represents the force that source element X exerts on destination element Y,
and what we want to do is calculate the total amount of force
from all source objects on each destination object.
This formulation of the problem exhibits a lot of parallelism.
Each of the N-squared forces can be computed in parallel.
We just load the parameters of the source and destination thread to do one particular computation,
and compute the force, say, using gravitational attraction.
This requires specifying parameters for each object.
For gravity, this would be, say, the mass and the position, X, Y, Z.
Then each object, say object Y, must add up all N forces that act on it.
Then, we have to calculate N reductions of N forces also in parallel.
Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements,
we can see that we have a lot of parallelism to exploit.
In this case, the computation would be very straightforward.
We will now see that if we run it in this way, we would be making poor use of our memory bandwidth,
so if we assume that all of our N-squared force computations running in parallel must go to global memory,
how many times must we fetch each element as a function of N?
我们开始看看我们如何能尽可能简单地实现这个,
我们开始把这计算想成在NxN矩阵上工作。
在X轴上的N体是每个力的源,
在Y轴上的N体是每个力的目标。
源和终点对象当然是相同的对象。
这就是我们如何组织计算。
所以在位置X、Y的矩阵元素
代表源元素X运用在目标元素Y的力,
我们要做的是计算
来自每个目标对象的所有源对象的总力。
问题的公式化展示了很多并行度。
每个N平方的力可以并行计算。
我们刚加载源和目标线程的参数,进行一种特殊计算,
计算力,例如用万有引力。
这需要为每个对象指定参数。
对于重力,参数是,例如质量和位置X、Y、Z。
然后每个对象,例如对象Y,必须加和所有对它作用的力。
接着,我们得计算N个力的N个下降,也是并行的。
因为我们要在几万到几十万,可能甚至几百万个元素上运行,
我们可以看到我们有大量并行度可以开发。
在这种情况下,计算很简单。
你会看到如果我们以这种方式运行,那么我们没有很好地利用我们的内存带宽,
所以如果我们假设所有的N平方的力的并行计算必须到全局内存,
用N的函数表示,每个元素我们必须取多少次?