1 00:00:00,200 --> 00:00:04,030 So let's start by looking at how we might implement this as simply as possible, 2 00:00:04,030 --> 00:00:09,399 and we're going to begin by thinking about this computation as working on an N by N matrix. 3 00:00:09,399 --> 00:00:12,662 The N bodies on the X-axis here are the source of each force, 4 00:00:12,662 --> 00:00:17,078 and the N bodies on the Y-axis here are the destination of each force. 5 00:00:17,078 --> 00:00:19,555 The source and destination objects are the same objects, of course. 6 00:00:19,555 --> 00:00:22,058 This is just how we're organizing the computation. 7 00:00:22,058 --> 00:00:24,740 So the matrix element at the position X,Y 8 00:00:24,740 --> 00:00:30,213 represents the force that source element X exerts on destination element Y, 9 00:00:30,213 --> 00:00:33,267 and what we want to do is calculate the total amount of force 10 00:00:33,267 --> 00:00:36,480 from all source objects on each destination object. 11 00:00:36,480 --> 00:00:40,150 This formulation of the problem exhibits a lot of parallelism. 12 00:00:40,150 --> 00:00:43,680 Each of the N-squared forces can be computed in parallel. 13 00:00:43,680 --> 00:00:48,983 We just load the parameters of the source and destination thread to do one particular computation, 14 00:00:48,983 --> 00:00:52,360 and compute the force, say, using gravitational attraction. 15 00:00:52,360 --> 00:00:54,864 This requires specifying parameters for each object. 16 00:00:54,864 --> 00:00:58,830 For gravity, this would be, say, the mass and the position, X, Y, Z. 17 00:00:58,830 --> 00:01:03,998 Then each object, say object Y, must add up all N forces that act on it. 18 00:01:03,998 --> 00:01:07,830 Then, we have to calculate N reductions of N forces also in parallel. 19 00:01:07,830 --> 00:01:12,835 Since we want to run this computation on tens to hundreds of thousands, maybe even millions of elements, 20 00:01:12,835 --> 00:01:15,721 we can see that we have a lot of parallelism to exploit. 21 00:01:15,721 --> 00:01:18,226 In this case, the computation would be very straightforward. 22 00:01:18,226 --> 00:01:23,286 We will now see that if we run it in this way, we would be making poor use of our memory bandwidth, 23 00:01:23,286 --> 00:01:29,130 so if we assume that all of our N-squared force computations running in parallel must go to global memory, 24 00:01:29,130 --> 00:01:32,682 how many times must we fetch each element as a function of N?