0:00:00.230,0:00:04.320 In this problem you need to divide your work up into chunks; in this case, tiles. 0:00:04.320,0:00:08.013 We have a continuum between tiny tiles--lots of them, 0:00:08.013,0:00:13.784 and fewer tiles where each tile is sized to the maximum that can fit in a single thread block. 0:00:13.784,0:00:15.123 And in this particular problem, 0:00:15.123,0:00:18.019 bigger tiles means less memory bandwidth; this is good. 0:00:18.019,0:00:22.397 Generally then you want to make your tiles as big as can fit into a single thread block, 0:00:22.397,0:00:24.970 because that minimizes overall memory bandwidth. 0:00:24.970,0:00:26.929 But note the following 2 caveats. 0:00:26.929,0:00:29.264 One, you need to have at least as many thread blocks 0:00:29.264,0:00:31.266 as you have SMs in your GPU, 0:00:31.266,0:00:33.437 because otherwise you'll have SMs sitting idle. 0:00:33.437,0:00:36.319 Definitely you want to make sure fill the machine 0:00:36.319,0:00:38.387 with enough work to keep all the SMs busy, 0:00:38.387,0:00:41.368 even if you have to move a little bit this way on the continuum 0:00:41.368,0:00:43.982 and size your tiles just a little bit smaller. 0:00:43.982,0:00:46.720 Two, if you're sitting at the right end of this continuum, 0:00:46.720,0:00:48.592 it's best for overall memory bandwidth, 0:00:48.592,0:00:50.798 but often it turns out that you would actually prefer 0:00:50.798,0:00:52.969 to just maybe 1 tick to the left. 0:00:52.969,0:00:54.897 This allows a small number, 0:00:54.897,0:00:57.670 say, 2 blocks to both B-resident at a time, 0:00:57.670,0:01:00.595 And that potentially gives better latency-hiding characteristics, 0:01:00.595,0:01:03.164 because you have more warps that may be in flight at the same time 0:01:03.164,0:01:05.677 from slightly different pieces of the program. 0:01:05.677,0:01:08.642 It's certainly something that you would want to tune carefully 0:01:08.642,0:01:12.249 if you needed the fastest possible implementation.