The answer, too big: If P has more threads than a thread block is allowed to have,
then we can't use shared memory to share data among all P threads,
because we have to distribute that tile across multiple thread blocks.
Another consideration is making sure that we have at least as many thread blocks as SMs
or else SMs will sit idle.