  1. So we have 2 approaches here--thread per row and thread per element. Which is better?
  2. So we might have different performance on matrices that have a similar number of elements per row,
  3. and we might have differing performance if we have a varying
  4. or even a wildly varying number of elements per row.
  5. So which of these is comparatively better on each of these kinds of matrices?
  6. So I'd like you to put a couple checkboxes in.