So which approach will launch more threads: Thread per row, thread per element?
The answer is thread per element.
Thread per element will launch a thread for every X in this matrix,
but thread per row, on the other hand, will only launch a thread for every row in this matrix.
So thread per element will certainly have at least as many, if not many more threads than thread per row.
Which approach will require communication between threads?
Well, that's also going to be thread per element.
What's going to happen there is
we compute a partial product for each element in a particular row
and then we need to add up those partial products across those elements to create the final value.
And if you remember we use the segmented scan operation to be able to do that in Unit 4.
The final question, which approach will do more work per thread?
Well that's going to be thread per row.
This row has to do all the partial products and add them up,
whereas thread per element only does 1 partial product per element
and then combines the work across multiple threads to be able to get a final result.
那么哪种方法将启动更多的线程:
每行线程还是每元素线程?
答案是每元素线程。
每元素线程将为这个矩阵中的每个X启动一个线程,
但每行线程,相反地,将只为这个矩阵的每一行
启动一个线程。
所以每元素线程将肯定至少有和每行线程一样多的线程,
即使不是多许多。
哪种方法将需要线程之间的通信?
好的,那也是每元素线程。
那里将要发生的是
我们计算特定行中每个元素的部分乘积
然后我们需要在那些元素之间
累加那些部分乘积来创建最终值。
如果你还记得在单元四我们使用分段的扫描操作
才能进行那样的操作。
最后一个问题,哪种方法将会在每个线程做更多的工作?
是的,那将是每行线程。
这行必须进行所有的部分乘积运算,并把它们加起来,
而每元素线程只能做一个元素的部分乘积,
然后在多个线程间合并工作以便能够得到最后的结果。