So to answer this, we want to look at these for loops
and decide how many times each warp is going to have to execute the for loop.
And that means how many times at least 1 thread in the warp is going to have to execute it.
So looking at this expression here, for a single warp these values will vary from 0 to 31.
So there will be at least 1 thread in that warp for whom this modulo expression evaluates to 31.
And that means that the entire warp is going to go through the motions
of executing this bar function 31 times.
Now, some of those times some of the threads will be deactivated.
So the very first time, thread 0 will not execute the bar function.
It will be deactivated because i will not be less than 0.
And the next time thread 0 and 1 will be deactivated and so forth.
Ultimately, the total amount of time that the warp has to spend in this loop
depends on the total number of time that any 1 thread has to spend on it.
Each warp will executive this loop 31 times.
This next loop, though, is different.
In this case the integer divide means that threads 0 through 31 are going to evaluate to 0.
This expression will evaluate to 0, and therefore they're not going to execute the bar function at all.
And in threads 32 through 63 we'll evaluate this expression to 1,
and they'll execute the loop 1s and so forth.
So there will be 1 warp which evaluates at 0 times, 1 warp which evaluates at 1 time,
1 that evaluates at 2 times and so forth, all the way up to a single warp,
which evaluates at 31 times.
So the average number of times that all of the warps will execute this loop is 15.5.
So now we know what we need to know to answer the question.
Clearly, the second loop will execute faster, and it will be twice as fast
because, on average, the number of times that the expensive bar function gets evaluated
is half the number of times that that bar function gets evaluated during the first loop.
要回答这个问题,我们得了解一下这些循环
之后再来判断每个变形将要在循环之中执行多少次。
意思是至少一条在变形中的线程将要执行多少次。
来看一下此处的表达式,单一变形,取值范围将从0到31。
所以在此变形中至少会有1个线程执行此取模表达式中,值为31.
这意味着整个变形将遍历
执行此bar函数31次。
现在,某些次数中的某些线程将会失效。
所以在最开始,线程0不会执行bar函数。
函数将失效因为i值不会小于0.
下一次线程0和1将会失效,以此类推。
最终,此变形在此循环中的总时间
取决于任意某线程需要消耗的总时间。
每个变形将会在此循环中执行31次。
下面这个循环,却不同。
在此条件下的整数除意味着线程0到31的值都会为0。
下一个表达式的值将为0,因此它们完全不会执行此bar函数。
而线程32到63我们会执行此表达式取值为1,
它们将会执行此循环1s,以此类推。
所以此处有一个变形将会执行0次,下一个执行1次,
再下一个执行2次以此类推,所有执行次数加起来的和等于此单个变形。
总共为31次。
所以所有这些变形在本次循环中执行的平均次数为15.5次。
那么现在我们需要知道这道题的答案了。
显然,第二个循环执行的更快,两倍于第一个循环,
因为,通常来说,时间要求更高的bar函数求值次数
为在首个循环中bar函数所需次数的一半。