So, let's look at Litte's Law for GPUs.
To recap, Litte's Law states that the number of useful bytes delivered
is equal to the average latency of memory transaction times the bandwidth.
Now, what are some implications of this?
First of all, there's a minimum latency to take a signal or piece of data all the way from an SM
to somewhere on the DRAM, or to take information from the DRAM and pull it into an SM.
Okay, you can find the details for your particular GPU online,
but in general any DRAM transaction is going to take hundreds of clock cycles.
And by the way, this isn't a GPU thing. This is true of all modern processors.
A clock cycle on a modern chip takes half a nanosecond, for example, on a 2 gigahertz chip.
And even the speed of light--you know, light doesn't go very far in half a nanosecond.
And electricity is even slower, especially on the tiny wires that you find in computer chips.
So to go from somewhere inside the GPU off the chip,
over a wire somewhere on the board into the DRAM, get a result,
go all the way back, hundreds and hundreds of clock cycles, many, many nanoseconds.
So this means that a thread that's trying to read or write global memory
is going to have to wait 100s of clocks, and time that it could otherwise
be spending by doing actual computation.
And this, in turn, is why we have so many threads in flight.
We deal with this high latency hundreds of clocks between
memory accesses by having many, many threads that are able to run at any one time,
so after one thread requests a piece of data from global memory
or initiates a store to global memory, another thread can step in and do some computation.
我们来看看GPU的Litter法则。
回顾一下,Litte法则阐述了传达的有用字节数
等于平均内存事务延迟乘以带宽。
这有什么含义?
首先,存在最小延迟,把一个信号或一块数据从一个SM一直
移到DRAM的某处,或从DRAM获取信息并带入SM。
好的,你可以在网上找到你的特定GPU的具体细节,
但一般来说,任何DRAM事务将花数以百计的时间周期。
顺便说一下,这不是GPU专有的,这适用于所有现代处理器。
现代芯片上的一个时钟周期花费半纳秒,例如,2千兆赫芯片。
即使是光速—你知道的,光在一纳秒内走不了很远。
电更慢,特别是在你在计算机芯片上找到的小导线上。
所以从GPU内部某处离开芯片,
通过电板上的某处导线到DRAM,获得结果,
一直往回运动,数以百计的时钟周期,很多很多纳秒。
所以这意味着试图读写全局内存的线程
得等待100秒时钟,时间可以用在
可以进行真正计算。
这反过来就是为什么我们有这么多线程在飞行状态的原因。
我们通过有很多很多线程能够同时运行,
来处理这内存访问间数以百计的高延迟,
所以在一个线程向全局内存请求一块数据,
或启动一个全局内存存储,另一个线程可以步入进行一些计算。