One of the questions I ask the students in the classes that I teach is,
"What are you going to do with 100 times more compute?"
And sometimes that's a really hard question for them.
There's a lot of head scratching both in terms of what can we do with a super computer that's 100 times more powerful
and what can you do with something on your desk or in your pocket?
-Where do you see us going in this direction?
-Yeah, well I I have an insatiable appetite for FLOPs.
I would have no trouble using 100 or even 1,000 or even 10,000 times more compute.
A lot of what I do is designing computers,
and a lot of that involves prototyping and simulation new computer designs.
I'm always frustrated about how those simulations run.
So of I could run RTL simulations of a new computer 100 times faster,
it would enable me to be much more productive in trying out new ideas for computer design.
Same for circuit simulations.
I spend a lot of time waiting for circuit simulation to converge.
If I could run it 100 times faster, I could not just run one simulation but run whole parameter sweeps at once
and do optimizations the same time I'm simulating.
Another thing is also you look at the computers in your car.
I mean our Tegra processors are actually designed into lots of different automobiles,
including the Tesla Model S, the Motor Trend Car of the Year,
but also Audis and BMWs and all sorts of Fords have Tegras in them.
And the applications people are starting to use for these mobile processors in cars
involve having lots of computer vision to look at what people inside the car are doing,
look at what people outside of the car are doing.
And in many ways it makes your cars much safer by having the car aware of what's going on around it.
It can in many ways compensate for the driver not being completely alert
or perhaps texting or doing something they shouldn't be doing.
And in mobile devices I think there are a lot of compelling applications
in both computational photography and augmented reality.
If your mobile device is constantly aware of what's around you, it can be informing you.
Oh, I think you're hungry.
Here's a place that has gyros that I know you like
because I have your profile of your likes and dislikes. Maybe you should stop for lunch.
Or a block away is this guy who you really don't like.
Maybe you should turn right at this corner and avoid running into him.
In many ways, I think it sort of evolved to having your computing devices becoming your personal assistant.
I always liked Jeeves in the Iron Man movies.
I would like to have a device I can kind of talk to that is aware
of the environment around me and can be basically a brain amplifier for me.
It can sort of remember things that I forget
and tell me about things in my environment and basically assist me in going through my day,
both on professional and personal bases.
So one of the goals of the supercomputer industry is to get up to—the term they use is exascale
that they'd like to do 10 ^ 18 FLOPs per second.
Certainly, Nvidia is going to be interested in being in those computers. What are we going to use that for?
Well, I think first of all, there's nothing magical about an exascale.
It's like, you know, when we first made
petascale machines, which is just a few years ago,
it wasn't like breaking the sound barrier or anything really qualitatively changed,
but enabled better science and there's always—
You look at sort of the fidelity of simulations we're able to do today
to, say, simulate a more efficient engine for automobiles to improve gas mileage,
and we're making lots of approximations to fit them on the supercomputers we have today.
As we can get to higher fidelity by resolving grids finer
and modeling a bunch of effects like turbulence more directly
rather than using macro models to model them, we'll get more accurate simulations.
And that will enable a better understanding of combustion in some of the you biotech applications
of how proteins fold, various other climate—
-Climate modeling.
-Sure.
Climate evolves.
Basically as we get better computing capacity—
and it's not you're reaching magic exascale and wonderful things happen,
but at every step along the way, we get better science,
we are able to design better products.
And computing is a big driver of both scientific understanding
and economic progress across across the board.
And I think it's very important that we maintain that steady march forward,
and exascale is just one milestone along that march.
And my understanding is that power is really an enormously crucial thing for them to get right
to be able to enable the exascale that we don't want machines
that are going to cost $2 million a month just to plug in.
-Right.
It's really an economic argument.
I mean if you really wanted an exascale machine today, you could build one.
You just have to write a really big check and locate it right next to the nuclear power plant,
the entire output of which it will consume.
But I think if there was some application that was so compelling
they were willing to really write the multi-billion dollar check required to do that, you would do it.
I think that the real question of exascale is an economical exascale,
and because on total cost of ownership the power bill is a tremendous fraction.
So it's not actually an economical exascale machine
unless you can do it for reasonable power level,
and the number that's been thrown out is 20 megawatts.
So that's $20 million a year.
Yeah, $20 million a year power bill if you're paying roughly $10 a kilowatt hour.
In fact, the bill actually winds up usually being a little bit higher than that
because the cost of provisioning energy amortized over, say, a 30-year lifetime of the facility,
usually is about equal to the annual bill for the energy.
There is also something called the PUE,
which is basically the efficiency of providing the energy.
Even for a very good installation today maybe on the order of 1.1 to 1.2.
So you pay another, say, 20% to run the air conditioners and fans and things like that in the facility,
and basically energy you're consuming isn't being consumed by the computer.
But it's a big challenge for us to get from, say, Sandy Bridge today—
that's 1.5 nanojoules per instruction—to if you wanted to do exa instructions per second,
to do an exa FLOP you might have to do more than an exa instruction per second.
But even if you take that as a thing at 20 megawatts,
that's 20 picojoules per instruction.
And that's not just the processor; that's everything.
That's the memory system, that's the network, that's the io storage system.
It's the whole ball of wax that to do it.
So you mainly get 10 picojoules per instruction to actually use in the processor.
And so even in Nvidia it's not quite close enough to that.
Yeah, well compared to Sandy Bridge, that's a factor of 150 down,
and process isn't going to help you much.
So that's why conventional CPUs are not going to get there.
It's going to require a hybrid multi-core approach
with most of the work being done in a GPU like throughput processor to get there.
But even we have a ways to go.
We're probably close to over-magnitude, and we might get a factor of three from process.
We need to be very clever to come up with the other factor of 3 or 4 that we need.
-Titan does have CPUs in it, yes?
-That's correct.
So is there a vision where that won't even be the case?
No I think there are always pieces of the code
where you have a critical path, you have a piece of single thread code that you need to run very quickly.
And so you always need a latency optimized processor around to do that,
but most of the work it's one of these things, it's kind of like a cache memory, where most of your acesses
are to this little memory that runs really fast but you still need the capacity
of the big memory sitting behind it, right?
And so it's, it's the same thing on throughput versus latency.
Most of your work is done in the throughput processors,
but when you do have a latency critical thing, you run it in on the latency optimized processors.
And so you wind up getting the critical path performance of the CPU
with the bulk of the energy consumption of the GPU.
-And the bulk of the FLOPs and Titan is certainly going to the GPU's.
-Right.
The bulk of the FLOPs will be in the GPU's.
在我教的课上,我向学生提的其中一个问题是,
“你将如何使用高出100 倍的计算能力?”
有时候这对他们来说是个很难回答的问题。
有许多令人伤脑筋的问题,无论是我们可以用一台强大100倍的超级计算机做什么,
还是你可以用课桌上或口袋里的东西做什么。
——你认为我们在这个方向处在什么位置?——嗯,我对浮点运算(FLOP)的欲望永不满足。
我对使用高出100倍的计算能力没有任何问题,即使 1000 倍或甚至 10000倍。
我的很多工作是设计计算机,
许多涉及原型设计和模拟新计算机设计。
我总是对那些模拟的运行方式感到失望。
因此,如果我能以高出100 倍的速度运行一台新计算机的 RTL 模拟,
这将使我在试验一些新的计算机设计理念时更加有效率。
对电路模拟也是如此。
我花了很多时间等待电路模拟收敛。
如果我能以高出100 倍的速度运行模拟,我不仅可以运行一次模拟,而且能够同时运行整个参数扫描,
并且在我模拟的同时进行优化。
另外,你也可以看看你汽车里的计算机。
我是说我们的Tegra处理器实际上被设计到很多不同汽车中,
包括《汽车潮流》杂志的年度车型特斯拉Model S,
还有奥迪、宝马和各种福特汽车也使用Tegra。
人们现在开始使用的这些车载移动处理器上运行的应用程序
涉及很多电脑视觉技术,既注意汽车里面的人在做什么,
也注意汽车外面的人在做什么。
通过让汽车感知周围的情况,在很多方面它使得你的汽车更安全。
它在很多方面可以弥补驾驶员不是完全警觉
或可能在短信或做一些他们不该做的事情。
在移动设备中,我觉得在计算机图像和增强现实方面有很多令人赞叹的应用程序。
如果您的移动设备持续感知你周围的事物,它可以实时通知您
哦,我估计你饿了吧。
这有个地方有皮塔三明治,我知道你喜欢,
因为我有你的喜恶档案。也许你应该停下来吃午饭。
或是一个街区之外有一个你不喜欢的家伙。
或许你应该在这个拐角右转来避免碰到他。
在许多方面,我认为它在某种程度上演变成把你的计算设备变成了你的个人助理。
我一直喜欢钢铁侠电影中的吉夫斯。
我想要一个能够谈话的设备,它能够感知
我周围的环境,而且基本上能充当我的大脑放大器。
它在某种程度上能够记住我遗忘事情,
告诉我周围的事情,并且协助我完成基本日常活动,
不管是工作上还是个人事务。
超级计算机产业的目标之一是达到——他们使用的术语是exascale量级,
也就是他们希望每秒做10^18次浮点运算。
当然,英伟达对进入那些计算机很感兴趣。我们将用它来做什么?
嗯,首先我觉得exascale计算机并不是不可思议的。
就好像,你知道,当我们首次制作出
千万亿次量级(petascale)的机器,这不过是几年前的事情,
它不是突破音障或真正发生质变的东西,
但是更好地促进了科技,而且总有——
你看看我们今天能做的模拟的保真度
例如,模拟一个更高效的汽车发动机来改善油耗,
我们在现有的超级计算机上设置很多近似值来拟合。
如果我们能获得更高的保真度,通过更精细地解析网格栅
以及对像湍流那样的一批效应更直接地建立模型,
而不是使用宏观模型为它们建模,我们将实现更加精准的模拟。
这将有助于更好地理解燃烧的作用,在某些
蛋白质如何折叠的生物技术应用方面,多种气候 ——
——气候建模。——当然。
气候演变。
基本上随着我们拥有更强的计算能力,
不是在你达到神奇的exascale,奇妙的事情发生了;
而是在前进中的每一步,我们实现了更先进的科学,
我们可以设计更好的产品。
计算能力是一个很大的驱动力,驱动科学的认知
以及全面地驱动经济发展。
我认为我们继续稳步向前迈进非常重要,
而exascale 只是征途中的一个里程碑。
另外,我的理解是电力对它们确实很至关重要,
关系到能否使exascale成为可能。我们不需要
只插上电源每月就要花费两百万美元的机器。— 是的。
它的确是一个经济方面的争论。
我是说如果你今天真的想要一台exascale机器,你可以建造一台。
你只需要写一张非常大额的支票,并将机器直接放在核电站的旁边,
它将消耗核电站的所有电力输出。
但我认为,如果有一些应用迫切需要,
他们确实愿意花费那么做需要的几十亿美元,你可以建造它。
我认为exascale的实际问题是一个经济的exascale,
由于在总拥有成本的账单上,电费开支是一个巨大的部分。
所以它实际上不是一台经济的exascale机器,
除非你可以在合理的电力级别使用它。
提出的电力数值是 20 兆瓦,
那就是一年二千万美元。
是啊,如果每度电你支付约10美元,电费账单是每年二千万美元。
事实上,最终的账单金额通常比那个要高一点。
因为能源供应的成本是分期摊销的,譬如,设备的30年使用寿命,
通常约等于能源的年账单。
还有一个叫做电源使用效率(PUE),
这基本上是提供能源的效率。
即使今天很好的设备也可能在 1.1 至 1.2。
你需要支付另外的,例如20%,用于设施里面的空调、电扇和类似设备,
而且基本上你正在消耗的能源并不是电脑正在使用的。
但是这对我们是一项重大的挑战。如从今天的Sandy Bridge
——每个指令 1.5 纳焦耳——到如果你想每秒运行百亿亿次指令,
要想运行百亿亿次的浮点运算,你可能得每秒运行超过百亿亿次的指令。
但是即使你以 20 兆瓦运行机器,
每条指令才20微微焦耳。
这不仅仅是处理器,包括所有东西。
这包括内存系统、网络、IO存储系统。
这涉及机器的所有细节。
因此,你大致上每个指令只有10微微焦耳用在处理器。
即使在英伟达也不能足够接近这个值。
是,与Sandy Bridge相比,降低了150倍,
工艺没有多大帮助。
这也是为什么传统的CPU无法做到这一点。
这将需要一种混合型多核方法,
依靠GPU像吞吐量处理器那样处理大部分工作,这样才能实现。
但是即使我们有方法,
我们大概快超过量级了,我们可能会在工艺中得到因数3。
我们需要很聪明,才能得到我们需要的其他因数 3 或 4 。
——Titan里面确实有CPU,是吗?——对的。
那么,将来会不会出现甚至没有CPU这种情况?
不会,我认为总有代码片段,
在其中你有一条关键路径,你有一段需要快速运行的单线程代码。
所以你总是需要一个延迟优化处理器来处理它。
但大多数的的工作,这些中的一个,它有点像高速缓存内存,你的大多数访问
都是到这个小小的内存,虽然它运行速度很快,
但这背后你仍需要大内存容量,对吧?
所以,对吞吐量与延迟来说,这是相同的。
你的大部分工作是在吞吐量处理器上完成的,
但当你有一个关键的延迟事件,你在延迟优化处理器上运行它。
这样你最终获得CPU的关键路径性能,
伴随着GPU的大部分能源消耗。
——大部分浮点运算和Titan肯定通过 GPU处理器。—— 是的。
大部分浮点运算将在GPU中进行。