It's important to be able to reason about this the way that I just described to you, right?
So we sort of walked our way through.
We figured out what kind of bandwidth we were getting and what percentage of theoretical peak that was.
We saw that it was really quite low and we said, why would we be getting low bandwidth-to-global memory?
Well, the first thing you always look at there is coalescing.
And then we inspected the code and convinced ourselves that, yes,
there's bad coalescing happening when we write to the output matrix.
But, I also want to make the point that you don't have to do this from scratch every time.
Right? Doing all these calculations is a little bit like rubbing two sticks together to start a fire;
it's good to know how, but there are tools to help you do this.
The tool that we're going to be using is called nSight.
This is an Nvidia product, there's also third-party products.
Maybe I'll give some links to those in supplementary material.
And if you're using Linux or a Mac like I'm using, then you'll be using the nSight Eclipse edition.
If you were using Windows, you'd by using nSight Visual Studio edition.
These are integrated debuggers and profilers, they're full-blown development environments.
The part that we're going to use is called the Nvidia Visual Profiler, or NVPP.
Let's fire that up now.
能以我向你描述的方式对此思考很重要,对吗?
因此,我们似乎走出了我们自己的路。
我们计算出我们得到了什么样的带宽
以及那是多少百分比的理论峰值。
我们看到它是真地很低,我们说,我们为什么会对全局内存得到低带宽?
嗯,那里你总是查看的第一件事是合并。
然后我们检查代码然后说服自己,是的
我们写入输出矩阵的时候坏合并发生了。
但是,我也想说明你不需要每次从零开始这么做。
对吧?做所有这些计算有点像摩擦两根木棍来生火;
然而,好消息是,有工具可以帮助你执行此操作。
我们将要使用的工具被称为 nSight。
这是一个英伟达产品,也有第三方的产品。
也许我会在补充材料中提供那些内容的一些链接地址。
另外,如果您正在使用 Linux 或像我使用的Mac,
那么你将在使用Eclipse版的nSight。
如果您在使用 Windows,您将使用Visual Studio版本的nSight。
这些都是集成的调试器和探查器,它们是成熟的开发环境。
我们要使用的部分称为英伟达可视化探查器或 NVPP。
让我们现在启动它。