AI:GPU是如何工作的?

文章来自微信公众号“科文路”,欢迎关注、互动。转发须注明出处。

接前回AI:冯诺依曼瓶颈AI:神经网络是如何工作的?,本文介绍GPU 的工作原理

本文将翻译What makes TPUs fine-tuned for deep learning? | Google Cloud Blog中的部分内容。

为了获得比 CPU 更高的吞吐量,GPU 使用了一个简单的策略:何不在一个处理器中配上几千个 ALU?当下的 GPU 通常在单个处理器中配备了 2,500–5,000 个 ALU,这就意味着可以同时执行几千个乘法和加法运算。

这种 GPU 架构适用于具有大规模并行性的应用程序,例如神经网络中的矩阵乘法。实际上,在深度学习的典型训练工作负载上,GPU 的吞吐量比 CPU 高出一个数量级。这也就是此时,GPU 是深度学习中最流行的处理器架构的原因。

但是,GPU 仍然是一个“通用”处理器,因为需要它支持各种各样的应用程序。这又回到了我们的基础问题——冯·诺依曼瓶颈。对于几千个 ALU 中的每一次计算,GPU 都需要访问寄存器或共享内存来读取和存储中间计算结果。由于 GPU 在其数千个 ALU 上更多的时候执行的是并行计算,因此它也相应地花费了更多的能耗来访问内存,同时复杂的布线也增大了 GPU 的面积。

To gain higher throughput than a CPU, a GPU uses a simple strategy: why not have thousands of ALUs in a processor? The modern GPU usually has 2,500–5,000 ALUs in a single processor that means you could execute thousands of multiplications and additions simultaneously.

This GPU architecture works well on applications with massive parallelism, such as matrix multiplication in a neural network. Actually, you would see order of magnitude higher throughput than CPU on typical training workload for deep learning. This is why the GPU is the most popular processor architecture used in deep learning at time of writing.

But, the GPU is still a general purpose processor that has to support millions of different applications and software. This leads back to our fundamental problem, the von Neumann bottleneck. For every single calculation in the thousands of ALUs, GPU need to access registers or shared memory to read and store the intermediate calculation results. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory and also increases footprint of GPU for complex wiring.

都看到这儿了,不如关注每日推送的“科文路”、互动起来~

至少点个赞再走吧~

Author

xlindo

Posted on

2022-07-14

Updated on

2023-05-10

Licensed under

Comments