AI:TPU是如何工作的?

文章来自微信公众号“科文路”,欢迎关注、互动。转发须注明出处。

接前回AI:冯诺依曼瓶颈AI:神经网络是如何工作的?AI:GPU是如何工作的?,本文介绍 TPU 的工作原理

本文将翻译What makes TPUs fine-tuned for deep learning? | Google Cloud Blog中的部分内容。

当谷歌设计 TPU 时,他们做了一个 DSA(特定领域的架构)。也就是说,我们把它设计成了一个专门用于神经网络工作的矩阵处理器,而没有设计一个通用的处理器。TPU 不能运行文字处理软件、不能控制火箭引擎、也不能执行银行交易,但它可以以惊人的速度处理神经网络的大量乘法和加法运算,而消耗的功率和占用的访存却小得多。

其最关键的推动作用体现在减弱冯·诺依曼瓶颈上。由于 TPU 的主要任务是矩阵计算,TPU 的硬件设计师很清楚这些计算的每一个具体步骤。因此,他们能够放置几千个乘法器和加法器,并将它们直接连接到一起,形成一个由这些运算器组成的大型物理矩阵。这被称为脉冲阵列架构。就 Cloud TPU v2 而言,有两个 128 x 128 的脉冲阵列,它集成了 32,768 个用于 16 位浮点值计算的 ALU。

脉冲阵列工作方式

让我们看看脉冲阵列是如何执行神经网络计算的。

首先,TPU 将参数从内存加载到矩阵处理器中。

然后,TPU 从内存中加载数据。当每个乘法被执行时,其结果将被直接传递给下一个乘法器求和。因此,最终输出将是数据和参数之间所有乘加运算的结果。

在整个计算和数据传递的过程中,不需要访问内存

这就是 TPU 可以在神经网络计算中可以达到很高的计算吞吐量,同时功耗更低、占用访存更小的原因。

When Google designed the TPU, we built a domain-specific architecture. That means, instead of designing a general purpose processor, we designed it as a matrix processor specialized for neural network work loads. TPUs can’t run word processors, control rocket engines, or execute bank transactions, but they can handle the massive multiplications and additions for neural networks, at blazingly fast speeds while consuming much less power and inside a smaller physical footprint.

The key enabler is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designer of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them to each other directly to form a large physical matrix of those operators. This is called systolic array architecture. In case of Cloud TPU v2, there are two systolic arrays of 128 x 128, aggregating 32,768 ALUs for 16 bit floating point values in a single processor.

Let’s see how a systolic array executes the neural network calculations. At first, TPU loads the parameters from memory into the matrix of multipliers and adders.

Then, the TPU loads data from memory. As each multiplication is executed, the result will be passed to next multipliers while taking summation at the same time. So the output will be the summation of all multiplication result between data and parameters. During the whole process of massive calculations and data passing, no memory access is required at all.

This is why the TPU can achieve a high computational throughput on neural network calculations with much less power consumption and smaller footprint.

都看到这儿了,不如关注每日推送的“科文路”、互动起来~

至少点个赞再走吧~

Author

xlindo

Posted on

2022-07-21

Updated on

2023-05-10

Licensed under

Comments