Google's Tensor Processing Unit (TPU) is the first custom accelerator Application-Specific Integrated Circuit (ASIC) for machine learning. The technology is designed specifically to give high performance and power efficiency when running TensorFlow, Google's open-source library of machine intelligence software.
Services such as Google Search, Street View, Google Photos and Google Translate all use Google’s Tensor Processing Unit, to accelerate their neural network computations.
With the increasing use of machine learning and neural networks, the need for processing power outmatched the ability of conventional Central Processing Units (CPUs) and Graphics Processing Units (GPU)s. Neural networks require large amounts of computation, with millions of matrix multiplication operations. In 2013, Google estimated the computational demands of neutral networks could require them to double their number of data centers. The TPU project lead by Norm Jouppi designed, verified, built and deployed the processor to data centers in just 15 months.
Google unveiled the TPU at its I/O conference in 2016. In 2018 they made TPUs commercially available through its cloud computing service.
TPUs use an optimization technique called quantization that converts 32-bit or 16-bit floating points into 8-bit integers whilst maintaining the appropriate level of accuracy. Quantization is a powerful tool that reduces the cost, memory usage, and hardware footprint of neural network predictions.
Most CPUs utilize the Reduced Instruction Set Computer (RISC) design style that defines simple instructions commonly used by the majority of applications and then executes those instructions as fast as possible. TPUs utilize the Complex Instruction Set Computer (CISC) style that focuses on implementing high-level instructions that run more complex tasks with each instruction. This allowed software to translate API calls from TensorFlow graphs directly into TPU instructions.
Unlike CPUs which typically are scalar processors or GPUs which are effectively vector processors, for TPUs Google designed it's matric processor to allow the process of hundreds of thousands of operations in a single clock cycle.
To implement such a large-scale Matrix Multiplier Unit (MXU), TPUs features a drastically different architecture than typical CPUs and GPUs, called a systolic array. Matrix multiplication reuses both inputs many times as part of producing the output. The input value can be read once but used for many different operations without storing it back to a register. Wires only connect spatially adjacent Arithmetic Logic Units (ALUs), which makes them short and energy-efficient. The ALUs perform only multiplications and additions in fixed patterns, simplifying their design.
The particular kind of systolic array in the MXU is optimized for power and area efficiency in performing matrix multiplications. It is not well suited for general-purpose computation, making the engineering tradeoff to limit registers, control and operational flexibility in exchange for efficiency and much higher operation density.
Techical Lead for the TPU project
An in-depth look at Google's first Tensor Processing Unit (TPU)
Kaz Sato , Cliff Young , and David Patterson .
May 12th, 2017
Post-training quantization | TensorFlow Lite