June 26, 2017 | Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Geib, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Haggman, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacey, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon
This paper evaluates a custom ASIC, the Tensor Processing Unit (TPU), deployed in datacenters since 2015 to accelerate neural network (NN) inference. The TPU's core is a 65,536 8-bit MAC matrix multiply unit with a peak throughput of 92 TOPS and 28 MiB of on-chip memory. It has a deterministic execution model better suited to the 99th-percentile response-time requirements of NN applications than CPUs and GPUs. Despite having many MACs and large memory, the TPU is relatively small and low power due to the lack of time-varying optimizations found in CPUs and GPUs. The TPU is compared to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are deployed in the same datacenters. The TPU is on average 15X-30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X-80X higher. Using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
The TPU is designed as a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers like a GPU. It is closer in spirit to an FPU coprocessor than a GPU. The TPU's instructions are sent from the host over the PCIe Gen3 x16 bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide paths. The Matrix Multiply Unit is the heart of the TPU, containing 256x256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit. The matrix unit produces one 256-element partial sum per clock cycle. The TPU's software stack is compatible with those developed for CPUs and GPUs, allowing applications to be ported quickly to the TPU. The TPU's performance is measured against its contemporary CPUs and GPUs, showing it is significantly faster and more energy-efficient. The TPU has a long "slanted" part of its roofline, where performance is limited by memory bandwidth rather than by peak compute. Five of the six applications are memory bound, and CNNs are computation bound. The TPU's performance/Watt is 30X-80X that of contemporary products, and with the K80 memory, it would be 70X-200X better. The TPU is much closer to its highest throughput than CPUs and GPUs, and has none of the sophisticated microarchitectural features that consume transistors and energy to improve the average case but not the 99thThis paper evaluates a custom ASIC, the Tensor Processing Unit (TPU), deployed in datacenters since 2015 to accelerate neural network (NN) inference. The TPU's core is a 65,536 8-bit MAC matrix multiply unit with a peak throughput of 92 TOPS and 28 MiB of on-chip memory. It has a deterministic execution model better suited to the 99th-percentile response-time requirements of NN applications than CPUs and GPUs. Despite having many MACs and large memory, the TPU is relatively small and low power due to the lack of time-varying optimizations found in CPUs and GPUs. The TPU is compared to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are deployed in the same datacenters. The TPU is on average 15X-30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X-80X higher. Using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
The TPU is designed as a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers like a GPU. It is closer in spirit to an FPU coprocessor than a GPU. The TPU's instructions are sent from the host over the PCIe Gen3 x16 bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide paths. The Matrix Multiply Unit is the heart of the TPU, containing 256x256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit. The matrix unit produces one 256-element partial sum per clock cycle. The TPU's software stack is compatible with those developed for CPUs and GPUs, allowing applications to be ported quickly to the TPU. The TPU's performance is measured against its contemporary CPUs and GPUs, showing it is significantly faster and more energy-efficient. The TPU has a long "slanted" part of its roofline, where performance is limited by memory bandwidth rather than by peak compute. Five of the six applications are memory bound, and CNNs are computation bound. The TPU's performance/Watt is 30X-80X that of contemporary products, and with the K80 memory, it would be 70X-200X better. The TPU is much closer to its highest throughput than CPUs and GPUs, and has none of the sophisticated microarchitectural features that consume transistors and energy to improve the average case but not the 99th