[slides and audio] Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

This paper provides a comprehensive benchmarking study of the latest Nvidia Hopper GPU architecture, focusing on its microarchitectural intricacies and performance characteristics. The study aims to unveil the novel attributes of the Hopper GPU, such as new tensor cores supporting FP8, DPX, and distributed shared memory, through an examination of the instruction-set architecture (ISA) and new CUDA APIs. The research involves two main aspects: conventional latency and throughput comparisons across the Hopper, Ada, and Ampere architectures, and a detailed analysis of the Hopper-specific features. Key findings include: 1. **Tensor Cores**: The Hopper GPU introduces new warp-group-level Tensor Core instructions, specifically the wgmma instructions, which are executed asynchronously and support advanced sparse matrix multiplication capabilities. The wgmma instructions can achieve high throughput for large matrix sizes but may not fully utilize the sparse capabilities on smaller matrices due to shared memory access latency. 2. **DPX Instructions**: DPX instructions accelerate dynamic programming algorithms, showing significant performance improvements on the H800 GPU compared to other architectures. However, not all DPX functions have acceleration effects, and some simple operations do not show significant performance differences. 3. **Asynchronous Data Movement**: The Hopper GPU enhances the asynchronous execution mechanism with the Tensor Memory Accelerator (TMA), improving data movement efficiency. The *AsyncPipe* implementation generally outperforms the *SyncShare* implementation for smaller block sizes but shows diminishing benefits for larger block sizes. 4. **Distributed Shared Memory**: The Hopper GPU features a direct SM-to-SM communication network, reducing data transfer overhead by up to 7 times. The performance of this feature is influenced by cluster and block sizes, with larger block sizes and more parallelizable instructions leading to higher throughputs. The study concludes that the Hopper GPU offers significant advantages in memory bandwidth and tensor core performance, particularly for large-scale applications. The findings provide valuable insights for optimizing GPU programming and enhancing AI applications.This paper provides a comprehensive benchmarking study of the latest Nvidia Hopper GPU architecture, focusing on its microarchitectural intricacies and performance characteristics. The study aims to unveil the novel attributes of the Hopper GPU, such as new tensor cores supporting FP8, DPX, and distributed shared memory, through an examination of the instruction-set architecture (ISA) and new CUDA APIs. The research involves two main aspects: conventional latency and throughput comparisons across the Hopper, Ada, and Ampere architectures, and a detailed analysis of the Hopper-specific features. Key findings include: 1. **Tensor Cores**: The Hopper GPU introduces new warp-group-level Tensor Core instructions, specifically the wgmma instructions, which are executed asynchronously and support advanced sparse matrix multiplication capabilities. The wgmma instructions can achieve high throughput for large matrix sizes but may not fully utilize the sparse capabilities on smaller matrices due to shared memory access latency. 2. **DPX Instructions**: DPX instructions accelerate dynamic programming algorithms, showing significant performance improvements on the H800 GPU compared to other architectures. However, not all DPX functions have acceleration effects, and some simple operations do not show significant performance differences. 3. **Asynchronous Data Movement**: The Hopper GPU enhances the asynchronous execution mechanism with the Tensor Memory Accelerator (TMA), improving data movement efficiency. The *AsyncPipe* implementation generally outperforms the *SyncShare* implementation for smaller block sizes but shows diminishing benefits for larger block sizes. 4. **Distributed Shared Memory**: The Hopper GPU features a direct SM-to-SM communication network, reducing data transfer overhead by up to 7 times. The performance of this feature is influenced by cluster and block sizes, with larger block sizes and more parallelizable instructions leading to higher throughputs. The study concludes that the Hopper GPU offers significant advantages in memory bandwidth and tensor core performance, particularly for large-scale applications. The findings provide valuable insights for optimizing GPU programming and enhancing AI applications.

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

21 Feb 2024 | Weile Luo1, Ruibo Fan1, Zeyu Li1, Dayou Du1, Qiang Wang2,†, Xiaowen Chu1,3,†