21 Feb 2024 | Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu
This paper presents an extensive benchmarking study of the latest Nvidia Hopper GPU architecture, focusing on its microarchitectural characteristics, including new tensor cores supporting FP8, DPX, and distributed shared memory. The study compares the Hopper architecture with previous generations (Ampere and Ada) in terms of instruction latency, throughput, and performance of key features such as tensor cores, asynchronous execution, and distributed shared memory. The research highlights the unique advantages of the Hopper architecture, particularly in terms of memory bandwidth and tensor core performance, which align with official claims. It also explores the impact of varying data precisions on AI performance, showing that low-precision data types offer greater advantages for large-scale operations. The study analyzes the Hopper architecture's key features: DPX, asynchronous data movement, and distributed shared memory, providing insights into their performance and programmability. The research contributes to a deeper understanding of the Hopper architecture's traits and performance, aiding in optimized algorithm design and application performance. The findings demonstrate that the Hopper architecture's new features, such as DPX and asynchronous data movement, significantly enhance performance, while the tensor core's full potential can only be realized through the latest wgmma instructions. The study also evaluates the energy efficiency of the Hopper architecture, showing that it has significantly higher energy efficiency compared to previous generations. Overall, the research provides valuable insights into the Hopper architecture's performance and features, supporting further advancements in GPU computing.This paper presents an extensive benchmarking study of the latest Nvidia Hopper GPU architecture, focusing on its microarchitectural characteristics, including new tensor cores supporting FP8, DPX, and distributed shared memory. The study compares the Hopper architecture with previous generations (Ampere and Ada) in terms of instruction latency, throughput, and performance of key features such as tensor cores, asynchronous execution, and distributed shared memory. The research highlights the unique advantages of the Hopper architecture, particularly in terms of memory bandwidth and tensor core performance, which align with official claims. It also explores the impact of varying data precisions on AI performance, showing that low-precision data types offer greater advantages for large-scale operations. The study analyzes the Hopper architecture's key features: DPX, asynchronous data movement, and distributed shared memory, providing insights into their performance and programmability. The research contributes to a deeper understanding of the Hopper architecture's traits and performance, aiding in optimized algorithm design and application performance. The findings demonstrate that the Hopper architecture's new features, such as DPX and asynchronous data movement, significantly enhance performance, while the tensor core's full potential can only be realized through the latest wgmma instructions. The study also evaluates the energy efficiency of the Hopper architecture, showing that it has significantly higher energy efficiency compared to previous generations. Overall, the research provides valuable insights into the Hopper architecture's performance and features, supporting further advancements in GPU computing.