[slides and audio] Analyzing CUDA workloads using a detailed GPU simulator

This paper evaluates the performance of twelve non-graphics applications written in NVIDIA's CUDA programming model using a detailed microarchitecture performance simulator. The simulator runs NVIDIA's parallel thread execution (PTX) virtual instruction set, allowing for the analysis of various microarchitecture design choices. The study focuses on applications that do not achieve peak performance on GPU hardware compared to CPU-only sequential versions. Key findings include: 1. **Performance Sensitivity**: The applications are more sensitive to interconnect bisection bandwidth than latency. 2. **Thread Concurrency**: For certain applications, running fewer concurrent threads can improve performance by reducing contention in the memory system. 3. **Application Characteristics**: The analysis covers dynamic instruction mix, SIMD warp branch divergence properties, and DRAM locality characteristics. The paper also explores different GPU architectural design options, such as interconnect topology, CTA distribution, memory access coalescing, and caching, and provides insights into their impact on application performance. The results highlight the importance of optimizing these aspects to enhance the efficiency and performance of CUDA applications on GPUs.This paper evaluates the performance of twelve non-graphics applications written in NVIDIA's CUDA programming model using a detailed microarchitecture performance simulator. The simulator runs NVIDIA's parallel thread execution (PTX) virtual instruction set, allowing for the analysis of various microarchitecture design choices. The study focuses on applications that do not achieve peak performance on GPU hardware compared to CPU-only sequential versions. Key findings include: 1. **Performance Sensitivity**: The applications are more sensitive to interconnect bisection bandwidth than latency. 2. **Thread Concurrency**: For certain applications, running fewer concurrent threads can improve performance by reducing contention in the memory system. 3. **Application Characteristics**: The analysis covers dynamic instruction mix, SIMD warp branch divergence properties, and DRAM locality characteristics. The paper also explores different GPU architectural design options, such as interconnect topology, CTA distribution, memory access coalescing, and caching, and provides insights into their impact on application performance. The results highlight the importance of optimizing these aspects to enhance the efficiency and performance of CUDA applications on GPUs.

Analyzing CUDA Workloads Using a Detailed GPU Simulator

| Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt