Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator

| Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt
This paper presents a detailed analysis of CUDA workloads using a GPU simulator. The authors evaluate twelve non-graphics CUDA applications to understand the performance characteristics of GPU hardware. They find that performance is more sensitive to interconnect bisection bandwidth than latency, and that reducing the number of concurrent threads can improve performance by reducing memory contention. The study also shows that certain applications benefit from memory request coalescing and cache optimizations. The authors describe a detailed GPU architecture and simulation infrastructure, and evaluate various architectural design choices, including interconnect topology, CTA distribution, memory access coalescing, and caching. They also explore the impact of different memory controller designs on performance. The results show that performance is more sensitive to interconnect bandwidth than latency for non-graphics workloads. The study highlights the importance of understanding the trade-offs between different architectural design choices when optimizing GPU applications. The authors conclude that their findings provide useful guidance for future architecture and software research.This paper presents a detailed analysis of CUDA workloads using a GPU simulator. The authors evaluate twelve non-graphics CUDA applications to understand the performance characteristics of GPU hardware. They find that performance is more sensitive to interconnect bisection bandwidth than latency, and that reducing the number of concurrent threads can improve performance by reducing memory contention. The study also shows that certain applications benefit from memory request coalescing and cache optimizations. The authors describe a detailed GPU architecture and simulation infrastructure, and evaluate various architectural design choices, including interconnect topology, CTA distribution, memory access coalescing, and caching. They also explore the impact of different memory controller designs on performance. The results show that performance is more sensitive to interconnect bandwidth than latency for non-graphics workloads. The study highlights the importance of understanding the trade-offs between different architectural design choices when optimizing GPU applications. The authors conclude that their findings provide useful guidance for future architecture and software research.
Reach us at info@study.space