[slides] Implementing sparse matrix-vector multiplication on throughput-oriented processors

The paper "Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors" by Nathan Bell and Michael Garland from NVIDIA Research explores efficient methods for sparse matrix-vector multiplication (SpMV) on throughput-oriented processors, particularly GPUs. The authors aim to harness the high computational throughput of GPUs by exposing fine-grained parallelism and ensuring regular execution paths and memory access patterns. They propose several SpMV methods tailored for GPUs, including the DIA, ELL, CSR, COO, and HYB formats, each designed to handle different types of sparse matrices. The techniques are evaluated on a GeForce GTX 285 GPU, achieving high bandwidth utilization and excellent throughput, with results ranging from 16 GFLOP/s to 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively. The paper also discusses the performance of these methods on various structured and unstructured matrices, comparing them to optimized kernels on multicore platforms. The authors conclude that their methods offer significant performance improvements over existing solutions, highlighting the potential of GPUs for sparse linear algebra tasks.The paper "Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors" by Nathan Bell and Michael Garland from NVIDIA Research explores efficient methods for sparse matrix-vector multiplication (SpMV) on throughput-oriented processors, particularly GPUs. The authors aim to harness the high computational throughput of GPUs by exposing fine-grained parallelism and ensuring regular execution paths and memory access patterns. They propose several SpMV methods tailored for GPUs, including the DIA, ELL, CSR, COO, and HYB formats, each designed to handle different types of sparse matrices. The techniques are evaluated on a GeForce GTX 285 GPU, achieving high bandwidth utilization and excellent throughput, with results ranging from 16 GFLOP/s to 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively. The paper also discusses the performance of these methods on various structured and unstructured matrices, comparing them to optimized kernels on multicore platforms. The authors conclude that their methods offer significant performance improvements over existing solutions, highlighting the potential of GPUs for sparse linear algebra tasks.

Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

2009 | Nathan Bell, Michael Garland