This paper presents performance results for dense linear algebra using recent NVIDIA GPUs. The authors developed a matrix-matrix multiply routine (GEMM) that runs up to 60% faster than the vendor's implementation and approaches the peak hardware capabilities. Their LU, QR, and Cholesky factorizations achieve up to 80–90% of the peak GEMM rate. Parallel LU on two GPUs achieves up to 540 Gflop/s. These results challenge the accepted view of GPU architecture and programming guidelines, arguing that modern GPUs should be viewed as multithreaded multicore vector units. The study includes detailed benchmarking of the GPU memory system, revealing cache and TLB sizes and latencies. Algorithmic optimizations increase parallelism and regularity, slightly improving performance.
The authors show that LU, QR, and Cholesky factorizations can achieve over 300 Gflop/s on a GPU, which is significant for dense linear algebra. Their results include performance on the 8-series of NVIDIA GPUs, which had not been previously achieved. They provide insights into programming these and newer GPUs, achieving performance in basic kernels like matrix-matrix multiply that is 60% faster than the optimized vendor's library CUBLAS 1.1. Some of their codes have been licensed by NVIDIA and included in CUBLAS 2.0.
The authors analyze the performance of their implementations, showing that all components of the final system run at nearly optimal rates. Their best speedups vs. one quad core CPU are over 4× in all three factorizations. The paper is organized into sections discussing GPU architecture, microbenchmarks, design and performance evaluation of matrix multiplication, design of LU, QR, and Cholesky, and performance evaluation of these algorithms. The study reveals the structure of the GPU memory system, including cache and TLB sizes and latencies. It also discusses the performance of the GPU memory system, pipeline latency, and memory bandwidth. The authors achieve 98% of the arithmetic peak in register-to-register multiply-and-add instructions. They also show that the use of shared memory can reduce throughput, and that the optimal vector length for performance is 64 elements. The paper concludes that the performance of GPU-based linear algebra is significantly better than CPU-based implementations for large matrices.This paper presents performance results for dense linear algebra using recent NVIDIA GPUs. The authors developed a matrix-matrix multiply routine (GEMM) that runs up to 60% faster than the vendor's implementation and approaches the peak hardware capabilities. Their LU, QR, and Cholesky factorizations achieve up to 80–90% of the peak GEMM rate. Parallel LU on two GPUs achieves up to 540 Gflop/s. These results challenge the accepted view of GPU architecture and programming guidelines, arguing that modern GPUs should be viewed as multithreaded multicore vector units. The study includes detailed benchmarking of the GPU memory system, revealing cache and TLB sizes and latencies. Algorithmic optimizations increase parallelism and regularity, slightly improving performance.
The authors show that LU, QR, and Cholesky factorizations can achieve over 300 Gflop/s on a GPU, which is significant for dense linear algebra. Their results include performance on the 8-series of NVIDIA GPUs, which had not been previously achieved. They provide insights into programming these and newer GPUs, achieving performance in basic kernels like matrix-matrix multiply that is 60% faster than the optimized vendor's library CUBLAS 1.1. Some of their codes have been licensed by NVIDIA and included in CUBLAS 2.0.
The authors analyze the performance of their implementations, showing that all components of the final system run at nearly optimal rates. Their best speedups vs. one quad core CPU are over 4× in all three factorizations. The paper is organized into sections discussing GPU architecture, microbenchmarks, design and performance evaluation of matrix multiplication, design of LU, QR, and Cholesky, and performance evaluation of these algorithms. The study reveals the structure of the GPU memory system, including cache and TLB sizes and latencies. It also discusses the performance of the GPU memory system, pipeline latency, and memory bandwidth. The authors achieve 98% of the arithmetic peak in register-to-register multiply-and-add instructions. They also show that the use of shared memory can reduce throughput, and that the optimal vector length for performance is 64 elements. The paper concludes that the performance of GPU-based linear algebra is significantly better than CPU-based implementations for large matrices.