The paper presents performance results for dense linear algebra operations using recent NVIDIA GPUs, achieving significant improvements over vendor implementations. The authors focus on matrix-matrix multiplication (GEMM), LU, QR, and Cholesky factorizations, achieving up to 60% faster performance and approaching the hardware peak. They exploit the GPU's multithreaded multicore vector capabilities, similar to vector computers, and optimize memory access patterns. Detailed benchmarks reveal bottlenecks such as on-chip memory access and kernel launch overhead. The study includes algorithmic optimizations like blocking and autotuning to enhance parallelism and regularity. The best speedups compared to a quad-core CPU are over 4x for all three factorizations. The paper also discusses the GPU architecture, microbenchmarks, and implementation details of the matrix-matrix multiply routine, providing insights into the GPU's memory system and performance characteristics.The paper presents performance results for dense linear algebra operations using recent NVIDIA GPUs, achieving significant improvements over vendor implementations. The authors focus on matrix-matrix multiplication (GEMM), LU, QR, and Cholesky factorizations, achieving up to 60% faster performance and approaching the hardware peak. They exploit the GPU's multithreaded multicore vector capabilities, similar to vector computers, and optimize memory access patterns. Detailed benchmarks reveal bottlenecks such as on-chip memory access and kernel launch overhead. The study includes algorithmic optimizations like blocking and autotuning to enhance parallelism and regularity. The best speedups compared to a quad-core CPU are over 4x for all three factorizations. The paper also discusses the GPU architecture, microbenchmarks, and implementation details of the matrix-matrix multiply routine, providing insights into the GPU's memory system and performance characteristics.