[slides and audio] cuDNN%3A Efficient Primitives for Deep Learning

The paper introduces cuDNN, a library of optimized routines for deep learning primitives, particularly focusing on convolutional neural networks (CNNs). Deep learning workloads are computationally intensive, and optimizing their kernels is challenging and time-consuming. cuDNN aims to address this issue by providing a library similar to the Basic Linear Algebra Subroutines (BLAS) for deep learning, which can be integrated into existing frameworks like Caffe. The library includes routines for GPUs and can be implemented for other platforms. It offers optimized performance and memory usage, improving performance by 36% on a standard model while reducing memory consumption. cuDNN supports various operations such as convolution, pooling, activation functions, and tensor transformations. The library is designed to be flexible and easy to use, with a low-level C-language API that integrates seamlessly into existing frameworks. It also supports variable data layout and strides, allowing for flexible manipulation of 4D tensors. The paper discusses the implementation details of cuDNN, including its approach to convolution, which leverages matrix multiplication to achieve high performance without requiring auxiliary memory. CuDNN has been integrated into Caffe and other deep learning frameworks, demonstrating significant performance improvements. Future work includes expanding the library to support additional primitives and improving performance on multiple GPUs.The paper introduces cuDNN, a library of optimized routines for deep learning primitives, particularly focusing on convolutional neural networks (CNNs). Deep learning workloads are computationally intensive, and optimizing their kernels is challenging and time-consuming. cuDNN aims to address this issue by providing a library similar to the Basic Linear Algebra Subroutines (BLAS) for deep learning, which can be integrated into existing frameworks like Caffe. The library includes routines for GPUs and can be implemented for other platforms. It offers optimized performance and memory usage, improving performance by 36% on a standard model while reducing memory consumption. cuDNN supports various operations such as convolution, pooling, activation functions, and tensor transformations. The library is designed to be flexible and easy to use, with a low-level C-language API that integrates seamlessly into existing frameworks. It also supports variable data layout and strides, allowing for flexible manipulation of 4D tensors. The paper discusses the implementation details of cuDNN, including its approach to convolution, which leverages matrix multiplication to achieve high performance without requiring auxiliary memory. CuDNN has been integrated into Caffe and other deep learning frameworks, demonstrating significant performance improvements. Future work includes expanding the library to support additional primitives and improving performance on multiple GPUs.

cuDNN: Efficient Primitives for Deep Learning

18 Dec 2014 | Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer