[slides and audio] Fast Algorithms for Convolutional Neural Networks

This paper introduces a new class of fast algorithms for convolutional neural networks (CNNs) based on Winograd's minimal filtering algorithms. These algorithms reduce the arithmetic complexity of CNN layers by up to a factor of 4 compared to direct convolution, making them efficient for small filters and batch sizes. The algorithms compute minimal complexity convolution over small tiles, which is particularly useful for small batch sizes and small filters in CNNs. The authors benchmark a GPU implementation of their algorithm with the VGG network and show state-of-the-art throughput at batch sizes from 1 to 64, using at most 16MB of workspace memory. The paper also discusses related work, including FFT-based convolution and Strassen's algorithm for fast matrix multiplication, and provides a detailed arithmetic complexity analysis. The GPU implementation is designed to be efficient, with a fused kernel that combines data and filter transforms, batched matrix multiplies, and inverse transforms. Experiments with the VGG Network E demonstrate the accuracy and speed of the proposed algorithms, showing significant improvements over cuDNN for small batch sizes and low precision data.This paper introduces a new class of fast algorithms for convolutional neural networks (CNNs) based on Winograd's minimal filtering algorithms. These algorithms reduce the arithmetic complexity of CNN layers by up to a factor of 4 compared to direct convolution, making them efficient for small filters and batch sizes. The algorithms compute minimal complexity convolution over small tiles, which is particularly useful for small batch sizes and small filters in CNNs. The authors benchmark a GPU implementation of their algorithm with the VGG network and show state-of-the-art throughput at batch sizes from 1 to 64, using at most 16MB of workspace memory. The paper also discusses related work, including FFT-based convolution and Strassen's algorithm for fast matrix multiplication, and provides a detailed arithmetic complexity analysis. The GPU implementation is designed to be efficient, with a fused kernel that combines data and filter transforms, batched matrix multiplies, and inverse transforms. Experiments with the VGG Network E demonstrate the accuracy and speed of the proposed algorithms, showing significant improvements over cuDNN for small batch sizes and low precision data.

Fast Algorithms for Convolutional Neural Networks

10 Nov 2015 | Andrew Lavin, Scott Gray