This paper introduces a new class of fast algorithms for convolutional neural networks (CNNs) based on Winograd's minimal filtering algorithms. These algorithms significantly reduce the arithmetic complexity of CNN layers, achieving up to a 4x reduction compared to direct convolution. The algorithms are particularly efficient for small filters (e.g., 3x3) and small batch sizes, making them suitable for applications requiring low latency, such as pedestrian detection in self-driving cars and mobile image recognition. The proposed algorithms use dense matrix multiplications, which are efficient even for small batch sizes, and require minimal memory, making them practical for implementation on GPUs. The authors benchmark their GPU implementation with the VGG network and show state-of-the-art throughput for batch sizes from 1 to 64, using at most 16MB of workspace memory. The algorithms are based on nested 1D minimal filtering algorithms, which are used to compute 2D convolutions with reduced arithmetic complexity. The paper also compares the performance of the proposed algorithms with FFT-based convolution and shows that the Winograd-based approach is more efficient, especially for small batch sizes and small filters. The results demonstrate that the proposed algorithms achieve higher throughput and better performance than existing methods, particularly for small batch sizes. The paper also discusses the implementation details of the algorithms on NVIDIA Maxwell GPUs and shows that the proposed approach can be extended to larger filters and batch sizes. The results indicate that the proposed algorithms are a promising alternative to traditional convolution methods for CNNs.This paper introduces a new class of fast algorithms for convolutional neural networks (CNNs) based on Winograd's minimal filtering algorithms. These algorithms significantly reduce the arithmetic complexity of CNN layers, achieving up to a 4x reduction compared to direct convolution. The algorithms are particularly efficient for small filters (e.g., 3x3) and small batch sizes, making them suitable for applications requiring low latency, such as pedestrian detection in self-driving cars and mobile image recognition. The proposed algorithms use dense matrix multiplications, which are efficient even for small batch sizes, and require minimal memory, making them practical for implementation on GPUs. The authors benchmark their GPU implementation with the VGG network and show state-of-the-art throughput for batch sizes from 1 to 64, using at most 16MB of workspace memory. The algorithms are based on nested 1D minimal filtering algorithms, which are used to compute 2D convolutions with reduced arithmetic complexity. The paper also compares the performance of the proposed algorithms with FFT-based convolution and shows that the Winograd-based approach is more efficient, especially for small batch sizes and small filters. The results demonstrate that the proposed algorithms achieve higher throughput and better performance than existing methods, particularly for small batch sizes. The paper also discusses the implementation details of the algorithms on NVIDIA Maxwell GPUs and shows that the proposed approach can be extended to larger filters and batch sizes. The results indicate that the proposed algorithms are a promising alternative to traditional convolution methods for CNNs.