This paper presents a new method for parallelizing the training of convolutional neural networks (CNNs) across multiple GPUs. The method is more efficient than existing alternatives, especially for modern CNNs. The approach combines data parallelism for convolutional layers and model parallelism for fully-connected layers. In data parallelism, each worker processes a different subset of the data, while in model parallelism, different workers handle different parts of the model. The paper describes three schemes for handling the fully-connected layers, with scheme (c) being the most efficient due to its constant communication-to-computation ratio.
The proposed algorithm uses a variable batch size, with a large batch size in the convolutional layers and a smaller batch size in the fully-connected layers. This allows for faster convergence to better minima. The algorithm is tested on the ImageNet 2012 dataset, achieving good results despite the accuracy cost associated with larger batch sizes. The experiments show that the method scales well, although not perfectly linearly, due to communication overheads and inefficiencies in dense matrix multiplications.
The paper compares the proposed method with other approaches, finding it to be more efficient. It also discusses the potential for further improvements, such as using restricted connectivity in the upper layers of the network and switching to scheme (c) for better performance. The method is effective for existing architectures and shows promise for more GPU-optimized designs. The results demonstrate that the proposed approach achieves significant speedups while maintaining reasonable accuracy.This paper presents a new method for parallelizing the training of convolutional neural networks (CNNs) across multiple GPUs. The method is more efficient than existing alternatives, especially for modern CNNs. The approach combines data parallelism for convolutional layers and model parallelism for fully-connected layers. In data parallelism, each worker processes a different subset of the data, while in model parallelism, different workers handle different parts of the model. The paper describes three schemes for handling the fully-connected layers, with scheme (c) being the most efficient due to its constant communication-to-computation ratio.
The proposed algorithm uses a variable batch size, with a large batch size in the convolutional layers and a smaller batch size in the fully-connected layers. This allows for faster convergence to better minima. The algorithm is tested on the ImageNet 2012 dataset, achieving good results despite the accuracy cost associated with larger batch sizes. The experiments show that the method scales well, although not perfectly linearly, due to communication overheads and inefficiencies in dense matrix multiplications.
The paper compares the proposed method with other approaches, finding it to be more efficient. It also discusses the potential for further improvements, such as using restricted connectivity in the upper layers of the network and switching to scheme (c) for better performance. The method is effective for existing architectures and shows promise for more GPU-optimized designs. The results demonstrate that the proposed approach achieves significant speedups while maintaining reasonable accuracy.