The paper introduces a novel approach to parallelize the training of convolutional neural networks (CNNs) across multiple GPUs. The method scales significantly better than existing alternatives, particularly when applied to modern CNNs. The author proposes two variants of the algorithm: one that perfectly simulates synchronous SGD on a single core and another that introduces an approximation for practical use. The proposed scheme leverages data parallelism in convolutional layers and model parallelism in fully-connected layers. The forward pass involves each worker computing convolutional layer activities on their batch, followed by model parallelism for fully-connected layer activities. The backward pass is similar, with gradients computed in fully-connected layers and propagated through convolutional layers. The paper also discusses weight synchronization and variable batch sizes, which can improve convergence and accuracy. Experiments on the ImageNet 2012 dataset show that the proposed method achieves significant speedup while maintaining or improving accuracy. The paper concludes by suggesting improvements for larger architectures and multi-GPU setups.The paper introduces a novel approach to parallelize the training of convolutional neural networks (CNNs) across multiple GPUs. The method scales significantly better than existing alternatives, particularly when applied to modern CNNs. The author proposes two variants of the algorithm: one that perfectly simulates synchronous SGD on a single core and another that introduces an approximation for practical use. The proposed scheme leverages data parallelism in convolutional layers and model parallelism in fully-connected layers. The forward pass involves each worker computing convolutional layer activities on their batch, followed by model parallelism for fully-connected layer activities. The backward pass is similar, with gradients computed in fully-connected layers and propagated through convolutional layers. The paper also discusses weight synchronization and variable batch sizes, which can improve convergence and accuracy. Experiments on the ImageNet 2012 dataset show that the proposed method achieves significant speedup while maintaining or improving accuracy. The paper concludes by suggesting improvements for larger architectures and multi-GPU setups.