25 Feb 2019 | Pavel Izmailov*1 Dmitrii Podoprikhin*2,3 Timur Garipov*4,5 Dmitry Vetrov2,3 Andrew Gordon Wilson1
The paper introduces Stochastic Weight Averaging (SWA), a method that improves the generalization of deep neural networks by averaging multiple points along the trajectory of Stochastic Gradient Descent (SGD) with a cyclical or constant learning rate. SWA is shown to find flatter solutions than conventional SGD, leading to better test accuracy on various datasets and architectures, including ResNet, PyramidNet, DenseNet, and Shake-Shake networks. The method is easy to implement, has minimal computational overhead, and can be used as a drop-in replacement for standard SGD training. The paper also discusses the connection between SWA and Fast Geometric Ensembling (FGE), demonstrating that SWA can approximate FGE ensembles with a single model. Experimental results on CIFAR-10, CIFAR-100, and ImageNet show significant improvements in test accuracy compared to conventional SGD training.The paper introduces Stochastic Weight Averaging (SWA), a method that improves the generalization of deep neural networks by averaging multiple points along the trajectory of Stochastic Gradient Descent (SGD) with a cyclical or constant learning rate. SWA is shown to find flatter solutions than conventional SGD, leading to better test accuracy on various datasets and architectures, including ResNet, PyramidNet, DenseNet, and Shake-Shake networks. The method is easy to implement, has minimal computational overhead, and can be used as a drop-in replacement for standard SGD training. The paper also discusses the connection between SWA and Fast Geometric Ensembling (FGE), demonstrating that SWA can approximate FGE ensembles with a single model. Experimental results on CIFAR-10, CIFAR-100, and ImageNet show significant improvements in test accuracy compared to conventional SGD training.