Averaging Weights Leads to Wider Optima and Better Generalization

Averaging Weights Leads to Wider Optima and Better Generalization

25 Feb 2019 | Pavel Izmailov*1 Dmitrii Podoprikhin*2,3 Timur Garipov*4,5 Dmitry Vetrov2,3 Andrew Gordon Wilson1
Stochastic Weight Averaging (SWA) improves generalization in deep neural networks by averaging weights from SGD trajectories with cyclical or constant learning rates. This method finds flatter solutions than SGD and approximates Fast Geometric Ensembling (FGE) with a single model. SWA achieves significant improvements in test accuracy on CIFAR-10, CIFAR-100, and ImageNet, with minimal computational overhead. It is easy to implement and leads to better generalization without requiring additional training time. SWA explores the weight space more thoroughly, leading to solutions that are wider than those found by SGD, which are often near the boundary of optimal regions. SWA finds points centered in these regions, resulting in better test performance despite slightly higher training loss. SWA can be interpreted as an approximation to FGE ensembles but with the convenience of a single model. It is effective for a wide range of architectures and benchmarks, achieving notable improvements in accuracy. SWA is also efficient to implement and has negligible computational overhead compared to conventional training. The method is supported by theoretical analysis and empirical results showing its effectiveness in improving generalization and training efficiency.Stochastic Weight Averaging (SWA) improves generalization in deep neural networks by averaging weights from SGD trajectories with cyclical or constant learning rates. This method finds flatter solutions than SGD and approximates Fast Geometric Ensembling (FGE) with a single model. SWA achieves significant improvements in test accuracy on CIFAR-10, CIFAR-100, and ImageNet, with minimal computational overhead. It is easy to implement and leads to better generalization without requiring additional training time. SWA explores the weight space more thoroughly, leading to solutions that are wider than those found by SGD, which are often near the boundary of optimal regions. SWA finds points centered in these regions, resulting in better test performance despite slightly higher training loss. SWA can be interpreted as an approximation to FGE ensembles but with the convenience of a single model. It is effective for a wide range of architectures and benchmarks, achieving notable improvements in accuracy. SWA is also efficient to implement and has negligible computational overhead compared to conventional training. The method is supported by theoretical analysis and empirical results showing its effectiveness in improving generalization and training efficiency.
Reach us at info@study.space
[slides and audio] Averaging Weights Leads to Wider Optima and Better Generalization