February 9, 2016 | Moritz Hardt, Benjamin Recht, Yoram Singer
This paper shows that stochastic gradient descent (SGD) achieves small generalization error when trained for a reasonable amount of time. The authors prove that SGD is algorithmically stable in the sense of Bousquet and Elisseeff, meaning that small changes in the training data lead to small changes in the model output. This stability is shown to hold for both convex and non-convex optimization problems under standard Lipschitz and smoothness assumptions.
In the convex case, the authors show that multiple epochs of SGD generalize well, and that this is due to the algorithm's stability. In the non-convex case, they show that popular training techniques for deep neural networks, such as dropout and weight decay, improve the stability of SGD. These findings suggest that reducing training time can help prevent overfitting, as faster training leads to more stable models.
The paper also provides generalization bounds for models learned with SGD. These bounds depend on the number of iterations and the smoothness of the objective function. For convex optimization, the bounds are shown to be tight, and for non-convex optimization, the bounds are shown to be tight under certain conditions. The results are applied to both convex and non-convex optimization problems, and the paper shows that SGD can achieve good generalization performance even for complex models with many parameters.
The paper also discusses the relationship between stability and generalization, showing that stable algorithms tend to generalize better. The authors argue that minimizing training time is not only computationally beneficial but also has the important side effect of reducing generalization error. This suggests that practitioners should focus on minimizing training time, for example, by designing model architectures that allow SGD to converge quickly to a desired error level.This paper shows that stochastic gradient descent (SGD) achieves small generalization error when trained for a reasonable amount of time. The authors prove that SGD is algorithmically stable in the sense of Bousquet and Elisseeff, meaning that small changes in the training data lead to small changes in the model output. This stability is shown to hold for both convex and non-convex optimization problems under standard Lipschitz and smoothness assumptions.
In the convex case, the authors show that multiple epochs of SGD generalize well, and that this is due to the algorithm's stability. In the non-convex case, they show that popular training techniques for deep neural networks, such as dropout and weight decay, improve the stability of SGD. These findings suggest that reducing training time can help prevent overfitting, as faster training leads to more stable models.
The paper also provides generalization bounds for models learned with SGD. These bounds depend on the number of iterations and the smoothness of the objective function. For convex optimization, the bounds are shown to be tight, and for non-convex optimization, the bounds are shown to be tight under certain conditions. The results are applied to both convex and non-convex optimization problems, and the paper shows that SGD can achieve good generalization performance even for complex models with many parameters.
The paper also discusses the relationship between stability and generalization, showing that stable algorithms tend to generalize better. The authors argue that minimizing training time is not only computationally beneficial but also has the important side effect of reducing generalization error. This suggests that practitioners should focus on minimizing training time, for example, by designing model architectures that allow SGD to converge quickly to a desired error level.