[slides and audio] On Early Stopping in Gradient Descent Learning

This paper investigates the approximation of regression functions from reproducing kernel Hilbert spaces (RKHSs) using a family of gradient descent algorithms with a polynomially decreasing step size. The authors focus on two iteration paths: the population iteration, which minimizes expected risk, and the sample iteration, which minimizes empirical risk. These paths diverge, with the population iteration converging to the true regression function and the sample iteration often converging to an overfitting function. This leads to a bias-variance trade-off, where early stopping can reduce variance but increase bias, and late stopping can increase variance but reduce bias. The paper derives an early stopping rule by solving this trade-off and provides probabilistic upper bounds for the convergence of the algorithms. It also discusses the implications of these results in classification, showing that under certain assumptions, fast convergence rates to the Bayes classifier can be achieved. The paper compares early stopping with other regularization methods, highlighting its advantages over the usual penalized least square algorithm, which can suffer from saturation. The main results are supported by proofs and discussions on related works, including connections to boosting, Landweber iterations, and online learning.This paper investigates the approximation of regression functions from reproducing kernel Hilbert spaces (RKHSs) using a family of gradient descent algorithms with a polynomially decreasing step size. The authors focus on two iteration paths: the population iteration, which minimizes expected risk, and the sample iteration, which minimizes empirical risk. These paths diverge, with the population iteration converging to the true regression function and the sample iteration often converging to an overfitting function. This leads to a bias-variance trade-off, where early stopping can reduce variance but increase bias, and late stopping can increase variance but reduce bias. The paper derives an early stopping rule by solving this trade-off and provides probabilistic upper bounds for the convergence of the algorithms. It also discusses the implications of these results in classification, showing that under certain assumptions, fast convergence rates to the Bayes classifier can be achieved. The paper compares early stopping with other regularization methods, highlighting its advantages over the usual penalized least square algorithm, which can suffer from saturation. The main results are supported by proofs and discussions on related works, including connections to boosting, Landweber iterations, and online learning.

On Early Stopping in Gradient Descent Learning

2007 | Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto