SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter
University of Freiburg, Germany
{ilya,fh}@cs.uni-freiburg.de
Abstract: Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR.
Introduction: Deep neural networks (DNNs) are currently the best-performing method for many classification problems, such as object recognition from images (Krizhevsky et al., 2012a; Donahue et al., 2014) or speech recognition from audio data (Deng et al., 2013). Their training on large datasets (where DNNs perform particularly well) is the main computational bottleneck: it often requires several days, even on high-performance GPUs, and any speedups would be of substantial value.
The training of a DNN with n free parameters can be formulated as the problem of minimizing a function f: R^n → R. The commonly used procedure to optimize f is to iteratively adjust x_t ∈ R^n (the parameter vector at time step t) using gradient information ∇f_t(x_t) obtained on a relatively small t-th batch of b datapoints. The Stochastic Gradient Descent (SGD) procedure then becomes an extension of the Gradient Descent (GD) to stochastic optimization of f as follows:
x_{t+1} = x_t - η_t ∇f_t(x_t),
where η_t is a learning rate. One would like to consider second-order information
x_{t+1} = x_t - η_t H_t^{-1} ∇f_t(x_t),
but this is often infeasible since the computation and storage of the inverse Hessian H_t^{-1} is intractable for large n. The usual way to deal with this problem by using limited-memory quasi-Newton methods such as L-BFGS (Liu & Nocedal, 1989) is not currently in favor in deep learning, not the least due to (i) the stochasticity of ∇f_t(x_t), (ii) ill-conditioningSGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter
University of Freiburg, Germany
{ilya,fh}@cs.uni-freiburg.de
Abstract: Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR.
Introduction: Deep neural networks (DNNs) are currently the best-performing method for many classification problems, such as object recognition from images (Krizhevsky et al., 2012a; Donahue et al., 2014) or speech recognition from audio data (Deng et al., 2013). Their training on large datasets (where DNNs perform particularly well) is the main computational bottleneck: it often requires several days, even on high-performance GPUs, and any speedups would be of substantial value.
The training of a DNN with n free parameters can be formulated as the problem of minimizing a function f: R^n → R. The commonly used procedure to optimize f is to iteratively adjust x_t ∈ R^n (the parameter vector at time step t) using gradient information ∇f_t(x_t) obtained on a relatively small t-th batch of b datapoints. The Stochastic Gradient Descent (SGD) procedure then becomes an extension of the Gradient Descent (GD) to stochastic optimization of f as follows:
x_{t+1} = x_t - η_t ∇f_t(x_t),
where η_t is a learning rate. One would like to consider second-order information
x_{t+1} = x_t - η_t H_t^{-1} ∇f_t(x_t),
but this is often infeasible since the computation and storage of the inverse Hessian H_t^{-1} is intractable for large n. The usual way to deal with this problem by using limited-memory quasi-Newton methods such as L-BFGS (Liu & Nocedal, 1989) is not currently in favor in deep learning, not the least due to (i) the stochasticity of ∇f_t(x_t), (ii) ill-conditioning