[slides and audio] ADADELTA%3A An Adaptive Learning Rate Method

The paper introduces ADADELTA, a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first-order information. This method minimizes computational overhead compared to vanilla stochastic gradient descent (SGD) and does not require manual tuning of a learning rate. It is robust to noisy gradient information, different model architectures, various data modalities, and hyperparameter selection. The method is evaluated on the MNIST digit classification task and a large-scale voice dataset, showing promising results. Key features of ADADELTA include: 1. **No Manual Learning Rate Setting**: Unlike traditional methods, ADADELTA automatically adjusts the learning rate per dimension. 2. **Robustness**: It is insensitive to hyperparameter choices and can handle noisy gradient information. 3. **Efficiency**: The method has minimal computational overhead compared to SGD. 4. **Adaptivity**: It dynamically adjusts the learning rate based on past gradients, ensuring progress throughout training. The paper also discusses related methods such as momentum, ADAGRAD, and second-order methods, highlighting how ADADELTA improves upon their limitations. Experimental results on MNIST and a speech dataset demonstrate the effectiveness of ADADELTA, showing faster convergence and better performance compared to other methods.The paper introduces ADADELTA, a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first-order information. This method minimizes computational overhead compared to vanilla stochastic gradient descent (SGD) and does not require manual tuning of a learning rate. It is robust to noisy gradient information, different model architectures, various data modalities, and hyperparameter selection. The method is evaluated on the MNIST digit classification task and a large-scale voice dataset, showing promising results. Key features of ADADELTA include: 1. **No Manual Learning Rate Setting**: Unlike traditional methods, ADADELTA automatically adjusts the learning rate per dimension. 2. **Robustness**: It is insensitive to hyperparameter choices and can handle noisy gradient information. 3. **Efficiency**: The method has minimal computational overhead compared to SGD. 4. **Adaptivity**: It dynamically adjusts the learning rate based on past gradients, ensuring progress throughout training. The paper also discusses related methods such as momentum, ADAGRAD, and second-order methods, highlighting how ADADELTA improves upon their limitations. Experimental results on MNIST and a speech dataset demonstrate the effectiveness of ADADELTA, showing faster convergence and better performance compared to other methods.

ADADELTA: AN ADAPTIVE LEARNING RATE METHOD

22 Dec 2012 | Matthew D. Zeiler