7 Aug 2017 | Stephen Merity, Nitish Shirish Keskar, Richard Socher
This paper presents regularization and optimization strategies for LSTM language models. The authors propose the weight-dropped LSTM, which applies DropConnect regularization to the hidden-to-hidden weights of the LSTM. They also introduce NT-ASGD, a variant of averaged stochastic gradient descent with a non-monotonic trigger, which improves training performance. The proposed methods achieve state-of-the-art results on the Penn Treebank and WikiText-2 datasets, with perplexities of 57.3 and 65.8, respectively. Further, when combined with a neural cache, the model achieves even lower perplexities of 52.8 and 52.0. The paper also explores other regularization techniques, including variable-length backpropagation sequences, variational dropout, embedding dropout, weight tying, and activation regularization. The experiments show that these techniques improve data efficiency and prevent overfitting. The authors also analyze the effectiveness of their models and find that the weight-dropped LSTM is crucial for achieving state-of-the-art performance. The paper concludes that their regularization and optimization strategies are effective for language modeling and may be applicable to other sequence learning tasks.This paper presents regularization and optimization strategies for LSTM language models. The authors propose the weight-dropped LSTM, which applies DropConnect regularization to the hidden-to-hidden weights of the LSTM. They also introduce NT-ASGD, a variant of averaged stochastic gradient descent with a non-monotonic trigger, which improves training performance. The proposed methods achieve state-of-the-art results on the Penn Treebank and WikiText-2 datasets, with perplexities of 57.3 and 65.8, respectively. Further, when combined with a neural cache, the model achieves even lower perplexities of 52.8 and 52.0. The paper also explores other regularization techniques, including variable-length backpropagation sequences, variational dropout, embedding dropout, weight tying, and activation regularization. The experiments show that these techniques improve data efficiency and prevent overfitting. The authors also analyze the effectiveness of their models and find that the weight-dropped LSTM is crucial for achieving state-of-the-art performance. The paper concludes that their regularization and optimization strategies are effective for language modeling and may be applicable to other sequence learning tasks.