7 Aug 2017 | Stephen Merity, Nitish Shirish Keskar, Richard Socher
This paper explores regularization and optimization strategies for LSTM-based language models, focusing on word-level language modeling. The authors propose the Weight-Dropped LSTM (WD-LSTM), which uses DropConnect on hidden-to-hidden weights to prevent overfitting. They also introduce NT-ASGD, a variant of averaged stochastic gradient descent (ASGD) where the averaging trigger is determined using a non-monotonic condition. These techniques achieve state-of-the-art perplexities on the Penn Treebank and WikiText-2 datasets. Additionally, the paper investigates other regularization methods such as variable-length backpropagation through time (BPTT), embedding dropout, activation regularization (AR), and temporal activation regularization (TAR). The authors demonstrate that their approach outperforms custom RNN cells and complex regularization strategies, and further improves performance with a neural cache. The paper concludes by discussing the effectiveness of the proposed strategies and their potential applicability to other sequence learning tasks.This paper explores regularization and optimization strategies for LSTM-based language models, focusing on word-level language modeling. The authors propose the Weight-Dropped LSTM (WD-LSTM), which uses DropConnect on hidden-to-hidden weights to prevent overfitting. They also introduce NT-ASGD, a variant of averaged stochastic gradient descent (ASGD) where the averaging trigger is determined using a non-monotonic condition. These techniques achieve state-of-the-art perplexities on the Penn Treebank and WikiText-2 datasets. Additionally, the paper investigates other regularization methods such as variable-length backpropagation through time (BPTT), embedding dropout, activation regularization (AR), and temporal activation regularization (TAR). The authors demonstrate that their approach outperforms custom RNN cells and complex regularization strategies, and further improves performance with a neural cache. The paper concludes by discussing the effectiveness of the proposed strategies and their potential applicability to other sequence learning tasks.