[slides and audio] One billion word benchmark for measuring progress in statistical language modeling

The paper introduces a new benchmark corpus with nearly one billion words of training data to measure progress in statistical language modeling. The benchmark aims to facilitate the evaluation of novel language modeling techniques and their combination with advanced methods. The best results are achieved by a recurrent neural network (RNN) language model, which reduces perplexity by 35% compared to the baseline unpruned Kneser-Ney 5-gram model. The benchmark is available as a code project on Google Code, providing log-probability values for each word in ten held-out datasets. The paper also discusses various baseline and advanced language modeling techniques, including normalized Stupid Backoff, Binary Maximum Entropy, Maximum Entropy with Hierarchical Softmax, and RNN-based models. The results show that the RNN model outperforms other models, achieving a significant reduction in perplexity and cross-entropy. The authors encourage further research and contributions to the benchmark to enhance transparency and reproducibility in language modeling.The paper introduces a new benchmark corpus with nearly one billion words of training data to measure progress in statistical language modeling. The benchmark aims to facilitate the evaluation of novel language modeling techniques and their combination with advanced methods. The best results are achieved by a recurrent neural network (RNN) language model, which reduces perplexity by 35% compared to the baseline unpruned Kneser-Ney 5-gram model. The benchmark is available as a code project on Google Code, providing log-probability values for each word in ten held-out datasets. The paper also discusses various baseline and advanced language modeling techniques, including normalized Stupid Backoff, Binary Maximum Entropy, Maximum Entropy with Hierarchical Softmax, and RNN-based models. The results show that the RNN model outperforms other models, achieving a significant reduction in perplexity and cross-entropy. The authors encourage further research and contributions to the benchmark to enhance transparency and reproducibility in language modeling.

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

4 Mar 2014 | Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Philipp Koehn, Tony Robinson