4 Mar 2014 | Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Philipp Koehn, Tony Robinson
A new benchmark corpus for statistical language modeling is introduced, consisting of one billion words of training data. The benchmark aims to evaluate and compare the performance of various language modeling techniques. The dataset is derived from the WMT11 website and includes text data from multiple sources. The data was processed to remove duplicates, normalize words, and construct a vocabulary. The benchmark is available as a code.google.com project, providing scripts to rebuild the data and log-probability values for each word in ten held-out data sets.
Several baseline language models were evaluated, including the Kneser-Ney 5-gram model, which achieved a perplexity of 67.6. A combination of techniques led to a 35% reduction in perplexity and a 10% reduction in cross-entropy. Advanced techniques such as normalized stupid backoff, binary maximum entropy, hierarchical softmax, and recurrent neural networks (RNNs) were also tested. RNN-based models significantly outperformed other techniques, achieving lower perplexity on the Penn Treebank dataset.
The best results were achieved by combining models, with the optimal interpolation weights for different models being determined based on held-out data. The RNN model, trained with techniques such as parallelization and reduced parameters, achieved significant performance improvements. The benchmark allows for fair comparison of various techniques, as the data is freely available and the results are reproducible. The choice of one billion words provides a balance between data relevance and ease of evaluation. The paper concludes that the benchmark is a valuable tool for measuring progress in statistical language modeling and encourages further research and collaboration in this area.A new benchmark corpus for statistical language modeling is introduced, consisting of one billion words of training data. The benchmark aims to evaluate and compare the performance of various language modeling techniques. The dataset is derived from the WMT11 website and includes text data from multiple sources. The data was processed to remove duplicates, normalize words, and construct a vocabulary. The benchmark is available as a code.google.com project, providing scripts to rebuild the data and log-probability values for each word in ten held-out data sets.
Several baseline language models were evaluated, including the Kneser-Ney 5-gram model, which achieved a perplexity of 67.6. A combination of techniques led to a 35% reduction in perplexity and a 10% reduction in cross-entropy. Advanced techniques such as normalized stupid backoff, binary maximum entropy, hierarchical softmax, and recurrent neural networks (RNNs) were also tested. RNN-based models significantly outperformed other techniques, achieving lower perplexity on the Penn Treebank dataset.
The best results were achieved by combining models, with the optimal interpolation weights for different models being determined based on held-out data. The RNN model, trained with techniques such as parallelization and reduced parameters, achieved significant performance improvements. The benchmark allows for fair comparison of various techniques, as the data is freely available and the results are reproducible. The choice of one billion words provides a balance between data relevance and ease of evaluation. The paper concludes that the benchmark is a valuable tool for measuring progress in statistical language modeling and encourages further research and collaboration in this area.