Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

4 Apr 2024 | Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens This paper presents a new approach to n-gram language models (LMs) that scales them to a trillion tokens and allows for unbounded n. The proposed model, called infini-gram, is a generalization of the traditional n-gram LM, where the value of n is not limited. Instead of precomputing n-gram count tables, which is infeasible for large n, the infini-gram model uses a suffix array to compute probabilities with millisecond-level latency. This allows for efficient and accurate predictions of the next token in both human-written and machine-generated text. The infini-gram model is trained on large text corpora, including the Pile dataset, which contains 5 trillion tokens. This model is able to achieve a high accuracy in predicting the next token, with a 47% accuracy rate for human-written text. It also shows significant potential to complement and improve neural LMs, especially when combined with them. The model is able to reduce the perplexity of neural LMs by up to 73% in some cases. The infini-gram model is powered by a suffix array, which is a data structure that allows for efficient n-gram counting. This structure is used to store the training data and allows for fast queries. The model is able to handle large datasets efficiently, with a storage overhead of 7 bytes per token. The model is also able to handle a wide range of query types, including n-gram and infinite-gram language modeling. The paper also discusses the results of analyzing human-written and machine-generated text using the infini-gram model. It finds that the model is able to accurately predict the next token in human-written text, with a higher accuracy when the effective n is larger. The model also shows that it can complement neural LMs, especially when the neural LM is not performing well. The model is able to reduce the perplexity of neural LMs by up to 73% in some cases. The paper also discusses the impact of different decoding methods on the performance of the infini-gram model. It finds that nucleus sampling produces text that is most similar to human-written text, while greedy decoding shows significant fluctuations in agreement with the suffix length. This indicates potential deficiencies in neural LM pretraining and the positional embeddings of Transformers. Overall, the infini-gram model is a powerful tool for text analysis and improving neural LMs. It is able to handle large datasets efficiently and provides accurate predictions of the next token in both human-written and machine-generated text. The model is also able to complement and improve neural LMs, especially when combined with them. The paper concludes that the infini-gram model has significant potential for future research and applications in natural language processing.Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens This paper presents a new approach to n-gram language models (LMs) that scales them to a trillion tokens and allows for unbounded n. The proposed model, called infini-gram, is a generalization of the traditional n-gram LM, where the value of n is not limited. Instead of precomputing n-gram count tables, which is infeasible for large n, the infini-gram model uses a suffix array to compute probabilities with millisecond-level latency. This allows for efficient and accurate predictions of the next token in both human-written and machine-generated text. The infini-gram model is trained on large text corpora, including the Pile dataset, which contains 5 trillion tokens. This model is able to achieve a high accuracy in predicting the next token, with a 47% accuracy rate for human-written text. It also shows significant potential to complement and improve neural LMs, especially when combined with them. The model is able to reduce the perplexity of neural LMs by up to 73% in some cases. The infini-gram model is powered by a suffix array, which is a data structure that allows for efficient n-gram counting. This structure is used to store the training data and allows for fast queries. The model is able to handle large datasets efficiently, with a storage overhead of 7 bytes per token. The model is also able to handle a wide range of query types, including n-gram and infinite-gram language modeling. The paper also discusses the results of analyzing human-written and machine-generated text using the infini-gram model. It finds that the model is able to accurately predict the next token in human-written text, with a higher accuracy when the effective n is larger. The model also shows that it can complement neural LMs, especially when the neural LM is not performing well. The model is able to reduce the perplexity of neural LMs by up to 73% in some cases. The paper also discusses the impact of different decoding methods on the performance of the infini-gram model. It finds that nucleus sampling produces text that is most similar to human-written text, while greedy decoding shows significant fluctuations in agreement with the suffix length. This indicates potential deficiencies in neural LM pretraining and the positional embeddings of Transformers. Overall, the infini-gram model is a powerful tool for text analysis and improving neural LMs. It is able to handle large datasets efficiently and provides accurate predictions of the next token in both human-written and machine-generated text. The model is also able to complement and improve neural LMs, especially when combined with them. The paper concludes that the infini-gram model has significant potential for future research and applications in natural language processing.
Reach us at info@study.space
Understanding Infini-gram%3A Scaling Unbounded n-gram Language Models to a Trillion Tokens