[slides and audio] Infini-gram%3A Scaling Unbounded n-gram Language Models to a Trillion Tokens

The paper "Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens" by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi explores the relevance of n-gram language models (LMs) in the era of large neural language models (LLMs). The authors argue that n-gram LMs remain valuable for text analysis and improving neural LLMs. To modernize n-gram LMs, they scale up the training data to 5 trillion tokens and introduce an ∞-gram LM, which allows for arbitrarily large n values. They develop the infini-gram engine, powered by suffix arrays, to efficiently compute ∞-gram probabilities with millisecond-level latency. The ∞-gram LM achieves high accuracy (47%) in predicting next tokens in human-written documents and complements neural LMs to reduce perplexity. The paper also analyzes the agreement level of machine-generated text with ∞-gram, revealing irregularities in suffix length and suggesting deficiencies in neural LLM pretraining and positional embeddings. The authors provide a public web interface and API for serving n-gram/∞-gram queries on various open corpora.The paper "Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens" by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi explores the relevance of n-gram language models (LMs) in the era of large neural language models (LLMs). The authors argue that n-gram LMs remain valuable for text analysis and improving neural LLMs. To modernize n-gram LMs, they scale up the training data to 5 trillion tokens and introduce an ∞-gram LM, which allows for arbitrarily large n values. They develop the infini-gram engine, powered by suffix arrays, to efficiently compute ∞-gram probabilities with millisecond-level latency. The ∞-gram LM achieves high accuracy (47%) in predicting next tokens in human-written documents and complements neural LMs to reduce perplexity. The paper also analyzes the agreement level of machine-generated text with ∞-gram, revealing irregularities in suffix length and suggesting deficiencies in neural LLM pretraining and positional embeddings. The authors provide a public web interface and API for serving n-gram/∞-gram queries on various open corpora.

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

4 Apr 2024 | Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi