30 Apr 2024 | Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
This paper introduces a novel training method for large language models (LLMs) that predicts multiple future tokens at once, rather than just the next token. The authors argue that this approach improves sample efficiency and enhances downstream capabilities without increasing training time. Specifically, the model is trained to predict $n$ future tokens using $n$ independent output heads, all operating on a shared model trunk. This method is particularly beneficial for larger model sizes and multiple training epochs, showing significant improvements on generative benchmarks like coding tasks. The paper also demonstrates that multi-token prediction enables faster inference, up to 3 times faster, and improves the development of induction heads and algorithmic reasoning capabilities. Experiments on various benchmarks, including HumanEval and MBPP, show that 13B parameter models solve 12% more problems on HumanEval and 17% more on MBPP compared to next-token models. The authors provide a detailed analysis of the method's effectiveness, including memory-efficient implementation and self-speculative decoding techniques. They conclude by discussing the potential of multi-token prediction for improving the performance, coherence, and reasoning abilities of LLMs.This paper introduces a novel training method for large language models (LLMs) that predicts multiple future tokens at once, rather than just the next token. The authors argue that this approach improves sample efficiency and enhances downstream capabilities without increasing training time. Specifically, the model is trained to predict $n$ future tokens using $n$ independent output heads, all operating on a shared model trunk. This method is particularly beneficial for larger model sizes and multiple training epochs, showing significant improvements on generative benchmarks like coding tasks. The paper also demonstrates that multi-token prediction enables faster inference, up to 3 times faster, and improves the development of induction heads and algorithmic reasoning capabilities. Experiments on various benchmarks, including HumanEval and MBPP, show that 13B parameter models solve 12% more problems on HumanEval and 17% more on MBPP compared to next-token models. The authors provide a detailed analysis of the method's effectiveness, including memory-efficient implementation and self-speculative decoding techniques. They conclude by discussing the potential of multi-token prediction for improving the performance, coherence, and reasoning abilities of LLMs.