Better & Faster Large Language Models via Multi-token Prediction

Better & Faster Large Language Models via Multi-token Prediction

30 Apr 2024 | Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
This paper proposes a method for training large language models (LLMs) to predict multiple future tokens at once, which leads to improved sample efficiency and performance. The approach involves using a shared model trunk with multiple independent output heads to predict n future tokens in parallel. This method is shown to improve performance on both code and natural language tasks without increasing training time. The method is particularly effective for larger models and offers faster inference speeds, especially with 4-token prediction, which can be up to 3× faster than next-token prediction. Experiments on various benchmarks show that multi-token prediction models outperform next-token models in code tasks, solving more problems on HumanEval and MBPP. The method also enables self-speculative decoding, which speeds up inference. The paper also demonstrates that multi-token prediction improves algorithmic reasoning and generalization on arithmetic tasks. Additionally, it shows that multi-token prediction enhances performance on natural language tasks, improving generative evaluations like summarization without regressing on standard benchmarks. The method is memory-efficient and allows for faster training and inference. The paper concludes that multi-token prediction is a promising approach for training LLMs, offering improved performance, efficiency, and reasoning capabilities.This paper proposes a method for training large language models (LLMs) to predict multiple future tokens at once, which leads to improved sample efficiency and performance. The approach involves using a shared model trunk with multiple independent output heads to predict n future tokens in parallel. This method is shown to improve performance on both code and natural language tasks without increasing training time. The method is particularly effective for larger models and offers faster inference speeds, especially with 4-token prediction, which can be up to 3× faster than next-token prediction. Experiments on various benchmarks show that multi-token prediction models outperform next-token models in code tasks, solving more problems on HumanEval and MBPP. The method also enables self-speculative decoding, which speeds up inference. The paper also demonstrates that multi-token prediction improves algorithmic reasoning and generalization on arithmetic tasks. Additionally, it shows that multi-token prediction enhances performance on natural language tasks, improving generative evaluations like summarization without regressing on standard benchmarks. The method is memory-efficient and allows for faster training and inference. The paper concludes that multi-token prediction is a promising approach for training LLMs, offering improved performance, efficiency, and reasoning capabilities.
Reach us at info@study.space
[slides and audio] Better %26 Faster Large Language Models via Multi-token Prediction