The paper "The Pitfalls of Next-Token Prediction" argues that next-token prediction models, while effective at capturing token-level probabilities, may fail to model human-like planning abilities. The authors highlight two key issues: (1) the compounding of errors during autoregressive inference, and (2) the failure of teacher-forced training to learn accurate next-token predictors in certain tasks. They propose a minimal lookahead task involving path-finding on a graph, where both the Transformer and Mamba architectures fail to learn the correct solution, despite the task being straightforward. The authors suggest that a simple modification—predicting multiple tokens in advance—can resolve this failure. They argue that next-token prediction during training may be at fault, rather than the autoregressive inference process or the model architecture itself. The paper calls for a reevaluation of the next-token prediction paradigm and highlights the need to distinguish between teacher-forcing and autoregressive inference. The authors also propose teacherless training as a potential solution, where models are trained to predict multiple tokens in advance, which can circumvent the failures observed in teacher-forced training. The study demonstrates that next-token prediction may not be sufficient for complex planning tasks, such as story-writing, and that alternative training paradigms may be necessary to achieve human-like planning abilities.The paper "The Pitfalls of Next-Token Prediction" argues that next-token prediction models, while effective at capturing token-level probabilities, may fail to model human-like planning abilities. The authors highlight two key issues: (1) the compounding of errors during autoregressive inference, and (2) the failure of teacher-forced training to learn accurate next-token predictors in certain tasks. They propose a minimal lookahead task involving path-finding on a graph, where both the Transformer and Mamba architectures fail to learn the correct solution, despite the task being straightforward. The authors suggest that a simple modification—predicting multiple tokens in advance—can resolve this failure. They argue that next-token prediction during training may be at fault, rather than the autoregressive inference process or the model architecture itself. The paper calls for a reevaluation of the next-token prediction paradigm and highlights the need to distinguish between teacher-forcing and autoregressive inference. The authors also propose teacherless training as a potential solution, where models are trained to predict multiple tokens in advance, which can circumvent the failures observed in teacher-forced training. The study demonstrates that next-token prediction may not be sufficient for complex planning tasks, such as story-writing, and that alternative training paradigms may be necessary to achieve human-like planning abilities.