[slides and audio] Implicit Optimization Bias of Next-token Prediction in Linear Models

This paper explores the implicit bias of next-token prediction (NTP) training, a paradigm used in large language models. NTP involves predicting the next token in a sequence, and the paper frames this training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector. The main contributions are: 1. **NTP-Separability Conditions**: The paper determines conditions under which gradient descent (GD) can achieve the lower bound of the cross-entropy loss, specifically when the embedding dimension exceeds the number of distinct contexts. These conditions hold under overparameterization. 2. **Convergence to Unique Solution**: For linear NTP models trained with GD, the parameters projected onto an appropriate data subspace converge to the unique solution of a system of linear equations. On the orthogonal subspace, the parameters diverge and align with the solution of a max-margin quadratic program. 3. **Implicit Bias of GD**: Under NTP$_{\mathcal{H}}$-compatibility and NTP-separability, GD iterates grow in norm and converge to a finite solution within the data subspace, while aligning with the max-margin classifier in the complementary subspace. The paper also discusses the connection between NTP and soft-label classification, and provides experimental validation on synthetic data. The findings open avenues for future research on optimization, generalization, and robustness in large language models.This paper explores the implicit bias of next-token prediction (NTP) training, a paradigm used in large language models. NTP involves predicting the next token in a sequence, and the paper frames this training as cross-entropy minimization over distinct contexts, each associated with a sparse empirical probability vector. The main contributions are: 1. **NTP-Separability Conditions**: The paper determines conditions under which gradient descent (GD) can achieve the lower bound of the cross-entropy loss, specifically when the embedding dimension exceeds the number of distinct contexts. These conditions hold under overparameterization. 2. **Convergence to Unique Solution**: For linear NTP models trained with GD, the parameters projected onto an appropriate data subspace converge to the unique solution of a system of linear equations. On the orthogonal subspace, the parameters diverge and align with the solution of a max-margin quadratic program. 3. **Implicit Bias of GD**: Under NTP$_{\mathcal{H}}$-compatibility and NTP-separability, GD iterates grow in norm and converge to a finite solution within the data subspace, while aligning with the max-margin classifier in the complementary subspace. The paper also discusses the connection between NTP and soft-label classification, and provides experimental validation on synthetic data. The findings open avenues for future research on optimization, generalization, and robustness in large language models.

Implicit Bias of Next-Token Prediction

February 29, 2024 | Christos Thrampoulidis