This paper investigates the implicit bias of next-token prediction (NTP) in training large language models. It frames NTP training as minimizing cross-entropy over distinct contexts, each associated with a sparse empirical probability vector. The study addresses the question of whether gradient-based optimizers exhibit a bias towards specific solutions as the NTP training loss approaches its lower bound (entropy). For linear NTP models trained with gradient descent (GD), the paper makes several contributions: first, it determines NTP-separability conditions under which GD can attain the lower bound. Second, it shows that GD parameters projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, while parameters in the orthogonal subspace diverge and converge to the solution of a max-margin quadratic program. The paper also explores the implicit bias of GD in the NTP setting, showing that GD iterates grow in norm and converge to a finite solution within a data subspace, while aligning with the solution of a max-margin quadratic program in the orthogonal subspace. The results are validated through experiments on synthetic data and discussed in the context of future research on optimization, generalization, and robustness principles of NTP-trained models. The paper also connects NTP to soft-label classification and discusses related work on implicit bias in one-hot prediction and transformers.This paper investigates the implicit bias of next-token prediction (NTP) in training large language models. It frames NTP training as minimizing cross-entropy over distinct contexts, each associated with a sparse empirical probability vector. The study addresses the question of whether gradient-based optimizers exhibit a bias towards specific solutions as the NTP training loss approaches its lower bound (entropy). For linear NTP models trained with gradient descent (GD), the paper makes several contributions: first, it determines NTP-separability conditions under which GD can attain the lower bound. Second, it shows that GD parameters projected onto an appropriate data subspace converge to the unique solution of a system of linear equations, while parameters in the orthogonal subspace diverge and converge to the solution of a max-margin quadratic program. The paper also explores the implicit bias of GD in the NTP setting, showing that GD iterates grow in norm and converge to a finite solution within a data subspace, while aligning with the solution of a max-margin quadratic program in the orthogonal subspace. The results are validated through experiments on synthetic data and discussed in the context of future research on optimization, generalization, and robustness principles of NTP-trained models. The paper also connects NTP to soft-label classification and discusses related work on implicit bias in one-hot prediction and transformers.