The paper introduces Test-Time Training (TTT) layers, a new class of sequence modeling layers where the hidden state is a model and the update rule is a step of self-supervised learning. TTT layers allow the hidden state to be updated during test time, effectively training the model at test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. These layers are efficient in both FLOPs and wall-clock time, with TTT-Linear outperforming Transformers and Mamba in evaluations ranging from 125M to 1.3B parameters. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP shows potential for long context but faces challenges in memory I/O. The paper also discusses the theoretical equivalence of TTT layers with linear attention and self-attention, and presents experiments showing that TTT layers perform well in both short and long context scenarios. The results demonstrate that TTT layers are a promising direction for future research in sequence modeling.The paper introduces Test-Time Training (TTT) layers, a new class of sequence modeling layers where the hidden state is a model and the update rule is a step of self-supervised learning. TTT layers allow the hidden state to be updated during test time, effectively training the model at test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. These layers are efficient in both FLOPs and wall-clock time, with TTT-Linear outperforming Transformers and Mamba in evaluations ranging from 125M to 1.3B parameters. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP shows potential for long context but faces challenges in memory I/O. The paper also discusses the theoretical equivalence of TTT layers with linear attention and self-attention, and presents experiments showing that TTT layers perform well in both short and long context scenarios. The results demonstrate that TTT layers are a promising direction for future research in sequence modeling.