The paper introduces a new class of sequence modeling layers called Test-Time Training (TTT) layers, which have linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself and update it using self-supervised learning during test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. Evaluations show that both TTT-Linear and TTT-MLP match or exceed the performance of strong baselines like Transformers and Mamba, a modern RNN. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time, while TTT-MLP shows larger potential in long context. The paper also discusses practical innovations to improve hardware efficiency, such as mini-batch TTT and a dual form for operations, making TTT-Linear a practical building block for LLMs.The paper introduces a new class of sequence modeling layers called Test-Time Training (TTT) layers, which have linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself and update it using self-supervised learning during test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. Evaluations show that both TTT-Linear and TTT-MLP match or exceed the performance of strong baselines like Transformers and Mamba, a modern RNN. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time, while TTT-MLP shows larger potential in long context. The paper also discusses practical innovations to improve hardware efficiency, such as mini-batch TTT and a dual form for operations, making TTT-Linear a practical building block for LLMs.