Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

2024-08-11 | Yu Sun*, Xinhao Li*, Karan Dalal*, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin
The paper introduces Test-Time Training (TTT) layers, a new class of sequence modeling layers where the hidden state is a model and the update rule is a step of self-supervised learning. TTT layers allow the hidden state to be updated during test time, effectively training the model at test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. These layers are efficient in both FLOPs and wall-clock time, with TTT-Linear outperforming Transformers and Mamba in evaluations ranging from 125M to 1.3B parameters. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP shows potential for long context but faces challenges in memory I/O. The paper also discusses the theoretical equivalence of TTT layers with linear attention and self-attention, and presents experiments showing that TTT layers perform well in both short and long context scenarios. The results demonstrate that TTT layers are a promising direction for future research in sequence modeling.The paper introduces Test-Time Training (TTT) layers, a new class of sequence modeling layers where the hidden state is a model and the update rule is a step of self-supervised learning. TTT layers allow the hidden state to be updated during test time, effectively training the model at test time. Two instantiations, TTT-Linear and TTT-MLP, are proposed, where the hidden state is a linear model and a two-layer MLP, respectively. These layers are efficient in both FLOPs and wall-clock time, with TTT-Linear outperforming Transformers and Mamba in evaluations ranging from 125M to 1.3B parameters. TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP shows potential for long context but faces challenges in memory I/O. The paper also discusses the theoretical equivalence of TTT layers with linear attention and self-attention, and presents experiments showing that TTT layers perform well in both short and long context scenarios. The results demonstrate that TTT layers are a promising direction for future research in sequence modeling.
Reach us at info@study.space