Transformers are Multi-State RNNs

Transformers are Multi-State RNNs

18 Jun 2024 | Matanel Oren*, Michael Hassid*, Nir Yarden, Yossi Adi, Roy Schwartz
Transformers are conceptualized as multi-state RNNs. This paper demonstrates that decoder-only transformers can be viewed as unbounded multi-state RNNs (MSRNNs), with an unlimited number of states. By limiting the number of states, transformers can be converted into bounded MSRNNs, effectively compressing their key-value cache. The authors introduce a novel, training-free compression policy called Token Omission Via Attention (TOVA), which retains states with the highest attention scores. Experiments with four long-range tasks and several large language models (LLMs) show that TOVA outperforms existing compression policies. TOVA achieves results comparable to the full model using only 1/8 of the original cache size, leading to a 4.8X increase in throughput. The results highlight the connection between transformers and RNNs and help mitigate the computational bottleneck of LLMs' key-value cache. TOVA allows processing of long inputs up to 70K tokens. The study also shows that not all recent tokens are essential to retain, and some can be safely dropped. The findings suggest that transformers can be conceptualized as unbounded MSRNNs, and that LLMs, despite their unbounded capacity, often behave as bounded MSRNNs in practice. The results have practical implications, reducing LLM cache size by up to 88% and increasing throughput by 4.8X.Transformers are conceptualized as multi-state RNNs. This paper demonstrates that decoder-only transformers can be viewed as unbounded multi-state RNNs (MSRNNs), with an unlimited number of states. By limiting the number of states, transformers can be converted into bounded MSRNNs, effectively compressing their key-value cache. The authors introduce a novel, training-free compression policy called Token Omission Via Attention (TOVA), which retains states with the highest attention scores. Experiments with four long-range tasks and several large language models (LLMs) show that TOVA outperforms existing compression policies. TOVA achieves results comparable to the full model using only 1/8 of the original cache size, leading to a 4.8X increase in throughput. The results highlight the connection between transformers and RNNs and help mitigate the computational bottleneck of LLMs' key-value cache. TOVA allows processing of long inputs up to 70K tokens. The study also shows that not all recent tokens are essential to retain, and some can be safely dropped. The findings suggest that transformers can be conceptualized as unbounded MSRNNs, and that LLMs, despite their unbounded capacity, often behave as bounded MSRNNs in practice. The results have practical implications, reducing LLM cache size by up to 88% and increasing throughput by 4.8X.
Reach us at info@study.space
[slides and audio] Transformers are Multi-State RNNs