18 Jun 2024 | Matanel Oren*,H Michael Hassid*,H,M Nir YardenH Yossi AdiH,M Roy SchwartzH
Transformers are often considered conceptually different from recurrent neural networks (RNNs). This work demonstrates that decoder-only transformers can be conceptualized as unbounded multi-state RNNs (MSRNNs), which have an unlimited number of states corresponding to the tokens in the sequence. By limiting the number of states, transformers can be converted into bounded MSRNNs, effectively compressing their key-value cache. The authors introduce Token Omission Via Attention (TOVA), a novel training-free compression policy that retains states with the highest attention scores. Experiments on four long-range tasks and several large language models (LLMs) show that TOVA outperforms existing compression policies, achieving performance comparable to the full model using only 1/8 of the original cache size, which translates to a 4.8X increase in throughput. The findings shed light on the connection between transformers and RNNs and help mitigate the computational bottleneck of LLMs' key-value cache.Transformers are often considered conceptually different from recurrent neural networks (RNNs). This work demonstrates that decoder-only transformers can be conceptualized as unbounded multi-state RNNs (MSRNNs), which have an unlimited number of states corresponding to the tokens in the sequence. By limiting the number of states, transformers can be converted into bounded MSRNNs, effectively compressing their key-value cache. The authors introduce Token Omission Via Attention (TOVA), a novel training-free compression policy that retains states with the highest attention scores. Experiments on four long-range tasks and several large language models (LLMs) show that TOVA outperforms existing compression policies, achieving performance comparable to the full model using only 1/8 of the original cache size, which translates to a 4.8X increase in throughput. The findings shed light on the connection between transformers and RNNs and help mitigate the computational bottleneck of LLMs' key-value cache.