18 Jun 2024 | Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe
The paper "Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters" by Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe from the Nara Institute of Science and Technology explores the limitations of relying solely on attention scores for token pruning in large language models (LLMs). The authors investigate the non-uniform distribution of value vector norms and propose a new method called Value-Aware Token Pruning (VATP). VATP combines attention scores with the $\ell_1$ norm of value vectors to evaluate token importance, addressing the issue of non-uniform attention distribution. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B models across 16 LongBench tasks demonstrate that VATP outperforms traditional attention-only methods. The study highlights the critical role of value vectors in KV cache reduction, challenging the prevailing belief that attention scores alone are sufficient for determining token importance.The paper "Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters" by Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe from the Nara Institute of Science and Technology explores the limitations of relying solely on attention scores for token pruning in large language models (LLMs). The authors investigate the non-uniform distribution of value vector norms and propose a new method called Value-Aware Token Pruning (VATP). VATP combines attention scores with the $\ell_1$ norm of value vectors to evaluate token importance, addressing the issue of non-uniform attention distribution. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B models across 16 LongBench tasks demonstrate that VATP outperforms traditional attention-only methods. The study highlights the critical role of value vectors in KV cache reduction, challenging the prevailing belief that attention scores alone are sufficient for determining token importance.