21 Jun 2024 | Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
This paper proposes MoA, a training-free sparse attention method that automatically tailors distinct sparse attention configurations to different heads and layers in large language models (LLMs). MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9× with the same average attention span, boosting retrieval accuracy by 1.5–7.1× over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%–36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2–1.4× GPU memory reduction and boosts decode throughput by 5.5–6.7× for 7B and 13B dense models on a single GPU, with minimal impact on performance.
MoA introduces heterogeneous elastic rules for masks of each attention head, allowing for diverse elastic rules that tailor the local attention span relative to the input length for each attention head. It also emphasizes the importance of data engineering in LLM compression, using datasets with long-range dependencies and referencing the original LLM’s responses to accurately profile the influences of compression. MoA proposes an automatic pipeline to find the optimal compression plan encompassing heterogeneous elastic rules for various attention heads. This pipeline can efficiently find the optimal plan within several hours, for example, two hours for compressing Vicuna-13B.
Experiments show that MoA achieves 5.5× to 6.7× throughput improvements on 7B and 13B dense LLMs at a 50% density (the average of KV-Cache length / input length), with only 1% average relative degradation in retrieval accuracy. Additionally, MoA achieves over 90% retrieval accuracy with just 25% average density, far surpassing sparse attention baselines that need a density of 75% to 100% for similar performance. On long-context understanding benchmarks, MoA performs comparably to dense models, with a maximum relative performance drop of less than 5%, which is about one-sixth of that observed with the uniform sparse attention baseline. The code is available at https://github.com/thu-nics/MoA.MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
This paper proposes MoA, a training-free sparse attention method that automatically tailors distinct sparse attention configurations to different heads and layers in large language models (LLMs). MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9× with the same average attention span, boosting retrieval accuracy by 1.5–7.1× over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%–36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2–1.4× GPU memory reduction and boosts decode throughput by 5.5–6.7× for 7B and 13B dense models on a single GPU, with minimal impact on performance.
MoA introduces heterogeneous elastic rules for masks of each attention head, allowing for diverse elastic rules that tailor the local attention span relative to the input length for each attention head. It also emphasizes the importance of data engineering in LLM compression, using datasets with long-range dependencies and referencing the original LLM’s responses to accurately profile the influences of compression. MoA proposes an automatic pipeline to find the optimal compression plan encompassing heterogeneous elastic rules for various attention heads. This pipeline can efficiently find the optimal plan within several hours, for example, two hours for compressing Vicuna-13B.
Experiments show that MoA achieves 5.5× to 6.7× throughput improvements on 7B and 13B dense LLMs at a 50% density (the average of KV-Cache length / input length), with only 1% average relative degradation in retrieval accuracy. Additionally, MoA achieves over 90% retrieval accuracy with just 25% average density, far surpassing sparse attention baselines that need a density of 75% to 100% for similar performance. On long-context understanding benchmarks, MoA performs comparably to dense models, with a maximum relative performance drop of less than 5%, which is about one-sixth of that observed with the uniform sparse attention baseline. The code is available at https://github.com/thu-nics/MoA.