21 Jun 2024 | Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang
The paper "Mixture of Sparse Attention for Automatic Large Language Model Compression" introduces MoA (Mixture of Attention), a method that automatically tailors sparse attention configurations to different heads and layers in large language models (LLMs). MoA addresses the limitations of uniform sparse attention masks, which fail to capture the diverse attention patterns inherent in LLMs and ignore their distinct accuracy-latency trade-offs. By constructing and navigating a search space of various attention patterns and their scaling rules relative to input sequence lengths, MoA profiles the model, evaluates potential configurations, and identifies the optimal sparse attention compression plan. Experiments show that MoA increases the effective context length by 3.9× with the same average attention span, boosts retrieval accuracy by 1.5–7.1× over uniform-attention baselines across Vicuna-7B, Vicuna-13B, and Llama3-8B models, and narrows the capability gap between sparse and dense models, reducing the maximum relative performance drop from 9%–36% to within 5% on long-context understanding benchmarks. Additionally, MoA achieves a 1.2–1.4× GPU memory reduction and boosts decode throughput by 5.5–6.7× for 7B and 13B dense models on a single GPU, with minimal impact on performance. The method is evaluated on various benchmarks and shows superior performance and efficiency compared to state-of-the-art sparse attention methods.The paper "Mixture of Sparse Attention for Automatic Large Language Model Compression" introduces MoA (Mixture of Attention), a method that automatically tailors sparse attention configurations to different heads and layers in large language models (LLMs). MoA addresses the limitations of uniform sparse attention masks, which fail to capture the diverse attention patterns inherent in LLMs and ignore their distinct accuracy-latency trade-offs. By constructing and navigating a search space of various attention patterns and their scaling rules relative to input sequence lengths, MoA profiles the model, evaluates potential configurations, and identifies the optimal sparse attention compression plan. Experiments show that MoA increases the effective context length by 3.9× with the same average attention span, boosts retrieval accuracy by 1.5–7.1× over uniform-attention baselines across Vicuna-7B, Vicuna-13B, and Llama3-8B models, and narrows the capability gap between sparse and dense models, reducing the maximum relative performance drop from 9%–36% to within 5% on long-context understanding benchmarks. Additionally, MoA achieves a 1.2–1.4× GPU memory reduction and boosts decode throughput by 5.5–6.7× for 7B and 13B dense models on a single GPU, with minimal impact on performance. The method is evaluated on various benchmarks and shows superior performance and efficiency compared to state-of-the-art sparse attention methods.