SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

19 Jul 2024 | Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim
This paper introduces SLEB (Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks), a novel approach to streamline large language models (LLMs) by eliminating redundant transformer blocks. The primary challenge in pruning LLMs is the high similarity between the outputs of consecutive transformer blocks, which makes it difficult to achieve significant speedup without compromising linguistic performance. SLEB addresses this issue by selecting and removing redundant blocks based on a refined metric that assesses the impact of each block on token predictions. The proposed method is evaluated on various models, including OPT and LLaMA-2, and compared with existing pruning techniques such as 2:4 pruning, channel-wise pruning, and early exit strategies. Experimental results show that SLEB can effectively reduce the number of transformer blocks by up to 20% while maintaining or improving perplexity and accuracy. SLEB also demonstrates superior speedup in end-to-end LLM inference, particularly in multi-batch scenarios, and is compatible with post-training quantization techniques. The code for SLEB is available at <https://github.com/jiwonsong-dev/SLEB>.This paper introduces SLEB (Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks), a novel approach to streamline large language models (LLMs) by eliminating redundant transformer blocks. The primary challenge in pruning LLMs is the high similarity between the outputs of consecutive transformer blocks, which makes it difficult to achieve significant speedup without compromising linguistic performance. SLEB addresses this issue by selecting and removing redundant blocks based on a refined metric that assesses the impact of each block on token predictions. The proposed method is evaluated on various models, including OPT and LLaMA-2, and compared with existing pruning techniques such as 2:4 pruning, channel-wise pruning, and early exit strategies. Experimental results show that SLEB can effectively reduce the number of transformer blocks by up to 20% while maintaining or improving perplexity and accuracy. SLEB also demonstrates superior speedup in end-to-end LLM inference, particularly in multi-batch scenarios, and is compatible with post-training quantization techniques. The code for SLEB is available at <https://github.com/jiwonsong-dev/SLEB>.
Reach us at info@study.space