19 Jul 2024 | Jiwoon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim
SLEB is a novel approach for streamlining large language models (LLMs) by eliminating redundant transformer blocks. The method identifies and removes redundant blocks without affecting the linguistic capabilities of LLMs. The key idea is to leverage the high similarity between outputs of neighboring transformer blocks, which allows for efficient pruning. SLEB uses a metric to assess the significance of each block and iteratively removes the least significant ones. This approach ensures that the remaining blocks maintain high performance in terms of perplexity and accuracy. Experimental results show that SLEB outperforms previous pruning methods in accelerating LLM inference while maintaining superior performance. The method is implemented in PyTorch and tested on various LLMs, including OPT and LLaMA-2. SLEB achieves significant speedup in both prompt processing and token generation stages, and is compatible with post-training quantization. The code is available at https://github.com/jiwonsong-dev/SLEB.SLEB is a novel approach for streamlining large language models (LLMs) by eliminating redundant transformer blocks. The method identifies and removes redundant blocks without affecting the linguistic capabilities of LLMs. The key idea is to leverage the high similarity between outputs of neighboring transformer blocks, which allows for efficient pruning. SLEB uses a metric to assess the significance of each block and iteratively removes the least significant ones. This approach ensures that the remaining blocks maintain high performance in terms of perplexity and accuracy. Experimental results show that SLEB outperforms previous pruning methods in accelerating LLM inference while maintaining superior performance. The method is implemented in PyTorch and tested on various LLMs, including OPT and LLaMA-2. SLEB achieves significant speedup in both prompt processing and token generation stages, and is compatible with post-training quantization. The code is available at https://github.com/jiwonsong-dev/SLEB.