Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

23 Jun 2024 | Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song
This paper presents a study on depth pruning for large language models (LLMs), demonstrating that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. The authors propose a method for depth pruning by removing entire Transformer blocks, which can significantly improve inference speeds, especially under memory-constrained conditions that require limited batch sizes. They also compare different retraining methods for pruned models, finding that continued pretraining on a large corpus outperforms LoRA-based tuning, particularly at severe pruning ratios. The study focuses on structured pruning, which removes groups of unnecessary weights and can facilitate hardware-agnostic acceleration. The authors compare their depth pruning method with existing width pruning methods, showing that depth pruning can achieve similar or better performance in zero-shot capabilities. They also show that depth pruning can lead to significant speedups in LLM inference, especially when combined with LoRA retraining. The authors evaluate their method on LLaMA-7B and Vicuna models, showing that depth pruning can achieve comparable performance to width pruning methods while significantly improving inference speeds. They also show that continued pretraining (CPT) is crucial for restoring the quality of heavily pruned models, and that CPT followed by LoRA retraining can further improve performance. The study also explores the compatibility of depth pruning with quantization, showing that 4-bit GPTQ can be applied to pruned models without significant degradation in zero-shot performance. The authors also compare different pruning granularities, finding that removing entire Transformer blocks generally yields better results than removing individual MHA and FFN modules. The authors conclude that depth pruning is a compelling option for compressing LLMs, and that their method can achieve comparable or superior performance to prior studies depending on the retraining setups. They also highlight the importance of continued pretraining for severely pruned models and the effectiveness of combining CPT with LoRA retraining for performance recovery. The study also discusses the limitations of their approach, including the inability to test on LLMs exceeding 13B parameters and the need for further exploration of different training corpora and hyperparameters.This paper presents a study on depth pruning for large language models (LLMs), demonstrating that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. The authors propose a method for depth pruning by removing entire Transformer blocks, which can significantly improve inference speeds, especially under memory-constrained conditions that require limited batch sizes. They also compare different retraining methods for pruned models, finding that continued pretraining on a large corpus outperforms LoRA-based tuning, particularly at severe pruning ratios. The study focuses on structured pruning, which removes groups of unnecessary weights and can facilitate hardware-agnostic acceleration. The authors compare their depth pruning method with existing width pruning methods, showing that depth pruning can achieve similar or better performance in zero-shot capabilities. They also show that depth pruning can lead to significant speedups in LLM inference, especially when combined with LoRA retraining. The authors evaluate their method on LLaMA-7B and Vicuna models, showing that depth pruning can achieve comparable performance to width pruning methods while significantly improving inference speeds. They also show that continued pretraining (CPT) is crucial for restoring the quality of heavily pruned models, and that CPT followed by LoRA retraining can further improve performance. The study also explores the compatibility of depth pruning with quantization, showing that 4-bit GPTQ can be applied to pruned models without significant degradation in zero-shot performance. The authors also compare different pruning granularities, finding that removing entire Transformer blocks generally yields better results than removing individual MHA and FFN modules. The authors conclude that depth pruning is a compelling option for compressing LLMs, and that their method can achieve comparable or superior performance to prior studies depending on the retraining setups. They also highlight the importance of continued pretraining for severely pruned models and the effectiveness of combining CPT with LoRA retraining for performance recovery. The study also discusses the limitations of their approach, including the inability to test on LLMs exceeding 13B parameters and the need for further exploration of different training corpora and hyperparameters.
Reach us at info@study.space