[slides] Shortened LLaMA%3A A Simple Depth Pruning for Large Language Models

This paper introduces a structured pruning method for large language models (LLMs) called Depth Pruning, which removes entire Transformer blocks to reduce computational demands. Unlike width pruning, which reduces projection weight matrices, depth pruning maintains the number of layers while removing significant computational overhead. The authors compare depth pruning with width pruning and find that depth pruning can achieve comparable or superior performance in terms of inference efficiency, especially under memory-constrained conditions. They also demonstrate that continued pretraining on a large corpus significantly outperforms LoRA-based tuning for heavily pruned models. The study highlights the effectiveness of depth pruning in compressing LLMs while preserving or improving their performance, making it a compelling alternative to width pruning. The paper includes experimental results on LLaMA-7B and Vicuna-7B-v1.3 models, showing that depth pruning can reduce inference latency and GPU memory usage without significant degradation in zero-shot accuracy.This paper introduces a structured pruning method for large language models (LLMs) called Depth Pruning, which removes entire Transformer blocks to reduce computational demands. Unlike width pruning, which reduces projection weight matrices, depth pruning maintains the number of layers while removing significant computational overhead. The authors compare depth pruning with width pruning and find that depth pruning can achieve comparable or superior performance in terms of inference efficiency, especially under memory-constrained conditions. They also demonstrate that continued pretraining on a large corpus significantly outperforms LoRA-based tuning for heavily pruned models. The study highlights the effectiveness of depth pruning in compressing LLMs while preserving or improving their performance, making it a compelling alternative to width pruning. The paper includes experimental results on LLaMA-7B and Vicuna-7B-v1.3 models, showing that depth pruning can reduce inference latency and GPU memory usage without significant degradation in zero-shot accuracy.

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

23 Jun 2024 | Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song