9 Feb 2024 | Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar
The paper "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes" addresses the challenge of making large language models (LLMs) more accessible to practitioners with limited hardware resources. The authors propose Bonsai, a gradient-free, perturbative pruning method that can produce small, fast, and accurate pruned models using only forward passes through the original model. Bonsai aims to empower practitioners to prune models to a size that their hardware can handle for inference, addressing the gap between the resources available to lay practitioners and those endowed institutions.
Key contributions of Bonsai include:
1. **Memory-Friendly Pruning**: Unlike gradient-based pruning methods, Bonsai does not require significant memory overhead during training, making it accessible to practitioners with limited resources.
2. **Efficient Pruning Decisions**: Bonsai estimates module importances by generating sub-models and evaluating their performance, using an under-determined regression problem to estimate the relevance of modules.
3. **Global Pruning**: Unlike layer-by-layer pruning, Bonsai takes a holistic view of the model, ensuring that modules across layers are removed and evaluated together to preserve accuracy.
Experiments demonstrate that Bonsai:
- Achieves comparable performance to semi-structured pruning methods like Wanda but with faster inference.
- Outperforms gradient-based structured pruning methods like LLM-Pruner and LoRAPrune on multiple evaluation settings.
- Can produce a sub-2B model that outperforms the best sub-2B parameter model on the Huggingface Open LLM leaderboard on 4 out of 6 tasks.
The paper also discusses the limitations and future work, including the need for adaptive sampling and dynamic fine-tuning during pruning. Overall, Bonsai represents a significant advancement in making LLMs more accessible and efficient for a broader range of users.The paper "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes" addresses the challenge of making large language models (LLMs) more accessible to practitioners with limited hardware resources. The authors propose Bonsai, a gradient-free, perturbative pruning method that can produce small, fast, and accurate pruned models using only forward passes through the original model. Bonsai aims to empower practitioners to prune models to a size that their hardware can handle for inference, addressing the gap between the resources available to lay practitioners and those endowed institutions.
Key contributions of Bonsai include:
1. **Memory-Friendly Pruning**: Unlike gradient-based pruning methods, Bonsai does not require significant memory overhead during training, making it accessible to practitioners with limited resources.
2. **Efficient Pruning Decisions**: Bonsai estimates module importances by generating sub-models and evaluating their performance, using an under-determined regression problem to estimate the relevance of modules.
3. **Global Pruning**: Unlike layer-by-layer pruning, Bonsai takes a holistic view of the model, ensuring that modules across layers are removed and evaluated together to preserve accuracy.
Experiments demonstrate that Bonsai:
- Achieves comparable performance to semi-structured pruning methods like Wanda but with faster inference.
- Outperforms gradient-based structured pruning methods like LLM-Pruner and LoRAPrune on multiple evaluation settings.
- Can produce a sub-2B model that outperforms the best sub-2B parameter model on the Huggingface Open LLM leaderboard on 4 out of 6 tasks.
The paper also discusses the limitations and future work, including the need for adaptive sampling and dynamic fine-tuning during pruning. Overall, Bonsai represents a significant advancement in making LLMs more accessible and efficient for a broader range of users.