9 Feb 2024 | Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar
Bonsai is a gradient-free, perturbative pruning method for large language models (LLMs) that uses only forward passes to produce small, fast, and accurate pruned models. The method is designed to empower practitioners with limited hardware to prune models that are too large to run on their available memory. Bonsai estimates module importance by generating sub-models and evaluating their performance, allowing for efficient pruning without requiring gradient-based optimization. It outperforms existing structured pruning methods in terms of speed and accuracy, and produces a sub-2B model using a single A6000 GPU that achieves state-of-the-art performance on four out of six tasks on the Huggingface Open LLM leaderboard. Bonsai also enables post-pruning adaptation through distillation, allowing for further performance improvements. The method is memory-efficient and can be applied to a wide range of LLMs, making it a valuable tool for practitioners with limited resources. The paper presents a detailed methodology for Bonsai, including the use of informative priors for sub-model selection and iterative pruning to achieve optimal results. The results show that Bonsai produces models that are significantly faster and more accurate than other pruning methods, while maintaining the ability to perform well on a variety of tasks. The work highlights the importance of efficient pruning techniques in making LLMs accessible to a broader audience.Bonsai is a gradient-free, perturbative pruning method for large language models (LLMs) that uses only forward passes to produce small, fast, and accurate pruned models. The method is designed to empower practitioners with limited hardware to prune models that are too large to run on their available memory. Bonsai estimates module importance by generating sub-models and evaluating their performance, allowing for efficient pruning without requiring gradient-based optimization. It outperforms existing structured pruning methods in terms of speed and accuracy, and produces a sub-2B model using a single A6000 GPU that achieves state-of-the-art performance on four out of six tasks on the Huggingface Open LLM leaderboard. Bonsai also enables post-pruning adaptation through distillation, allowing for further performance improvements. The method is memory-efficient and can be applied to a wide range of LLMs, making it a valuable tool for practitioners with limited resources. The paper presents a detailed methodology for Bonsai, including the use of informative priors for sub-model selection and iterative pruning to achieve optimal results. The results show that Bonsai produces models that are significantly faster and more accurate than other pruning methods, while maintaining the ability to perform well on a variety of tasks. The work highlights the importance of efficient pruning techniques in making LLMs accessible to a broader audience.