4 Jun 2024 | Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li
This study explores the trustworthiness of Large Language Models (LLMs) under compression, focusing on the interplay between efficiency and various dimensions of trustworthiness. The research evaluates three leading LLMs using five state-of-the-art (SoTA) compression techniques across eight trustworthiness dimensions. Key findings include:
1. **Quantization vs. Pruning**: Quantization is found to be more effective than pruning in achieving both efficiency and trustworthiness. For example, a 4-bit quantized model retains the trustworthiness of its original counterpart, while pruning significantly degrades trustworthiness even at 50% sparsity.
2. **Extreme Compression**: Extreme quantization to very low bit levels (3 bits) significantly reduces trustworthiness, particularly in ethics and fairness dimensions. This risk cannot be identified by examining benign performance alone, emphasizing the need for comprehensive trustworthiness evaluation.
3. **Optimal Compression Rate**: The optimal compression rate for maintaining trustworthiness is 4-bit, with minimal loss across all dimensions. At this rate, quantized models show joint enhancements in efficiency and trustworthiness, especially in fairness and ethics.
4. **Practical Recommendations**: The study provides recommendations for achieving high utility, efficiency, and trustworthiness in LLMs, including the use of quantization over pruning and careful selection of the optimal compression rate.
The findings highlight the importance of a holistic approach to evaluating LLMs under compression, ensuring that both performance and trustworthiness are maintained.This study explores the trustworthiness of Large Language Models (LLMs) under compression, focusing on the interplay between efficiency and various dimensions of trustworthiness. The research evaluates three leading LLMs using five state-of-the-art (SoTA) compression techniques across eight trustworthiness dimensions. Key findings include:
1. **Quantization vs. Pruning**: Quantization is found to be more effective than pruning in achieving both efficiency and trustworthiness. For example, a 4-bit quantized model retains the trustworthiness of its original counterpart, while pruning significantly degrades trustworthiness even at 50% sparsity.
2. **Extreme Compression**: Extreme quantization to very low bit levels (3 bits) significantly reduces trustworthiness, particularly in ethics and fairness dimensions. This risk cannot be identified by examining benign performance alone, emphasizing the need for comprehensive trustworthiness evaluation.
3. **Optimal Compression Rate**: The optimal compression rate for maintaining trustworthiness is 4-bit, with minimal loss across all dimensions. At this rate, quantized models show joint enhancements in efficiency and trustworthiness, especially in fairness and ethics.
4. **Practical Recommendations**: The study provides recommendations for achieving high utility, efficiency, and trustworthiness in LLMs, including the use of quantization over pruning and careful selection of the optimal compression rate.
The findings highlight the importance of a holistic approach to evaluating LLMs under compression, ensuring that both performance and trustworthiness are maintained.