Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

5 Jun 2024 | Wentao Guo¹, Jikai Long², Yimeng Zeng³, Zirui Liu⁴, Xinyu Yang⁵, Yide Ran², Jacob R. Gardner³, Osbert Bastani³, Christopher De Sa⁶, Xiaodong Yu², Beidi Chen⁵, and Zhaozhou Xu²
This paper proposes an efficient zeroth-order (ZO) fine-tuning strategy for large language models (LLMs) by leveraging extreme sparsity. The key idea is to identify a small subset of "sensitive parameters" that are critical for ZO fine-tuning, allowing the majority of parameters to be quantized to reduce memory usage. The study demonstrates that fine-tuning only 0.1% of the parameters can achieve performance comparable to full ZO fine-tuning while significantly reducing computational and memory requirements. This approach enables efficient ZO fine-tuning on devices with limited memory, such as mobile phones and laptops, by combining ZO optimization with 4-bit quantization. The results show that this method achieves a 1.49× speedup in inference and a 2.15× speedup in sparse operations for the Llama2-7B model. The proposed method also allows for on-device personalization of LLMs by using pre-trained sensitive parameters to guide fine-tuning without requiring access to downstream task gradients. Theoretical analysis supports the effectiveness of this approach, showing that sensitive parameters maximize the gradient difference and cover a large fraction of the Hessian diagonal. The method is validated through extensive experiments on various LLMs and tasks, demonstrating its efficiency and performance.This paper proposes an efficient zeroth-order (ZO) fine-tuning strategy for large language models (LLMs) by leveraging extreme sparsity. The key idea is to identify a small subset of "sensitive parameters" that are critical for ZO fine-tuning, allowing the majority of parameters to be quantized to reduce memory usage. The study demonstrates that fine-tuning only 0.1% of the parameters can achieve performance comparable to full ZO fine-tuning while significantly reducing computational and memory requirements. This approach enables efficient ZO fine-tuning on devices with limited memory, such as mobile phones and laptops, by combining ZO optimization with 4-bit quantization. The results show that this method achieves a 1.49× speedup in inference and a 2.15× speedup in sparse operations for the Llama2-7B model. The proposed method also allows for on-device personalization of LLMs by using pre-trained sensitive parameters to guide fine-tuning without requiring access to downstream task gradients. Theoretical analysis supports the effectiveness of this approach, showing that sensitive parameters maximize the gradient difference and cover a large fraction of the Hessian diagonal. The method is validated through extensive experiments on various LLMs and tasks, demonstrating its efficiency and performance.
Reach us at info@study.space