5 Jun 2024 | Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, and Zhaozhuo Xu
This paper addresses the challenge of fine-tuning Large Language Models (LLMs) on memory-constrained devices like mobile phones and laptops using zeroth-order (ZO) optimization, which does not require backpropagation. The authors propose integrating sparsity and quantization into ZO fine-tuning to efficiently fine-tune a small subset of LLM parameters. They identify that only 0.1% of parameters are sufficient for ZO fine-tuning, and these parameters can be quantized to reduce memory usage. The study demonstrates that fine-tuning 0.1% of sensitive parameters can outperform full ZO fine-tuning while significantly reducing wall-clock time. Additionally, combining ZO fine-tuning with 4-bit quantization enables efficient fine-tuning of an Llama2-7B model on a GPU with less than 8 GiB of memory, achieving notable latency reduction. The paper also explores the transferability of pre-training sparsity patterns across different downstream tasks and provides a theoretical analysis of the convergence rate of sensitive sparse ZO-SGD. Finally, the authors validate their method through extensive experiments on various LLMs and tasks, showing superior performance and efficiency.This paper addresses the challenge of fine-tuning Large Language Models (LLMs) on memory-constrained devices like mobile phones and laptops using zeroth-order (ZO) optimization, which does not require backpropagation. The authors propose integrating sparsity and quantization into ZO fine-tuning to efficiently fine-tune a small subset of LLM parameters. They identify that only 0.1% of parameters are sufficient for ZO fine-tuning, and these parameters can be quantized to reduce memory usage. The study demonstrates that fine-tuning 0.1% of sensitive parameters can outperform full ZO fine-tuning while significantly reducing wall-clock time. Additionally, combining ZO fine-tuning with 4-bit quantization enables efficient fine-tuning of an Llama2-7B model on a GPU with less than 8 GiB of memory, achieving notable latency reduction. The paper also explores the transferability of pre-training sparsity patterns across different downstream tasks and provides a theoretical analysis of the convergence rate of sensitive sparse ZO-SGD. Finally, the authors validate their method through extensive experiments on various LLMs and tasks, showing superior performance and efficiency.