[slides and audio] Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

This paper addresses the challenge of fine-tuning Large Language Models (LLMs) on memory-constrained devices like mobile phones and laptops using zeroth-order (ZO) optimization, which does not require backpropagation. The authors propose integrating sparsity and quantization into ZO fine-tuning to efficiently fine-tune a small subset of LLM parameters. They identify that only 0.1% of parameters are sufficient for ZO fine-tuning, and these parameters can be quantized to reduce memory usage. The study demonstrates that fine-tuning 0.1% of sensitive parameters can outperform full ZO fine-tuning while significantly reducing wall-clock time. Additionally, combining ZO fine-tuning with 4-bit quantization enables efficient fine-tuning of an Llama2-7B model on a GPU with less than 8 GiB of memory, achieving notable latency reduction. The paper also explores the transferability of pre-training sparsity patterns across different downstream tasks and provides a theoretical analysis of the convergence rate of sensitive sparse ZO-SGD. Finally, the authors validate their method through extensive experiments on various LLMs and tasks, showing superior performance and efficiency.This paper addresses the challenge of fine-tuning Large Language Models (LLMs) on memory-constrained devices like mobile phones and laptops using zeroth-order (ZO) optimization, which does not require backpropagation. The authors propose integrating sparsity and quantization into ZO fine-tuning to efficiently fine-tune a small subset of LLM parameters. They identify that only 0.1% of parameters are sufficient for ZO fine-tuning, and these parameters can be quantized to reduce memory usage. The study demonstrates that fine-tuning 0.1% of sensitive parameters can outperform full ZO fine-tuning while significantly reducing wall-clock time. Additionally, combining ZO fine-tuning with 4-bit quantization enables efficient fine-tuning of an Llama2-7B model on a GPU with less than 8 GiB of memory, achieving notable latency reduction. The paper also explores the transferability of pre-training sparsity patterns across different downstream tasks and provides a theoretical analysis of the convergence rate of sensitive sparse ZO-SGD. Finally, the authors validate their method through extensive experiments on various LLMs and tasks, showing superior performance and efficiency.

Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

5 Jun 2024 | Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, and Zhaozhuo Xu