Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

24 Feb 2024 | Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning This paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization method for fine-tuning large language models (LLMs). Sparse MeZO applies zeroth-order optimization only to a carefully selected subset of parameters, significantly improving performance and convergence speed compared to standard zeroth-order optimization (MeZO). The method uses a sparse mask to select parameters for optimization, reducing memory usage and enabling efficient fine-tuning on large models like LLaMA-30b with a single A100 GPU. Sparse MeZO outperforms MeZO on multiple tasks, achieving a 9% absolute accuracy improvement and 3.5x speedup on the RTE task. The method also demonstrates faster convergence, requiring fewer steps to reach similar performance levels. Theoretical analysis shows that sparse optimization can accelerate convergence by focusing on sub-networks with smaller gradient norms. The paper proposes a memory-efficient implementation of Sparse MeZO that calculates the sparse mask during the forward pass, eliminating the need to store perturbed parameters. This approach reduces memory usage and allows for efficient fine-tuning of large models. Experiments on various tasks, including SuperGLUE, show that Sparse MeZO achieves better performance than other zeroth-order methods and outperforms zero-shot learning and in-context learning techniques. Sparse MeZO is also scalable, achieving improved performance on larger models like LLaMA-30b. The method's efficiency is demonstrated through memory usage analysis, showing that it requires significantly less GPU memory than full-parameter fine-tuning. The paper concludes that Sparse MeZO is a promising approach for memory-efficient zeroth-order optimization in LLM fine-tuning, with potential for further improvements in future research.Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning This paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization method for fine-tuning large language models (LLMs). Sparse MeZO applies zeroth-order optimization only to a carefully selected subset of parameters, significantly improving performance and convergence speed compared to standard zeroth-order optimization (MeZO). The method uses a sparse mask to select parameters for optimization, reducing memory usage and enabling efficient fine-tuning on large models like LLaMA-30b with a single A100 GPU. Sparse MeZO outperforms MeZO on multiple tasks, achieving a 9% absolute accuracy improvement and 3.5x speedup on the RTE task. The method also demonstrates faster convergence, requiring fewer steps to reach similar performance levels. Theoretical analysis shows that sparse optimization can accelerate convergence by focusing on sub-networks with smaller gradient norms. The paper proposes a memory-efficient implementation of Sparse MeZO that calculates the sparse mask during the forward pass, eliminating the need to store perturbed parameters. This approach reduces memory usage and allows for efficient fine-tuning of large models. Experiments on various tasks, including SuperGLUE, show that Sparse MeZO achieves better performance than other zeroth-order methods and outperforms zero-shot learning and in-context learning techniques. Sparse MeZO is also scalable, achieving improved performance on larger models like LLaMA-30b. The method's efficiency is demonstrated through memory usage analysis, showing that it requires significantly less GPU memory than full-parameter fine-tuning. The paper concludes that Sparse MeZO is a promising approach for memory-efficient zeroth-order optimization in LLM fine-tuning, with potential for further improvements in future research.
Reach us at info@study.space
[slides] Sparse MeZO%3A Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning | StudySpace