[slides] LISA%3A Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

The paper "LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning" introduces a novel optimization method called Layerwise Importance Sampled AdamW (LISA) to address the memory constraints in fine-tuning large language models (LLMs). The authors observe that Low-Rank Adaptation (LoRA) has a skewed distribution of weight norms across different layers, with the bottom and top layers holding the majority of weights. This observation inspires the development of LISA, which randomly freezes most middle layers during optimization, effectively simulating LoRA's updating pattern while reducing memory consumption. LISA is designed to scale up to models with over 70 billion parameters and achieve similar or better performance compared to LoRA and full-parameter training. Experimental results show that LISA outperforms LoRA by 10%-35% in MT-Bench scores while maintaining or improving performance in other benchmarks such as MMLU, AGIEval, and WinoGrande. LISA also demonstrates superior convergence behavior and is more memory-efficient, requiring less GPU memory for fine-tuning tasks. The paper includes ablation studies to optimize hyperparameters and validate the effectiveness of LISA in various settings, including instruction-following tasks, continual pre-training, and large-scale fine-tuning. The authors conclude that LISA is a promising alternative to LoRA, offering significant improvements in memory efficiency and performance across different models and tasks.The paper "LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning" introduces a novel optimization method called Layerwise Importance Sampled AdamW (LISA) to address the memory constraints in fine-tuning large language models (LLMs). The authors observe that Low-Rank Adaptation (LoRA) has a skewed distribution of weight norms across different layers, with the bottom and top layers holding the majority of weights. This observation inspires the development of LISA, which randomly freezes most middle layers during optimization, effectively simulating LoRA's updating pattern while reducing memory consumption. LISA is designed to scale up to models with over 70 billion parameters and achieve similar or better performance compared to LoRA and full-parameter training. Experimental results show that LISA outperforms LoRA by 10%-35% in MT-Bench scores while maintaining or improving performance in other benchmarks such as MMLU, AGIEval, and WinoGrande. LISA also demonstrates superior convergence behavior and is more memory-efficient, requiring less GPU memory for fine-tuning tasks. The paper includes ablation studies to optimize hyperparameters and validate the effectiveness of LISA in various settings, including instruction-following tasks, continual pre-training, and large-scale fine-tuning. The authors conclude that LISA is a promising alternative to LoRA, offering significant improvements in memory efficiency and performance across different models and tasks.

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

May 28, 2024 | Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang