LoRA Training in the NTK Regime has No Spurious Local Minima

LoRA Training in the NTK Regime has No Spurious Local Minima

2024 | Uijeong Jang¹ Jason D. Lee² Ernest K. Ryu³
This paper presents theoretical analysis of LoRA (Low-Rank Adaptation) fine-tuning in the Neural Tangent Kernel (NTK) regime. The authors show that full fine-tuning without LoRA admits a low-rank solution with rank $ r \lesssim \sqrt{N} $. When using LoRA with rank $ r \gtrsim \sqrt{N} $, spurious local minima are eliminated, allowing (stochastic) gradient descent to find low-rank solutions. Furthermore, the low-rank solution found using LoRA generalizes well. The paper begins by introducing the problem setting and reviewing relevant prior work on neural networks, including expressive power, trainability, and generalization. It then discusses the NTK regime, which is a reasonable assumption for fine-tuning if the parameter changes are small. The authors define linearized losses and show that LoRA can be viewed as a low-rank parameterization. The paper then proves the existence of low-rank solutions in the NTK regime. It shows that full fine-tuning admits a low-rank solution with rank $ r \lesssim \sqrt{N} $, and that using LoRA with rank $ r \gtrsim \sqrt{N} $ eliminates spurious local minima, allowing gradient descent to find low-rank global minima. The low-rank solution found using LoRA is shown to generalize well. The paper also provides experimental results on fine-tuning pre-trained models for different modalities, validating the theory and providing further insights. The experiments show that LoRA fine-tuning converges to the same globally optimal loss value, although the convergence rates differ. The authors hypothesize that lower LoRA ranks may create unfavorable regions of the loss landscape, slowing down gradient descent dynamics. In conclusion, the paper presents theoretical guarantees on the trainability and generalization capabilities of LoRA fine-tuning of pre-trained models. The results represent a first step in theoretically analyzing the LoRA fine-tuning dynamics of pre-trained models. Future work includes further refined analyses under more specific assumptions, relaxing the linearization/NTK regime assumption, better understanding the minimum rank requirement, and analyzing the tradeoff between training rate and LoRA rank.This paper presents theoretical analysis of LoRA (Low-Rank Adaptation) fine-tuning in the Neural Tangent Kernel (NTK) regime. The authors show that full fine-tuning without LoRA admits a low-rank solution with rank $ r \lesssim \sqrt{N} $. When using LoRA with rank $ r \gtrsim \sqrt{N} $, spurious local minima are eliminated, allowing (stochastic) gradient descent to find low-rank solutions. Furthermore, the low-rank solution found using LoRA generalizes well. The paper begins by introducing the problem setting and reviewing relevant prior work on neural networks, including expressive power, trainability, and generalization. It then discusses the NTK regime, which is a reasonable assumption for fine-tuning if the parameter changes are small. The authors define linearized losses and show that LoRA can be viewed as a low-rank parameterization. The paper then proves the existence of low-rank solutions in the NTK regime. It shows that full fine-tuning admits a low-rank solution with rank $ r \lesssim \sqrt{N} $, and that using LoRA with rank $ r \gtrsim \sqrt{N} $ eliminates spurious local minima, allowing gradient descent to find low-rank global minima. The low-rank solution found using LoRA is shown to generalize well. The paper also provides experimental results on fine-tuning pre-trained models for different modalities, validating the theory and providing further insights. The experiments show that LoRA fine-tuning converges to the same globally optimal loss value, although the convergence rates differ. The authors hypothesize that lower LoRA ranks may create unfavorable regions of the loss landscape, slowing down gradient descent dynamics. In conclusion, the paper presents theoretical guarantees on the trainability and generalization capabilities of LoRA fine-tuning of pre-trained models. The results represent a first step in theoretically analyzing the LoRA fine-tuning dynamics of pre-trained models. Future work includes further refined analyses under more specific assumptions, relaxing the linearization/NTK regime assumption, better understanding the minimum rank requirement, and analyzing the tradeoff between training rate and LoRA rank.
Reach us at info@study.space