16 Oct 2021 | Edward Hu*, Yelong Shen*, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
**Low-Rank Adaptation of Large Language Models (LoRA)**
**Authors:** Edward Hu
**Affiliation:** Microsoft Corporation
**Contact Information:**
{edwardhu, yeshe, phwallis, zeyuana, yuanzhi1, swang, luw, wzchen}@microsoft.com
yuanzhi1@andrew.cmu.edu
**Version:** 2
**Abstract:**
Natural language processing often involves pre-training large-scale models on general domain data and adapting them to specific tasks or domains. However, full fine-tuning, which re-trains all model parameters, becomes impractical as models grow larger. For instance, deploying independent instances of fine-tuned models with 175 billion parameters like GPT-3 is prohibitively expensive. LoRA proposes a solution by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture, significantly reducing the number of trainable parameters for downstream tasks. Compared to fine-tuning GPT-3 with Adam, LoRA reduces the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, higher training throughput, and no additional inference latency compared to adapters.
**Introduction:**
LoRA addresses the challenge of adapting large-scale pre-trained models to multiple downstream applications. Traditional fine-tuning updates all model parameters, making it costly and impractical for large models. LoRA introduces a low-rank representation for the task-specific parameter increment, allowing efficient adaptation while retaining the pre-trained model's weights. This approach reduces storage and computational requirements, enabling efficient task switching and faster deployment.
**Problem Statement:**
The problem is to adapt a pre-trained language model to downstream tasks without the high computational and storage costs of full fine-tuning. LoRA aims to achieve this by optimizing low-rank updates to the model weights.
**Our Method:**
LoRA represents the update to the weight matrices as a low-rank decomposition, freezing the pre-trained weights and optimizing the rank decomposition matrices. This approach reduces the number of trainable parameters and improves efficiency. LoRA can be applied to any dense layers in deep learning models, particularly focusing on the attention weights in Transformer models.
**Empirical Experiments:**
LoRA is evaluated on various downstream tasks, including natural language understanding and generation, using models like RoBERTa, DeBERTa, and GPT-2. Results show that LoRA matches or exceeds the performance of fine-tuning with fewer trainable parameters, demonstrating its effectiveness and efficiency.
**Conclusion:**
LoRA provides an efficient and parameter-efficient approach to adapting large language models to specific tasks, offering significant advantages in terms of computational and storage requirements. It retains high model quality and introduces no**Low-Rank Adaptation of Large Language Models (LoRA)**
**Authors:** Edward Hu
**Affiliation:** Microsoft Corporation
**Contact Information:**
{edwardhu, yeshe, phwallis, zeyuana, yuanzhi1, swang, luw, wzchen}@microsoft.com
yuanzhi1@andrew.cmu.edu
**Version:** 2
**Abstract:**
Natural language processing often involves pre-training large-scale models on general domain data and adapting them to specific tasks or domains. However, full fine-tuning, which re-trains all model parameters, becomes impractical as models grow larger. For instance, deploying independent instances of fine-tuned models with 175 billion parameters like GPT-3 is prohibitively expensive. LoRA proposes a solution by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture, significantly reducing the number of trainable parameters for downstream tasks. Compared to fine-tuning GPT-3 with Adam, LoRA reduces the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, higher training throughput, and no additional inference latency compared to adapters.
**Introduction:**
LoRA addresses the challenge of adapting large-scale pre-trained models to multiple downstream applications. Traditional fine-tuning updates all model parameters, making it costly and impractical for large models. LoRA introduces a low-rank representation for the task-specific parameter increment, allowing efficient adaptation while retaining the pre-trained model's weights. This approach reduces storage and computational requirements, enabling efficient task switching and faster deployment.
**Problem Statement:**
The problem is to adapt a pre-trained language model to downstream tasks without the high computational and storage costs of full fine-tuning. LoRA aims to achieve this by optimizing low-rank updates to the model weights.
**Our Method:**
LoRA represents the update to the weight matrices as a low-rank decomposition, freezing the pre-trained weights and optimizing the rank decomposition matrices. This approach reduces the number of trainable parameters and improves efficiency. LoRA can be applied to any dense layers in deep learning models, particularly focusing on the attention weights in Transformer models.
**Empirical Experiments:**
LoRA is evaluated on various downstream tasks, including natural language understanding and generation, using models like RoBERTa, DeBERTa, and GPT-2. Results show that LoRA matches or exceeds the performance of fine-tuning with fewer trainable parameters, demonstrating its effectiveness and efficiency.
**Conclusion:**
LoRA provides an efficient and parameter-efficient approach to adapting large language models to specific tasks, offering significant advantages in terms of computational and storage requirements. It retains high model quality and introduces no