LoRA Meets Dropout under a Unified Framework

LoRA Meets Dropout under a Unified Framework

27 May 2024 | Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu
This paper explores the potential contradiction between parameter-efficient LoRA and traditional dropout methods in the context of large language models (LLMs). While LoRA is known for its efficiency in parameter usage, it is found to be prone to overfitting, which is a problem that traditional dropout methods are designed to address. The study revisits existing transformer-specific dropout methods, such as DropKey, DropAttention, and HiddenCut, and establishes their mathematical and empirical equivalence and differences. A unified framework is introduced that considers dropping position, structural pattern, and compensation measure to analyze these methods. This framework reveals new preferences and performance comparisons of these methods when applied to LoRA scenarios. A novel dropout method, HiddenKey, is proposed, which drops attention logits column-wise and hidden representations element-wise, and augments the vanilla loss with KL divergence loss. Extensive experiments show that HiddenKey outperforms existing methods across multiple models and tasks, demonstrating its effectiveness in mitigating overfitting in LoRA-based scenarios. The study also highlights the importance of compensation measures in reducing the gap between training and inference phases. The results indicate that HiddenKey is a superior approach for high-performance and parameter-efficient fine-tuning of LLMs.This paper explores the potential contradiction between parameter-efficient LoRA and traditional dropout methods in the context of large language models (LLMs). While LoRA is known for its efficiency in parameter usage, it is found to be prone to overfitting, which is a problem that traditional dropout methods are designed to address. The study revisits existing transformer-specific dropout methods, such as DropKey, DropAttention, and HiddenCut, and establishes their mathematical and empirical equivalence and differences. A unified framework is introduced that considers dropping position, structural pattern, and compensation measure to analyze these methods. This framework reveals new preferences and performance comparisons of these methods when applied to LoRA scenarios. A novel dropout method, HiddenKey, is proposed, which drops attention logits column-wise and hidden representations element-wise, and augments the vanilla loss with KL divergence loss. Extensive experiments show that HiddenKey outperforms existing methods across multiple models and tasks, demonstrating its effectiveness in mitigating overfitting in LoRA-based scenarios. The study also highlights the importance of compensation measures in reducing the gap between training and inference phases. The results indicate that HiddenKey is a superior approach for high-performance and parameter-efficient fine-tuning of LLMs.
Reach us at info@study.space