Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

25 Jun 2024 | Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, Prateek Mittal
**Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs** This paper introduces Lottery Ticket Adaptation (LoTA), a sparse adaptation method for large language models (LLMs) that mitigates destructive interference between tasks. Existing methods for adapting LLMs to new tasks often modify all model weights, leading to catastrophic forgetting of earlier tasks. LoTA identifies and optimizes only a sparse subnetwork of the model, enabling efficient adaptation across multiple tasks without significant performance loss. LoTA is evaluated on various challenging tasks, including instruction following, reasoning, math, and summarization. It outperforms full fine-tuning (FFT) and low-rank adaptation (LoRA) in performance and maintains good performance even after training on other tasks. By extracting and fine-tuning over lottery tickets (sparse task vectors), LoTA enables model merging across highly dissimilar tasks, achieving better performance than existing merging methods that rely on post hoc sparsification. The paper discusses three multi-task adaptation paradigms: storing and loading adapters, sequential training, and model merging. LoTA addresses the challenges in each paradigm by providing sparse representations that minimize destructive interference between tasks. In sequential training, LoTA prevents catastrophic forgetting by restricting task vectors to be sparse, ensuring that adaptation to new data does not erase previous task knowledge. In model merging, LoTA directly trains sparse task vectors, avoiding the need for post hoc sparsification and achieving better performance than existing methods. LoTA is implemented with a three-phase workflow: mask calibration, mask extraction, and sparse adaptation. The mask calibration phase fine-tunes the model for a number of iterations, yielding a fine-tuned model. The mask extraction phase identifies a sparsity mask based on the magnitude of the updates in the task vector. The sparse adaptation phase fine-tunes the model using the sparsity mask, leaving the remaining parameters frozen. The paper also introduces Lottery Ticket Together Optimization (LoTTO), which learns mutually sparse masks for sequentially learned tasks. LoTTO enables learning new tasks without forgetting old ones by restricting weight adaptations to not interfere with the weights important for previous adaptations. LoTA is evaluated on a range of tasks, including instruction following, safety, math, coding, summarization, and reasoning. It outperforms LoRA and FFT in performance on challenging tasks, achieving comparable performance to FFT. LoTA also demonstrates better performance in model merging across heterogeneous datasets, achieving higher performance than existing methods. The paper concludes that LoTA is a promising method for multi-task adaptation, providing a sparse adaptation framework that mitigates destructive interference and enables efficient model merging across diverse tasks.**Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs** This paper introduces Lottery Ticket Adaptation (LoTA), a sparse adaptation method for large language models (LLMs) that mitigates destructive interference between tasks. Existing methods for adapting LLMs to new tasks often modify all model weights, leading to catastrophic forgetting of earlier tasks. LoTA identifies and optimizes only a sparse subnetwork of the model, enabling efficient adaptation across multiple tasks without significant performance loss. LoTA is evaluated on various challenging tasks, including instruction following, reasoning, math, and summarization. It outperforms full fine-tuning (FFT) and low-rank adaptation (LoRA) in performance and maintains good performance even after training on other tasks. By extracting and fine-tuning over lottery tickets (sparse task vectors), LoTA enables model merging across highly dissimilar tasks, achieving better performance than existing merging methods that rely on post hoc sparsification. The paper discusses three multi-task adaptation paradigms: storing and loading adapters, sequential training, and model merging. LoTA addresses the challenges in each paradigm by providing sparse representations that minimize destructive interference between tasks. In sequential training, LoTA prevents catastrophic forgetting by restricting task vectors to be sparse, ensuring that adaptation to new data does not erase previous task knowledge. In model merging, LoTA directly trains sparse task vectors, avoiding the need for post hoc sparsification and achieving better performance than existing methods. LoTA is implemented with a three-phase workflow: mask calibration, mask extraction, and sparse adaptation. The mask calibration phase fine-tunes the model for a number of iterations, yielding a fine-tuned model. The mask extraction phase identifies a sparsity mask based on the magnitude of the updates in the task vector. The sparse adaptation phase fine-tunes the model using the sparsity mask, leaving the remaining parameters frozen. The paper also introduces Lottery Ticket Together Optimization (LoTTO), which learns mutually sparse masks for sequentially learned tasks. LoTTO enables learning new tasks without forgetting old ones by restricting weight adaptations to not interfere with the weights important for previous adaptations. LoTA is evaluated on a range of tasks, including instruction following, safety, math, coding, summarization, and reasoning. It outperforms LoRA and FFT in performance on challenging tasks, achieving comparable performance to FFT. LoTA also demonstrates better performance in model merging across heterogeneous datasets, achieving higher performance than existing methods. The paper concludes that LoTA is a promising method for multi-task adaptation, providing a sparse adaptation framework that mitigates destructive interference and enables efficient model merging across diverse tasks.
Reach us at info@study.space
[slides] Lottery Ticket Adaptation%3A Mitigating Destructive Interference in LLMs | StudySpace