This paper introduces Conditional Mixture-of-LoRA (MixLoRA), a novel approach for multimodal instruction tuning that addresses task interference in parameter-efficient fine-tuning. Multimodal Large Language Models (MLLMs) have shown strong performance across diverse tasks, but their zero-shot generalization to new multimodal tasks remains a challenge. Multimodal instruction tuning, which fine-tunes pre-trained models on diverse multimodal tasks through instructions, has emerged as a promising strategy for achieving zero-shot generalization. However, traditional parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) face challenges in handling the diverse and complex nature of multimodal tasks, leading to task interference and performance degradation.
To address this, the paper proposes MixLoRA, which dynamically constructs low-rank adaptation matrices tailored to each input instance. Unlike conventional LoRA, which uses shared low-rank matrices across all tasks, MixLoRA dynamically selects decomposition factors from two collections to construct varied matrices for different input scenarios. This approach reduces task interference by ensuring that the factors selected for LoRA A and B are not only tailored to the input but also cohesively aligned.
The paper evaluates MixLoRA on various multimodal evaluation datasets, including MME and seven additional datasets. Experimental results show that MixLoRA outperforms LoRA in terms of performance across various multimodal tasks, even when using the same or higher ranks. The dynamic factor selection mechanism in MixLoRA enables the model to generalize to unseen tasks through adaptive factor activation, demonstrating its effectiveness in mitigating task interference and enhancing performance across diverse multimodal tasks. The study highlights the importance of effective adaptation strategies in parameter-efficient multimodal instruction tuning and underscores the potential of MixLoRA to improve robustness and versatility in handling complex multimodal tasks.This paper introduces Conditional Mixture-of-LoRA (MixLoRA), a novel approach for multimodal instruction tuning that addresses task interference in parameter-efficient fine-tuning. Multimodal Large Language Models (MLLMs) have shown strong performance across diverse tasks, but their zero-shot generalization to new multimodal tasks remains a challenge. Multimodal instruction tuning, which fine-tunes pre-trained models on diverse multimodal tasks through instructions, has emerged as a promising strategy for achieving zero-shot generalization. However, traditional parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) face challenges in handling the diverse and complex nature of multimodal tasks, leading to task interference and performance degradation.
To address this, the paper proposes MixLoRA, which dynamically constructs low-rank adaptation matrices tailored to each input instance. Unlike conventional LoRA, which uses shared low-rank matrices across all tasks, MixLoRA dynamically selects decomposition factors from two collections to construct varied matrices for different input scenarios. This approach reduces task interference by ensuring that the factors selected for LoRA A and B are not only tailored to the input but also cohesively aligned.
The paper evaluates MixLoRA on various multimodal evaluation datasets, including MME and seven additional datasets. Experimental results show that MixLoRA outperforms LoRA in terms of performance across various multimodal tasks, even when using the same or higher ranks. The dynamic factor selection mechanism in MixLoRA enables the model to generalize to unseen tasks through adaptive factor activation, demonstrating its effectiveness in mitigating task interference and enhancing performance across diverse multimodal tasks. The study highlights the importance of effective adaptation strategies in parameter-efficient multimodal instruction tuning and underscores the potential of MixLoRA to improve robustness and versatility in handling complex multimodal tasks.