Understanding DELLA-Merging%3A Reducing Interference in Model Merging through Magnitude-Based Sampling

The paper introduces a novel model merging technique called Drop and rEscaLe via sampLing with mAgitude (DELLA-Merging), which aims to combine the capabilities of multiple domain-specific models into a single multitasking model without additional training. Della-Merging employs a pruning technique called MAGPRUNE, which ranks parameters by magnitude and assigns higher dropout probabilities to parameters with lower magnitudes. The method then rescales the surviving parameters to approximate the original embeddings. On three expert models (LM, Math, Code) and their corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), Della-Merging shows an average improvement of 2.4 points over baseline methods using delta parameter pruning (3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). The paper also demonstrates the importance of scaling unpruned delta parameters, which improves performance by 7.6 points on the Math+Code model. The methodology involves three steps: Drop, Elect, and Fuse, each designed to reduce interference and improve model performance. The results show that Della-Merging outperforms other methods in terms of both individual and aggregated benchmark scores, highlighting its effectiveness in model merging.The paper introduces a novel model merging technique called Drop and rEscaLe via sampLing with mAgitude (DELLA-Merging), which aims to combine the capabilities of multiple domain-specific models into a single multitasking model without additional training. Della-Merging employs a pruning technique called MAGPRUNE, which ranks parameters by magnitude and assigns higher dropout probabilities to parameters with lower magnitudes. The method then rescales the surviving parameters to approximate the original embeddings. On three expert models (LM, Math, Code) and their corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), Della-Merging shows an average improvement of 2.4 points over baseline methods using delta parameter pruning (3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). The paper also demonstrates the importance of scaling unpruned delta parameters, which improves performance by 7.6 points on the Math+Code model. The methodology involves three steps: Drop, Elect, and Fuse, each designed to reduce interference and improve model performance. The results show that Della-Merging outperforms other methods in terms of both individual and aggregated benchmark scores, highlighting its effectiveness in model merging.

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

17 Jun 2024 | Pala Tej Deep, Rishabh Bhardwaj, Soujanya Poria