[slides] Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

The paper "Mitigating Catastrophic Forgetting in Language Transfer via Model Merging" addresses the issue of catastrophic forgetting when adapting large language models (LLMs) to different languages. The authors propose Branch-and-Merge (BAM), a novel adaptation method that iteratively merges multiple models fine-tuned on subsets of the training data. BAM aims to reduce the magnitude of weight changes while maintaining the quality of learning, thereby minimizing forgetting of the source domain while maintaining learning on the target domain. The method is based on the insight that lower magnitude but higher quality weight changes can reduce forgetting while preserving learning. The paper demonstrates the effectiveness of BAM through extensive empirical studies on Bulgarian and German, showing that it can significantly reduce forgetting while matching or improving target domain performance compared to standard continued pretraining and instruction fine-tuning across different model architectures. The authors also explore the impact of approximate experience replay and minimal experience replay on the effectiveness of BAM, finding that high-quality data mixtures are crucial for both effective language adaptation and reducing forgetting. Key contributions of the paper include the introduction of BAM, the development of a high-quality data mixture for approximate experience replay, and a comprehensive empirical investigation of BAM's effectiveness. The results highlight that BAM can improve benchmark performance in both the target and source languages, demonstrating its potential for practical language adaptation tasks.The paper "Mitigating Catastrophic Forgetting in Language Transfer via Model Merging" addresses the issue of catastrophic forgetting when adapting large language models (LLMs) to different languages. The authors propose Branch-and-Merge (BAM), a novel adaptation method that iteratively merges multiple models fine-tuned on subsets of the training data. BAM aims to reduce the magnitude of weight changes while maintaining the quality of learning, thereby minimizing forgetting of the source domain while maintaining learning on the target domain. The method is based on the insight that lower magnitude but higher quality weight changes can reduce forgetting while preserving learning. The paper demonstrates the effectiveness of BAM through extensive empirical studies on Bulgarian and German, showing that it can significantly reduce forgetting while matching or improving target domain performance compared to standard continued pretraining and instruction fine-tuning across different model architectures. The authors also explore the impact of approximate experience replay and minimal experience replay on the effectiveness of BAM, finding that high-quality data mixtures are crucial for both effective language adaptation and reducing forgetting. Key contributions of the paper include the introduction of BAM, the development of a high-quality data mixture for approximate experience replay, and a comprehensive empirical investigation of BAM's effectiveness. The results highlight that BAM can improve benchmark performance in both the target and source languages, demonstrating its potential for practical language adaptation tasks.

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

16 Jul 2024 | Anton Alexandrov*,1, Veselin Raychev1,2, Mark Niklas Müller2,3 Ce Zhang1,4,5, Martin Vechev1,3, Kristina Toutanova1,6