16 Jul 2024 | Anton Alexandrov, Veselin Raychev, Mark Niklas Müller, Ce Zhang, Martin Vechev, Kristina Toutanova
This paper introduces Branch-and-Merge (BAM), a novel method for mitigating catastrophic forgetting during language transfer. BAM reduces the magnitude of weight changes while maintaining learning quality, thereby minimizing the loss of previously learned capabilities. The method involves iteratively merging multiple models fine-tuned on subsets of training data. The approach is validated on Bulgarian and German languages, showing significant improvements in target domain performance compared to standard continued pretraining and instruction fine-tuning. BAM achieves this by using a data mix for approximate experience replay, which helps retain base model skills crucial for downstream tasks. The method is implemented using linear and spherical interpolation techniques for model merging. Results show that BAM outperforms existing methods in both language adaptation and instruction fine-tuning, with notable improvements in performance on both source and target languages. The study also highlights the effectiveness of BAM in reducing forgetting and improving learning efficiency, demonstrating its potential for broader applications in language transfer and domain adaptation.This paper introduces Branch-and-Merge (BAM), a novel method for mitigating catastrophic forgetting during language transfer. BAM reduces the magnitude of weight changes while maintaining learning quality, thereby minimizing the loss of previously learned capabilities. The method involves iteratively merging multiple models fine-tuned on subsets of training data. The approach is validated on Bulgarian and German languages, showing significant improvements in target domain performance compared to standard continued pretraining and instruction fine-tuning. BAM achieves this by using a data mix for approximate experience replay, which helps retain base model skills crucial for downstream tasks. The method is implemented using linear and spherical interpolation techniques for model merging. Results show that BAM outperforms existing methods in both language adaptation and instruction fine-tuning, with notable improvements in performance on both source and target languages. The study also highlights the effectiveness of BAM in reducing forgetting and improving learning efficiency, demonstrating its potential for broader applications in language transfer and domain adaptation.