Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

19 Jan 2024 | Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer
This paper introduces Cross-lingual Expert Language Models (X-ELM), a method to mitigate the "curse of multilinguality" in multilingual language models (LMs). X-ELM addresses the issue of inter-language competition for model parameters by training separate language models on subsets of a multilingual corpus. This approach allows X-ELMs to specialize in different languages while maintaining effectiveness as a multilingual ensemble. The paper proposes x-BTM, an extension of the Branch-Train-Merge (BTM) paradigm, which improves upon existing BTM techniques by introducing balanced clustering based on typological similarity and Hierarchical Multi-Round (HMR) training for efficiently training new experts on unseen languages. X-ELMs are trained on 20 languages, including 4 unseen ones, with up to 21 billion training tokens. Experiments show that X-ELMs outperform jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELMs also provide additional benefits, such as the ability to iteratively add new experts without catastrophic forgetting and asynchronous training that reduces hardware requirements. The paper also demonstrates that X-ELMs perform well on downstream tasks, including in-context learning, and that they do not disproportionately benefit high-resource languages over low-resource ones. The work shows that X-ELMs provide a more efficient and effective approach to multilingual modeling compared to dense models.This paper introduces Cross-lingual Expert Language Models (X-ELM), a method to mitigate the "curse of multilinguality" in multilingual language models (LMs). X-ELM addresses the issue of inter-language competition for model parameters by training separate language models on subsets of a multilingual corpus. This approach allows X-ELMs to specialize in different languages while maintaining effectiveness as a multilingual ensemble. The paper proposes x-BTM, an extension of the Branch-Train-Merge (BTM) paradigm, which improves upon existing BTM techniques by introducing balanced clustering based on typological similarity and Hierarchical Multi-Round (HMR) training for efficiently training new experts on unseen languages. X-ELMs are trained on 20 languages, including 4 unseen ones, with up to 21 billion training tokens. Experiments show that X-ELMs outperform jointly trained multilingual models across all considered languages and that these gains transfer to downstream tasks. X-ELMs also provide additional benefits, such as the ability to iteratively add new experts without catastrophic forgetting and asynchronous training that reduces hardware requirements. The paper also demonstrates that X-ELMs perform well on downstream tasks, including in-context learning, and that they do not disproportionately benefit high-resource languages over low-resource ones. The work shows that X-ELMs provide a more efficient and effective approach to multilingual modeling compared to dense models.
Reach us at info@study.space
[slides and audio] Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models