15 Mar 2024 | Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
This paper introduces MYTE, a morphology-driven byte encoding method that improves multilingual language modeling by creating more equitable text representations across languages. The method encodes text using morphemes rather than characters, which are more balanced across languages. MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This leads to better multilingual language model performance and reduces the perplexity gap between languages.
The paper discusses the challenges of multilingual language modeling, particularly the bias towards high-resource languages in current text encoding methods. It proposes a new encoding paradigm that assigns consistent segment sizes across languages, improving fairness and efficiency. The method is based on unsupervised morphological segmentation and uses a multilingual morpheme inventory derived from Wikipedia lexicons.
The paper evaluates the effectiveness of MYTE in creating equitable text representations and its applicability to multilingual language modeling across 99 typologically diverse languages. The results show that MYTE improves language modeling performance, especially for low-resource and non-Latin script languages, and provides efficiency benefits over traditional byte-level models. The method is also more efficient at scale, with improved inference speed and performance on downstream tasks.
The paper also discusses related work in fair representation across languages and tokenization-free language modeling. It highlights the importance of considering a wide range of languages and scripts when constructing morpheme inventories and notes the limitations of the method, including its dependence on data and the potential for over-segmentation in some languages.
Overall, MYTE bridges the gap in encoding efficiency between high and low-resource languages, benefiting all 99 analyzed languages. The method is more efficient than traditional byte-level models and provides better performance on downstream tasks, particularly for low-resource languages. The paper concludes that MYTE is a fairer and more efficient byte-level representation for multilingual language modeling.MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
This paper introduces MYTE, a morphology-driven byte encoding method that improves multilingual language modeling by creating more equitable text representations across languages. The method encodes text using morphemes rather than characters, which are more balanced across languages. MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This leads to better multilingual language model performance and reduces the perplexity gap between languages.
The paper discusses the challenges of multilingual language modeling, particularly the bias towards high-resource languages in current text encoding methods. It proposes a new encoding paradigm that assigns consistent segment sizes across languages, improving fairness and efficiency. The method is based on unsupervised morphological segmentation and uses a multilingual morpheme inventory derived from Wikipedia lexicons.
The paper evaluates the effectiveness of MYTE in creating equitable text representations and its applicability to multilingual language modeling across 99 typologically diverse languages. The results show that MYTE improves language modeling performance, especially for low-resource and non-Latin script languages, and provides efficiency benefits over traditional byte-level models. The method is also more efficient at scale, with improved inference speed and performance on downstream tasks.
The paper also discusses related work in fair representation across languages and tokenization-free language modeling. It highlights the importance of considering a wide range of languages and scripts when constructing morpheme inventories and notes the limitations of the method, including its dependence on data and the potential for over-segmentation in some languages.
Overall, MYTE bridges the gap in encoding efficiency between high and low-resource languages, benefiting all 99 analyzed languages. The method is more efficient than traditional byte-level models and provides better performance on downstream tasks, particularly for low-resource languages. The paper concludes that MYTE is a fairer and more efficient byte-level representation for multilingual language modeling.