MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

15 Mar 2024 | Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
The paper introduces MYTE (Morphology-Driven Byte Encoding), a novel byte-level encoding method designed to address the biases and inefficiencies in current text encoding methods for multilingual language modeling. MYTE aims to provide fairer and more equitable segmentations across diverse languages and scripts by encoding text based on morphemes, which are more balanced across languages than characters. The method is based on unsupervised morphological segmentation using the Morfessor algorithm, which is trained on lexicons derived from Wikipedia articles in 99 languages. Key contributions of MYTE include: 1. **Equitable Segmentation**: MYTE produces shorter encodings for all 99 analyzed languages, with significant improvements for non-European languages and non-Latin scripts. 2. **Improved Language Modeling Performance**: MYTE enhances the performance of multilingual language models, particularly for low-resource and non-Latin script languages, while reducing inference costs. 3. **Efficiency Benefits**: MYTE offers faster inference speeds and better compression rates compared to traditional byte-level encodings like UTF-8. The paper evaluates MYTE through experiments on the Flores 200 corpus, demonstrating its effectiveness in achieving more balanced sequence lengths and compression rates across languages. The results show that MYTE outperforms vanilla UTF-8, character-level, and subword tokenization methods in terms of equitability and compression. Additionally, the proposed MyT5 model, trained with MYTE, shows superior performance in language modeling and efficiency compared to the ByT5 model. The authors also discuss the limitations of their approach, such as the dependence on the quality of the corpus and the potential for over-segmentation in certain languages. Despite these limitations, MYTE provides a significant step towards fairer and more efficient multilingual language modeling.The paper introduces MYTE (Morphology-Driven Byte Encoding), a novel byte-level encoding method designed to address the biases and inefficiencies in current text encoding methods for multilingual language modeling. MYTE aims to provide fairer and more equitable segmentations across diverse languages and scripts by encoding text based on morphemes, which are more balanced across languages than characters. The method is based on unsupervised morphological segmentation using the Morfessor algorithm, which is trained on lexicons derived from Wikipedia articles in 99 languages. Key contributions of MYTE include: 1. **Equitable Segmentation**: MYTE produces shorter encodings for all 99 analyzed languages, with significant improvements for non-European languages and non-Latin scripts. 2. **Improved Language Modeling Performance**: MYTE enhances the performance of multilingual language models, particularly for low-resource and non-Latin script languages, while reducing inference costs. 3. **Efficiency Benefits**: MYTE offers faster inference speeds and better compression rates compared to traditional byte-level encodings like UTF-8. The paper evaluates MYTE through experiments on the Flores 200 corpus, demonstrating its effectiveness in achieving more balanced sequence lengths and compression rates across languages. The results show that MYTE outperforms vanilla UTF-8, character-level, and subword tokenization methods in terms of equitability and compression. Additionally, the proposed MyT5 model, trained with MYTE, shows superior performance in language modeling and efficiency compared to the ByT5 model. The authors also discuss the limitations of their approach, such as the dependence on the quality of the corpus and the potential for over-segmentation in certain languages. Despite these limitations, MYTE provides a significant step towards fairer and more efficient multilingual language modeling.
Reach us at info@study.space