ROMANSETU: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

ROMANSETU: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

23 Jun 2024 | Jaavid Akbar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
The paper introduces RomanSetu, a method to enhance the multilingual capabilities of Large Language Models (LLMs) by using romanized text as an interface. The approach involves continually pretraining an English LLM, such as Llama 2, on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results show that romanized text reduces token fertility by 2x-4x and matches or outperforms native script representations in various NLU, NLG, and MT tasks. Romanized embeddings align more closely with their English translations than native script embeddings, suggesting better cross-lingual transfer. The method is efficient and effective for languages traditionally underrepresented in NLP, leveraging the power of English LLMs. The approach is validated across multiple languages and tasks, demonstrating competitive or superior performance compared to native script data. Romanization is shown to be beneficial not only for understanding but also for generation tasks. The study highlights the potential of romanization as a practical solution for extending LLM capabilities to non-Roman scripts, with implications for improving multilingual models. The work is supported by experiments on Indian languages, which span multiple language families and scripts, and shows that romanized models are more efficient and perform better in terms of task performance compared to native script models. The findings suggest that romanization can be a promising alternative for extending the capabilities of English LLMs to other languages.The paper introduces RomanSetu, a method to enhance the multilingual capabilities of Large Language Models (LLMs) by using romanized text as an interface. The approach involves continually pretraining an English LLM, such as Llama 2, on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results show that romanized text reduces token fertility by 2x-4x and matches or outperforms native script representations in various NLU, NLG, and MT tasks. Romanized embeddings align more closely with their English translations than native script embeddings, suggesting better cross-lingual transfer. The method is efficient and effective for languages traditionally underrepresented in NLP, leveraging the power of English LLMs. The approach is validated across multiple languages and tasks, demonstrating competitive or superior performance compared to native script data. Romanization is shown to be beneficial not only for understanding but also for generation tasks. The study highlights the potential of romanization as a practical solution for extending LLM capabilities to non-Roman scripts, with implications for improving multilingual models. The work is supported by experiments on Indian languages, which span multiple language families and scripts, and shows that romanized models are more efficient and perform better in terms of task performance compared to native script models. The findings suggest that romanization can be a promising alternative for extending the capabilities of English LLMs to other languages.
Reach us at info@study.space
[slides and audio] RomanSetu%3A Efficiently unlocking multilingual capabilities of Large Language Models via Romanization