23 Jun 2024 | Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts. The authors propose an approach that utilizes romanized text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. The approach involves continually pretraining an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text reduces token fertility by 2x-4x and matches or outperforms native script representation across various NLU, NLG, and MT tasks. Additionally, the embeddings computed on romanized text exhibit closer alignment with their English translations. The approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP. The code is available on GitHub.This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts. The authors propose an approach that utilizes romanized text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. The approach involves continually pretraining an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text reduces token fertility by 2x-4x and matches or outperforms native script representation across various NLU, NLG, and MT tasks. Additionally, the embeddings computed on romanized text exhibit closer alignment with their English translations. The approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP. The code is available on GitHub.