Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

28 Feb 2024 | Kexun Zhang, Yee Man Choi, Zhenqiao Song, Taiqi He, William Yang Wang, Lei Li
This paper introduces LINGOLLM, a training-free approach to enable large language models (LLMs) to process and translate endangered languages that rarely appear in their pre-training data. The key idea is to incorporate linguistic descriptions, such as dictionaries, grammar books, and morphologically analyzed text, into the prompt of an LLM to facilitate language processing. LINGOLLM first preprocesses input text using a morphological analyzer and dictionary to extract morphemes and their glosses. These are then passed to an LLM along with a grammar book to generate translations in a high-resource language like English. The LLM then uses this translated text for downstream tasks. LINGOLLM is implemented on top of two models, GPT-4 and Mixtral, and evaluated on five tasks across eight endangered or low-resource languages. Results show that LINGOLLM significantly improves translation performance, increasing GPT-4's BLEU score from 0 to 10.5 for 10 language directions. It also enhances mathematical reasoning accuracy from 18% to 75% and response selection accuracy from 43% to 63%. The method relies on external linguistic descriptions rather than internal knowledge of LLMs, focusing on how LLMs can utilize information they don't know. The paper also discusses related work, including previous studies on low-resource languages and the challenges of translating truly extinct languages. It highlights the importance of linguistic descriptions in improving LLM performance on endangered languages and the potential of LLMs in making these languages more accessible. The study emphasizes the need for further research and development to extend LINGOLLM to more languages, while acknowledging limitations such as the difficulty of digitizing linguistic descriptions and the potential contamination of data from high-resource languages. The paper concludes that LINGOLLM has significant potential for preserving and promoting endangered languages through improved communication and understanding.This paper introduces LINGOLLM, a training-free approach to enable large language models (LLMs) to process and translate endangered languages that rarely appear in their pre-training data. The key idea is to incorporate linguistic descriptions, such as dictionaries, grammar books, and morphologically analyzed text, into the prompt of an LLM to facilitate language processing. LINGOLLM first preprocesses input text using a morphological analyzer and dictionary to extract morphemes and their glosses. These are then passed to an LLM along with a grammar book to generate translations in a high-resource language like English. The LLM then uses this translated text for downstream tasks. LINGOLLM is implemented on top of two models, GPT-4 and Mixtral, and evaluated on five tasks across eight endangered or low-resource languages. Results show that LINGOLLM significantly improves translation performance, increasing GPT-4's BLEU score from 0 to 10.5 for 10 language directions. It also enhances mathematical reasoning accuracy from 18% to 75% and response selection accuracy from 43% to 63%. The method relies on external linguistic descriptions rather than internal knowledge of LLMs, focusing on how LLMs can utilize information they don't know. The paper also discusses related work, including previous studies on low-resource languages and the challenges of translating truly extinct languages. It highlights the importance of linguistic descriptions in improving LLM performance on endangered languages and the potential of LLMs in making these languages more accessible. The study emphasizes the need for further research and development to extend LINGOLLM to more languages, while acknowledging limitations such as the difficulty of digitizing linguistic descriptions and the potential contamination of data from high-resource languages. The paper concludes that LINGOLLM has significant potential for preserving and promoting endangered languages through improved communication and understanding.
Reach us at info@study.space
[slides] Hire a Linguist!%3A Learning Endangered Languages with In-Context Linguistic Descriptions | StudySpace