Teaching Large Language Models an Unseen Language on the Fly

Teaching Large Language Models an Unseen Language on the Fly

13 Jun 2024 | Chen Zhang, Xiao Liu, JiuHeng Lin, Yansong Feng
This paper investigates whether large language models (LLMs) can learn a new language on the fly using only prompting. We present ZHUANGBENCH, a research suite for the Zhuang language, which is an extremely low-resource language with no existing LLM support. ZHUANGBENCH includes a dictionary, a parallel corpus of 5,000 Zhuang-Chinese sentences, and a translation test set. We introduce D1PMT++, a framework for adapting LLMs to unseen languages through in-context learning (ICL). Using only a dictionary and 5,000 parallel sentences, D1PMT++ significantly improves GPT-4's performance in Chinese-to-Zhuang and Zhuang-to-Chinese translation, achieving 16 BLEU and 32 BLEU scores, respectively. We also validate D1PMT++ on Kalamang, another unseen language, demonstrating its effectiveness in aiding human translation of completely unseen languages. Our framework improves translation quality and efficiency for humans, showing potential in preserving linguistic diversity. We evaluate D1PMT++ on various models, finding it outperforms other prompting baselines and smaller models. D1PMT++ also performs well on MTOB, a benchmark for translating between English and Kalamang. We explore strategies to enhance lexical and syntactic knowledge acquisition in LLMs. Our contributions include ZHUANGBENCH, a challenging benchmark for LLMs to translate unseen languages with limited resources, and D1PMT++, an ICL framework for on-the-fly language learning. We also show that D1PMT++ can assist humans in translating unseen languages, which could benefit linguistic diversity preservation. Our code and data are publicly available.This paper investigates whether large language models (LLMs) can learn a new language on the fly using only prompting. We present ZHUANGBENCH, a research suite for the Zhuang language, which is an extremely low-resource language with no existing LLM support. ZHUANGBENCH includes a dictionary, a parallel corpus of 5,000 Zhuang-Chinese sentences, and a translation test set. We introduce D1PMT++, a framework for adapting LLMs to unseen languages through in-context learning (ICL). Using only a dictionary and 5,000 parallel sentences, D1PMT++ significantly improves GPT-4's performance in Chinese-to-Zhuang and Zhuang-to-Chinese translation, achieving 16 BLEU and 32 BLEU scores, respectively. We also validate D1PMT++ on Kalamang, another unseen language, demonstrating its effectiveness in aiding human translation of completely unseen languages. Our framework improves translation quality and efficiency for humans, showing potential in preserving linguistic diversity. We evaluate D1PMT++ on various models, finding it outperforms other prompting baselines and smaller models. D1PMT++ also performs well on MTOB, a benchmark for translating between English and Kalamang. We explore strategies to enhance lexical and syntactic knowledge acquisition in LLMs. Our contributions include ZHUANGBENCH, a challenging benchmark for LLMs to translate unseen languages with limited resources, and D1PMT++, an ICL framework for on-the-fly language learning. We also show that D1PMT++ can assist humans in translating unseen languages, which could benefit linguistic diversity preservation. Our code and data are publicly available.
Reach us at info@study.space
[slides] Teaching Large Language Models an Unseen Language on the Fly | StudySpace