[slides] IndicGenBench%3A A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

**INDICGENBENCH: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages** **Authors:** Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar **Affiliation:** Google Research India **Emails:** {hrman, guptanitish, shikharop, dineshtewari, partha}@google.com **Abstract:** As large language models (LLMs) gain global adoption, it is crucial for these models to represent the linguistic diversity of the world. India, with its 1.4 billion people and 1369 mother tongues, presents a unique challenge. To address this, we introduce INDICGENBENCH, the largest benchmark for evaluating LLMs on user-facing generation tasks across 29 Indic languages, covering 13 scripts and 4 language families. INDICGENBENCH includes diverse tasks such as cross-lingual summarization, machine translation, and cross-lingual question answering. It extends existing benchmarks through human curation, providing multi-way parallel evaluation data for under-represented Indic languages for the first time. We evaluate various proprietary and open-source LLMs, including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM, and LLaMA, across different settings. The results show that while PaLM-2 models perform well on most tasks, there is a significant performance gap compared to English, highlighting the need for more inclusive multilingual models. **Key Contributions:** - Creation and release of INDICGENBENCH, a high-quality text benchmark for diverse language generation tasks in 29 Indic languages. - Comprehensive evaluation of state-of-the-art LLMs on INDICGENBENCH, revealing significant performance gaps between English and Indic languages. - Qualitative analysis to identify areas for future research and improvement. **Experiments and Analysis:** - **LLM Performance:** Larger models from the same LLM family generally perform better, with PaLM-2-L performing the best on most tasks. - **Performance Across Language Categories:** There is a significant drop in performance from higher to medium and lower resource languages. - **In-context Learning:** Increasing the amount of in-context supervision improves performance. - **Transfer from High-resource Languages:** Hindi in-context exemplars are more useful than English ones. - **Fine-tuning vs. In-context Learning:** Fine-tuning outperforms in-context learning for some tasks, especially for larger models. - **Tokenizer Analysis:** Token fertility varies significantly across languages, affecting performance in low-resource languages. **Conclusion:** INDICGENBENCH is a comprehensive benchmark for evaluating LLMs on Indic languages, covering a wide range of tasks and languages. It aims to guide the development of more inclusive mult**INDICGENBENCH: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages** **Authors:** Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar **Affiliation:** Google Research India **Emails:** {hrman, guptanitish, shikharop, dineshtewari, partha}@google.com **Abstract:** As large language models (LLMs) gain global adoption, it is crucial for these models to represent the linguistic diversity of the world. India, with its 1.4 billion people and 1369 mother tongues, presents a unique challenge. To address this, we introduce INDICGENBENCH, the largest benchmark for evaluating LLMs on user-facing generation tasks across 29 Indic languages, covering 13 scripts and 4 language families. INDICGENBENCH includes diverse tasks such as cross-lingual summarization, machine translation, and cross-lingual question answering. It extends existing benchmarks through human curation, providing multi-way parallel evaluation data for under-represented Indic languages for the first time. We evaluate various proprietary and open-source LLMs, including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM, and LLaMA, across different settings. The results show that while PaLM-2 models perform well on most tasks, there is a significant performance gap compared to English, highlighting the need for more inclusive multilingual models. **Key Contributions:** - Creation and release of INDICGENBENCH, a high-quality text benchmark for diverse language generation tasks in 29 Indic languages. - Comprehensive evaluation of state-of-the-art LLMs on INDICGENBENCH, revealing significant performance gaps between English and Indic languages. - Qualitative analysis to identify areas for future research and improvement. **Experiments and Analysis:** - **LLM Performance:** Larger models from the same LLM family generally perform better, with PaLM-2-L performing the best on most tasks. - **Performance Across Language Categories:** There is a significant drop in performance from higher to medium and lower resource languages. - **In-context Learning:** Increasing the amount of in-context supervision improves performance. - **Transfer from High-resource Languages:** Hindi in-context exemplars are more useful than English ones. - **Fine-tuning vs. In-context Learning:** Fine-tuning outperforms in-context learning for some tasks, especially for larger models. - **Tokenizer Analysis:** Token fertility varies significantly across languages, affecting performance in low-resource languages. **Conclusion:** INDICGENBENCH is a comprehensive benchmark for evaluating LLMs on Indic languages, covering a wide range of tasks and languages. It aims to guide the development of more inclusive mult

INDICGENBENCH: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

7 Aug 2024 | Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar