26 Feb 2024 | Jay Gala, Thanmay Jayakumar, Jaavid Akter Husain, Aswanth Kumar, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan
The paper introduces Airavata, an instruction-tuned Hindi large language model (LLM) developed by extending the OpenHathi model. The model is fine-tuned using diverse instruction-tuned Hindi datasets to improve its performance on assistive tasks. The research highlights the need for a dedicated ecosystem for Indian languages, as existing models and datasets have predominantly focused on English. Airavata is built using human-curated, license-friendly datasets, avoiding reliance on proprietary models like GPT-4 to ensure sustainability and avoid licensing restrictions. The model is evaluated on various NLP benchmarks, including native Hindi test sets and translated versions of English benchmarks, to assess its performance in natural language understanding (NLU) and generation (NLG) tasks. The results show that Airavata outperforms the OpenHathi base model on most tasks, particularly in NLU, while showing potential for improvement in NLG. The model is also evaluated in human evaluations, where it performs well in generating natural-sounding Hindi content, though it still trails behind GPT-4 in instruction-following and content quality. The study also addresses toxicity and misinformation, finding that Airavata performs reasonably well in detecting hate speech and factual questions, though further improvements are needed. The paper provides datasets and resources for further research in Hindi LLMs, emphasizing the importance of creating diverse instruction-tuned datasets and high-quality foundational models for Indian languages. The research underscores the need for continued efforts in developing cross-lingual alignment and improving the performance of Hindi LLMs.The paper introduces Airavata, an instruction-tuned Hindi large language model (LLM) developed by extending the OpenHathi model. The model is fine-tuned using diverse instruction-tuned Hindi datasets to improve its performance on assistive tasks. The research highlights the need for a dedicated ecosystem for Indian languages, as existing models and datasets have predominantly focused on English. Airavata is built using human-curated, license-friendly datasets, avoiding reliance on proprietary models like GPT-4 to ensure sustainability and avoid licensing restrictions. The model is evaluated on various NLP benchmarks, including native Hindi test sets and translated versions of English benchmarks, to assess its performance in natural language understanding (NLU) and generation (NLG) tasks. The results show that Airavata outperforms the OpenHathi base model on most tasks, particularly in NLU, while showing potential for improvement in NLG. The model is also evaluated in human evaluations, where it performs well in generating natural-sounding Hindi content, though it still trails behind GPT-4 in instruction-following and content quality. The study also addresses toxicity and misinformation, finding that Airavata performs reasonably well in detecting hate speech and factual questions, though further improvements are needed. The paper provides datasets and resources for further research in Hindi LLMs, emphasizing the importance of creating diverse instruction-tuned datasets and high-quality foundational models for Indian languages. The research underscores the need for continued efforts in developing cross-lingual alignment and improving the performance of Hindi LLMs.