[slides and audio] IndicVoices%3A Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

The paper introduces INDICVOICES, a comprehensive dataset of natural and spontaneous speech from 16,237 speakers across 145 Indian districts and 22 languages. The dataset includes 7,348 hours of read (9%), extempore (74%), and conversational (17%) audio, with 1,639 hours already transcribed. The authors detail their process of capturing cultural, linguistic, and demographic diversity, including standardized protocols, centralized tools, engaging questions, prompts, and conversation scenarios. They also outline a robust quality control mechanism and transcription guidelines to ensure data quality. The dataset is used to build IndicASR, the first ASR model supporting all 22 languages listed in the Indian Constitution. The paper aims to provide an open-source blueprint for data collection in multilingual regions, with all materials, tools, and guidelines made publicly available. The authors discuss the challenges and solutions in collecting diverse and high-quality data, emphasizing the importance of inclusive and representative datasets for improving speech recognition systems in India.The paper introduces INDICVOICES, a comprehensive dataset of natural and spontaneous speech from 16,237 speakers across 145 Indian districts and 22 languages. The dataset includes 7,348 hours of read (9%), extempore (74%), and conversational (17%) audio, with 1,639 hours already transcribed. The authors detail their process of capturing cultural, linguistic, and demographic diversity, including standardized protocols, centralized tools, engaging questions, prompts, and conversation scenarios. They also outline a robust quality control mechanism and transcription guidelines to ensure data quality. The dataset is used to build IndicASR, the first ASR model supporting all 22 languages listed in the Indian Constitution. The paper aims to provide an open-source blueprint for data collection in multilingual regions, with all materials, tools, and guidelines made publicly available. The authors discuss the challenges and solutions in collecting diverse and high-quality data, emphasizing the importance of inclusive and representative datasets for improving speech recognition systems in India.

INDICVOICES: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages