INDICVOICES: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

INDICVOICES: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

4 Mar 2024 | Tahir Javed*, αβ Janki Atul Nawaleα Eldho Ittan Georgeα Sakshi Joshiαβ Kaushal Santosh Bhogaleαβ Deovrat Mehendaleα Ishvinder Virender Sethiα Aparna Ananthanarayananα Hafsa Faquihα Pratiti Palitα Sneha Ravishankarα Saranya Sukumaranα Tripura Panchagnulaα Sunjay Muraliα Kunal Sharad Gandhiα Ambujavalli Rα Manickam K Mα C Venkata Vaijayanthiα Krishnan Srinivasa Raghavan Karunganniα Pratyush Kumar βγ Mitesh M Khapraαβ
The paper introduces INDICVOICES, a comprehensive dataset of natural and spontaneous speech from 16,237 speakers across 145 Indian districts and 22 languages. The dataset includes 7,348 hours of read (9%), extempore (74%), and conversational (17%) audio, with 1,639 hours already transcribed. The authors detail their process of capturing cultural, linguistic, and demographic diversity, including standardized protocols, centralized tools, engaging questions, prompts, and conversation scenarios. They also outline a robust quality control mechanism and transcription guidelines to ensure data quality. The dataset is used to build IndicASR, the first ASR model supporting all 22 languages listed in the Indian Constitution. The paper aims to provide an open-source blueprint for data collection in multilingual regions, with all materials, tools, and guidelines made publicly available. The authors discuss the challenges and solutions in collecting diverse and high-quality data, emphasizing the importance of inclusive and representative datasets for improving speech recognition systems in India.The paper introduces INDICVOICES, a comprehensive dataset of natural and spontaneous speech from 16,237 speakers across 145 Indian districts and 22 languages. The dataset includes 7,348 hours of read (9%), extempore (74%), and conversational (17%) audio, with 1,639 hours already transcribed. The authors detail their process of capturing cultural, linguistic, and demographic diversity, including standardized protocols, centralized tools, engaging questions, prompts, and conversation scenarios. They also outline a robust quality control mechanism and transcription guidelines to ensure data quality. The dataset is used to build IndicASR, the first ASR model supporting all 22 languages listed in the Indian Constitution. The paper aims to provide an open-source blueprint for data collection in multilingual regions, with all materials, tools, and guidelines made publicly available. The authors discuss the challenges and solutions in collecting diverse and high-quality data, emphasizing the importance of inclusive and representative datasets for improving speech recognition systems in India.
Reach us at info@study.space