February 12, 2024 | Shivalika Singh, Freddie Vargas, Daniel D'souza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O'Mahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergin, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaei, Sara Hooker
The Aya Dataset is an open-access collection of multilingual instruction-tuned data, developed to address the language gap in AI research. It includes 204,114 high-quality annotations in 65 languages, created through human-curated contributions from fluent speakers worldwide. The dataset is complemented by the Aya Collection, which contains 513 million instances across 114 languages, derived from 44 multilingual templated datasets and 19 translated datasets. The Aya Evaluation Suite provides a framework for assessing multilingual open-ended generation quality. The Aya Annotation Platform, an open-source tool, facilitates data collection and annotation, with 2,997 contributors from 119 countries. The dataset and collection are available under a permissive Apache 2.0 license, enabling researchers to advance multilingual models and applications. The Aya project also highlights the importance of diverse, inclusive data collection and addresses challenges in data quality, annotation consistency, and language representation. The dataset includes a variety of tasks such as text classification, natural language generation, and question answering, with a focus on improving model performance through diverse and high-quality data. The Aya Collection aims to provide a comprehensive resource for training multilingual models, with datasets that have permissive licenses for redistribution. The project emphasizes the importance of participatory research, ensuring that data reflects a wide range of languages and cultural contexts. The Aya Dataset and Collection are designed to support the development of more inclusive and effective AI systems.The Aya Dataset is an open-access collection of multilingual instruction-tuned data, developed to address the language gap in AI research. It includes 204,114 high-quality annotations in 65 languages, created through human-curated contributions from fluent speakers worldwide. The dataset is complemented by the Aya Collection, which contains 513 million instances across 114 languages, derived from 44 multilingual templated datasets and 19 translated datasets. The Aya Evaluation Suite provides a framework for assessing multilingual open-ended generation quality. The Aya Annotation Platform, an open-source tool, facilitates data collection and annotation, with 2,997 contributors from 119 countries. The dataset and collection are available under a permissive Apache 2.0 license, enabling researchers to advance multilingual models and applications. The Aya project also highlights the importance of diverse, inclusive data collection and addresses challenges in data quality, annotation consistency, and language representation. The dataset includes a variety of tasks such as text classification, natural language generation, and question answering, with a focus on improving model performance through diverse and high-quality data. The Aya Collection aims to provide a comprehensive resource for training multilingual models, with datasets that have permissive licenses for redistribution. The project emphasizes the importance of participatory research, ensuring that data reflects a wide range of languages and cultural contexts. The Aya Dataset and Collection are designed to support the development of more inclusive and effective AI systems.