9 Feb 2024 | Shivalika Singh, Freddie Vargas, Daniel D'souza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O'Mahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzeminski, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, and Sara Hooker
The Aya Dataset is an open-access collection designed to bridge the language gap in instruction fine-tuning (IFT) for natural language processing (NLP). The primary goal is to create a human-curated instruction-following dataset spanning 65 languages, addressing the lack of diverse and representative datasets in existing resources. The project involved collecting natural instances of instructions and completions from fluent speakers of various languages, resulting in the largest human-annotated multilingual IFT dataset to date, containing over 204,114 instances.
Key contributions of the Aya initiative include:
1. **Aya Annotation Platform (Aya UI)**: A robust annotation tool supporting 182 languages, including dialects, designed to facilitate high-quality multilingual data collection.
2. **Aya Dataset**: The largest human-annotated multilingual IFT dataset, covering 65 languages.
3. **Aya Collection**: An extensive collection of 513 million instances across 114 languages, including 44 monolingual and multilingual templated datasets and 19 translated datasets.
4. **Aya Evaluation Suite**: A diverse evaluation suite for multilingual open-ended generation quality, consisting of 250 human-written prompts for 7 languages, 200 automatically translated prompts for 101 languages, and human-edited prompts for 6 languages.
The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. The project aims to reduce linguistic inequality by providing a comprehensive and diverse dataset for training multilingual models, ensuring that models can better represent and respond to instructions in a wide range of languages.The Aya Dataset is an open-access collection designed to bridge the language gap in instruction fine-tuning (IFT) for natural language processing (NLP). The primary goal is to create a human-curated instruction-following dataset spanning 65 languages, addressing the lack of diverse and representative datasets in existing resources. The project involved collecting natural instances of instructions and completions from fluent speakers of various languages, resulting in the largest human-annotated multilingual IFT dataset to date, containing over 204,114 instances.
Key contributions of the Aya initiative include:
1. **Aya Annotation Platform (Aya UI)**: A robust annotation tool supporting 182 languages, including dialects, designed to facilitate high-quality multilingual data collection.
2. **Aya Dataset**: The largest human-annotated multilingual IFT dataset, covering 65 languages.
3. **Aya Collection**: An extensive collection of 513 million instances across 114 languages, including 44 monolingual and multilingual templated datasets and 19 translated datasets.
4. **Aya Evaluation Suite**: A diverse evaluation suite for multilingual open-ended generation quality, consisting of 250 human-written prompts for 7 languages, 200 automatically translated prompts for 101 languages, and human-edited prompts for 6 languages.
The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. The project aims to reduce linguistic inequality by providing a comprehensive and diverse dataset for training multilingual models, ensuring that models can better represent and respond to instructions in a wide range of languages.