SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

31 May 2024 | Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M. Mohammad
SemRel2024 is a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages belong to five distinct language families and are predominantly spoken in Africa and Asia, regions with limited NLP resources. The datasets consist of sentence pairs annotated by native speakers with scores representing the degree of semantic relatedness between the sentences. The scores are obtained using a comparative annotation framework called Best-Worst Scaling (BWS), which is known to avoid common limitations of traditional rating scale annotation methods. The data collection and annotation processes, challenges, baseline experiments, and their impact and utility in NLP are described. The datasets are publicly released as part of a shared task to promote research in semantic relatedness, particularly for low-resource languages. The datasets include a wide range of relatedness scores, and the annotation process ensures high reliability. The datasets are used for various NLP tasks, including evaluating sentence representation methods, question answering, and summarization. The study also discusses the challenges of data collection and annotation for low-resource languages and presents baseline experiments in different monolingual and crosslingual settings. The results show that the datasets are useful for evaluating models in different settings, with high correlation scores for many languages. The study also acknowledges the limitations of the datasets, including the lack of a formal definition of semantic relatedness and the limited number of data sources for some languages. The datasets are publicly available for research on semantic relatedness, low-resource languages, and disagreements. The study emphasizes the importance of capturing common perceptions of semantic relatedness rather than "correct" or "right" rankings. The datasets are a valuable resource for researchers working on semantic relatedness in low-resource languages.SemRel2024 is a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages belong to five distinct language families and are predominantly spoken in Africa and Asia, regions with limited NLP resources. The datasets consist of sentence pairs annotated by native speakers with scores representing the degree of semantic relatedness between the sentences. The scores are obtained using a comparative annotation framework called Best-Worst Scaling (BWS), which is known to avoid common limitations of traditional rating scale annotation methods. The data collection and annotation processes, challenges, baseline experiments, and their impact and utility in NLP are described. The datasets are publicly released as part of a shared task to promote research in semantic relatedness, particularly for low-resource languages. The datasets include a wide range of relatedness scores, and the annotation process ensures high reliability. The datasets are used for various NLP tasks, including evaluating sentence representation methods, question answering, and summarization. The study also discusses the challenges of data collection and annotation for low-resource languages and presents baseline experiments in different monolingual and crosslingual settings. The results show that the datasets are useful for evaluating models in different settings, with high correlation scores for many languages. The study also acknowledges the limitations of the datasets, including the lack of a formal definition of semantic relatedness and the limited number of data sources for some languages. The datasets are publicly available for research on semantic relatedness, low-resource languages, and disagreements. The study emphasizes the importance of capturing common perceptions of semantic relatedness rather than "correct" or "right" rankings. The datasets are a valuable resource for researchers working on semantic relatedness in low-resource languages.
Reach us at info@study.space