31 May 2024 | Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Ajfi, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Belouci, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhib Yimam, Saif M. Mohammad
The paper introduces *SemRel*, a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages are predominantly spoken in Africa and Asia, regions characterized by limited NLP resources. Each dataset contains sentence pairs annotated by native speakers with scores representing the degree of semantic textual relatedness between the sentences. The datasets are created using various methods, including lexical overlap, contiguity, topic coverage, and random selection, to ensure a wide range of relatedness scores. The paper discusses the data collection and annotation processes, challenges, and baseline experiments conducted in different monolingual and cross-lingual settings. The datasets are publicly released to promote research in semantic relatedness, particularly for low-resource languages. The experiments demonstrate the usefulness and potential of the dataset collection in various NLP tasks.The paper introduces *SemRel*, a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages are predominantly spoken in Africa and Asia, regions characterized by limited NLP resources. Each dataset contains sentence pairs annotated by native speakers with scores representing the degree of semantic textual relatedness between the sentences. The datasets are created using various methods, including lexical overlap, contiguity, topic coverage, and random selection, to ensure a wide range of relatedness scores. The paper discusses the data collection and annotation processes, challenges, and baseline experiments conducted in different monolingual and cross-lingual settings. The datasets are publicly released to promote research in semantic relatedness, particularly for low-resource languages. The experiments demonstrate the usefulness and potential of the dataset collection in various NLP tasks.