[slides and audio] SemRel2024%3A A Collection of Semantic Textual Relatedness Datasets for 14 Languages

The paper introduces *SemRel*, a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages are predominantly spoken in Africa and Asia, regions characterized by limited NLP resources. Each dataset contains sentence pairs annotated by native speakers with scores representing the degree of semantic textual relatedness between the sentences. The datasets are created using various methods, including lexical overlap, contiguity, topic coverage, and random selection, to ensure a wide range of relatedness scores. The paper discusses the data collection and annotation processes, challenges, and baseline experiments conducted in different monolingual and cross-lingual settings. The datasets are publicly released to promote research in semantic relatedness, particularly for low-resource languages. The experiments demonstrate the usefulness and potential of the dataset collection in various NLP tasks.The paper introduces *SemRel*, a collection of semantic textual relatedness datasets for 13 languages, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages are predominantly spoken in Africa and Asia, regions characterized by limited NLP resources. Each dataset contains sentence pairs annotated by native speakers with scores representing the degree of semantic textual relatedness between the sentences. The datasets are created using various methods, including lexical overlap, contiguity, topic coverage, and random selection, to ensure a wide range of relatedness scores. The paper discusses the data collection and annotation processes, challenges, and baseline experiments conducted in different monolingual and cross-lingual settings. The datasets are publicly released to promote research in semantic relatedness, particularly for low-resource languages. The experiments demonstrate the usefulness and potential of the dataset collection in various NLP tasks.

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages