31 Jul 2017 | Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia
The 2017 SemEval-2017 Task 1 focused on Semantic Textual Similarity (STS), assessing the meaning similarity of sentences across multiple languages. The task included four language pairs: Arabic, English, Spanish, and Turkish, with a special emphasis on multilingual and cross-lingual pairs. The evaluation criteria combined performance across all language conditions, except for the English-Turkish pair, which was a surprise language track. The task attracted strong participation from 31 teams, with 17 teams submitting systems for all language tracks. The best overall system, ECNU, achieved a Pearson correlation of 0.7316, outperforming other top-ranked systems like BIT, HCTI, MITRE, FCICU, CompLIG, LIM-LIG, and DT_Team. The analysis of common errors highlighted issues such as word sense disambiguation, attribute importance, compositional meaning, and semantic blending. The STS Benchmark, a curated selection of English data from previous SemEval tasks, was introduced to support ongoing research and provide a standard for evaluating new models. The benchmark data was used to evaluate select participant systems and competitive baselines, showing that while state-of-the-art baselines performed reasonably well, there was room for improvement. The task's results and insights into the limitations of existing models are valuable for advancing the field of STS.The 2017 SemEval-2017 Task 1 focused on Semantic Textual Similarity (STS), assessing the meaning similarity of sentences across multiple languages. The task included four language pairs: Arabic, English, Spanish, and Turkish, with a special emphasis on multilingual and cross-lingual pairs. The evaluation criteria combined performance across all language conditions, except for the English-Turkish pair, which was a surprise language track. The task attracted strong participation from 31 teams, with 17 teams submitting systems for all language tracks. The best overall system, ECNU, achieved a Pearson correlation of 0.7316, outperforming other top-ranked systems like BIT, HCTI, MITRE, FCICU, CompLIG, LIM-LIG, and DT_Team. The analysis of common errors highlighted issues such as word sense disambiguation, attribute importance, compositional meaning, and semantic blending. The STS Benchmark, a curated selection of English data from previous SemEval tasks, was introduced to support ongoing research and provide a standard for evaluating new models. The benchmark data was used to evaluate select participant systems and competitive baselines, showing that while state-of-the-art baselines performed reasonably well, there was room for improvement. The task's results and insights into the limitations of existing models are valuable for advancing the field of STS.