31 Jul 2017 | Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia
The SemEval-2017 Task 1 focused on semantic textual similarity (STS) for multilingual and cross-lingual pairs, with a sub-track on machine translation quality estimation (MTQE). The task attracted 31 teams, with 17 participating in all language tracks. The goal was to assess the state-of-the-art in STS, which is crucial for various NLP applications like machine translation, question answering, and semantic search. The task used a diverse set of sentence pairs, including English, Arabic, Spanish, and Turkish, with some tracks involving cross-lingual comparisons. The evaluation criteria were based on Pearson correlation between machine scores and human judgments, with annotations ranging from 0 (no meaning overlap) to 5 (meaning equivalence).
The task included several tracks, such as SNLI-based pairs, WMT quality estimation data, and surprise language tracks. Data was prepared by translating sentences from SNLI into Arabic, Spanish, and Turkish, and by using WMT quality estimation data for MT quality estimation. Annotation was done via crowdsourcing, with some tracks using expert annotators. The STS Benchmark was introduced as a new shared training and evaluation set, carefully selected from English STS data (2012-2017) to support ongoing research.
The task highlighted common errors in existing models, emphasizing the need for more robust methods. The best-performing systems combined feature engineering with deep learning, achieving high correlations with human judgments. The results showed that while English STS models performed well, multilingual and cross-lingual pairs posed greater challenges. The STS Benchmark provided a standardized dataset for evaluating new methods, and the task results demonstrated the importance of fine-grained distinctions in semantic similarity for downstream applications. The task also underscored the need for further research to develop models that can handle new languages and limited training data.The SemEval-2017 Task 1 focused on semantic textual similarity (STS) for multilingual and cross-lingual pairs, with a sub-track on machine translation quality estimation (MTQE). The task attracted 31 teams, with 17 participating in all language tracks. The goal was to assess the state-of-the-art in STS, which is crucial for various NLP applications like machine translation, question answering, and semantic search. The task used a diverse set of sentence pairs, including English, Arabic, Spanish, and Turkish, with some tracks involving cross-lingual comparisons. The evaluation criteria were based on Pearson correlation between machine scores and human judgments, with annotations ranging from 0 (no meaning overlap) to 5 (meaning equivalence).
The task included several tracks, such as SNLI-based pairs, WMT quality estimation data, and surprise language tracks. Data was prepared by translating sentences from SNLI into Arabic, Spanish, and Turkish, and by using WMT quality estimation data for MT quality estimation. Annotation was done via crowdsourcing, with some tracks using expert annotators. The STS Benchmark was introduced as a new shared training and evaluation set, carefully selected from English STS data (2012-2017) to support ongoing research.
The task highlighted common errors in existing models, emphasizing the need for more robust methods. The best-performing systems combined feature engineering with deep learning, achieving high correlations with human judgments. The results showed that while English STS models performed well, multilingual and cross-lingual pairs posed greater challenges. The STS Benchmark provided a standardized dataset for evaluating new methods, and the task results demonstrated the importance of fine-grained distinctions in semantic similarity for downstream applications. The task also underscored the need for further research to develop models that can handle new languages and limited training data.