SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

June 7-8, 2012 | Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre
This paper presents the results of the Semantic Textual Similarity (STS) pilot task in SemEval-2012. The STS task measures the degree of semantic equivalence between two texts, which is more directly applicable in various NLP tasks such as machine translation, summarization, and question answering. The training data included 2000 sentence pairs from existing paraphrase datasets and machine translation evaluation resources, while the test data comprised 2000 sentence pairs from these datasets, plus additional surprise datasets from a different machine translation evaluation corpus and a lexical resource mapping exercise. Human judges rated the similarity of sentence pairs on a 0-5 scale using Amazon Mechanical Turk, achieving high Pearson correlation scores (around 90%). Thirty-five teams participated, submitting 88 runs, with the best results achieving a Pearson correlation over 80%, significantly outperforming a simple lexical baseline. The paper discusses the evaluation metrics used and highlights the need for further research to refine the STS task and develop a more satisfactory evaluation metric.This paper presents the results of the Semantic Textual Similarity (STS) pilot task in SemEval-2012. The STS task measures the degree of semantic equivalence between two texts, which is more directly applicable in various NLP tasks such as machine translation, summarization, and question answering. The training data included 2000 sentence pairs from existing paraphrase datasets and machine translation evaluation resources, while the test data comprised 2000 sentence pairs from these datasets, plus additional surprise datasets from a different machine translation evaluation corpus and a lexical resource mapping exercise. Human judges rated the similarity of sentence pairs on a 0-5 scale using Amazon Mechanical Turk, achieving high Pearson correlation scores (around 90%). Thirty-five teams participated, submitting 88 runs, with the best results achieving a Pearson correlation over 80%, significantly outperforming a simple lexical baseline. The paper discusses the evaluation metrics used and highlights the need for further research to refine the STS task and develop a more satisfactory evaluation metric.
Reach us at info@study.space
Understanding SemEval-2012 Task 6%3A A Pilot on Semantic Textual Similarity