June 7-8, 2012 | Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre
The SemEval-2012 Task 6 pilot explored Semantic Textual Similarity (STS), measuring the semantic equivalence between two texts. The task involved 35 teams submitting 88 runs, using data from various sources including paraphrase datasets, machine translation evaluations, and lexical resources. The training data included 2000 sentence pairs, while the test data included 2000 pairs from existing datasets and two surprise datasets with 400 and 750 pairs. Human judges rated sentence pairs on a 0-5 scale, achieving high Pearson correlation scores (around 90%). The best results achieved over 80% correlation, surpassing a simple lexical baseline of 31%.
STS differs from Textual Entailment (TE) and Paraphrase (PARA) in that it assumes symmetric graded equivalence between sentence pairs. The task aimed to create a unified framework for evaluating semantic components in NLP tasks. The dataset included 1500 pairs from MSRpar and MSRvid, 1500 from WMT, and 750 from OntoNotes-WordNet. The dataset was split 50% train and 50% test, with surprise datasets from different domains.
The task used Amazon Mechanical Turk for annotation, with high-quality results. Evaluation metrics included Pearson correlation, normalized Pearson, and weighted mean. The best results showed high correlation, with the highest for MSRvid (0.88). The baseline system used simple word overlap, achieving lower scores.
Participants used various tools and resources, including WordNet, monolingual corpora, and Wikipedia. Machine learning was widely used to combine and tune components. The task highlighted the need for a satisfactory evaluation metric and thorough analysis of task definitions. Future work includes analyzing STS scores in relation to paraphrase judgements and establishing an open framework for combining NLP components and similarity algorithms. The dataset is publicly available, and the task was successful in participation and results.The SemEval-2012 Task 6 pilot explored Semantic Textual Similarity (STS), measuring the semantic equivalence between two texts. The task involved 35 teams submitting 88 runs, using data from various sources including paraphrase datasets, machine translation evaluations, and lexical resources. The training data included 2000 sentence pairs, while the test data included 2000 pairs from existing datasets and two surprise datasets with 400 and 750 pairs. Human judges rated sentence pairs on a 0-5 scale, achieving high Pearson correlation scores (around 90%). The best results achieved over 80% correlation, surpassing a simple lexical baseline of 31%.
STS differs from Textual Entailment (TE) and Paraphrase (PARA) in that it assumes symmetric graded equivalence between sentence pairs. The task aimed to create a unified framework for evaluating semantic components in NLP tasks. The dataset included 1500 pairs from MSRpar and MSRvid, 1500 from WMT, and 750 from OntoNotes-WordNet. The dataset was split 50% train and 50% test, with surprise datasets from different domains.
The task used Amazon Mechanical Turk for annotation, with high-quality results. Evaluation metrics included Pearson correlation, normalized Pearson, and weighted mean. The best results showed high correlation, with the highest for MSRvid (0.88). The baseline system used simple word overlap, achieving lower scores.
Participants used various tools and resources, including WordNet, monolingual corpora, and Wikipedia. Machine learning was widely used to combine and tune components. The task highlighted the need for a satisfactory evaluation metric and thorough analysis of task definitions. Future work includes analyzing STS scores in relation to paraphrase judgements and establishing an open framework for combining NLP components and similarity algorithms. The dataset is publicly available, and the task was successful in participation and results.