[slides and audio] Automatically Constructing a Corpus of Sentential Paraphrases

The Microsoft Research Paraphrase Corpus (MSRP) is a dataset of 5801 sentence pairs, each labeled as a paraphrase or not by human judges. The corpus was created using heuristic extraction techniques and an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topic-clustered news data. These pairs were then evaluated by human judges, who confirmed that 67% were semantically equivalent. The corpus was designed to support research in paraphrase identification and generation, and to help establish standards for constructing paraphrase corpora. The MSRP was constructed from a large database of sentence pairs extracted from news articles. Heuristics were used to narrow the search space, including shared lexical properties, sentence position, and word-based edit distance. A classifier was then used to identify candidate pairs, which were further evaluated by human judges. The resulting corpus includes a mix of paraphrase pairs and near-miss negatives, with a focus on semantic equivalence rather than strict bidirectional entailment. The corpus has been used to evaluate paraphrase recognition algorithms and to compare results across research efforts. However, the corpus is limited in size and may not be suitable for direct use as a training corpus. The authors suggest that future work could involve creating a "virtual paraphrase corpus" by aggregating data from multiple sources to reduce selection bias and improve coverage. The MSRP is a valuable resource for researchers in natural language processing, but it has limitations in terms of coverage and size. The authors hope that others will use the corpus, find it useful, and provide feedback to improve it. The methodology used to create the corpus is adaptable and can be applied to other types of corpora. The authors also suggest exploring new methods for collecting paraphrase data, such as using web volunteers to gather colloquial paraphrases.The Microsoft Research Paraphrase Corpus (MSRP) is a dataset of 5801 sentence pairs, each labeled as a paraphrase or not by human judges. The corpus was created using heuristic extraction techniques and an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topic-clustered news data. These pairs were then evaluated by human judges, who confirmed that 67% were semantically equivalent. The corpus was designed to support research in paraphrase identification and generation, and to help establish standards for constructing paraphrase corpora. The MSRP was constructed from a large database of sentence pairs extracted from news articles. Heuristics were used to narrow the search space, including shared lexical properties, sentence position, and word-based edit distance. A classifier was then used to identify candidate pairs, which were further evaluated by human judges. The resulting corpus includes a mix of paraphrase pairs and near-miss negatives, with a focus on semantic equivalence rather than strict bidirectional entailment. The corpus has been used to evaluate paraphrase recognition algorithms and to compare results across research efforts. However, the corpus is limited in size and may not be suitable for direct use as a training corpus. The authors suggest that future work could involve creating a "virtual paraphrase corpus" by aggregating data from multiple sources to reduce selection bias and improve coverage. The MSRP is a valuable resource for researchers in natural language processing, but it has limitations in terms of coverage and size. The authors hope that others will use the corpus, find it useful, and provide feedback to improve it. The methodology used to create the corpus is adaptable and can be applied to other types of corpora. The authors also suggest exploring new methods for collecting paraphrase data, such as using web volunteers to gather colloquial paraphrases.

Automatically Constructing a Corpus of Sentential Paraphrases

| William B. Dolan and Chris Brockett