Automatically Constructing a Corpus of Sentential Paraphrases

Automatically Constructing a Corpus of Sentential Paraphrases

| William B. Dolan and Chris Brockett
The paper discusses the creation of the Microsoft Research Paraphrase Corpus (MSRP), a large-scale corpus of 5801 sentence pairs labeled as paraphrases. The corpus was constructed using heuristic extraction techniques and an SVM-based classifier to identify likely paraphrase pairs from a large corpus of news data. These pairs were then evaluated by human judges, who confirmed that 67% of them were semantically equivalent. The authors explore the challenges in defining guidelines for human raters and discuss the limitations of the corpus, such as its small size and the need for more diverse paraphrase pairs. They also propose the idea of a "virtual paraphrase corpus" to address the lack of large, publicly available labeled paraphrase corpora. The paper concludes by highlighting the potential of the MSRP for research in paraphrase identification and generation.The paper discusses the creation of the Microsoft Research Paraphrase Corpus (MSRP), a large-scale corpus of 5801 sentence pairs labeled as paraphrases. The corpus was constructed using heuristic extraction techniques and an SVM-based classifier to identify likely paraphrase pairs from a large corpus of news data. These pairs were then evaluated by human judges, who confirmed that 67% of them were semantically equivalent. The authors explore the challenges in defining guidelines for human raters and discuss the limitations of the corpus, such as its small size and the need for more diverse paraphrase pairs. They also propose the idea of a "virtual paraphrase corpus" to address the lack of large, publicly available labeled paraphrase corpora. The paper concludes by highlighting the potential of the MSRP for research in paraphrase identification and generation.
Reach us at info@study.space
Understanding Automatically Constructing a Corpus of Sentential Paraphrases