29 Mar 2019 | Siva Reddy*, Danqi Chen*, Christopher D. Manning
CoQA is a conversational question answering dataset designed to evaluate systems that can engage in natural dialogue. It contains 127,000 questions from 8,000 conversations about text passages from seven diverse domains. The questions are conversational, and answers are free-form text with corresponding evidence highlighted in the passage. CoQA introduces challenges such as coreference and pragmatic reasoning not present in traditional reading comprehension datasets. The best system on CoQA achieves an F1 score of 65.4%, significantly lower than human performance (88.8%). CoQA includes free-form answers with rationales, enabling natural and reliable evaluation. The dataset is used for both in-domain and out-of-domain evaluations. CoQA's questions are shorter and more varied, with a higher proportion of unanswerable questions. The dataset includes a variety of linguistic phenomena, including lexical matches, paraphrasing, and pragmatics. The dataset is used to evaluate models for conversational and reading comprehension tasks. The best-performing model on CoQA is a combination of a reading comprehension model and a conversational model, achieving an F1 score of 65.1%. CoQA is a benchmark for evaluating conversational question answering systems and highlights the need for further research in this area.CoQA is a conversational question answering dataset designed to evaluate systems that can engage in natural dialogue. It contains 127,000 questions from 8,000 conversations about text passages from seven diverse domains. The questions are conversational, and answers are free-form text with corresponding evidence highlighted in the passage. CoQA introduces challenges such as coreference and pragmatic reasoning not present in traditional reading comprehension datasets. The best system on CoQA achieves an F1 score of 65.4%, significantly lower than human performance (88.8%). CoQA includes free-form answers with rationales, enabling natural and reliable evaluation. The dataset is used for both in-domain and out-of-domain evaluations. CoQA's questions are shorter and more varied, with a higher proportion of unanswerable questions. The dataset includes a variety of linguistic phenomena, including lexical matches, paraphrasing, and pragmatics. The dataset is used to evaluate models for conversational and reading comprehension tasks. The best-performing model on CoQA is a combination of a reading comprehension model and a conversational model, achieving an F1 score of 65.1%. CoQA is a benchmark for evaluating conversational question answering systems and highlights the need for further research in this area.