25 Sep 2018 | Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning
HOTPOTQA is a new dataset designed to train question answering (QA) systems to perform complex reasoning and provide explanations for answers. It contains 113,000 Wikipedia-based question-answer pairs with four key features: (1) questions require reasoning over multiple supporting documents; (2) questions are diverse and not constrained to any pre-existing knowledge bases; (3) sentence-level supporting facts are provided for reasoning and explanation; (4) a new type of factoid comparison questions are included to test the ability to extract and compare relevant facts. The dataset is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
The dataset was collected through crowdsourcing based on Wikipedia articles, where crowd workers were shown multiple supporting context documents and asked to create questions requiring reasoning about all of them. This ensures that the questions are natural and not designed with any pre-existing knowledge base schema in mind. Additionally, crowd workers were asked to provide the supporting facts used to answer the questions, which are also part of the dataset. The data collection pipeline was carefully designed to ensure high-quality multi-hop questions.
The dataset includes a variety of question types, such as yes/no questions, comparison questions, and questions about entities, locations, events, dates, and numbers. The dataset also includes a wide range of answer types, including entities, dates, numbers, and adjectives. The multi-hop reasoning types include comparing two entities, chain reasoning, and locating the answer entity by satisfying multiple properties simultaneously.
The dataset was evaluated in two benchmark settings: the full wiki setting, where the model is tested on the entire Wikipedia corpus, and the distractor setting, where the model is tested with distractor paragraphs. The results show that the model performs significantly lower than human performance, indicating that more technical advancements are needed in future work. The dataset also includes a new type of factoid comparison questions to test systems' ability to extract and compare various entity properties in text. The dataset is publicly available at https://HotpotQA.github.io.HOTPOTQA is a new dataset designed to train question answering (QA) systems to perform complex reasoning and provide explanations for answers. It contains 113,000 Wikipedia-based question-answer pairs with four key features: (1) questions require reasoning over multiple supporting documents; (2) questions are diverse and not constrained to any pre-existing knowledge bases; (3) sentence-level supporting facts are provided for reasoning and explanation; (4) a new type of factoid comparison questions are included to test the ability to extract and compare relevant facts. The dataset is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
The dataset was collected through crowdsourcing based on Wikipedia articles, where crowd workers were shown multiple supporting context documents and asked to create questions requiring reasoning about all of them. This ensures that the questions are natural and not designed with any pre-existing knowledge base schema in mind. Additionally, crowd workers were asked to provide the supporting facts used to answer the questions, which are also part of the dataset. The data collection pipeline was carefully designed to ensure high-quality multi-hop questions.
The dataset includes a variety of question types, such as yes/no questions, comparison questions, and questions about entities, locations, events, dates, and numbers. The dataset also includes a wide range of answer types, including entities, dates, numbers, and adjectives. The multi-hop reasoning types include comparing two entities, chain reasoning, and locating the answer entity by satisfying multiple properties simultaneously.
The dataset was evaluated in two benchmark settings: the full wiki setting, where the model is tested on the entire Wikipedia corpus, and the distractor setting, where the model is tested with distractor paragraphs. The results show that the model performs significantly lower than human performance, indicating that more technical advancements are needed in future work. The dataset also includes a new type of factoid comparison questions to test systems' ability to extract and compare various entity properties in text. The dataset is publicly available at https://HotpotQA.github.io.