DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

16 Apr 2019 | Dheeru Dua*, Yizhong Wang*, Pradeep Dasigi*, Gabriel Stanovsky*, Sameer Singh*, and Matt Gardner*
DROP is a new reading comprehension benchmark that requires discrete reasoning over paragraph content. It consists of 96,567 questions derived from Wikipedia, focusing on tasks that involve numerical reasoning, counting, sorting, and other discrete operations. The benchmark is designed to challenge systems to understand paragraph semantics and perform complex reasoning, which is more demanding than previous datasets. The dataset was created through crowdsourcing, with questions generated to be difficult for existing models. The best systems achieve only 32.7% F1 on the benchmark, while human performance is 96.4%. A new model combining reading comprehension with numerical reasoning achieves 47.0% F1. The DROP dataset includes a variety of question types, such as addition, subtraction, counting, and comparison, requiring systems to extract information from paragraphs and perform discrete operations. The dataset is structured to test the ability of models to understand and reason about complex semantic relationships within text. It includes questions that require identifying multiple events, aggregating information, and performing arithmetic operations. The dataset is evaluated using metrics that focus on numerical accuracy, with answers being spans, numbers, or dates. The benchmark was evaluated using three types of systems: semantic parsers, reading comprehension models, and heuristic baselines. The best performing system, NAQANet, combines neural reading comprehension with symbolic reasoning to handle questions involving counting, addition, and subtraction. It achieves 47.0% F1 on the benchmark, showing significant improvement over previous models. The DROP dataset highlights the challenges of paragraph understanding and discrete reasoning, offering a new benchmark for evaluating systems that can perform complex reasoning tasks. It is designed to encourage research into methods that combine neural and symbolic reasoning, with the goal of improving the ability of systems to understand and reason about text. The dataset is available for research and includes code for baseline systems and a leaderboard with a hidden test set.DROP is a new reading comprehension benchmark that requires discrete reasoning over paragraph content. It consists of 96,567 questions derived from Wikipedia, focusing on tasks that involve numerical reasoning, counting, sorting, and other discrete operations. The benchmark is designed to challenge systems to understand paragraph semantics and perform complex reasoning, which is more demanding than previous datasets. The dataset was created through crowdsourcing, with questions generated to be difficult for existing models. The best systems achieve only 32.7% F1 on the benchmark, while human performance is 96.4%. A new model combining reading comprehension with numerical reasoning achieves 47.0% F1. The DROP dataset includes a variety of question types, such as addition, subtraction, counting, and comparison, requiring systems to extract information from paragraphs and perform discrete operations. The dataset is structured to test the ability of models to understand and reason about complex semantic relationships within text. It includes questions that require identifying multiple events, aggregating information, and performing arithmetic operations. The dataset is evaluated using metrics that focus on numerical accuracy, with answers being spans, numbers, or dates. The benchmark was evaluated using three types of systems: semantic parsers, reading comprehension models, and heuristic baselines. The best performing system, NAQANet, combines neural reading comprehension with symbolic reasoning to handle questions involving counting, addition, and subtraction. It achieves 47.0% F1 on the benchmark, showing significant improvement over previous models. The DROP dataset highlights the challenges of paragraph understanding and discrete reasoning, offering a new benchmark for evaluating systems that can perform complex reasoning tasks. It is designed to encourage research into methods that combine neural and symbolic reasoning, with the goal of improving the ability of systems to understand and reason about text. The dataset is available for research and includes code for baseline systems and a leaderboard with a hidden test set.
Reach us at info@study.space