2019 | Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Ilia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov
The Natural Questions (NQ) dataset is a new benchmark for question answering (QA) research, consisting of 307,373 training examples, 7,830 development examples with 5-way annotations, and 7,842 test examples with 5-way annotations. The dataset contains real, anonymized search queries from Google, paired with corresponding Wikipedia pages. Annotators provide long and short answers based on the content of the Wikipedia pages. The long answer is typically a paragraph or table containing the necessary information, while the short answer is a set of entities or a boolean response. The dataset includes 25-way annotations, revealing variability in human annotations and providing insights into the quality of answers.
The NQ dataset is designed to evaluate QA systems by providing a large-scale, end-to-end training data set and a benchmark for natural language understanding (NLU). It includes a variety of question types, such as those requiring factual information, multiple entities, and categorical noun phrases. The dataset is annotated by a pool of 50 annotators, with each annotator providing a long answer and, if applicable, a short answer. The long answer is selected as the smallest HTML bounding box containing all necessary information, while the short answer is a set of entities or a boolean response.
The dataset includes a variety of question types, such as those requiring factual information, multiple entities, and categorical noun phrases. The long answer is selected as the smallest HTML bounding box containing all necessary information, while the short answer is a set of entities or a boolean response. The dataset is evaluated using metrics that account for variability in acceptable answers, with high human upper bounds on precision and recall. The dataset also includes a super-annotator upper bound, which significantly exceeds the performance of a single annotator.
The NQ dataset is used to evaluate the performance of QA systems, including baselines such as Document-QA and a custom pipeline (DecAtt + DocReader). The results show that Document-QA performs significantly worse than the custom pipeline in long answer identification. The custom pipeline achieves higher F1 scores on both long and short answer tasks, demonstrating its effectiveness in answering questions. The dataset also includes examples of questions and answers, along with expert judgments and statistics from the 25-way annotations. The NQ dataset is a valuable resource for evaluating QA systems and advancing NLU research.The Natural Questions (NQ) dataset is a new benchmark for question answering (QA) research, consisting of 307,373 training examples, 7,830 development examples with 5-way annotations, and 7,842 test examples with 5-way annotations. The dataset contains real, anonymized search queries from Google, paired with corresponding Wikipedia pages. Annotators provide long and short answers based on the content of the Wikipedia pages. The long answer is typically a paragraph or table containing the necessary information, while the short answer is a set of entities or a boolean response. The dataset includes 25-way annotations, revealing variability in human annotations and providing insights into the quality of answers.
The NQ dataset is designed to evaluate QA systems by providing a large-scale, end-to-end training data set and a benchmark for natural language understanding (NLU). It includes a variety of question types, such as those requiring factual information, multiple entities, and categorical noun phrases. The dataset is annotated by a pool of 50 annotators, with each annotator providing a long answer and, if applicable, a short answer. The long answer is selected as the smallest HTML bounding box containing all necessary information, while the short answer is a set of entities or a boolean response.
The dataset includes a variety of question types, such as those requiring factual information, multiple entities, and categorical noun phrases. The long answer is selected as the smallest HTML bounding box containing all necessary information, while the short answer is a set of entities or a boolean response. The dataset is evaluated using metrics that account for variability in acceptable answers, with high human upper bounds on precision and recall. The dataset also includes a super-annotator upper bound, which significantly exceeds the performance of a single annotator.
The NQ dataset is used to evaluate the performance of QA systems, including baselines such as Document-QA and a custom pipeline (DecAtt + DocReader). The results show that Document-QA performs significantly worse than the custom pipeline in long answer identification. The custom pipeline achieves higher F1 scores on both long and short answer tasks, demonstrating its effectiveness in answering questions. The dataset also includes examples of questions and answers, along with expert judgments and statistics from the 25-way annotations. The NQ dataset is a valuable resource for evaluating QA systems and advancing NLU research.