2019 | Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov
The paper introduces the Natural Questions (NQ) corpus, a benchmark dataset for open-domain question answering (QA). NQ consists of real, anonymized, aggregated queries from Google search results, paired with Wikipedia pages. Annotators are asked to provide long and short answers to these questions, with the long answer typically being a paragraph from the Wikipedia page. The dataset includes 307,373 training examples, 7,830 development examples, and 7,842 test examples. The paper discusses the annotation process, quality evaluation, and variability in human annotations. It also introduces robust metrics for evaluating QA systems and presents baseline results using competitive methods. The goal is to provide a large-scale dataset for end-to-end training and to drive research in natural language understanding.The paper introduces the Natural Questions (NQ) corpus, a benchmark dataset for open-domain question answering (QA). NQ consists of real, anonymized, aggregated queries from Google search results, paired with Wikipedia pages. Annotators are asked to provide long and short answers to these questions, with the long answer typically being a paragraph from the Wikipedia page. The dataset includes 307,373 training examples, 7,830 development examples, and 7,842 test examples. The paper discusses the annotation process, quality evaluation, and variability in human annotations. It also introduces robust metrics for evaluating QA systems and presents baseline results using competitive methods. The goal is to provide a large-scale dataset for end-to-end training and to drive research in natural language understanding.