MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

31 Oct 2018 | Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang
The paper introduces MS MARCO, a large-scale machine reading comprehension (MRC) dataset. The dataset comprises 1,010,916 anonymized questions, each with a human-generated answer, and 182,669 completely human-rewritten answers. It also includes 8,841,823 passages extracted from 3,563,555 web documents. The questions are derived from Bing's search query logs, making them more representative of real-world information needs. The dataset is designed to address the limitations of existing MRC datasets, such as the lack of real-world complexity and the need for models to handle noisy and conflicting information. The paper proposes three tasks: predicting if a question is answerable, generating a well-formed answer, and ranking retrieved passages. The authors present benchmarking results using various models and discuss the challenges and future directions for the dataset.The paper introduces MS MARCO, a large-scale machine reading comprehension (MRC) dataset. The dataset comprises 1,010,916 anonymized questions, each with a human-generated answer, and 182,669 completely human-rewritten answers. It also includes 8,841,823 passages extracted from 3,563,555 web documents. The questions are derived from Bing's search query logs, making them more representative of real-world information needs. The dataset is designed to address the limitations of existing MRC datasets, such as the lack of real-world complexity and the need for models to handle noisy and conflicting information. The paper proposes three tasks: predicting if a question is answerable, generating a well-formed answer, and ranking retrieved passages. The authors present benchmarking results using various models and discuss the challenges and future directions for the dataset.
Reach us at info@study.space