31 Oct 2018 | Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang
The MS MARCO dataset is a large-scale, real-world machine reading comprehension (MRC) dataset containing 1,010,916 anonymized questions sampled from Bing's search query logs, each with a human-generated answer and 182,669 human-written answers. It also includes 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing. The dataset is designed to address the limitations of existing MRC and question-answering datasets by providing a more realistic and diverse set of questions and answers. The questions are derived from real user search queries, making them more representative of actual information needs. The dataset includes three tasks: (i) predicting if a question is answerable and extracting the answer, (ii) generating a well-formed answer based on context, and (iii) ranking retrieved passages. The dataset is used to benchmark MRC and question-answering models. It is compared to other MRC datasets like SQuAD, NewsQA, DuReader, and RACE, and is noted for its larger size and more realistic nature. The dataset includes a passage ranking task for evaluating retrieval models. The dataset has been updated in multiple versions, with v2.1 being more challenging than v1.1. The dataset is used to evaluate various MRC models, including generative and cloze-style models, and has shown that the new version is more difficult. The dataset is also used to explore new metrics and evaluation strategies for MRC and neural information retrieval. The MS MARCO dataset is a valuable resource for benchmarking and improving MRC and question-answering models.The MS MARCO dataset is a large-scale, real-world machine reading comprehension (MRC) dataset containing 1,010,916 anonymized questions sampled from Bing's search query logs, each with a human-generated answer and 182,669 human-written answers. It also includes 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing. The dataset is designed to address the limitations of existing MRC and question-answering datasets by providing a more realistic and diverse set of questions and answers. The questions are derived from real user search queries, making them more representative of actual information needs. The dataset includes three tasks: (i) predicting if a question is answerable and extracting the answer, (ii) generating a well-formed answer based on context, and (iii) ranking retrieved passages. The dataset is used to benchmark MRC and question-answering models. It is compared to other MRC datasets like SQuAD, NewsQA, DuReader, and RACE, and is noted for its larger size and more realistic nature. The dataset includes a passage ranking task for evaluating retrieval models. The dataset has been updated in multiple versions, with v2.1 being more challenging than v1.1. The dataset is used to evaluate various MRC models, including generative and cloze-style models, and has shown that the new version is more difficult. The dataset is also used to explore new metrics and evaluation strategies for MRC and neural information retrieval. The MS MARCO dataset is a valuable resource for benchmarking and improving MRC and question-answering models.