11 Jun 2018 | Pranav Rajpurkar*, Robin Jia*, Percy Liang
SQuAD 2.0 is a new version of the Stanford Question Answering Dataset (SQuAD) designed to challenge existing models by including both answerable and unanswerable questions. Unlike previous versions, SQuAD 2.0 includes over 50,000 unanswerable questions created by crowdworkers to appear similar to answerable ones. These questions are relevant to the context and contain plausible answers, making them more challenging for models to distinguish from answerable questions. The dataset is designed to test whether models can recognize when an answer is not supported by the text and avoid making guesses.
SQuAD 2.0 is more difficult than previous versions, as models that achieved 86% F1 on SQuAD 1.1 only achieved 66% F1 on SQuAD 2.0. A state-of-the-art model achieved 66.3% F1 on SQuAD 2.0, while human accuracy was 89.5% F1. The dataset also includes a wide range of unanswerable questions, including those that are more challenging than automatically generated ones. The new dataset is intended to encourage the development of reading comprehension systems that can recognize when they do not know the answer.
The dataset was created by having crowdworkers write unanswerable questions based on SQuAD 1.1 articles. Each question was designed to be relevant to the context and contain a plausible answer. The dataset includes both answerable and unanswerable questions, with a roughly one-to-one ratio in the development and test sets. The dataset is available for public use and is the primary benchmark on the official SQuAD leaderboard. The authors are optimistic that SQuAD 2.0 will lead to the development of more advanced reading comprehension systems that can understand when they do not know the answer.SQuAD 2.0 is a new version of the Stanford Question Answering Dataset (SQuAD) designed to challenge existing models by including both answerable and unanswerable questions. Unlike previous versions, SQuAD 2.0 includes over 50,000 unanswerable questions created by crowdworkers to appear similar to answerable ones. These questions are relevant to the context and contain plausible answers, making them more challenging for models to distinguish from answerable questions. The dataset is designed to test whether models can recognize when an answer is not supported by the text and avoid making guesses.
SQuAD 2.0 is more difficult than previous versions, as models that achieved 86% F1 on SQuAD 1.1 only achieved 66% F1 on SQuAD 2.0. A state-of-the-art model achieved 66.3% F1 on SQuAD 2.0, while human accuracy was 89.5% F1. The dataset also includes a wide range of unanswerable questions, including those that are more challenging than automatically generated ones. The new dataset is intended to encourage the development of reading comprehension systems that can recognize when they do not know the answer.
The dataset was created by having crowdworkers write unanswerable questions based on SQuAD 1.1 articles. Each question was designed to be relevant to the context and contain a plausible answer. The dataset includes both answerable and unanswerable questions, with a roughly one-to-one ratio in the development and test sets. The dataset is available for public use and is the primary benchmark on the official SQuAD leaderboard. The authors are optimistic that SQuAD 2.0 will lead to the development of more advanced reading comprehension systems that can understand when they do not know the answer.