11 Oct 2016 | Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang
The Stanford Question Answering Dataset (SQuAD) is a large reading comprehension dataset consisting of 107,785 question-answer pairs from 536 Wikipedia articles. Each answer is a segment of text from the corresponding passage. The dataset is designed to challenge machine comprehension by requiring reasoning based on text, with answers derived from the passage rather than from a list of choices. SQuAD is freely available at https://stanford-qa.com.
SQuAD is significantly larger than previous datasets like MCTest and is similar to TREC-QA in terms of open-ended answers. It includes a diverse range of answer types, such as dates, numbers, proper nouns, and common noun phrases. The dataset was collected through three stages: curating passages, crowdsourcing questions and answers, and obtaining additional answers. The dataset is evaluated using metrics like exact match and F1 score, with human performance reaching 86.8% F1.
A logistic regression model was developed and compared with baseline methods. The model achieved an F1 score of 51.0%, significantly better than the sliding window baseline (20%). However, human performance remains much higher, indicating that the dataset presents a challenging problem for future research. The model's performance was analyzed in terms of answer types and syntactic divergence between questions and answers. The model performed best on answers like dates and entities, while struggling with other answer types.
The dataset also includes a method to quantify syntactic divergence between questions and answer sentences, which helps in stratifying the dataset by difficulty. The logistic regression model was found to be effective in selecting the correct sentence containing the answer, but the main challenge lies in identifying the exact span within the sentence.
The results show that while the logistic regression model outperforms baselines, there is still a significant gap between model performance and human performance. The dataset has been widely used for research, and recent advances in neural network-based models have improved performance, though human performance remains much higher. The SQuAD dataset is a valuable resource for advancing machine comprehension and understanding the challenges of reading comprehension for machines.The Stanford Question Answering Dataset (SQuAD) is a large reading comprehension dataset consisting of 107,785 question-answer pairs from 536 Wikipedia articles. Each answer is a segment of text from the corresponding passage. The dataset is designed to challenge machine comprehension by requiring reasoning based on text, with answers derived from the passage rather than from a list of choices. SQuAD is freely available at https://stanford-qa.com.
SQuAD is significantly larger than previous datasets like MCTest and is similar to TREC-QA in terms of open-ended answers. It includes a diverse range of answer types, such as dates, numbers, proper nouns, and common noun phrases. The dataset was collected through three stages: curating passages, crowdsourcing questions and answers, and obtaining additional answers. The dataset is evaluated using metrics like exact match and F1 score, with human performance reaching 86.8% F1.
A logistic regression model was developed and compared with baseline methods. The model achieved an F1 score of 51.0%, significantly better than the sliding window baseline (20%). However, human performance remains much higher, indicating that the dataset presents a challenging problem for future research. The model's performance was analyzed in terms of answer types and syntactic divergence between questions and answers. The model performed best on answers like dates and entities, while struggling with other answer types.
The dataset also includes a method to quantify syntactic divergence between questions and answer sentences, which helps in stratifying the dataset by difficulty. The logistic regression model was found to be effective in selecting the correct sentence containing the answer, but the main challenge lies in identifying the exact span within the sentence.
The results show that while the logistic regression model outperforms baselines, there is still a significant gap between model performance and human performance. The dataset has been widely used for research, and recent advances in neural network-based models have improved performance, though human performance remains much higher. The SQuAD dataset is a valuable resource for advancing machine comprehension and understanding the challenges of reading comprehension for machines.