24 May 2019 | Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova
This paper explores the challenge of natural yes/no questions, which are generated in unprompted and unconstrained settings. The authors build a reading comprehension dataset, BoolQ, and find that these questions often require complex, non-factoid information and difficult entailment-like inference to solve. They also investigate the effectiveness of various transfer learning baselines, finding that transferring from entailment data is more effective than paraphrase or extractive QA data. The best method, which trains BERT on MultiNLI and then re-trains it on the BoolQ train set, achieves 80.4% accuracy, compared to 90% accuracy of human annotators and 62% for a majority baseline. The paper discusses the nature of the questions, the annotation quality, and the types of inference required to answer them. It also explores different methods of leveraging extractive QA data and the effectiveness of unsupervised pre-training using language models like BERT. The results highlight the high difficulty of the BoolQ dataset and suggest that crowd-sourced entailment datasets can be leveraged to boost performance even with powerful pre-trained language models.This paper explores the challenge of natural yes/no questions, which are generated in unprompted and unconstrained settings. The authors build a reading comprehension dataset, BoolQ, and find that these questions often require complex, non-factoid information and difficult entailment-like inference to solve. They also investigate the effectiveness of various transfer learning baselines, finding that transferring from entailment data is more effective than paraphrase or extractive QA data. The best method, which trains BERT on MultiNLI and then re-trains it on the BoolQ train set, achieves 80.4% accuracy, compared to 90% accuracy of human annotators and 62% for a majority baseline. The paper discusses the nature of the questions, the annotation quality, and the types of inference required to answer them. It also explores different methods of leveraging extractive QA data and the effectiveness of unsupervised pre-training using language models like BERT. The results highlight the high difficulty of the BoolQ dataset and suggest that crowd-sourced entailment datasets can be leveraged to boost performance even with powerful pre-trained language models.