[slides] Reading Wikipedia to Answer Open-Domain Questions

This paper proposes using Wikipedia as a unique knowledge source for open-domain question answering (QA), where answers are text spans in Wikipedia articles. The task combines document retrieval (finding relevant articles) and machine comprehension (identifying answer spans). The authors develop DrQA, a system combining a Document Retriever (using bigram hashing and TF-IDF) and a Document Reader (a multi-layer recurrent neural network). Experiments show that both modules are competitive with existing methods, and multitask learning with distant supervision improves performance on challenging tasks. The system addresses the challenge of answering questions by first retrieving relevant articles from over 5 million Wikipedia items, then scanning them to find answers. This setting, called machine reading at scale (MRS), requires both efficient retrieval and deep text understanding. DrQA outperforms the built-in Wikipedia search engine in retrieval and achieves state-of-the-art results on the SQuAD benchmark. The full system is evaluated on multiple benchmarks, showing improved performance through multitask learning and distant supervision. The paper also discusses related work, including previous QA systems using Wikipedia, and highlights the importance of machine comprehension in open-domain QA. It compares DrQA with other systems like YodaQA, showing that while DrQA performs well on SQuAD, it faces challenges in full Wikipedia QA due to the need for accurate document retrieval and context understanding. The authors conclude that MRS is a key challenge for researchers, requiring integration of search, distant supervision, and multitask learning to build effective QA systems. Future work aims to improve DrQA by incorporating multi-document training and end-to-end training across the retrieval and reading pipeline.This paper proposes using Wikipedia as a unique knowledge source for open-domain question answering (QA), where answers are text spans in Wikipedia articles. The task combines document retrieval (finding relevant articles) and machine comprehension (identifying answer spans). The authors develop DrQA, a system combining a Document Retriever (using bigram hashing and TF-IDF) and a Document Reader (a multi-layer recurrent neural network). Experiments show that both modules are competitive with existing methods, and multitask learning with distant supervision improves performance on challenging tasks. The system addresses the challenge of answering questions by first retrieving relevant articles from over 5 million Wikipedia items, then scanning them to find answers. This setting, called machine reading at scale (MRS), requires both efficient retrieval and deep text understanding. DrQA outperforms the built-in Wikipedia search engine in retrieval and achieves state-of-the-art results on the SQuAD benchmark. The full system is evaluated on multiple benchmarks, showing improved performance through multitask learning and distant supervision. The paper also discusses related work, including previous QA systems using Wikipedia, and highlights the importance of machine comprehension in open-domain QA. It compares DrQA with other systems like YodaQA, showing that while DrQA performs well on SQuAD, it faces challenges in full Wikipedia QA due to the need for accurate document retrieval and context understanding. The authors conclude that MRS is a key challenge for researchers, requiring integration of search, distant supervision, and multitask learning to build effective QA systems. Future work aims to improve DrQA by incorporating multi-document training and end-to-end training across the retrieval and reading pipeline.

Reading Wikipedia to Answer Open-Domain Questions

28 Apr 2017 | Danqi Chen, Adam Fisch, Jason Weston & Antoine Bordes