27 Jun 2019 | Kenton Lee Ming-Wei Chang Kristina Toutanova
This paper introduces ORQA, an end-to-end open-domain question answering system that jointly learns the retriever and reader from question-answer string pairs without any information retrieval (IR) system. Unlike traditional approaches that rely on black-box IR systems, ORQA treats evidence retrieval from Wikipedia as a latent variable and pre-trains the retriever using an Inverse Cloze Task (ICT). This allows the model to learn to retrieve evidence directly from the open corpus without relying on a pre-defined IR system. The system is evaluated on open versions of five QA datasets, showing that learned retrieval outperforms BM25 by up to 19 points in exact match on datasets where the question writers do not know the answer. ORQA is capable of retrieving any text in an open corpus, rather than being limited to the closed set returned by a black-box IR system. The model uses BERT for encoding and jointly learns the retriever and reader components. The retriever is trained using ICT, which involves predicting the context of a sentence based on its content. The reader component is trained to identify the correct span of text within the evidence block that contains the answer. The system is evaluated on five open-domain QA datasets, showing that it performs well on datasets where the question writers do not know the answer. The results demonstrate that ORQA is effective in learning to retrieve evidence from an open corpus without relying on a black-box IR system.This paper introduces ORQA, an end-to-end open-domain question answering system that jointly learns the retriever and reader from question-answer string pairs without any information retrieval (IR) system. Unlike traditional approaches that rely on black-box IR systems, ORQA treats evidence retrieval from Wikipedia as a latent variable and pre-trains the retriever using an Inverse Cloze Task (ICT). This allows the model to learn to retrieve evidence directly from the open corpus without relying on a pre-defined IR system. The system is evaluated on open versions of five QA datasets, showing that learned retrieval outperforms BM25 by up to 19 points in exact match on datasets where the question writers do not know the answer. ORQA is capable of retrieving any text in an open corpus, rather than being limited to the closed set returned by a black-box IR system. The model uses BERT for encoding and jointly learns the retriever and reader components. The retriever is trained using ICT, which involves predicting the context of a sentence based on its content. The reader component is trained to identify the correct span of text within the evidence block that contains the answer. The system is evaluated on five open-domain QA datasets, showing that it performs well on datasets where the question writers do not know the answer. The results demonstrate that ORQA is effective in learning to retrieve evidence from an open corpus without relying on a black-box IR system.