WIKIQA is a new publicly available dataset for open-domain question answering (QA), consisting of 3,047 questions sampled from Bing query logs. Each question is associated with a Wikipedia page, and the summary paragraph of the page is used as the source of candidate answer sentences. These sentences are annotated by crowdsourcing workers to determine if they correctly answer the question. The dataset includes 29,258 sentences, of which 1,473 are labeled as correct answers. Unlike previous datasets like QASENT, which rely on keyword matching, WIKIQA includes questions with no correct answers, enabling research on answer triggering, a critical component in QA systems.
The WIKIQA dataset is significantly larger than QASENT, with over three times more answer sentences. It also includes a more diverse range of question types, including more description or definition questions, which are harder to answer. The dataset was split into training (70%), development (10%), and testing (20%) sets for evaluation.
Several systems were tested on both datasets, including lexical semantic models and sentence semantic models. The results showed that lexical semantic methods performed better on QASENT, while sentence semantic approaches, such as convolutional neural networks (CNN), outperformed lexical methods on WIKIQA. The best F1 scores on WIKIQA were slightly above 30%, indicating room for improvement.
The answer triggering task was evaluated using question-level precision, recall, and F1 scores. The best system, CNN-Cnt, achieved an F1 score of around 30%, suggesting that deeper semantic understanding is important for performance on WIKIQA. Additional features, such as question length and question class, were also studied, with the question length feature significantly improving the F1 score.
The results show that WIKIQA is a more challenging dataset than QASENT, as it includes questions with no correct answers and requires deeper semantic understanding for accurate answer selection. The study highlights the importance of developing models that can handle such challenges and provides a new benchmark for QA research.WIKIQA is a new publicly available dataset for open-domain question answering (QA), consisting of 3,047 questions sampled from Bing query logs. Each question is associated with a Wikipedia page, and the summary paragraph of the page is used as the source of candidate answer sentences. These sentences are annotated by crowdsourcing workers to determine if they correctly answer the question. The dataset includes 29,258 sentences, of which 1,473 are labeled as correct answers. Unlike previous datasets like QASENT, which rely on keyword matching, WIKIQA includes questions with no correct answers, enabling research on answer triggering, a critical component in QA systems.
The WIKIQA dataset is significantly larger than QASENT, with over three times more answer sentences. It also includes a more diverse range of question types, including more description or definition questions, which are harder to answer. The dataset was split into training (70%), development (10%), and testing (20%) sets for evaluation.
Several systems were tested on both datasets, including lexical semantic models and sentence semantic models. The results showed that lexical semantic methods performed better on QASENT, while sentence semantic approaches, such as convolutional neural networks (CNN), outperformed lexical methods on WIKIQA. The best F1 scores on WIKIQA were slightly above 30%, indicating room for improvement.
The answer triggering task was evaluated using question-level precision, recall, and F1 scores. The best system, CNN-Cnt, achieved an F1 score of around 30%, suggesting that deeper semantic understanding is important for performance on WIKIQA. Additional features, such as question length and question class, were also studied, with the question length feature significantly improving the F1 score.
The results show that WIKIQA is a more challenging dataset than QASENT, as it includes questions with no correct answers and requires deeper semantic understanding for accurate answer selection. The study highlights the importance of developing models that can handle such challenges and provides a new benchmark for QA research.