COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge

COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge

15 Mar 2019 | Alon Talmor*,1,2 Jonathan Herzig*,1 Nicholas Lourie2 Jonathan Berant1,2
CommonsenseQA is a new dataset for commonsense question answering, developed to evaluate models' ability to answer questions that require prior knowledge. The dataset is built using knowledge from CONCEPTNET, a graph-based knowledge base containing 32 million triplets of concepts and their relationships. The dataset includes 12,247 multiple-choice questions, each with five candidate answers, including one correct answer and four distractors. The questions are generated by crowd workers who are asked to create questions that require commonsense knowledge to answer, rather than relying on surface-level clues. The dataset was created by first extracting subgraphs from CONCEPTNET, each containing one source concept and three target concepts related by the same semantic relation. Crowd workers then created three questions for each subgraph, ensuring that only the correct target concept was the answer. Additional distractors were added by selecting from CONCEPTNET and manually creating them. The questions were then augmented with textual context from web snippets to provide additional information for answering. The dataset was evaluated using a variety of models, including pre-trained language models such as BERT-LARGE and GPT, as well as other baselines. BERT-LARGE achieved the highest accuracy on the dataset, with 55.9% accuracy, which is significantly lower than human performance of 88.9%. This highlights the difficulty of commonsense reasoning for current NLU models. The dataset provides a new benchmark for evaluating models' ability to answer questions that require commonsense knowledge. It is designed to test the ability of models to understand and reason about the world, rather than relying on surface-level clues or distributional biases. The dataset is available for download and includes a detailed analysis of the questions and the types of commonsense skills required to answer them. The results show that while current models can achieve some level of accuracy, they still fall far short of human performance, indicating that there is much room for improvement in commonsense reasoning capabilities.CommonsenseQA is a new dataset for commonsense question answering, developed to evaluate models' ability to answer questions that require prior knowledge. The dataset is built using knowledge from CONCEPTNET, a graph-based knowledge base containing 32 million triplets of concepts and their relationships. The dataset includes 12,247 multiple-choice questions, each with five candidate answers, including one correct answer and four distractors. The questions are generated by crowd workers who are asked to create questions that require commonsense knowledge to answer, rather than relying on surface-level clues. The dataset was created by first extracting subgraphs from CONCEPTNET, each containing one source concept and three target concepts related by the same semantic relation. Crowd workers then created three questions for each subgraph, ensuring that only the correct target concept was the answer. Additional distractors were added by selecting from CONCEPTNET and manually creating them. The questions were then augmented with textual context from web snippets to provide additional information for answering. The dataset was evaluated using a variety of models, including pre-trained language models such as BERT-LARGE and GPT, as well as other baselines. BERT-LARGE achieved the highest accuracy on the dataset, with 55.9% accuracy, which is significantly lower than human performance of 88.9%. This highlights the difficulty of commonsense reasoning for current NLU models. The dataset provides a new benchmark for evaluating models' ability to answer questions that require commonsense knowledge. It is designed to test the ability of models to understand and reason about the world, rather than relying on surface-level clues or distributional biases. The dataset is available for download and includes a detailed analysis of the questions and the types of commonsense skills required to answer them. The results show that while current models can achieve some level of accuracy, they still fall far short of human performance, indicating that there is much room for improvement in commonsense reasoning capabilities.
Reach us at info@study.space
[slides] CommonsenseQA%3A A Question Answering Challenge Targeting Commonsense Knowledge | StudySpace