14 Mar 2018 | Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord
The AI2 Reasoning Challenge (ARC) is a new question answering benchmark designed to evaluate advanced reasoning and knowledge-based question answering. It consists of 7,787 natural science questions, primarily from standardized tests, divided into a Challenge Set (2,590 questions) and an Easy Set (5,197 questions). The Challenge Set contains questions that are difficult for both retrieval-based and word co-occurrence algorithms, requiring more advanced reasoning and knowledge. The dataset is the largest public-domain set of this type.
The ARC Corpus, a large collection of 14 million science-related sentences, is also released to support the challenge. Three neural baseline models—DecompAttn, BiDAF, and DGEM—are provided for testing. These models perform well on the Easy Set but fail to significantly outperform a random baseline on the Challenge Set, highlighting the difficulty of the task.
The ARC challenge aims to encourage research into more complex question answering tasks that require reasoning, commonsense knowledge, and deeper text comprehension. It differs from previous challenges by including a Challenge Set that is difficult for simple algorithms, and by providing a science corpus and models for the community to use. The dataset, corpus, and models are publicly available, and the challenge is open to the research community. The ARC dataset is designed to address limitations in previous datasets, which often focused on simple retrieval tasks and did not adequately challenge advanced reasoning methods. The challenge emphasizes the need for more sophisticated approaches to handle complex questions that require combining multiple facts and understanding of domain-specific knowledge.The AI2 Reasoning Challenge (ARC) is a new question answering benchmark designed to evaluate advanced reasoning and knowledge-based question answering. It consists of 7,787 natural science questions, primarily from standardized tests, divided into a Challenge Set (2,590 questions) and an Easy Set (5,197 questions). The Challenge Set contains questions that are difficult for both retrieval-based and word co-occurrence algorithms, requiring more advanced reasoning and knowledge. The dataset is the largest public-domain set of this type.
The ARC Corpus, a large collection of 14 million science-related sentences, is also released to support the challenge. Three neural baseline models—DecompAttn, BiDAF, and DGEM—are provided for testing. These models perform well on the Easy Set but fail to significantly outperform a random baseline on the Challenge Set, highlighting the difficulty of the task.
The ARC challenge aims to encourage research into more complex question answering tasks that require reasoning, commonsense knowledge, and deeper text comprehension. It differs from previous challenges by including a Challenge Set that is difficult for simple algorithms, and by providing a science corpus and models for the community to use. The dataset, corpus, and models are publicly available, and the challenge is open to the research community. The ARC dataset is designed to address limitations in previous datasets, which often focused on simple retrieval tasks and did not adequately challenge advanced reasoning methods. The challenge emphasizes the need for more sophisticated approaches to handle complex questions that require combining multiple facts and understanding of domain-specific knowledge.