17 Oct 2022 | Pan Lu1,3, Swaroop Mishra2,3, Tony Xia1, Liang Qiu1, Kai-Wei Chang1, Song-Chun Zhu1, Oyvind Tafjord3, Peter Clark3, Ashwin Kalyan3
This paper introduces SCIENCEQA, a new benchmark for science question answering that includes 21,208 multimodal multiple-choice questions with diverse science topics and detailed annotations of answers, including lectures and explanations. The dataset is designed to evaluate the multi-hop reasoning ability and interpretability of AI systems. The authors propose using chain-of-thought (CoT) reasoning to generate explanations that mimic human reasoning processes. They train language models to generate lectures and explanations as part of the CoT to improve performance on SCIENCEQA. Experiments show that CoT improves performance on SCIENCEQA by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. The study also explores the upper bound of model performance when explanations are included in the input, finding that it improves GPT-3's few-shot performance by 18.96%. Analysis shows that language models benefit from explanations to learn with fewer data, achieving the same performance with only 40% of the training data. The paper also compares SCIENCEQA with existing datasets, showing that it is larger, more diverse, and includes more detailed annotations. The authors evaluate various baselines, including VQA models and large language models, and find that CoT significantly improves reasoning ability. The results show that UnifiedQA with CoT achieves 70.12% accuracy, while GPT-3 with CoT achieves 75.17% accuracy on SCIENCEQA. The study highlights the importance of explanations in improving model performance and the potential of CoT in enhancing reasoning ability in language models.This paper introduces SCIENCEQA, a new benchmark for science question answering that includes 21,208 multimodal multiple-choice questions with diverse science topics and detailed annotations of answers, including lectures and explanations. The dataset is designed to evaluate the multi-hop reasoning ability and interpretability of AI systems. The authors propose using chain-of-thought (CoT) reasoning to generate explanations that mimic human reasoning processes. They train language models to generate lectures and explanations as part of the CoT to improve performance on SCIENCEQA. Experiments show that CoT improves performance on SCIENCEQA by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. The study also explores the upper bound of model performance when explanations are included in the input, finding that it improves GPT-3's few-shot performance by 18.96%. Analysis shows that language models benefit from explanations to learn with fewer data, achieving the same performance with only 40% of the training data. The paper also compares SCIENCEQA with existing datasets, showing that it is larger, more diverse, and includes more detailed annotations. The authors evaluate various baselines, including VQA models and large language models, and find that CoT significantly improves reasoning ability. The results show that UnifiedQA with CoT achieves 70.12% accuracy, while GPT-3 with CoT achieves 75.17% accuracy on SCIENCEQA. The study highlights the importance of explanations in improving model performance and the potential of CoT in enhancing reasoning ability in language models.