[slides] CLEVR%3A A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

CLEVR is a diagnostic dataset designed to evaluate the visual reasoning abilities of visual question answering (VQA) systems. It contains 100,000 rendered images and over one million automatically generated questions, with 853,000 unique questions. The dataset is designed to minimize biases and provide detailed annotations for each question, enabling in-depth analysis of visual reasoning. CLEVR contains simple 3D shapes and ensures that all information in each image is complete and exclusive, preventing the use of external knowledge. The dataset uses structured ground-truth representations for both images and questions, with questions represented as functional programs that can be executed to answer the question. This allows for detailed analysis of reasoning abilities that are not possible with traditional VQA datasets. CLEVR tests a range of visual reasoning abilities, including counting, comparison, logical reasoning, and memory. It is designed to challenge models with complex reasoning tasks that require a range of skills, such as factorized representations for generalization, short-term memory for counting or comparison, and compositional systems for multi-subtask questions. The dataset is used to analyze a variety of VQA models and discover weaknesses that are not widely known. For example, current state-of-the-art VQA models struggle on tasks requiring short-term memory or compositional reasoning. CLEVR is also used to study the effects of question size, relationship type, and question topology on model performance. It reveals that models struggle with long reasoning chains, spatial relationships, and disentangled representations. The dataset is also used to test the ability of VQA models to perform compositional generalization, showing that models have not learned the semantics of spatial relationships and rely on absolute image positions. The CLEVR dataset, along with code for generating new images and questions, will be made publicly available. It is designed to help guide future research in VQA and enable rapid progress on this important task.CLEVR is a diagnostic dataset designed to evaluate the visual reasoning abilities of visual question answering (VQA) systems. It contains 100,000 rendered images and over one million automatically generated questions, with 853,000 unique questions. The dataset is designed to minimize biases and provide detailed annotations for each question, enabling in-depth analysis of visual reasoning. CLEVR contains simple 3D shapes and ensures that all information in each image is complete and exclusive, preventing the use of external knowledge. The dataset uses structured ground-truth representations for both images and questions, with questions represented as functional programs that can be executed to answer the question. This allows for detailed analysis of reasoning abilities that are not possible with traditional VQA datasets. CLEVR tests a range of visual reasoning abilities, including counting, comparison, logical reasoning, and memory. It is designed to challenge models with complex reasoning tasks that require a range of skills, such as factorized representations for generalization, short-term memory for counting or comparison, and compositional systems for multi-subtask questions. The dataset is used to analyze a variety of VQA models and discover weaknesses that are not widely known. For example, current state-of-the-art VQA models struggle on tasks requiring short-term memory or compositional reasoning. CLEVR is also used to study the effects of question size, relationship type, and question topology on model performance. It reveals that models struggle with long reasoning chains, spatial relationships, and disentangled representations. The dataset is also used to test the ability of VQA models to perform compositional generalization, showing that models have not learned the semantics of spatial relationships and rely on absolute image positions. The CLEVR dataset, along with code for generating new images and questions, will be made publicly available. It is designed to help guide future research in VQA and enable rapid progress on this important task.

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

20 Dec 2016 | Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick