From Recognition to Cognition: Visual Commonsense Reasoning

From Recognition to Cognition: Visual Commonsense Reasoning

| Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi
Visual Commonsense Reasoning (VCR) is a task that challenges vision systems to understand images at a cognitive level, beyond simple object recognition. The task requires answering questions about images and providing rationales that explain the answers, based on the image content and background knowledge. The VCR dataset consists of 290,000 multiple-choice questions derived from 110,000 movie scenes. The key challenge in creating this dataset is avoiding annotation biases, which can lead to models exploiting these biases rather than understanding the content. To address this, the authors introduce Adversarial Matching, a method that generates multiple-choice questions by recycling correct answers for different questions, ensuring a balanced distribution of correct and incorrect options. The authors also present Recognition to Cognition Networks (R2C), a new model that performs three inference steps: grounding the meaning of the query and response in the image, contextualizing the answer within the question and image, and reasoning over the shared understanding of the question, answer, and image. R2C achieves high accuracy on VCR, outperforming state-of-the-art vision models. However, the task remains challenging, as humans achieve over 90% accuracy. The paper also discusses related work, including visual question answering (VQA), commonsense reasoning, and adversarial datasets. The authors highlight the importance of language understanding, vision, and world knowledge in solving VCR. They also emphasize the need for explainable AI, where models provide rationales for their answers. The VCR dataset and model are available for download at visualcommonsense.com.Visual Commonsense Reasoning (VCR) is a task that challenges vision systems to understand images at a cognitive level, beyond simple object recognition. The task requires answering questions about images and providing rationales that explain the answers, based on the image content and background knowledge. The VCR dataset consists of 290,000 multiple-choice questions derived from 110,000 movie scenes. The key challenge in creating this dataset is avoiding annotation biases, which can lead to models exploiting these biases rather than understanding the content. To address this, the authors introduce Adversarial Matching, a method that generates multiple-choice questions by recycling correct answers for different questions, ensuring a balanced distribution of correct and incorrect options. The authors also present Recognition to Cognition Networks (R2C), a new model that performs three inference steps: grounding the meaning of the query and response in the image, contextualizing the answer within the question and image, and reasoning over the shared understanding of the question, answer, and image. R2C achieves high accuracy on VCR, outperforming state-of-the-art vision models. However, the task remains challenging, as humans achieve over 90% accuracy. The paper also discusses related work, including visual question answering (VQA), commonsense reasoning, and adversarial datasets. The authors highlight the importance of language understanding, vision, and world knowledge in solving VCR. They also emphasize the need for explainable AI, where models provide rationales for their answers. The VCR dataset and model are available for download at visualcommonsense.com.
Reach us at info@study.space