[slides and audio] From Recognition to Cognition%3A Visual Commonsense Reasoning

The paper introduces a new task called Visual Commonsense Reasoning (VCR), which challenges computer vision systems to understand images beyond object recognition. VCR involves answering questions about an image and providing a rationale for the answer. The authors present a large-scale dataset, VCR, consisting of 290k multiple-choice QA problems derived from 110k movie scenes. They introduce Adversarial Matching, a novel algorithm that transforms rich annotations into multiple-choice questions with minimal bias, ensuring high-quality and diverse problems. The paper also presents Recognition to Cognition Networks (R2C), a model that performs layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines, achieving 65% accuracy in question answering and 67% in answer justification. However, the task remains challenging, with humans scoring 90% accuracy. The paper provides detailed insights and ablation studies to guide future research.The paper introduces a new task called Visual Commonsense Reasoning (VCR), which challenges computer vision systems to understand images beyond object recognition. VCR involves answering questions about an image and providing a rationale for the answer. The authors present a large-scale dataset, VCR, consisting of 290k multiple-choice QA problems derived from 110k movie scenes. They introduce Adversarial Matching, a novel algorithm that transforms rich annotations into multiple-choice questions with minimal bias, ensuring high-quality and diverse problems. The paper also presents Recognition to Cognition Networks (R2C), a model that performs layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines, achieving 65% accuracy in question answering and 67% in answer justification. However, the task remains challenging, with humans scoring 90% accuracy. The paper provides detailed insights and ablation studies to guide future research.

From Recognition to Cognition: Visual Commonsense Reasoning

| Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi