[slides and audio] STAR%3A A Benchmark for Situated Reasoning in Real-World Videos

STAR is a benchmark for situated reasoning in real-world videos, designed to evaluate systems' ability to reason in dynamic, complex situations. The benchmark includes four types of questions: interaction, sequence, prediction, and feasibility. Situations are represented using situation hypergraphs, which capture entities, actions, and relationships. The benchmark includes 60,000 questions, 240,000 candidate answers, and 22,000 situation videos. It also provides 144,000 situation hypergraphs for structured abstraction. The benchmark evaluates various models, including visual question answering and visual reasoning models, and finds that none perform well on situated reasoning tasks. A diagnostic neuro-symbolic model, NS-SR, is proposed to analyze the challenges of the benchmark. NS-SR combines visual perception, situation abstraction, language understanding, and symbolic reasoning. The model uses a video parser to extract entities, relationships, and human-object interactions, a transition model to process and predict situations, and a program executor to answer questions. The benchmark highlights the challenges of situated reasoning in real-world scenarios and provides insights into improving visual reasoning models. The results show that situated reasoning remains a challenging task for current methods, and the benchmark offers a new direction for research in this area.STAR is a benchmark for situated reasoning in real-world videos, designed to evaluate systems' ability to reason in dynamic, complex situations. The benchmark includes four types of questions: interaction, sequence, prediction, and feasibility. Situations are represented using situation hypergraphs, which capture entities, actions, and relationships. The benchmark includes 60,000 questions, 240,000 candidate answers, and 22,000 situation videos. It also provides 144,000 situation hypergraphs for structured abstraction. The benchmark evaluates various models, including visual question answering and visual reasoning models, and finds that none perform well on situated reasoning tasks. A diagnostic neuro-symbolic model, NS-SR, is proposed to analyze the challenges of the benchmark. NS-SR combines visual perception, situation abstraction, language understanding, and symbolic reasoning. The model uses a video parser to extract entities, relationships, and human-object interactions, a transition model to process and predict situations, and a program executor to answer questions. The benchmark highlights the challenges of situated reasoning in real-world scenarios and provides insights into improving visual reasoning models. The results show that situated reasoning remains a challenging task for current methods, and the benchmark offers a new direction for research in this area.

STAR: A Benchmark for Situated Reasoning in Real-World Videos

2021 | Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan