15 May 2024 | Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan
The paper introduces a new benchmark called Situated Reasoning in Real-World Videos (STAR) to evaluate situated reasoning abilities in real-world videos. STAR focuses on capturing and reasoning about dynamic, compositional, and logical situations, using hyper-graphs to represent entities and relations. The benchmark includes four types of questions: interaction, sequence, prediction, and feasibility. The paper highlights the challenges of situated reasoning and compares various existing video reasoning models, finding that they struggle with the task. To address these challenges, the authors propose a diagnostic neuro-symbolic model (NS-SR) that disentangles visual perception, situation abstraction, language understanding, and symbolic reasoning. The NS-SR model is designed to perform symbolic reasoning over structured situation graphs and dynamic clues from situations. The paper also provides a detailed analysis of the benchmark's construction, question generation, answer generation, and evaluation methods, demonstrating the effectiveness of the benchmark in uncovering the difficulties of situated reasoning.The paper introduces a new benchmark called Situated Reasoning in Real-World Videos (STAR) to evaluate situated reasoning abilities in real-world videos. STAR focuses on capturing and reasoning about dynamic, compositional, and logical situations, using hyper-graphs to represent entities and relations. The benchmark includes four types of questions: interaction, sequence, prediction, and feasibility. The paper highlights the challenges of situated reasoning and compares various existing video reasoning models, finding that they struggle with the task. To address these challenges, the authors propose a diagnostic neuro-symbolic model (NS-SR) that disentangles visual perception, situation abstraction, language understanding, and symbolic reasoning. The NS-SR model is designed to perform symbolic reasoning over structured situation graphs and dynamic clues from situations. The paper also provides a detailed analysis of the benchmark's construction, question generation, answer generation, and evaluation methods, demonstrating the effectiveness of the benchmark in uncovering the difficulties of situated reasoning.