SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

17 May 2024 | Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan
SOK-Bench is a new benchmark for situated video reasoning with aligned open-world knowledge. It consists of 44,000 questions and 10,000 video situations, with instance-level annotations. The benchmark requires models to reason using both situated and general knowledge to answer questions. The dataset was created using an automatic and scalable method that generates question-answer pairs, knowledge graphs, and rationales by combining large language models (LLMs) and multimodal large language models (MLLMs). The process involves extracting observable entities, relations, and processes from videos, extending to open-world knowledge, and generating questions through iterative dialogues and self-prompting. The dataset includes detailed rationales and multiple-choice options, ensuring quality through manual reviews. The benchmark evaluates recent large vision-language models, revealing their limitations in situated open-world reasoning. SOK-Bench includes three types of knowledge graphs: situated knowledge, general knowledge, and situated commonsense. The benchmark covers 12 question types, including spatiotemporal, causal, and dynamic reasoning. The dataset is built using public video datasets, such as YouCook2 and HOMAGE. Experiments show that current models struggle with complex reasoning tasks, highlighting the need for further improvements in video reasoning capabilities. The benchmark provides a structured and scalable method for generating high-quality question-answer pairs, enhancing the evaluation of models in situated and open-world reasoning.SOK-Bench is a new benchmark for situated video reasoning with aligned open-world knowledge. It consists of 44,000 questions and 10,000 video situations, with instance-level annotations. The benchmark requires models to reason using both situated and general knowledge to answer questions. The dataset was created using an automatic and scalable method that generates question-answer pairs, knowledge graphs, and rationales by combining large language models (LLMs) and multimodal large language models (MLLMs). The process involves extracting observable entities, relations, and processes from videos, extending to open-world knowledge, and generating questions through iterative dialogues and self-prompting. The dataset includes detailed rationales and multiple-choice options, ensuring quality through manual reviews. The benchmark evaluates recent large vision-language models, revealing their limitations in situated open-world reasoning. SOK-Bench includes three types of knowledge graphs: situated knowledge, general knowledge, and situated commonsense. The benchmark covers 12 question types, including spatiotemporal, causal, and dynamic reasoning. The dataset is built using public video datasets, such as YouCook2 and HOMAGE. Experiments show that current models struggle with complex reasoning tasks, highlighting the need for further improvements in video reasoning capabilities. The benchmark provides a structured and scalable method for generating high-quality question-answer pairs, enhancing the evaluation of models in situated and open-world reasoning.
Reach us at info@study.space