17 May 2024 | Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan
The paper introduces SOK-Bench, a novel benchmark for evaluating situated and open-world commonsense reasoning in videos. The benchmark consists of 44K questions and 10K video clips, designed to assess models' ability to understand and apply situated knowledge and general knowledge for problem-solving. The authors propose an automatic and scalable method to generate question-answer pairs, knowledge graphs, and rationales by combining LLMs and MLLMs. The process involves extracting observable situated entities, relations, and processes from videos, extending to open-world knowledge, and generating questions through iterative dialogues. The benchmark includes diverse question types and covers 12 categories of questions, sourced from real-world activities. The authors evaluate recent large vision-language models on the benchmark and find significant room for improvement, highlighting the need for better handling of situated open-world knowledge. The paper also discusses related works, the generation process, and experimental results, emphasizing the value of the benchmark in advancing AI's ability to reason in dynamic, real-world contexts.The paper introduces SOK-Bench, a novel benchmark for evaluating situated and open-world commonsense reasoning in videos. The benchmark consists of 44K questions and 10K video clips, designed to assess models' ability to understand and apply situated knowledge and general knowledge for problem-solving. The authors propose an automatic and scalable method to generate question-answer pairs, knowledge graphs, and rationales by combining LLMs and MLLMs. The process involves extracting observable situated entities, relations, and processes from videos, extending to open-world knowledge, and generating questions through iterative dialogues. The benchmark includes diverse question types and covers 12 categories of questions, sourced from real-world activities. The authors evaluate recent large vision-language models on the benchmark and find significant room for improvement, highlighting the need for better handling of situated open-world knowledge. The paper also discusses related works, the generation process, and experimental results, emphasizing the value of the benchmark in advancing AI's ability to reason in dynamic, real-world contexts.