Understanding VideoCoT%3A A Video Chain-of-Thought Dataset with Active Annotation Tool

The paper introduces VideoCoT, a dataset for video chain-of-thought (CoT) reasoning, which aims to enhance the reasoning capabilities of multimodal large language models (MLLMs). The authors address the challenge of creating reliable video CoT datasets by developing an automatic annotation tool that combines machine and human experts under the active learning paradigm. This tool reduces the workload of human labeling and ensures dataset quality. The dataset collection process involves three steps: prompt generation, automatic scoring, and expert refinement. The resulting datasets, VideoCoT, TopicQA, and TopicCoT, are designed to improve MLLMs' reasoning abilities in video question answering (VQA) tasks. Extensive experiments demonstrate the effectiveness of these datasets in enhancing the performance of MLLMs, particularly in open-ended VQA tasks. The paper also proposes a benchmark based on the collected datasets to evaluate the models' reasoning capabilities.The paper introduces VideoCoT, a dataset for video chain-of-thought (CoT) reasoning, which aims to enhance the reasoning capabilities of multimodal large language models (MLLMs). The authors address the challenge of creating reliable video CoT datasets by developing an automatic annotation tool that combines machine and human experts under the active learning paradigm. This tool reduces the workload of human labeling and ensures dataset quality. The dataset collection process involves three steps: prompt generation, automatic scoring, and expert refinement. The resulting datasets, VideoCoT, TopicQA, and TopicCoT, are designed to improve MLLMs' reasoning abilities in video question answering (VQA) tasks. Extensive experiments demonstrate the effectiveness of these datasets in enhancing the performance of MLLMs, particularly in open-ended VQA tasks. The paper also proposes a benchmark based on the collected datasets to evaluate the models' reasoning capabilities.

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

7 Jul 2024 | Yan Wang1, Yawen Zeng2*, Jingsheng Zheng1, Xiaofen Xing1, Jin Xu1,3, Xiangmin Xu1