7 Jul 2024 | Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, Xiangmin Xu
VideoCoT is a video chain-of-thought (CoT) dataset with an active annotation tool. The paper introduces an automatic annotation tool that combines machine and human experts under the active learning paradigm to reduce human annotation workload and ensure dataset quality. The tool includes a prompt generator to guide large language models (LLMs) to generate complex CoT based on video information, and a quality score to evaluate the generated CoT sentences from multiple aspects. Low-quality sentences are modified by human experts, and the modified CoT is used to train the prompt generator to guide LLMs to generate more reasonable CoT. With the help of this tool, three datasets are contributed: VideoCoT, TopicQA, and TopicCoT. VideoCoT is designed to supplement CoT between question and answer from existing datasets. TopicQA enables MLLMs to learn the relevant relationship between videos and topics, while TopicCoT facilitates reasoning about the topic relevance. A simple but effective benchmark is proposed based on these datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs. Extensive experiments demonstrate the effectiveness of the datasets and solution. The main contributions include: 1) Introducing an automatic annotation tool under the active learning paradigm for complex CoT generation in the video domain. 2) Collecting three datasets to fill the vacuum of Video CoT via the automatic annotation tool. 3) Proposing a simple but effective benchmark based on the collected datasets, which exploits CoT to achieve better reasoning ability. The datasets are analyzed for property quality, diversity quality, and visualization quality. The results show that the datasets achieve superior effectiveness, diversity, and explainability. The paper also discusses the limitations of the active annotation tool and the impact of funding constraints on the invitation of annotation experts. The datasets are expected to enhance the visual reasoning abilities of more models.VideoCoT is a video chain-of-thought (CoT) dataset with an active annotation tool. The paper introduces an automatic annotation tool that combines machine and human experts under the active learning paradigm to reduce human annotation workload and ensure dataset quality. The tool includes a prompt generator to guide large language models (LLMs) to generate complex CoT based on video information, and a quality score to evaluate the generated CoT sentences from multiple aspects. Low-quality sentences are modified by human experts, and the modified CoT is used to train the prompt generator to guide LLMs to generate more reasonable CoT. With the help of this tool, three datasets are contributed: VideoCoT, TopicQA, and TopicCoT. VideoCoT is designed to supplement CoT between question and answer from existing datasets. TopicQA enables MLLMs to learn the relevant relationship between videos and topics, while TopicCoT facilitates reasoning about the topic relevance. A simple but effective benchmark is proposed based on these datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs. Extensive experiments demonstrate the effectiveness of the datasets and solution. The main contributions include: 1) Introducing an automatic annotation tool under the active learning paradigm for complex CoT generation in the video domain. 2) Collecting three datasets to fill the vacuum of Video CoT via the automatic annotation tool. 3) Proposing a simple but effective benchmark based on the collected datasets, which exploits CoT to achieve better reasoning ability. The datasets are analyzed for property quality, diversity quality, and visualization quality. The results show that the datasets achieve superior effectiveness, diversity, and explainability. The paper also discusses the limitations of the active annotation tool and the impact of funding constraints on the invitation of annotation experts. The datasets are expected to enhance the visual reasoning abilities of more models.