CinePile: A Long Video Question Answering Dataset and Benchmark

CinePile: A Long Video Question Answering Dataset and Benchmark

14 Jun 2024 | Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein
CinePile is a novel long video question answering dataset and benchmark designed to address the limitations of existing datasets in providing genuine long-form comprehension challenges. The dataset comprises approximately 305,000 multiple-choice questions (MCQs) across 9,396 videos, covering various visual and multimodal aspects such as temporal comprehension, human-object interactions, and event reasoning. The creation of CinePile involved a comprehensive process, including data collection from YouTube, audio descriptions, and movie information, followed by automated question generation using large language models (LLMs). The dataset emphasizes question diversity and difficulty, with humans outperforming both open-source and proprietary video-centric LLMs by significant margins. The paper also presents a detailed evaluation of various LLMs on the test split of CinePile, highlighting the gap between human performance and model performance, particularly in tasks requiring temporal and narrative reasoning. The dataset and evaluation methods are publicly available, aiming to bridge the gap between open-source and commercial video understanding models and to provide a comprehensive benchmark for future research.CinePile is a novel long video question answering dataset and benchmark designed to address the limitations of existing datasets in providing genuine long-form comprehension challenges. The dataset comprises approximately 305,000 multiple-choice questions (MCQs) across 9,396 videos, covering various visual and multimodal aspects such as temporal comprehension, human-object interactions, and event reasoning. The creation of CinePile involved a comprehensive process, including data collection from YouTube, audio descriptions, and movie information, followed by automated question generation using large language models (LLMs). The dataset emphasizes question diversity and difficulty, with humans outperforming both open-source and proprietary video-centric LLMs by significant margins. The paper also presents a detailed evaluation of various LLMs on the test split of CinePile, highlighting the gap between human performance and model performance, particularly in tasks requiring temporal and narrative reasoning. The dataset and evaluation methods are publicly available, aiming to bridge the gap between open-source and commercial video understanding models and to provide a comprehensive benchmark for future research.
Reach us at info@study.space