CinePile: A Long Video Question Answering Dataset and Benchmark

CinePile: A Long Video Question Answering Dataset and Benchmark

14 Jun 2024 | Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein
CinePile is a large-scale long-form video understanding dataset and benchmark containing 305,000 multiple-choice questions (MCQs) across 9,396 videos. The dataset emphasizes diverse and challenging questions covering temporal understanding, visual analysis, complex reasoning, and narrative comprehension. It is designed to evaluate models' ability to understand long videos, which is challenging for existing vision-language models (VLMs) that often rely on single frames or short clips. The dataset includes questions that require understanding of dialogue, visual elements, and temporal progression, and is structured to avoid overemphasis on purely visual or classification tasks. The dataset was created using a novel pipeline that combines human-generated data with large language models (LLMs) for automated question generation and verification. It leverages audio descriptions of movies to create complex questions without explicit video input. The dataset includes both training and test splits, with the test split containing 148 videos and 4,941 questions. The test split has been evaluated against several commercial and open-source VLMs, revealing that even state-of-the-art models significantly lag behind human performance in these tasks. The dataset was also evaluated for quality, with checks to eliminate trivial or poorly framed questions. A human study was conducted to identify systematic issues in the dataset, leading to improvements in question diversity and quality. The dataset includes a wide range of question types, with a focus on temporal, narrative, and thematic reasoning. It is publicly available and includes a leaderboard for evaluating new video LLMs. CinePile addresses several limitations of existing video understanding datasets, including the need for large-scale instruction-tuning datasets and the lack of diverse question types. It is designed to be a comprehensive benchmark for evaluating models' ability to understand long-form videos, with a focus on multimodal reasoning and temporal understanding. The dataset is a significant advancement in the field of long-form video understanding, providing a challenging and diverse set of questions for evaluating video understanding models.CinePile is a large-scale long-form video understanding dataset and benchmark containing 305,000 multiple-choice questions (MCQs) across 9,396 videos. The dataset emphasizes diverse and challenging questions covering temporal understanding, visual analysis, complex reasoning, and narrative comprehension. It is designed to evaluate models' ability to understand long videos, which is challenging for existing vision-language models (VLMs) that often rely on single frames or short clips. The dataset includes questions that require understanding of dialogue, visual elements, and temporal progression, and is structured to avoid overemphasis on purely visual or classification tasks. The dataset was created using a novel pipeline that combines human-generated data with large language models (LLMs) for automated question generation and verification. It leverages audio descriptions of movies to create complex questions without explicit video input. The dataset includes both training and test splits, with the test split containing 148 videos and 4,941 questions. The test split has been evaluated against several commercial and open-source VLMs, revealing that even state-of-the-art models significantly lag behind human performance in these tasks. The dataset was also evaluated for quality, with checks to eliminate trivial or poorly framed questions. A human study was conducted to identify systematic issues in the dataset, leading to improvements in question diversity and quality. The dataset includes a wide range of question types, with a focus on temporal, narrative, and thematic reasoning. It is publicly available and includes a leaderboard for evaluating new video LLMs. CinePile addresses several limitations of existing video understanding datasets, including the need for large-scale instruction-tuning datasets and the lack of diverse question types. It is designed to be a comprehensive benchmark for evaluating models' ability to understand long-form videos, with a focus on multimodal reasoning and temporal understanding. The dataset is a significant advancement in the field of long-form video understanding, providing a challenging and diverse set of questions for evaluating video understanding models.
Reach us at info@study.space
Understanding CinePile%3A A Long Video Question Answering Dataset and Benchmark