MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

24 Jun 2024 | Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, Tao Chen
MovieLLM is a novel framework designed to generate high-quality, consistent video data for instruction tuning, addressing the challenges of limited and biased long video datasets. The framework leverages GPT-4's linguistic capabilities and textual inversion techniques to create movie-level video instruction data. The process involves three stages: movie plot generation, style immobilization, and video instruction data generation. In the first stage, GPT-4 generates detailed movie plots with specific elements such as themes, overview, and styles. In the second stage, textual inversion is used to embed style descriptions into the latent space of a diffusion model, ensuring consistent style across generated scenes. In the third stage, the generated data is combined with the diffusion model to produce consistent key frames and corresponding question-answer pairs. MovieLLM's approach significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets. The framework enables the creation of customized movies with a single description, offering a scalable and efficient alternative to traditional data collection methods. Extensive experiments validate the effectiveness of MovieLLM, showing that it outperforms existing methods in generating high-quality key frames and enhancing video understanding. The generated data is used to train multimodal large language models, improving their ability to understand long videos. The framework also facilitates automatic annotation, reducing the need for manual labor and associated costs. MovieLLM's contributions include a novel pipeline for generating movie-level video instruction data, a comprehensive dataset for movie-level video understanding, and a benchmark dataset for evaluating long video comprehension capabilities. The results demonstrate that MovieLLM significantly enhances the performance of models in video understanding tasks.MovieLLM is a novel framework designed to generate high-quality, consistent video data for instruction tuning, addressing the challenges of limited and biased long video datasets. The framework leverages GPT-4's linguistic capabilities and textual inversion techniques to create movie-level video instruction data. The process involves three stages: movie plot generation, style immobilization, and video instruction data generation. In the first stage, GPT-4 generates detailed movie plots with specific elements such as themes, overview, and styles. In the second stage, textual inversion is used to embed style descriptions into the latent space of a diffusion model, ensuring consistent style across generated scenes. In the third stage, the generated data is combined with the diffusion model to produce consistent key frames and corresponding question-answer pairs. MovieLLM's approach significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets. The framework enables the creation of customized movies with a single description, offering a scalable and efficient alternative to traditional data collection methods. Extensive experiments validate the effectiveness of MovieLLM, showing that it outperforms existing methods in generating high-quality key frames and enhancing video understanding. The generated data is used to train multimodal large language models, improving their ability to understand long videos. The framework also facilitates automatic annotation, reducing the need for manual labor and associated costs. MovieLLM's contributions include a novel pipeline for generating movie-level video instruction data, a comprehensive dataset for movie-level video understanding, and a benchmark dataset for evaluating long video comprehension capabilities. The results demonstrate that MovieLLM significantly enhances the performance of models in video understanding tasks.
Reach us at info@study.space