[slides and audio] VideoGPT%2B%3A Integrating Image and Video Encoders for Enhanced Video Understanding

**VideoGPT+** is a novel video understanding model that integrates image and video encoders to enhance video comprehension. The model combines the strengths of both encoders, leveraging an image encoder for detailed spatial details and a video encoder for global temporal context. To address the limitations of each encoder, VideoGPT+ employs a segment-wise sampling strategy, which divides videos into smaller segments and applies adaptive pooling to capture fine-grained temporal dynamics. The model processes these features through a large language model (LLM) to generate comprehensive responses to video-based questions. **Contributions:** 1. **VideoGPT+**: The first video-conversation model that benefits from a dual-encoding scheme, combining image and video features. 2. **VGG+ 112K**: An improved video-instruction tuning dataset developed using a semi-automatic annotation pipeline, providing dense video captions and reasoning-based QA pairs. 3. **VGBench-Diverse**: A diverse benchmark covering 18 broad video categories to comprehensively evaluate video LMMs. **Methods:** - **Overall Architecture**: Combines segment-wise sampling, dual visual encoders, vision-language adapters, and a large language model. - **Segment-wise Sampling**: Divides videos into segments and applies sampling to capture fine-grained temporal cues. - **Dual Vision Encoder**: Utilizes an image encoder for spatial features and a video encoder for temporal context. - **Visual Adapter**: Projects visual features into the language space and performs adaptive token pooling to reduce token length. **Experiments:** - **VGBench**: Achieves better performance compared to previous models. - **VGBench-Diverse**: Demonstrates superior performance across diverse video categories. - **MVBench**: Shows significant improvements in various temporal understanding tasks. - **Zero-shot Question-Answering**: Achieves superior results on multiple datasets. **Conclusion:** VideoGPT+ enhances video understanding by leveraging the complementary strengths of image and video encoders. It outperforms previous models on multiple benchmarks and introduces VGG+ 112K and VGBench-Diverse to improve data quality and diversity.**VideoGPT+** is a novel video understanding model that integrates image and video encoders to enhance video comprehension. The model combines the strengths of both encoders, leveraging an image encoder for detailed spatial details and a video encoder for global temporal context. To address the limitations of each encoder, VideoGPT+ employs a segment-wise sampling strategy, which divides videos into smaller segments and applies adaptive pooling to capture fine-grained temporal dynamics. The model processes these features through a large language model (LLM) to generate comprehensive responses to video-based questions. **Contributions:** 1. **VideoGPT+**: The first video-conversation model that benefits from a dual-encoding scheme, combining image and video features. 2. **VGG+ 112K**: An improved video-instruction tuning dataset developed using a semi-automatic annotation pipeline, providing dense video captions and reasoning-based QA pairs. 3. **VGBench-Diverse**: A diverse benchmark covering 18 broad video categories to comprehensively evaluate video LMMs. **Methods:** - **Overall Architecture**: Combines segment-wise sampling, dual visual encoders, vision-language adapters, and a large language model. - **Segment-wise Sampling**: Divides videos into segments and applies sampling to capture fine-grained temporal cues. - **Dual Vision Encoder**: Utilizes an image encoder for spatial features and a video encoder for temporal context. - **Visual Adapter**: Projects visual features into the language space and performs adaptive token pooling to reduce token length. **Experiments:** - **VGBench**: Achieves better performance compared to previous models. - **VGBench-Diverse**: Demonstrates superior performance across diverse video categories. - **MVBench**: Shows significant improvements in various temporal understanding tasks. - **Zero-shot Question-Answering**: Achieves superior results on multiple datasets. **Conclusion:** VideoGPT+ enhances video understanding by leveraging the complementary strengths of image and video encoders. It outperforms previous models on multiple benchmarks and introduces VGG+ 112K and VGBench-Diverse to improve data quality and diversity.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

13 Jun 2024 | Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan