**InternVideo2: Scaling Foundation Models for Multimodal Video Understanding**
**Authors:** Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
**Affiliations:** OpenGVLab, Shanghai AI Laboratory, Nanjing University, Shenzhen Institutes of Advanced Technology, CAS
**GitHub:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
**Abstract:**
InternVideo2 is a new family of video foundation models (ViFM) that achieve state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. The core design is a progressive training approach that unifies masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. The data level prioritizes spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions, improving alignment between video and text. Extensive experiments validate the designs and demonstrate superior performance on over 60 video and audio tasks, particularly in video-related dialogue and long video understanding.
**Introduction:**
Learning transferable spatiotemporal representations is crucial for various applications in computer vision. InternVideo2 leverages masked reconstruction, crossmodal contrastive learning, and next token prediction to enhance spatiotemporal perception, semantic alignment, and open-ended dialogue capabilities. The model is trained in three stages: unmasked video token reconstruction, video-audio-speech language contrastive learning, and connecting to a large language model (LLM) for joint training.
**Methods:**
- **Video Encoder:** Uses a Vision Transformer (ViT) with additional projection layers and attention pooling.
- **Stage 1:** Reconstructs unmasked video tokens using expert models like InternVL-6B and VideoMAEv2-g.
- **Stage 2:** Aligns video with audio, speech, and text through crossmodal contrastive and matching losses.
- **Stage 3:** Predicts next tokens with a video-centric dialogue system, enhancing long-term spatiotemporal capabilities.
**Datasets:**
- **K-Mash:** A new video set without labels, including various perspectives and durations.
- **InternVid2:** A multimodal video dataset with video-audio-speech information and descriptions.
- **VidCap:** A video multimodal annotation system for refining captions.
**Experiments:**
- **Video Classification:** Achieves state-of-the-art results on datasets like Kinetics, Something-Something V2, and Charades.
- **Temporal**InternVideo2: Scaling Foundation Models for Multimodal Video Understanding**
**Authors:** Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang
**Affiliations:** OpenGVLab, Shanghai AI Laboratory, Nanjing University, Shenzhen Institutes of Advanced Technology, CAS
**GitHub:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
**Abstract:**
InternVideo2 is a new family of video foundation models (ViFM) that achieve state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. The core design is a progressive training approach that unifies masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. The data level prioritizes spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions, improving alignment between video and text. Extensive experiments validate the designs and demonstrate superior performance on over 60 video and audio tasks, particularly in video-related dialogue and long video understanding.
**Introduction:**
Learning transferable spatiotemporal representations is crucial for various applications in computer vision. InternVideo2 leverages masked reconstruction, crossmodal contrastive learning, and next token prediction to enhance spatiotemporal perception, semantic alignment, and open-ended dialogue capabilities. The model is trained in three stages: unmasked video token reconstruction, video-audio-speech language contrastive learning, and connecting to a large language model (LLM) for joint training.
**Methods:**
- **Video Encoder:** Uses a Vision Transformer (ViT) with additional projection layers and attention pooling.
- **Stage 1:** Reconstructs unmasked video tokens using expert models like InternVL-6B and VideoMAEv2-g.
- **Stage 2:** Aligns video with audio, speech, and text through crossmodal contrastive and matching losses.
- **Stage 3:** Predicts next tokens with a video-centric dialogue system, enhancing long-term spatiotemporal capabilities.
**Datasets:**
- **K-Mash:** A new video set without labels, including various perspectives and durations.
- **InternVid2:** A multimodal video dataset with video-audio-speech information and descriptions.
- **VidCap:** A video multimodal annotation system for refining captions.
**Experiments:**
- **Video Classification:** Achieves state-of-the-art results on datasets like Kinetics, Something-Something V2, and Charades.
- **Temporal