InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2024 | Yi Wang*, Kunchang Li*, Xinhao Li*, Jiashuo Yu*, Yinan He*, Chenting Wang*, Guo Chen*, Baoqi Pei*, Ziang Yan*, Rongkun Zheng*, Jilan Xu*, Zun Wang*, yansong Shi*, Tianxiang Jiang*, Songze Li*, Hongjie Zhang*, Yifei Huang*, Yu Qiao*, Yali Wang*, Limin Wang*
InternVideo2 is a new family of video foundation models (ViFM) that achieve state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. The model is trained using a progressive approach that unifies masked video modeling, cross-modal contrastive learning, and next token prediction, scaling the video encoder to 6B parameters. The model improves video-text alignment by semantically segmenting videos and generating video-audio-speech captions. Through extensive experiments, the model demonstrates superior performance on over 60 video and audio tasks, outperforming others on video-related dialogue and long video understanding benchmarks. InternVideo2 is evaluated across a wide range of video-related tasks, including action recognition, video retrieval, and question-answering. The model achieves state-of-the-art performance on multiple tasks and is able to analyze and reason over sequences of actions. The model's performance is validated through extensive experiments, showing its capability to effectively capture and understand video content. InternVideo2 is also effective in video-centric dialogue and long video understanding, demonstrating its potential in modeling high-level world knowledge. The model is trained on a large-scale multimodal video-centric dataset consisting of 402M data entries, including 2M videos, 50M video-text pairs, 50M video-audio-speech-text pairs, and 300M image-text pairs. The model's performance is further enhanced through a progressive training scheme that incorporates video-based next token prediction and scales the entire training process. The model's design includes three stages: (1) capturing spatiotemporal structure via unmasked reconstruction, (2) aligning with semantics from other modalities, and (3) enhancing its open-ended dialogue power through next token prediction. The model's performance is validated through extensive experiments, showing its capability to effectively capture and understand video content. The model is also effective in video-centric dialogue and long video understanding, demonstrating its potential in modeling high-level world knowledge. The model's performance is further enhanced through a progressive training scheme that incorporates video-based next token prediction and scales the entire training process.InternVideo2 is a new family of video foundation models (ViFM) that achieve state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. The model is trained using a progressive approach that unifies masked video modeling, cross-modal contrastive learning, and next token prediction, scaling the video encoder to 6B parameters. The model improves video-text alignment by semantically segmenting videos and generating video-audio-speech captions. Through extensive experiments, the model demonstrates superior performance on over 60 video and audio tasks, outperforming others on video-related dialogue and long video understanding benchmarks. InternVideo2 is evaluated across a wide range of video-related tasks, including action recognition, video retrieval, and question-answering. The model achieves state-of-the-art performance on multiple tasks and is able to analyze and reason over sequences of actions. The model's performance is validated through extensive experiments, showing its capability to effectively capture and understand video content. InternVideo2 is also effective in video-centric dialogue and long video understanding, demonstrating its potential in modeling high-level world knowledge. The model is trained on a large-scale multimodal video-centric dataset consisting of 402M data entries, including 2M videos, 50M video-text pairs, 50M video-audio-speech-text pairs, and 300M image-text pairs. The model's performance is further enhanced through a progressive training scheme that incorporates video-based next token prediction and scales the entire training process. The model's design includes three stages: (1) capturing spatiotemporal structure via unmasked reconstruction, (2) aligning with semantics from other modalities, and (3) enhancing its open-ended dialogue power through next token prediction. The model's performance is validated through extensive experiments, showing its capability to effectively capture and understand video content. The model is also effective in video-centric dialogue and long video understanding, demonstrating its potential in modeling high-level world knowledge. The model's performance is further enhanced through a progressive training scheme that incorporates video-based next token prediction and scales the entire training process.
Reach us at info@study.space