2024 | Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
VideoPrism is a general-purpose video encoder that achieves state-of-the-art performance across a wide range of video understanding tasks. It is pretrained on a large corpus of 36 million high-quality video-caption pairs and 582 million video clips with noisy parallel text. The pretraining approach improves upon masked autoencoding by incorporating global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the associated text. VideoPrism is evaluated on four broad groups of video understanding tasks, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks. The model is designed with a two-stage pretraining approach, first using video-text contrastive learning to align video and text encoders, followed by masked video modeling with global-local distillation and token shuffling. The model is evaluated on a wide spectrum of video-centric understanding tasks, including classification, localization, retrieval, captioning, and question answering. VideoPrism is also tested on scientific video datasets, demonstrating its generalizability and potential for scientific applications. The model's performance is compared to existing ViFMs, showing significant improvements on most benchmarks. The results indicate that VideoPrism is a robust and effective video encoder for a wide range of video understanding tasks.VideoPrism is a general-purpose video encoder that achieves state-of-the-art performance across a wide range of video understanding tasks. It is pretrained on a large corpus of 36 million high-quality video-caption pairs and 582 million video clips with noisy parallel text. The pretraining approach improves upon masked autoencoding by incorporating global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the associated text. VideoPrism is evaluated on four broad groups of video understanding tasks, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks. The model is designed with a two-stage pretraining approach, first using video-text contrastive learning to align video and text encoders, followed by masked video modeling with global-local distillation and token shuffling. The model is evaluated on a wide spectrum of video-centric understanding tasks, including classification, localization, retrieval, captioning, and question answering. VideoPrism is also tested on scientific video datasets, demonstrating its generalizability and potential for scientific applications. The model's performance is compared to existing ViFMs, showing significant improvements on most benchmarks. The results indicate that VideoPrism is a robust and effective video encoder for a wide range of video understanding tasks.