VideoPrism: A Foundational Visual Encoder for Video Understanding

VideoPrism: A Foundational Visual Encoder for Video Understanding

16 Jun 2024 | Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
VideoPrism is a general-purpose video encoder designed to tackle a wide range of video understanding tasks, including classification, localization, retrieval, captioning, and question answering. The model is pre-trained on a heterogeneous corpus containing 36 million high-quality video-caption pairs and 582 million video clips with noisy parallel text. The pre-training approach combines global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus on the video modality while leveraging text associated with videos. Extensive evaluations on four broad categories of video understanding tasks, including web video question answering and CV for science, demonstrate that VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks. The model's robust generalizability is highlighted by its consistent outperformance of other baselines across various benchmarks.VideoPrism is a general-purpose video encoder designed to tackle a wide range of video understanding tasks, including classification, localization, retrieval, captioning, and question answering. The model is pre-trained on a heterogeneous corpus containing 36 million high-quality video-caption pairs and 582 million video clips with noisy parallel text. The pre-training approach combines global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus on the video modality while leveraging text associated with videos. Extensive evaluations on four broad categories of video understanding tasks, including web video question answering and CV for science, demonstrate that VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks. The model's robust generalizability is highlighted by its consistent outperformance of other baselines across various benchmarks.
Reach us at info@study.space