VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

17 Jun 2024 | Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building on its predecessor, VideoLLaMA 2 incorporates a custom Spatial-Temporal Convolution (STC) connector to effectively capture the intricate spatial and temporal dynamics of video data. Additionally, an Audio Branch is integrated through joint training, enriching the model's multi-modal understanding capabilities by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even approaches some proprietary models on several benchmarks. The model also shows reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models, highlighting its superior performance in multimodal comprehension. All models are publicly available to facilitate further research.VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building on its predecessor, VideoLLaMA 2 incorporates a custom Spatial-Temporal Convolution (STC) connector to effectively capture the intricate spatial and temporal dynamics of video data. Additionally, an Audio Branch is integrated through joint training, enriching the model's multi-modal understanding capabilities by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even approaches some proprietary models on several benchmarks. The model also shows reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models, highlighting its superior performance in multimodal comprehension. All models are publicly available to facilitate further research.
Reach us at info@study.space