17 Jun 2024 | Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a Spatial-Temporal Convolution (STC) connector to effectively capture the intricate spatial and temporal dynamics of video data. Additionally, an Audio Branch is integrated through joint training, enriching the model's multimodal understanding capabilities by incorporating audio cues. Comprehensive evaluations on multiple tasks, including MC-VQA, OE-VQA, and video captioning, demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even approaches some proprietary models on several benchmarks. It also shows reasonable improvements in audio-only and audio-video question-answering benchmarks. These advancements highlight VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. The model is publicly available to facilitate further research. VideoLLaMA 2 features a dual-branch framework with a Vision-Language Branch and an Audio-Language Branch, allowing for independent processing of visual and audio data. The Vision-Language Branch uses a CLIP-based image encoder and an STC connector for spatial-temporal representation learning. The Audio-Language Branch uses BEATs for audio encoding and a multilayer perceptron to align audio modalities with the large language model. The model is trained on a large-scale, weakly labeled dataset of image-text and video-text pairs, and undergoes multi-task fine-tuning on various benchmarks. The model's performance is evaluated on multiple video and audio understanding benchmarks, showing strong results compared to both proprietary and open-source models. VideoLLaMA 2 also demonstrates strong capabilities in audio understanding, outperforming existing models on several audio-only and audio-video question-answering benchmarks. The model is capable of understanding complex multimodal data and has been shown to perform well in various video-centric conversations, including global scene understanding, spatial-temporal orientation awareness, commonsense reasoning, and spatial-temporal fine-grained recognition. VideoLLaMA 2 is a generalist Video-LLM that can be further developed to benefit various specialized but challenging problems, such as long video understanding, video agent, autonomous driving, motion understanding, and robotic manipulation.VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a Spatial-Temporal Convolution (STC) connector to effectively capture the intricate spatial and temporal dynamics of video data. Additionally, an Audio Branch is integrated through joint training, enriching the model's multimodal understanding capabilities by incorporating audio cues. Comprehensive evaluations on multiple tasks, including MC-VQA, OE-VQA, and video captioning, demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even approaches some proprietary models on several benchmarks. It also shows reasonable improvements in audio-only and audio-video question-answering benchmarks. These advancements highlight VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. The model is publicly available to facilitate further research. VideoLLaMA 2 features a dual-branch framework with a Vision-Language Branch and an Audio-Language Branch, allowing for independent processing of visual and audio data. The Vision-Language Branch uses a CLIP-based image encoder and an STC connector for spatial-temporal representation learning. The Audio-Language Branch uses BEATs for audio encoding and a multilayer perceptron to align audio modalities with the large language model. The model is trained on a large-scale, weakly labeled dataset of image-text and video-text pairs, and undergoes multi-task fine-tuning on various benchmarks. The model's performance is evaluated on multiple video and audio understanding benchmarks, showing strong results compared to both proprietary and open-source models. VideoLLaMA 2 also demonstrates strong capabilities in audio understanding, outperforming existing models on several audio-only and audio-video question-answering benchmarks. The model is capable of understanding complex multimodal data and has been shown to perform well in various video-centric conversations, including global scene understanding, spatial-temporal orientation awareness, commonsense reasoning, and spatial-temporal fine-grained recognition. VideoLLaMA 2 is a generalist Video-LLM that can be further developed to benefit various specialized but challenging problems, such as long video understanding, video agent, autonomous driving, motion understanding, and robotic manipulation.