Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

25 Oct 2023 | Hang Zhang, Xin Li, Lidong Bing
Video-LLaMA is a multi-modal framework designed to enable large language models (LLMs) to understand both visual and auditory content in videos. Unlike previous works that complement LLMs with pre-trained visual or audio encoders, Video-LLaMA bootstraps cross-modal training from frozen pre-trained visual and audio encoders and LLMs. It addresses two key challenges: capturing temporal changes in visual scenes and integrating audio-visual signals. To tackle these challenges, Video-LLaMA introduces a Video Q-former for visual scene understanding and an Audio Q-former for audio signal processing. The model is trained on massive video/image-caption pairs and fine-tuned with visual-instruction datasets to align the output of visual and audio encoders with the LLM's embedding space. Experimental results demonstrate that Video-LLaMA can perceive and comprehend video content, generating meaningful responses grounded in both visual and auditory information. The paper also discusses related works and provides examples showcasing Video-LLaMA's multi-modal instruction-following capabilities.Video-LLaMA is a multi-modal framework designed to enable large language models (LLMs) to understand both visual and auditory content in videos. Unlike previous works that complement LLMs with pre-trained visual or audio encoders, Video-LLaMA bootstraps cross-modal training from frozen pre-trained visual and audio encoders and LLMs. It addresses two key challenges: capturing temporal changes in visual scenes and integrating audio-visual signals. To tackle these challenges, Video-LLaMA introduces a Video Q-former for visual scene understanding and an Audio Q-former for audio signal processing. The model is trained on massive video/image-caption pairs and fine-tuned with visual-instruction datasets to align the output of visual and audio encoders with the LLM's embedding space. Experimental results demonstrate that Video-LLaMA can perceive and comprehend video content, generating meaningful responses grounded in both visual and auditory information. The paper also discusses related works and provides examples showcasing Video-LLaMA's multi-modal instruction-following capabilities.
Reach us at info@study.space
[slides and audio] Video-LLaMA%3A An Instruction-tuned Audio-Visual Language Model for Video Understanding