2024 | Guangzhi Sun * 1 Wenyi Yu * 1 Changli Tang * 1 Xianzhao Chen 2 Tian Tan 2 Wei Li 2 Lu Lu 2 Zejun Ma 2 Yuxuan Wang 2 Chao Zhang 1
This paper introduces video-SALMONN, an end-to-end audio-visual large language model (av-LLM) designed to process and understand videos, including visual frame sequences, audio events, music, and speech. To enhance speech understanding while maintaining efficiency for other video elements, the paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure that aligns pre-trained audio-visual encoders with a large language model (LLM). The MRC Q-Former captures fine-grained temporal information required for speech understanding while ensuring efficient processing for other video components. Additionally, dedicated training approaches, such as diversity loss and unpaired audio-visual mixed training, are introduced to avoid modality dominance. On the Speech-Audio-Visual Evaluation (SAVE) benchmark, video-SALMONN achieves significant improvements, with over 25% accuracy gains on video QA tasks and over 30% on audio-visual QA tasks involving human speech. The model demonstrates advanced video comprehension and reasoning abilities, particularly in tasks requiring speech understanding and causal reasoning. The training code and model checkpoints are available at <https://github.com/bytedance/SALMONN/>.This paper introduces video-SALMONN, an end-to-end audio-visual large language model (av-LLM) designed to process and understand videos, including visual frame sequences, audio events, music, and speech. To enhance speech understanding while maintaining efficiency for other video elements, the paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure that aligns pre-trained audio-visual encoders with a large language model (LLM). The MRC Q-Former captures fine-grained temporal information required for speech understanding while ensuring efficient processing for other video components. Additionally, dedicated training approaches, such as diversity loss and unpaired audio-visual mixed training, are introduced to avoid modality dominance. On the Speech-Audio-Visual Evaluation (SAVE) benchmark, video-SALMONN achieves significant improvements, with over 25% accuracy gains on video QA tasks and over 30% on audio-visual QA tasks involving human speech. The model demonstrates advanced video comprehension and reasoning abilities, particularly in tasks requiring speech understanding and causal reasoning. The training code and model checkpoints are available at <https://github.com/bytedance/SALMONN/>.