Video-SALMONN is a speech-enhanced audio-visual large language model (av-LLM) designed for video processing. It can understand visual frame sequences, audio events, music, and speech. The model introduces a novel multi-resolution causal Q-Former (MRC Q-Former) structure to align audio-visual input features with text representation space at three different temporal scales. This structure enables the model to handle speech and non-speech audio inputs in a single end-to-end manner. To avoid frame or modality dominance, the model uses a diversity loss and a mixed training scheme. On the speech-audio-visual evaluation benchmark, video-SALMONN achieves over 25% accuracy improvements on video QA tasks and over 30% improvements on audio-visual QA tasks with human speech. It also demonstrates strong video comprehension and reasoning abilities. The model is trained using cross-entropy loss and diversity loss, and it uses a mixed training scheme to balance audio and visual inputs. The model is evaluated on the SAVE benchmark, which includes six single-modal tasks and four audio-visual tasks. Video-SALMONN achieves superior performance on the benchmark, particularly in audio-visual tasks requiring speech understanding and causal reasoning. The model's core components include the MRC Q-Former, which extracts audio-visual features at different temporal resolutions, and a causal attention module that captures temporal correlations among frames. The model's training data includes both single-modal and audio-visual paired data, and it is evaluated on various tasks, including audio-visual speech recognition, audio-visual QA, and audio-visual matching. The model's performance is compared to other av-LLMs, and it demonstrates strong audio-visual co-reasoning abilities. The model's results show that it can understand speech in videos and perform zero-shot audio-visual coreasoning. The model's structure and training approach are designed to handle the challenges of speech understanding in videos, including the need for temporally fine-grained modeling and balanced attention to different modalities. The model's performance on the SAVE benchmark highlights its effectiveness in video understanding and reasoning tasks.Video-SALMONN is a speech-enhanced audio-visual large language model (av-LLM) designed for video processing. It can understand visual frame sequences, audio events, music, and speech. The model introduces a novel multi-resolution causal Q-Former (MRC Q-Former) structure to align audio-visual input features with text representation space at three different temporal scales. This structure enables the model to handle speech and non-speech audio inputs in a single end-to-end manner. To avoid frame or modality dominance, the model uses a diversity loss and a mixed training scheme. On the speech-audio-visual evaluation benchmark, video-SALMONN achieves over 25% accuracy improvements on video QA tasks and over 30% improvements on audio-visual QA tasks with human speech. It also demonstrates strong video comprehension and reasoning abilities. The model is trained using cross-entropy loss and diversity loss, and it uses a mixed training scheme to balance audio and visual inputs. The model is evaluated on the SAVE benchmark, which includes six single-modal tasks and four audio-visual tasks. Video-SALMONN achieves superior performance on the benchmark, particularly in audio-visual tasks requiring speech understanding and causal reasoning. The model's core components include the MRC Q-Former, which extracts audio-visual features at different temporal resolutions, and a causal attention module that captures temporal correlations among frames. The model's training data includes both single-modal and audio-visual paired data, and it is evaluated on various tasks, including audio-visual speech recognition, audio-visual QA, and audio-visual matching. The model's performance is compared to other av-LLMs, and it demonstrates strong audio-visual co-reasoning abilities. The model's results show that it can understand speech in videos and perform zero-shot audio-visual coreasoning. The model's structure and training approach are designed to handle the challenges of speech understanding in videos, including the need for temporally fine-grained modeling and balanced attention to different modalities. The model's performance on the SAVE benchmark highlights its effectiveness in video understanding and reasoning tasks.