20 Aug 2024 | Hang Hua*, Yunlong Tang*, Chenliang Xu, Jiebo Luo
This paper introduces V2Xum-LLM, a novel cross-modal video summarization framework that unifies different video summarization tasks into a single pre-trained language decoder, eliminating the need for task-specific heads used in prior methods. The framework enables end-to-end processing of long video sequences and outperforms all strong baseline models on mainstream V2V, V2T, and V2VT benchmarks. To address the lack of video-language data for fine-tuning large VLMs in video summarization tasks, we created Instruct-V2Xum, a new instruction-following dataset for cross-modal video summarization. It contains 30k diverse YouTube videos, ranging from 40 to 940 seconds, enabling VLMs to generate modality-controllable video summaries with task prompts. The experiments validate the rationality of our proposed dataset. We also present a comprehensive analysis of the limitations in current video summarization tasks from the perspectives of data, methods, and evaluation. Based on this, we propose F_CLIP and Cross-F_CLIP, enhanced evaluation metrics for V2V and V2VT summarization tasks. Experimental results show that these metrics are highly consistent with traditional evaluation metrics including F1, Spearman correlation, and Kendall correlation. The proposed framework, V2Xum-LLaMA, uses interleaved video frames and temporal prompts as input and converts videos into multimodal summaries. The framework is trained with a temporal-aware decoding strategy to enhance the alignment of text and video summaries. The results show that V2Xum-LLaMA outperforms all baseline models on multiple video summarization tasks. Additionally, the proposed dataset and evaluation metrics provide a more accurate and comprehensive assessment of video summarization performance.This paper introduces V2Xum-LLM, a novel cross-modal video summarization framework that unifies different video summarization tasks into a single pre-trained language decoder, eliminating the need for task-specific heads used in prior methods. The framework enables end-to-end processing of long video sequences and outperforms all strong baseline models on mainstream V2V, V2T, and V2VT benchmarks. To address the lack of video-language data for fine-tuning large VLMs in video summarization tasks, we created Instruct-V2Xum, a new instruction-following dataset for cross-modal video summarization. It contains 30k diverse YouTube videos, ranging from 40 to 940 seconds, enabling VLMs to generate modality-controllable video summaries with task prompts. The experiments validate the rationality of our proposed dataset. We also present a comprehensive analysis of the limitations in current video summarization tasks from the perspectives of data, methods, and evaluation. Based on this, we propose F_CLIP and Cross-F_CLIP, enhanced evaluation metrics for V2V and V2VT summarization tasks. Experimental results show that these metrics are highly consistent with traditional evaluation metrics including F1, Spearman correlation, and Kendall correlation. The proposed framework, V2Xum-LLaMA, uses interleaved video frames and temporal prompts as input and converts videos into multimodal summaries. The framework is trained with a temporal-aware decoding strategy to enhance the alignment of text and video summaries. The results show that V2Xum-LLaMA outperforms all baseline models on multiple video summarization tasks. Additionally, the proposed dataset and evaluation metrics provide a more accurate and comprehensive assessment of video summarization performance.