27 Jun 2024 | Hao Fei, Member, IEEE, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua and Shuicheng Yan, Fellow, IEEE
This paper addresses the limitations of existing video-language models (VLMs) in terms of coarse-grained cross-modal alignment, under-modeling of temporal dynamics, and detached video-language views. To enhance VLMs, the authors propose a fine-grained structural spatio-temporal alignment learning method called Finsta. Finsta represents input texts and videos using scene graphs (SGs) and unifies them into a holistic SG (HSG) to bridge the two modalities. The framework includes a graph Transformer for encoding the textual SG, a recurrent graph Transformer for modeling spatial and temporal features, and a spatial-temporal Gaussian differential graph Transformer for capturing changes in objects. Finsta performs object-centered spatial alignment and predicate-centered temporal alignment to enhance video-language grounding. The method is designed as a plug-and-play system that can be integrated into existing VLMs without retraining from scratch or relying on SG annotations in downstream applications. Extensive experiments on 12 datasets across 6 representative video-language modeling tasks show that Finsta consistently improves the performance of 13 strong-performing VLMs, both in fine-tuning and zero-shot settings.This paper addresses the limitations of existing video-language models (VLMs) in terms of coarse-grained cross-modal alignment, under-modeling of temporal dynamics, and detached video-language views. To enhance VLMs, the authors propose a fine-grained structural spatio-temporal alignment learning method called Finsta. Finsta represents input texts and videos using scene graphs (SGs) and unifies them into a holistic SG (HSG) to bridge the two modalities. The framework includes a graph Transformer for encoding the textual SG, a recurrent graph Transformer for modeling spatial and temporal features, and a spatial-temporal Gaussian differential graph Transformer for capturing changes in objects. Finsta performs object-centered spatial alignment and predicate-centered temporal alignment to enhance video-language grounding. The method is designed as a plug-and-play system that can be integrated into existing VLMs without retraining from scratch or relying on SG annotations in downstream applications. Extensive experiments on 12 datasets across 6 representative video-language modeling tasks show that Finsta consistently improves the performance of 13 strong-performing VLMs, both in fine-tuning and zero-shot settings.