27 Jun 2024 | Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua and Shuicheng Yan
This paper proposes Finsta, a fine-grained structural spatio-temporal alignment learning method to enhance video-language models (VLMs). The method represents input texts and videos with fine-grained scene graph (SG) structures, which are unified into a holistic SG (HSG) for cross-modal alignment. A graph Transformer (GTrm) encodes the textual SG (TSG), while a novel recurrent graph Transformer (R-GTrm) models the video dynamic SG (DSG) and HSG for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer (STGD-GTrm) strengthens the perception of object changes across spatial and temporal dimensions. Object-centered spatial alignment and predicate-centered temporal alignment are performed to enhance video-language grounding in both spatiality and temporality. Finsta is designed as a plug-and-play system that can be integrated into existing VLMs for representation augmentation without retraining or relying on SG annotations. On six representative VL modeling tasks across 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the performance of 13 strong-performing VLMs and significantly refreshes the state-of-the-art end task performance in both fine-tuning and zero-shot settings. The method addresses key limitations of existing VLMs, including coarse-grained cross-modal alignment, under-modeling of temporal dynamics, and insufficient VL collaboration. The proposed method is evaluated on various VL tasks, including video action recognition, video captioning, video-text retrieval, video question answering, video-paragraph retrieval, and long-form video question answering. Results show that Finsta significantly improves the performance of existing VLMs and achieves state-of-the-art results in both fine-tuning and zero-shot settings. The method is efficient and can be applied to a wide range of existing VLMs.This paper proposes Finsta, a fine-grained structural spatio-temporal alignment learning method to enhance video-language models (VLMs). The method represents input texts and videos with fine-grained scene graph (SG) structures, which are unified into a holistic SG (HSG) for cross-modal alignment. A graph Transformer (GTrm) encodes the textual SG (TSG), while a novel recurrent graph Transformer (R-GTrm) models the video dynamic SG (DSG) and HSG for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer (STGD-GTrm) strengthens the perception of object changes across spatial and temporal dimensions. Object-centered spatial alignment and predicate-centered temporal alignment are performed to enhance video-language grounding in both spatiality and temporality. Finsta is designed as a plug-and-play system that can be integrated into existing VLMs for representation augmentation without retraining or relying on SG annotations. On six representative VL modeling tasks across 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the performance of 13 strong-performing VLMs and significantly refreshes the state-of-the-art end task performance in both fine-tuning and zero-shot settings. The method addresses key limitations of existing VLMs, including coarse-grained cross-modal alignment, under-modeling of temporal dynamics, and insufficient VL collaboration. The proposed method is evaluated on various VL tasks, including video action recognition, video captioning, video-text retrieval, video question answering, video-paragraph retrieval, and long-form video question answering. Results show that Finsta significantly improves the performance of existing VLMs and achieves state-of-the-art results in both fine-tuning and zero-shot settings. The method is efficient and can be applied to a wide range of existing VLMs.