Context-Guided Spatio-Temporal Video Grounding

Context-Guided Spatio-Temporal Video Grounding

3 Jan 2024 | Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang
Context-guided Spatio-Temporal Video Grounding (CG-STVG) is a novel framework that improves spatio-temporal video grounding (STVG) by leveraging instance visual context from videos. The framework introduces two key modules: Instance Context Generation (ICG) and Instance Context Refinement (ICR). ICG extracts visual context information (appearance and motion) of the target object, while ICR refines this context by eliminating irrelevant or harmful features. The refined context is then used as guidance for target localization, enhancing the model's ability to accurately locate objects in videos. The framework is implemented using a transformer-based architecture, with ICG and ICR deployed at each decoding stage to iteratively improve the instance context and guide the localization process. Experimental results on three benchmarks (HCSTVG-v1, HCSTVG-v2, and VidSTG) show that CG-STVG achieves state-of-the-art performance in terms of m_tIoU and m_vIoU, demonstrating the effectiveness of instance context in improving STVG. The code is available at https://github.com/HengLan/CGSTVG.Context-guided Spatio-Temporal Video Grounding (CG-STVG) is a novel framework that improves spatio-temporal video grounding (STVG) by leveraging instance visual context from videos. The framework introduces two key modules: Instance Context Generation (ICG) and Instance Context Refinement (ICR). ICG extracts visual context information (appearance and motion) of the target object, while ICR refines this context by eliminating irrelevant or harmful features. The refined context is then used as guidance for target localization, enhancing the model's ability to accurately locate objects in videos. The framework is implemented using a transformer-based architecture, with ICG and ICR deployed at each decoding stage to iteratively improve the instance context and guide the localization process. Experimental results on three benchmarks (HCSTVG-v1, HCSTVG-v2, and VidSTG) show that CG-STVG achieves state-of-the-art performance in terms of m_tIoU and m_vIoU, demonstrating the effectiveness of instance context in improving STVG. The code is available at https://github.com/HengLan/CGSTVG.
Reach us at info@study.space
Understanding Context-Guided Spatio-Temporal Video Grounding