Context-Guided Spatio-Temporal Video Grounding

Context-Guided Spatio-Temporal Video Grounding

3 Jan 2024 | Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang
The paper introduces a novel framework called Context-Guided Spatio-Temporal Video Grounding (CG-STVG) to improve the accuracy of spatio-temporal video grounding (STVG). STVG aims to localize a specific instance in a video using a text query. Current methods often struggle with distractors or heavy object appearance variations due to insufficient text information. To address this, CG-STVG mines discriminative instance context from the video and uses it as supplementary guidance for target localization. The key components of CG-STVG are two modules: Instance Context Generation (ICG) and Instance Context Refinement (ICR). ICG focuses on discovering visual context information (appearance and motion) of the instance, while ICR improves this context by eliminating irrelevant or harmful information. During grounding, ICG and ICR are deployed at each decoding stage of a Transformer architecture to learn instance context. The learned context is fed to the next stage, enhancing target-awareness and generating better new instance context. Experiments on three benchmarks (HSCSTVG-v1/v2 and VidSTG) show that CG-STVG outperforms existing methods, setting new state-of-the-art results in m.IoU and m.vIoU. The code for CG-STVG is available at <https://github.com/HengLan/CGSTVG>.The paper introduces a novel framework called Context-Guided Spatio-Temporal Video Grounding (CG-STVG) to improve the accuracy of spatio-temporal video grounding (STVG). STVG aims to localize a specific instance in a video using a text query. Current methods often struggle with distractors or heavy object appearance variations due to insufficient text information. To address this, CG-STVG mines discriminative instance context from the video and uses it as supplementary guidance for target localization. The key components of CG-STVG are two modules: Instance Context Generation (ICG) and Instance Context Refinement (ICR). ICG focuses on discovering visual context information (appearance and motion) of the instance, while ICR improves this context by eliminating irrelevant or harmful information. During grounding, ICG and ICR are deployed at each decoding stage of a Transformer architecture to learn instance context. The learned context is fed to the next stage, enhancing target-awareness and generating better new instance context. Experiments on three benchmarks (HSCSTVG-v1/v2 and VidSTG) show that CG-STVG outperforms existing methods, setting new state-of-the-art results in m.IoU and m.vIoU. The code for CG-STVG is available at <https://github.com/HengLan/CGSTVG>.
Reach us at info@study.space