This paper proposes a method for referring video segmentation that decouples static and hierarchical motion perception to enhance temporal understanding. The approach separates the processing of static cues (image-level features) and motion cues (temporal features) to better capture motion information in video sequences. The static cues are used to identify potential candidate objects based on image-level features, while motion cues are used to pinpoint the target objects by aligning them with temporal features across the video.
A key contribution is the introduction of a hierarchical motion perception module that progressively captures motion information across varying timescales, enabling the model to understand both short-term and long-term motions. This module is designed to mimic human video comprehension by processing short clips and building an understanding of long-term concepts based on the recollection of short-term clips. Additionally, the method employs contrastive learning to distinguish the motions of visually similar objects, enhancing the model's ability to differentiate between them based on motion cues.
The proposed method, named DsHmp, achieves state-of-the-art performance on five referring video segmentation datasets, including a significant 9.2% improvement in J&F on the challenging MeViS dataset. The method is evaluated on five datasets: MeViS, Ref-YouTubeVOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences. It outperforms existing methods on all datasets, demonstrating the effectiveness of the proposed approach in capturing motion information and improving referring video segmentation. The method is also shown to be effective in handling complex and motion-rich language descriptions, as demonstrated through qualitative visualization results. The approach is generalizable and effective, achieving new state-of-the-art performance across multiple datasets.This paper proposes a method for referring video segmentation that decouples static and hierarchical motion perception to enhance temporal understanding. The approach separates the processing of static cues (image-level features) and motion cues (temporal features) to better capture motion information in video sequences. The static cues are used to identify potential candidate objects based on image-level features, while motion cues are used to pinpoint the target objects by aligning them with temporal features across the video.
A key contribution is the introduction of a hierarchical motion perception module that progressively captures motion information across varying timescales, enabling the model to understand both short-term and long-term motions. This module is designed to mimic human video comprehension by processing short clips and building an understanding of long-term concepts based on the recollection of short-term clips. Additionally, the method employs contrastive learning to distinguish the motions of visually similar objects, enhancing the model's ability to differentiate between them based on motion cues.
The proposed method, named DsHmp, achieves state-of-the-art performance on five referring video segmentation datasets, including a significant 9.2% improvement in J&F on the challenging MeViS dataset. The method is evaluated on five datasets: MeViS, Ref-YouTubeVOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences. It outperforms existing methods on all datasets, demonstrating the effectiveness of the proposed approach in capturing motion information and improving referring video segmentation. The method is also shown to be effective in handling complex and motion-rich language descriptions, as demonstrated through qualitative visualization results. The approach is generalizable and effective, achieving new state-of-the-art performance across multiple datasets.