The paper "Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation" addresses the challenge of referring video segmentation, which relies on natural language expressions to identify and segment objects in videos. The authors propose a method that decouples the understanding of referring expressions into static and motion perception, emphasizing the importance of enhancing temporal comprehension. They introduce an expression-decoupling module to separate static and motion cues, a hierarchical motion perception module to capture temporal information across varying timescales, and contrastive learning to distinguish visually similar objects using motion cues. The contributions of the paper include:
1. **Decoupling Static and Motion Perception**: The method decouples the given sentence into static and motion cues, allowing for distinct and complementary roles in image-level and temporal-level understanding.
2. **Hierarchical Motion Perception**: This module effectively processes short-term and long-term motions, capturing motion patterns across different frame intervals.
3. **Contrastive Learning**: This technique enhances the model's ability to distinguish similar-looking objects using motion cues by generating discriminative motion representations.
The proposed approach, named DsHmp, achieves state-of-the-art performance on five datasets, including a significant 9.2% $\mathcal{J} \& \mathcal{F}$ improvement on the challenging MeViS dataset. The paper also includes ablation studies and qualitative visualizations to demonstrate the effectiveness of each component.The paper "Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation" addresses the challenge of referring video segmentation, which relies on natural language expressions to identify and segment objects in videos. The authors propose a method that decouples the understanding of referring expressions into static and motion perception, emphasizing the importance of enhancing temporal comprehension. They introduce an expression-decoupling module to separate static and motion cues, a hierarchical motion perception module to capture temporal information across varying timescales, and contrastive learning to distinguish visually similar objects using motion cues. The contributions of the paper include:
1. **Decoupling Static and Motion Perception**: The method decouples the given sentence into static and motion cues, allowing for distinct and complementary roles in image-level and temporal-level understanding.
2. **Hierarchical Motion Perception**: This module effectively processes short-term and long-term motions, capturing motion patterns across different frame intervals.
3. **Contrastive Learning**: This technique enhances the model's ability to distinguish similar-looking objects using motion cues by generating discriminative motion representations.
The proposed approach, named DsHmp, achieves state-of-the-art performance on five datasets, including a significant 9.2% $\mathcal{J} \& \mathcal{F}$ improvement on the challenging MeViS dataset. The paper also includes ablation studies and qualitative visualizations to demonstrate the effectiveness of each component.