[slides] Depth-Aware Test-Time Training for Zero-Shot Video Object Segmentation

This paper introduces a novel approach called Depth-aware Test-Time Training (DATTT) for zero-shot video object segmentation (ZSVOS). The main challenge in ZSVOS is to segment primary moving objects without any human annotations during inference. Traditional methods often struggle to generalize to unseen videos due to the lack of large-scale training datasets. To address this, the authors propose a test-time training (TTT) strategy that enforces the model to predict consistent depth maps during the inference phase. The key insight of DATTT is to train a single network that performs both segmentation and depth prediction tasks. This is achieved using a depth modulation layer, which enables interaction between the depth prediction head and the mask prediction head. During TTT, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. The authors explore different TTT weight updating strategies, finding that momentum-based weight initialization and looping-based training schemes lead to more stable improvements. Experiments on five widely-used ZSVOS datasets (DAVIS-16, FBMS, LongVideos, MCL, and SegTrackV2) demonstrate that DATTT achieves significant improvements over state-of-the-art TTT methods. The proposed method provides a significant advantage over existing ZSVOS approaches, showcasing the effectiveness of performing TTT during inference. The code for the proposed method is available at <https://nifangbaage.github.io/DATTT/>.This paper introduces a novel approach called Depth-aware Test-Time Training (DATTT) for zero-shot video object segmentation (ZSVOS). The main challenge in ZSVOS is to segment primary moving objects without any human annotations during inference. Traditional methods often struggle to generalize to unseen videos due to the lack of large-scale training datasets. To address this, the authors propose a test-time training (TTT) strategy that enforces the model to predict consistent depth maps during the inference phase. The key insight of DATTT is to train a single network that performs both segmentation and depth prediction tasks. This is achieved using a depth modulation layer, which enables interaction between the depth prediction head and the mask prediction head. During TTT, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. The authors explore different TTT weight updating strategies, finding that momentum-based weight initialization and looping-based training schemes lead to more stable improvements. Experiments on five widely-used ZSVOS datasets (DAVIS-16, FBMS, LongVideos, MCL, and SegTrackV2) demonstrate that DATTT achieves significant improvements over state-of-the-art TTT methods. The proposed method provides a significant advantage over existing ZSVOS approaches, showcasing the effectiveness of performing TTT during inference. The code for the proposed method is available at <https://nifangbaage.github.io/DATTT/>.

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

7 Mar 2024 | Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun