Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

7 Mar 2024 | Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun
Depth-aware Test-Time Training (DATTT) is introduced for Zero-shot Video Object Segmentation (ZSVOS), aiming to improve performance by leveraging depth information during test time. The key idea is to enforce the model to predict consistent depth maps for the same frame under different data augmentations during test-time training. The proposed framework combines segmentation and depth estimation tasks in a single network, with a depth modulation layer that enables interaction between the depth prediction and mask prediction heads. The model is first trained on a large-scale dataset to learn both tasks, and then updated during test-time training by optimizing the consistency of depth maps. The results show that the proposed method achieves significant improvements over state-of-the-art test-time training methods, with the momentum-based weight initialization and looping-based training scheme leading to more stable improvements. The method is evaluated on five widely-used ZSVOS datasets, demonstrating its effectiveness in improving segmentation performance. The framework is implemented with a shared image encoder, a flow encoder, and different decoder heads for each task. The depth modulation layer is found to be effective for test-time training, and the method is shown to be competitive with other ZSVOS approaches. The code is available at the provided link.Depth-aware Test-Time Training (DATTT) is introduced for Zero-shot Video Object Segmentation (ZSVOS), aiming to improve performance by leveraging depth information during test time. The key idea is to enforce the model to predict consistent depth maps for the same frame under different data augmentations during test-time training. The proposed framework combines segmentation and depth estimation tasks in a single network, with a depth modulation layer that enables interaction between the depth prediction and mask prediction heads. The model is first trained on a large-scale dataset to learn both tasks, and then updated during test-time training by optimizing the consistency of depth maps. The results show that the proposed method achieves significant improvements over state-of-the-art test-time training methods, with the momentum-based weight initialization and looping-based training scheme leading to more stable improvements. The method is evaluated on five widely-used ZSVOS datasets, demonstrating its effectiveness in improving segmentation performance. The framework is implemented with a shared image encoder, a flow encoder, and different decoder heads for each task. The depth modulation layer is found to be effective for test-time training, and the method is shown to be competitive with other ZSVOS approaches. The code is available at the provided link.
Reach us at info@study.space
Understanding Depth-Aware Test-Time Training for Zero-Shot Video Object Segmentation