UniVS: Unified and Universal Video Segmentation with Prompts as Queries

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

10 Jun 2024 | Minghan Li1,2*, Shuai Li1,2*, Xindong Zhang2 and Lei Zhang1,2†
The paper introduces UniVS, a novel unified video segmentation (VS) architecture that addresses the challenge of developing a single model to handle both category-specified and prompt-specified VS tasks. Category-specified VS tasks focus on detecting and tracking objects from predefined categories, while prompt-specified VS tasks require re-identifying targets with visual or textual prompts throughout the video. UniVS uses prompts as queries to average the prompt features from previous frames as initial queries, and introduces a target-wise prompt cross-attention layer to integrate prompt features in the memory pool. This approach converts different VS tasks into prompt-guided target segmentation, eliminating the need for heuristic inter-frame matching. UniVS achieves robust performance on 10 challenging VS benchmarks, including VIS, VSS, VPS, VOS, RefVOS, and PVOS, demonstrating a balance between performance and universality. The method is trained using a three-stage process: image-level training, video-level training, and long video fine-tuning, ensuring universal training and testing. Experimental results show that UniVS outperforms existing unified and individual VS models, achieving state-of-the-art performance on VSS and VPS tasks.The paper introduces UniVS, a novel unified video segmentation (VS) architecture that addresses the challenge of developing a single model to handle both category-specified and prompt-specified VS tasks. Category-specified VS tasks focus on detecting and tracking objects from predefined categories, while prompt-specified VS tasks require re-identifying targets with visual or textual prompts throughout the video. UniVS uses prompts as queries to average the prompt features from previous frames as initial queries, and introduces a target-wise prompt cross-attention layer to integrate prompt features in the memory pool. This approach converts different VS tasks into prompt-guided target segmentation, eliminating the need for heuristic inter-frame matching. UniVS achieves robust performance on 10 challenging VS benchmarks, including VIS, VSS, VPS, VOS, RefVOS, and PVOS, demonstrating a balance between performance and universality. The method is trained using a three-stage process: image-level training, video-level training, and long video fine-tuning, ensuring universal training and testing. Experimental results show that UniVS outperforms existing unified and individual VS models, achieving state-of-the-art performance on VSS and VPS tasks.
Reach us at info@study.space
[slides and audio] UniVS%3A Unified and Universal Video Segmentation with Prompts as Queries