UniVS: Unified and Universal Video Segmentation with Prompts as Queries

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

10 Jun 2024 | Minghan Li, Shuai Li, Xindong Zhang, Lei Zhang
UniVS is a unified and universal video segmentation framework that addresses the challenges of handling both category-specified and prompt-specified video segmentation tasks. The framework uses prompts as queries to decode masks, allowing it to handle various video segmentation tasks in a unified manner. UniVS integrates prompt features from previous frames into its initial query and introduces a target-wise prompt cross-attention layer to enhance the decoding process. This approach eliminates the need for heuristic inter-frame matching, enabling the model to handle different tasks with a single architecture. The framework is trained on multiple video segmentation benchmarks and demonstrates a strong balance between performance and universality. It achieves competitive results on category-specified tasks and shows improved performance on prompt-specified tasks. UniVS is the first model that successfully unifies all existing video segmentation tasks within a single framework, demonstrating high generalization capability across different scenarios. The model's architecture includes an image encoder, a prompt encoder, and a unified video mask decoder, which work together to decode masks for any entity or prompt-guided target in the video. The framework is trained using a multi-stage process, including image-level training, video-level training, and long video fine-tuning. The model's performance is evaluated on various video segmentation benchmarks, showing its effectiveness in handling different tasks. The results demonstrate that UniVS achieves state-of-the-art performance on multiple video segmentation tasks, including video instance, semantic, panoptic, object, and referring segmentation. The model's ability to handle both category-specified and prompt-specified tasks makes it a versatile solution for video segmentation.UniVS is a unified and universal video segmentation framework that addresses the challenges of handling both category-specified and prompt-specified video segmentation tasks. The framework uses prompts as queries to decode masks, allowing it to handle various video segmentation tasks in a unified manner. UniVS integrates prompt features from previous frames into its initial query and introduces a target-wise prompt cross-attention layer to enhance the decoding process. This approach eliminates the need for heuristic inter-frame matching, enabling the model to handle different tasks with a single architecture. The framework is trained on multiple video segmentation benchmarks and demonstrates a strong balance between performance and universality. It achieves competitive results on category-specified tasks and shows improved performance on prompt-specified tasks. UniVS is the first model that successfully unifies all existing video segmentation tasks within a single framework, demonstrating high generalization capability across different scenarios. The model's architecture includes an image encoder, a prompt encoder, and a unified video mask decoder, which work together to decode masks for any entity or prompt-guided target in the video. The framework is trained using a multi-stage process, including image-level training, video-level training, and long video fine-tuning. The model's performance is evaluated on various video segmentation benchmarks, showing its effectiveness in handling different tasks. The results demonstrate that UniVS achieves state-of-the-art performance on multiple video segmentation tasks, including video instance, semantic, panoptic, object, and referring segmentation. The model's ability to handle both category-specified and prompt-specified tasks makes it a versatile solution for video segmentation.
Reach us at info@study.space