VISA: Reasoning Video Object Segmentation via Large Language Models

VISA: Reasoning Video Object Segmentation via Large Language Models

2024-07-16 | Cilin Yan*, Haochen Wang*, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang†, Weidi Xie, Efstratios Gavves
This paper introduces ReasonVOS, a new task in video object segmentation that requires reasoning based on world knowledge and video context to generate binary mask sequences in response to implicit text queries. The task is more complex than traditional referring video segmentation, which relies on explicit text descriptions. To address this challenge, the authors propose VISA (Video-based large language Instructed Segmentation Assistant), a system that integrates long-term video features with complex text queries to enable reasoning-based video object segmentation. VISA leverages the reasoning capabilities of multi-modal large language models while maintaining the ability to segment and track objects in videos using a mask decoder. The authors also introduce a large-scale benchmark dataset called ReVOS, consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos. This dataset includes complex text instructions that require reasoning and understanding of both video content and general world knowledge. The dataset is used for instruction tuning and evaluation of ReasonVOS models. Experiments on eight datasets demonstrate that VISA effectively handles complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The results show that VISA achieves state-of-the-art performance on video and image datasets for reasoning and referring segmentation tasks. The code and dataset are available at https://github.com/cilinyan/VISA.This paper introduces ReasonVOS, a new task in video object segmentation that requires reasoning based on world knowledge and video context to generate binary mask sequences in response to implicit text queries. The task is more complex than traditional referring video segmentation, which relies on explicit text descriptions. To address this challenge, the authors propose VISA (Video-based large language Instructed Segmentation Assistant), a system that integrates long-term video features with complex text queries to enable reasoning-based video object segmentation. VISA leverages the reasoning capabilities of multi-modal large language models while maintaining the ability to segment and track objects in videos using a mask decoder. The authors also introduce a large-scale benchmark dataset called ReVOS, consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos. This dataset includes complex text instructions that require reasoning and understanding of both video content and general world knowledge. The dataset is used for instruction tuning and evaluation of ReasonVOS models. Experiments on eight datasets demonstrate that VISA effectively handles complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The results show that VISA achieves state-of-the-art performance on video and image datasets for reasoning and referring segmentation tasks. The code and dataset are available at https://github.com/cilinyan/VISA.
Reach us at info@study.space
[slides and audio] VISA%3A Reasoning Video Object Segmentation via Large Language Models