VISA: Reasoning Video Object Segmentation via Large Language Models

VISA: Reasoning Video Object Segmentation via Large Language Models

16 Jul 2024 | Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, Efstratios Gavves
The paper introduces a new task called Reasoning Video Object Segmentation (ReasonVOS), which aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts. To address this task, the authors propose VISA (Video-based large language Instructed Segmentation Assistant), a model that leverages the world knowledge reasoning capabilities of multi-modal large language models (LLMs) while possessing the ability to segment and track objects in videos with a mask decoder. The paper also establishes a comprehensive benchmark dataset named ReVOS, consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, to evaluate the performance of ReasonVOS models. Experiments on eight datasets demonstrate the effectiveness of VISA in handling complex reasoning segmentation tasks in both video and image domains. The main contributions of the paper include the introduction of ReasonVOS, the design of VISA, and the creation of the ReVOS dataset. The code and dataset are available at <https://github.com/cilinyan/VISA>.The paper introduces a new task called Reasoning Video Object Segmentation (ReasonVOS), which aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts. To address this task, the authors propose VISA (Video-based large language Instructed Segmentation Assistant), a model that leverages the world knowledge reasoning capabilities of multi-modal large language models (LLMs) while possessing the ability to segment and track objects in videos with a mask decoder. The paper also establishes a comprehensive benchmark dataset named ReVOS, consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, to evaluate the performance of ReasonVOS models. Experiments on eight datasets demonstrate the effectiveness of VISA in handling complex reasoning segmentation tasks in both video and image domains. The main contributions of the paper include the introduction of ReasonVOS, the design of VISA, and the creation of the ReVOS dataset. The code and dataset are available at <https://github.com/cilinyan/VISA>.
Reach us at info@study.space
Understanding VISA%3A Reasoning Video Object Segmentation via Large Language Models