[slides and audio] LLM-Seg%3A Bridging Image Segmentation and Large Language Model Reasoning

The paper "LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning" by Junchi Wang and Lei Ke from ETH Zurich introduces a novel framework called LLM-Seg, which combines vision-language models (VLMs) and vision foundation models to enable reasoning segmentation. This task requires the model to understand and interpret complex human instructions to segment target objects in images. The authors propose a two-stage method where the first stage generates mask proposals using a frozen Segment Anything Model (SAM), and the second stage uses a VLM like LLaVA to select the most appropriate mask proposals based on the input question. They also develop an automatic data generation pipeline using ChatGPT-4 to create the LLM-Seg40K dataset, which contains 14K images and multiple text-segmentation pairs. Experiments show that LLM-Seg outperforms existing methods in terms of segmentation quality and robustness, demonstrating its effectiveness in handling complex and open-vocabulary segmentation tasks. The paper contributes to both methodological design and dataset labeling, providing a new benchmark for future research in reasoning segmentation.The paper "LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning" by Junchi Wang and Lei Ke from ETH Zurich introduces a novel framework called LLM-Seg, which combines vision-language models (VLMs) and vision foundation models to enable reasoning segmentation. This task requires the model to understand and interpret complex human instructions to segment target objects in images. The authors propose a two-stage method where the first stage generates mask proposals using a frozen Segment Anything Model (SAM), and the second stage uses a VLM like LLaVA to select the most appropriate mask proposals based on the input question. They also develop an automatic data generation pipeline using ChatGPT-4 to create the LLM-Seg40K dataset, which contains 14K images and multiple text-segmentation pairs. Experiments show that LLM-Seg outperforms existing methods in terms of segmentation quality and robustness, demonstrating its effectiveness in handling complex and open-vocabulary segmentation tasks. The paper contributes to both methodological design and dataset labeling, providing a new benchmark for future research in reasoning segmentation.

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

12 Apr 2024 | Junchi Wang, Lei Ke