LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

12 Apr 2024 | Junchi Wang, Lei Ke
LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning LLM-Seg is a two-stage method that combines vision language models (VLMs) and vision foundation models to enable reasoning segmentation, a novel task that allows segmentation systems to reason and interpret implicit user intentions through large language model reasoning and then segment the corresponding target. The method introduces a new framework named LLM-Seg, which effectively connects the current foundational Segmentation Anything Model (SAM) with large language models (LLMs) through mask proposal selection. The dataset, LLM-Seg40K, is constructed using an automatic data generation pipeline and serves as a new benchmark for training and evaluating various reasoning segmentation approaches. The LLM-Seg model integrates multiple foundation models, including LLaVA, SAM, and DINOv2. LLaVA is responsible for perceiving the input image and question, and it outputs a special <SEG> token to guide the mask selection module. The mask selection module selects from the mask proposals based on the <SEG> token. The model structure is composed of four parts: the pretrained LLaVA-7B model, the SAM, the DINOv2 model, and the mask selection module. The SAM and DINOv2 are completely frozen, while the vision language model is optimized by LoRA. The model is trained using a combination of losses from the IoU and IoP selection heads. The LLM-Seg40K dataset contains 14K images and is divided into training, validation, and test sets. The dataset is generated using an automatic data generation pipeline that leverages ChatGPT-4 to process existing semantic segmentation datasets and automatically generate a dataset tailored for reasoning segmentation. The dataset provides high-quality question-segmentation pairs for model training and validation. Experiments show that LLM-Seg exhibits competitive performance compared with existing methods. The LLM-Seg40K dataset, developed through this pipeline, serves as a new benchmark for training and evaluating various reasoning segmentation approaches. The code, models, and dataset are available at https://github.com/wangjunchi/LLMSeg.LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning LLM-Seg is a two-stage method that combines vision language models (VLMs) and vision foundation models to enable reasoning segmentation, a novel task that allows segmentation systems to reason and interpret implicit user intentions through large language model reasoning and then segment the corresponding target. The method introduces a new framework named LLM-Seg, which effectively connects the current foundational Segmentation Anything Model (SAM) with large language models (LLMs) through mask proposal selection. The dataset, LLM-Seg40K, is constructed using an automatic data generation pipeline and serves as a new benchmark for training and evaluating various reasoning segmentation approaches. The LLM-Seg model integrates multiple foundation models, including LLaVA, SAM, and DINOv2. LLaVA is responsible for perceiving the input image and question, and it outputs a special <SEG> token to guide the mask selection module. The mask selection module selects from the mask proposals based on the <SEG> token. The model structure is composed of four parts: the pretrained LLaVA-7B model, the SAM, the DINOv2 model, and the mask selection module. The SAM and DINOv2 are completely frozen, while the vision language model is optimized by LoRA. The model is trained using a combination of losses from the IoU and IoP selection heads. The LLM-Seg40K dataset contains 14K images and is divided into training, validation, and test sets. The dataset is generated using an automatic data generation pipeline that leverages ChatGPT-4 to process existing semantic segmentation datasets and automatically generate a dataset tailored for reasoning segmentation. The dataset provides high-quality question-segmentation pairs for model training and validation. Experiments show that LLM-Seg exhibits competitive performance compared with existing methods. The LLM-Seg40K dataset, developed through this pipeline, serves as a new benchmark for training and evaluating various reasoning segmentation approaches. The code, models, and dataset are available at https://github.com/wangjunchi/LLMSeg.
Reach us at info@study.space
[slides] LLM-Seg%3A Bridging Image Segmentation and Large Language Model Reasoning | StudySpace