2024-04-12 | Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma
LaSagnA is a language-based segmentation assistant designed to handle complex queries involving multiple targets and non-existent categories in images. The main challenges with existing vision-large language models (vLLMs) are their inability to process multiple targets in a single query and their failure to identify the absence of query objects in images. LaSagnA addresses these issues by introducing a general sequence format for complex queries and incorporating a semantic segmentation task into the training pipeline. This allows the model to effectively handle complex queries and improve its segmentation performance. The model is trained on semantic segmentation datasets such as MS-COCO and ADE20K, and it outperforms existing vLLMs in reasoning and referring segmentation tasks. LaSagnA is capable of processing complex queries that may involve multiple arbitrary targets, some of which may not exist in the image. The model's effectiveness is validated through experiments on both closed-set and open-set semantic segmentation datasets, where it achieves results comparable to conventional methods. Additionally, LaSagnA demonstrates strong performance in zero-shot scenarios, showing its potential in handling complex queries. The model's training strategies include sequence augmentation, random classes list, and maintaining category order alignment with the query to address challenges such as incomplete predictions, lengthy input sequences, and inconsistent category names between queries and responses. LaSagnA is capable of supporting complex queries and has been shown to outperform existing methods in semantic segmentation tasks. The model's performance is evaluated on various benchmarks, including referring segmentation and reasoning segmentation, where it achieves high accuracy and demonstrates its effectiveness in handling complex queries. The model's ability to handle complex queries is further validated through qualitative results, showing its capability to perform multiple high-level understanding tasks simultaneously. LaSagnA is a significant advancement in the field of vLLM-based segmentation assistants, offering a more effective solution for handling complex queries in semantic segmentation tasks.LaSagnA is a language-based segmentation assistant designed to handle complex queries involving multiple targets and non-existent categories in images. The main challenges with existing vision-large language models (vLLMs) are their inability to process multiple targets in a single query and their failure to identify the absence of query objects in images. LaSagnA addresses these issues by introducing a general sequence format for complex queries and incorporating a semantic segmentation task into the training pipeline. This allows the model to effectively handle complex queries and improve its segmentation performance. The model is trained on semantic segmentation datasets such as MS-COCO and ADE20K, and it outperforms existing vLLMs in reasoning and referring segmentation tasks. LaSagnA is capable of processing complex queries that may involve multiple arbitrary targets, some of which may not exist in the image. The model's effectiveness is validated through experiments on both closed-set and open-set semantic segmentation datasets, where it achieves results comparable to conventional methods. Additionally, LaSagnA demonstrates strong performance in zero-shot scenarios, showing its potential in handling complex queries. The model's training strategies include sequence augmentation, random classes list, and maintaining category order alignment with the query to address challenges such as incomplete predictions, lengthy input sequences, and inconsistent category names between queries and responses. LaSagnA is capable of supporting complex queries and has been shown to outperform existing methods in semantic segmentation tasks. The model's performance is evaluated on various benchmarks, including referring segmentation and reasoning segmentation, where it achieves high accuracy and demonstrates its effectiveness in handling complex queries. The model's ability to handle complex queries is further validated through qualitative results, showing its capability to perform multiple high-level understanding tasks simultaneously. LaSagnA is a significant advancement in the field of vLLM-based segmentation assistants, offering a more effective solution for handling complex queries in semantic segmentation tasks.