1 Apr 2024 | Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li
The Draw-and-Understand project introduces SPHINX-V, a new multimodal large language model (MLLM) designed to enhance pixel-level image understanding through visual prompts. SPHINX-V integrates a vision encoder, a visual prompt encoder, and an LLM to process various visual prompts (points, bounding boxes, and free-form shapes) and language understanding. The project also presents MDVP-Data, a multi-domain dataset containing 1.6 million image-visual prompt-text instruction-following samples, and MDVP-Bench, a comprehensive benchmark for evaluating models' ability to understand visual prompting instructions. The dataset includes detailed attributes for objects identified by visual prompts, their relationships with nearby entities, and the context within the background. MDVP-Bench encompasses tasks such as point-level and region-level captioning, inter-relationship analysis, and complex reasoning. SPHINX-V's performance on these tasks demonstrates significant improvements in detailed pixel-level description and question-answering abilities. The model is trained using a two-stage strategy: pre-training for image-visual prompt-text alignment and supervised fine-tuning following instructions. Additionally, the model integrates continuous visual prompts (like strokes and scribbles) with bounding boxes during inference, avoiding separate modeling. The model's ability to handle free-form visual prompts is enhanced through noise-based training augmentation. SPHINX-V outperforms other visual prompting models in various tasks, showcasing its exceptional pixel-level understanding. The model's architecture includes a mixed vision encoder, a versatile visual prompt encoder, and a large language model (LLM). The model is trained on diverse datasets, including natural images, OCR images, document images, and multi-panel images. The model's performance is evaluated on benchmarks such as LLaVA-Bench, Ferret-Bench, and MDVP-Bench, demonstrating its robustness and effectiveness in pixel-level image understanding. The model's ability to handle complex tasks such as regional optical character recognition and detailed region captioning highlights its versatility and effectiveness in visual prompting. The model's performance in these tasks is significantly better than existing methods, demonstrating its potential for future research in the field of intelligent visual interaction systems.The Draw-and-Understand project introduces SPHINX-V, a new multimodal large language model (MLLM) designed to enhance pixel-level image understanding through visual prompts. SPHINX-V integrates a vision encoder, a visual prompt encoder, and an LLM to process various visual prompts (points, bounding boxes, and free-form shapes) and language understanding. The project also presents MDVP-Data, a multi-domain dataset containing 1.6 million image-visual prompt-text instruction-following samples, and MDVP-Bench, a comprehensive benchmark for evaluating models' ability to understand visual prompting instructions. The dataset includes detailed attributes for objects identified by visual prompts, their relationships with nearby entities, and the context within the background. MDVP-Bench encompasses tasks such as point-level and region-level captioning, inter-relationship analysis, and complex reasoning. SPHINX-V's performance on these tasks demonstrates significant improvements in detailed pixel-level description and question-answering abilities. The model is trained using a two-stage strategy: pre-training for image-visual prompt-text alignment and supervised fine-tuning following instructions. Additionally, the model integrates continuous visual prompts (like strokes and scribbles) with bounding boxes during inference, avoiding separate modeling. The model's ability to handle free-form visual prompts is enhanced through noise-based training augmentation. SPHINX-V outperforms other visual prompting models in various tasks, showcasing its exceptional pixel-level understanding. The model's architecture includes a mixed vision encoder, a versatile visual prompt encoder, and a large language model (LLM). The model is trained on diverse datasets, including natural images, OCR images, document images, and multi-panel images. The model's performance is evaluated on benchmarks such as LLaVA-Bench, Ferret-Bench, and MDVP-Bench, demonstrating its robustness and effectiveness in pixel-level image understanding. The model's ability to handle complex tasks such as regional optical character recognition and detailed region captioning highlights its versatility and effectiveness in visual prompting. The model's performance in these tasks is significantly better than existing methods, demonstrating its potential for future research in the field of intelligent visual interaction systems.