[slides and audio] Draw-and-Understand%3A Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

The paper introduces the Draw-and-Understand project, which aims to enhance the interaction between humans and AI through multimodal large language models (MLLMs). The project includes a new model called SPHINX-V, a multi-domain dataset (MDVP-Data), and a benchmark (MDVP-Bench) for visual prompting. SPHINX-V is designed to understand pixel-level referring expressions using a vision encoder, a visual prompt encoder, and an LLM. MDVP-Data contains 1.6 million unique image-visual prompt-text instruction-following samples across various image types, while MDVP-Bench evaluates models' ability to understand visual prompts through tasks like captioning, inter-relationship analysis, and complex reasoning. The paper demonstrates that SPHINX-V outperforms existing models in pixel-level understanding tasks, showcasing its effectiveness and robustness.The paper introduces the Draw-and-Understand project, which aims to enhance the interaction between humans and AI through multimodal large language models (MLLMs). The project includes a new model called SPHINX-V, a multi-domain dataset (MDVP-Data), and a benchmark (MDVP-Bench) for visual prompting. SPHINX-V is designed to understand pixel-level referring expressions using a vision encoder, a visual prompt encoder, and an LLM. MDVP-Data contains 1.6 million unique image-visual prompt-text instruction-following samples across various image types, while MDVP-Bench evaluates models' ability to understand visual prompts through tasks like captioning, inter-relationship analysis, and complex reasoning. The paper demonstrates that SPHINX-V outperforms existing models in pixel-level understanding tasks, showcasing its effectiveness and robustness.

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

1 Apr 2024 | Weifeng Lin1, Xinyu Wei2, Ruichuan An3, Peng Gao1, Bocheng Zou4, Yulin Luo2, Siyuan Huang4, Shanghang Zhang2, and Hongsheng Li5

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

1 Apr 2024 | Weifeng Lin1*, Xinyu Wei2*, Ruichuan An3, Peng Gao1**, Bocheng Zou4, Yulin Luo2, Siyuan Huang4, Shanghang Zhang2, and Hongsheng Li5**

1 Apr 2024 | Weifeng Lin1, Xinyu Wei2, Ruichuan An3, Peng Gao1, Bocheng Zou4, Yulin Luo2, Siyuan Huang4, Shanghang Zhang2, and Hongsheng Li5