The paper introduces the Draw-and-Understand project, which aims to enhance the interaction between humans and AI through multimodal large language models (MLLMs). The project includes a new model called SPHINX-V, a multi-domain dataset (MDVP-Data), and a benchmark (MDVP-Bench) for visual prompting. SPHINX-V is designed to understand pixel-level referring expressions using a vision encoder, a visual prompt encoder, and an LLM. MDVP-Data contains 1.6 million unique image-visual prompt-text instruction-following samples across various image types, while MDVP-Bench evaluates models' ability to understand visual prompts through tasks like captioning, inter-relationship analysis, and complex reasoning. The paper demonstrates that SPHINX-V outperforms existing models in pixel-level understanding tasks, showcasing its effectiveness and robustness.The paper introduces the Draw-and-Understand project, which aims to enhance the interaction between humans and AI through multimodal large language models (MLLMs). The project includes a new model called SPHINX-V, a multi-domain dataset (MDVP-Data), and a benchmark (MDVP-Bench) for visual prompting. SPHINX-V is designed to understand pixel-level referring expressions using a vision encoder, a visual prompt encoder, and an LLM. MDVP-Data contains 1.6 million unique image-visual prompt-text instruction-following samples across various image types, while MDVP-Bench evaluates models' ability to understand visual prompts through tasks like captioning, inter-relationship analysis, and complex reasoning. The paper demonstrates that SPHINX-V outperforms existing models in pixel-level understanding tasks, showcasing its effectiveness and robustness.