Understanding GenArtist%3A Multimodal LLM as an Agent for Unified Image Generation and Editing

GenArtist is a unified image generation and editing system that leverages a multimodal large language model (MLLM) as an AI agent to coordinate and manage the entire process. The MLLM agent decomposes complex tasks into simpler sub-problems, constructs a planning tree with step-by-step verification, and executes image generation or editing operations by invoking external tools. It also performs verification and correction on generated results to ensure accuracy. The system integrates a comprehensive range of existing models into a tool library, allowing the agent to select and execute the most suitable tools for each task. GenArtist achieves state-of-the-art performance in both image generation and editing tasks, outperforming existing models such as SDXL and DALL-E 3. It excels in complex editing tasks and provides a unified interface for various generation and editing tasks. The system's core mechanisms include decomposition of intricate text prompts, planning tree with step-by-step verification, and position-aware tool execution. The MLLM agent is responsible for decomposing problems, planning using a tree structure, and invoking tools to address issues. The system also incorporates position information into the tool library and employs auxiliary tools to provide missing position-related inputs. Extensive experiments demonstrate that GenArtist achieves significant improvements in image generation and editing tasks, with over 7% improvement compared to DALL-E 3 on T2I-CompBench and state-of-the-art performance on the image editing benchmark MagicBrush. The system's ability to handle complex tasks and provide reliable image results makes it a valuable solution for unified image generation and editing.GenArtist is a unified image generation and editing system that leverages a multimodal large language model (MLLM) as an AI agent to coordinate and manage the entire process. The MLLM agent decomposes complex tasks into simpler sub-problems, constructs a planning tree with step-by-step verification, and executes image generation or editing operations by invoking external tools. It also performs verification and correction on generated results to ensure accuracy. The system integrates a comprehensive range of existing models into a tool library, allowing the agent to select and execute the most suitable tools for each task. GenArtist achieves state-of-the-art performance in both image generation and editing tasks, outperforming existing models such as SDXL and DALL-E 3. It excels in complex editing tasks and provides a unified interface for various generation and editing tasks. The system's core mechanisms include decomposition of intricate text prompts, planning tree with step-by-step verification, and position-aware tool execution. The MLLM agent is responsible for decomposing problems, planning using a tree structure, and invoking tools to address issues. The system also incorporates position information into the tool library and employs auxiliary tools to provide missing position-related inputs. Extensive experiments demonstrate that GenArtist achieves significant improvements in image generation and editing tasks, with over 7% improvement compared to DALL-E 3 on T2I-CompBench and state-of-the-art performance on the image editing benchmark MagicBrush. The system's ability to handle complex tasks and provide reliable image results makes it a valuable solution for unified image generation and editing.

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

8 Jul 2024 | Zhenyu Wang, Aoxue Li, Zhenguo Li, Xihui Liu