MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

8 Apr 2024 | Kunpeng Song, Yizhe zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, and Xiao Yang
MoMA is a tuning-free, open-vocabulary image generation model that enables personalized image creation without requiring extensive training. It leverages a Multimodal Large Language Model (MLLM) to extract and generate image features, combining reference images and text prompts to produce high-quality, identity-preserving images. The model uses a self-attention shortcut mechanism to enhance detail accuracy and image fidelity. MoMA requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity preservation, and prompt faithfulness. It is designed to be a plug-and-play module, compatible with various diffusion models, and is open-sourced for broader accessibility. The model excels in recontextualization and texture editing tasks, producing images that maintain the subject's identity while adapting to new environments or textures. It is evaluated on multiple tasks, demonstrating superior performance in detail accuracy, background quality, and texture adaptation. The model's effectiveness is supported by both qualitative and quantitative results, showing its ability to generate high-quality images without fine-tuning. MoMA's approach combines the strengths of MLLMs and diffusion models, enabling efficient and accurate image generation with minimal computational overhead. The model is trained on a large dataset and uses a two-stage pre-training strategy to enhance performance. It is also tested on various subjects and prompts, showing its versatility and effectiveness in generating realistic images. The model's open-source nature allows for broader adoption and further development in the field of image generation.MoMA is a tuning-free, open-vocabulary image generation model that enables personalized image creation without requiring extensive training. It leverages a Multimodal Large Language Model (MLLM) to extract and generate image features, combining reference images and text prompts to produce high-quality, identity-preserving images. The model uses a self-attention shortcut mechanism to enhance detail accuracy and image fidelity. MoMA requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity preservation, and prompt faithfulness. It is designed to be a plug-and-play module, compatible with various diffusion models, and is open-sourced for broader accessibility. The model excels in recontextualization and texture editing tasks, producing images that maintain the subject's identity while adapting to new environments or textures. It is evaluated on multiple tasks, demonstrating superior performance in detail accuracy, background quality, and texture adaptation. The model's effectiveness is supported by both qualitative and quantitative results, showing its ability to generate high-quality images without fine-tuning. MoMA's approach combines the strengths of MLLMs and diffusion models, enabling efficient and accurate image generation with minimal computational overhead. The model is trained on a large dataset and uses a two-stage pre-training strategy to enhance performance. It is also tested on various subjects and prompts, showing its versatility and effectiveness in generating realistic images. The model's open-source nature allows for broader adoption and further development in the field of image generation.
Reach us at info@study.space