8 Apr 2024 | Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
**Authors:** Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang
**Institution:** ByteDance, Rutgers University
**Abstract:**
This paper introduces MoMA, an open-vocabulary, training-free personalized image generation model that excels in zero-shot capabilities. MoMA leverages a Multimodal Large Language Model (MLLM) to serve dual roles as a feature extractor and generator, effectively integrating reference images and text prompts to produce detailed and faithful images. The model introduces a novel self-attention shortcut method to efficiently transfer image features to an image diffusion model, enhancing the resemblance of the target object in generated images. MoMA requires only a single reference image and outperforms existing methods in detail fidelity, identity preservation, and prompt faithfulness. The authors commit to open-sourcing their work to provide universal access to these advancements.
**Keywords:** image generation, multimodal, personalization, LLM
- **Text-to-Image Diffusion Models:** These models generate images aligned with textual descriptions by denoising a random sample from a Gaussian distribution. Notable models include GLIDE, DALL-E 2, Imagen, Stable Diffusion, and eDiff-I.
- **Personalized Image Synthesis:** Previous approaches involve inverting input images into textual representations and using learnable text tokens to denote target concepts. Methods like DreamBooth, Textual Inversion, Custom Diffusion, LoRA, and SVDiff have been developed to optimize this process, but they require extensive resources for per-instance tuning.
MoMA is designed to address the limitations of existing methods by:
- **Text-to-Image Diffusion Models:** Utilizing a pre-trained MLLM to extract and modify image features.
- **Multimodal LLM (MLLM):** Leveraging the strengths of both LLMs and vision transformers to process images and text prompts simultaneously.
- **Multimodal Generative Learning and Diffusion Learning:** Pre-training the multimodal image-feature decoder to compose image features with target prompts and converting contextualized image embeddings to images.
- **Preliminaries:** Introduces text-to-image diffusion models and MLLMs.
- **Methodology:** Details the architecture and training process of MoMA, including the multimodal generative image-feature decoder and self-attention feature transfer.
- **Experiments:** Conducts qualitative and quantitative evaluations, showing superior performance in detail accuracy, background quality, and texture editing.
- **Ablation and Analysis:** Evaluates the effectiveness of the proposed subject-cross-attention modules and self-attention masking mechanism.
MoMA demonstrates superior performance in fast, personalized image generation, supporting recontextualization and texture editing. The model is tuning-free, open-vocabulary, and can be integrated with community models fine-tuned from the same base model, extending its applicationsMoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
**Authors:** Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang
**Institution:** ByteDance, Rutgers University
**Abstract:**
This paper introduces MoMA, an open-vocabulary, training-free personalized image generation model that excels in zero-shot capabilities. MoMA leverages a Multimodal Large Language Model (MLLM) to serve dual roles as a feature extractor and generator, effectively integrating reference images and text prompts to produce detailed and faithful images. The model introduces a novel self-attention shortcut method to efficiently transfer image features to an image diffusion model, enhancing the resemblance of the target object in generated images. MoMA requires only a single reference image and outperforms existing methods in detail fidelity, identity preservation, and prompt faithfulness. The authors commit to open-sourcing their work to provide universal access to these advancements.
**Keywords:** image generation, multimodal, personalization, LLM
- **Text-to-Image Diffusion Models:** These models generate images aligned with textual descriptions by denoising a random sample from a Gaussian distribution. Notable models include GLIDE, DALL-E 2, Imagen, Stable Diffusion, and eDiff-I.
- **Personalized Image Synthesis:** Previous approaches involve inverting input images into textual representations and using learnable text tokens to denote target concepts. Methods like DreamBooth, Textual Inversion, Custom Diffusion, LoRA, and SVDiff have been developed to optimize this process, but they require extensive resources for per-instance tuning.
MoMA is designed to address the limitations of existing methods by:
- **Text-to-Image Diffusion Models:** Utilizing a pre-trained MLLM to extract and modify image features.
- **Multimodal LLM (MLLM):** Leveraging the strengths of both LLMs and vision transformers to process images and text prompts simultaneously.
- **Multimodal Generative Learning and Diffusion Learning:** Pre-training the multimodal image-feature decoder to compose image features with target prompts and converting contextualized image embeddings to images.
- **Preliminaries:** Introduces text-to-image diffusion models and MLLMs.
- **Methodology:** Details the architecture and training process of MoMA, including the multimodal generative image-feature decoder and self-attention feature transfer.
- **Experiments:** Conducts qualitative and quantitative evaluations, showing superior performance in detail accuracy, background quality, and texture editing.
- **Ablation and Analysis:** Evaluates the effectiveness of the proposed subject-cross-attention modules and self-attention masking mechanism.
MoMA demonstrates superior performance in fast, personalized image generation, supporting recontextualization and texture editing. The model is tuning-free, open-vocabulary, and can be integrated with community models fine-tuned from the same base model, extending its applications