3 Jan 2024 | Suraj Patil, William Berman, Robin Rombach, Patrick von Platen
**aMUSEd: An Open-Source, Lightweight Masked Image Model (MIM)**
**Authors:** Suraj Patil, William Berman, Robin Rombach, Patrick von Platen
**Affiliations:** Hugging Face, Stability AI
**Abstract:**
This paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd, with only 10% of MUSE's parameters, focuses on fast image generation. The authors argue that MIM is underexplored compared to latent diffusion models, which are currently the prevailing approach for text-to-image generation. MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with just one image. The paper aims to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code and checkpoints for models producing images at 256x256 and 512x512 resolutions.
**Introduction:**
The paper discusses the advancements in diffusion-based text-to-image generative models and highlights the benefits of MIM over diffusion models, including reduced inference steps and better interpretability. aMUSEd is designed to be efficient and lightweight, using a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The design focuses on reduced complexity and computational requirements to facilitate broader use and experimentation.
**Related Work:**
The paper reviews token-based image generation and few-step diffusion models, highlighting the advantages and limitations of each approach. It also discusses the interpretability of text-to-image models and the use of quantization techniques.
**Method:**
The method section details the training and inference pipelines of aMUSEd, including the use of VQ-GAN, text conditioning, and U-ViT. The paper explains the masking schedule and micro-conditioning techniques used.
**Experimental Setup:**
The pre-training and fine-tuning processes are described, including data preparation, training details, and masking rate sampling. The paper also discusses the use of 8-bit quantization and the application of Styldrop for fine-tuning.
**Results:**
The paper presents results on inference speed, model quality, and task transfer, including zero-shot image variation and in-painting. The authors show that aMUSEd's inference speed is competitive with distilled diffusion-based models and that the models perform well in various tasks.
**Ethics and Safety:**
The paper addresses ethical considerations, such as filtering out images with high watermark probabilities or NSFW content.
**Conclusion:**
The authors conclude that aMUSEd is a lightweight and efficient alternative to diffusion models, demonstrating competitive performance in zero-shot image generation and fine-tuning capabilities. They hope that the open-sourcing of the model will promote further research in masked image modeling**aMUSEd: An Open-Source, Lightweight Masked Image Model (MIM)**
**Authors:** Suraj Patil, William Berman, Robin Rombach, Patrick von Platen
**Affiliations:** Hugging Face, Stability AI
**Abstract:**
This paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd, with only 10% of MUSE's parameters, focuses on fast image generation. The authors argue that MIM is underexplored compared to latent diffusion models, which are currently the prevailing approach for text-to-image generation. MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with just one image. The paper aims to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code and checkpoints for models producing images at 256x256 and 512x512 resolutions.
**Introduction:**
The paper discusses the advancements in diffusion-based text-to-image generative models and highlights the benefits of MIM over diffusion models, including reduced inference steps and better interpretability. aMUSEd is designed to be efficient and lightweight, using a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The design focuses on reduced complexity and computational requirements to facilitate broader use and experimentation.
**Related Work:**
The paper reviews token-based image generation and few-step diffusion models, highlighting the advantages and limitations of each approach. It also discusses the interpretability of text-to-image models and the use of quantization techniques.
**Method:**
The method section details the training and inference pipelines of aMUSEd, including the use of VQ-GAN, text conditioning, and U-ViT. The paper explains the masking schedule and micro-conditioning techniques used.
**Experimental Setup:**
The pre-training and fine-tuning processes are described, including data preparation, training details, and masking rate sampling. The paper also discusses the use of 8-bit quantization and the application of Styldrop for fine-tuning.
**Results:**
The paper presents results on inference speed, model quality, and task transfer, including zero-shot image variation and in-painting. The authors show that aMUSEd's inference speed is competitive with distilled diffusion-based models and that the models perform well in various tasks.
**Ethics and Safety:**
The paper addresses ethical considerations, such as filtering out images with high watermark probabilities or NSFW content.
**Conclusion:**
The authors conclude that aMUSEd is a lightweight and efficient alternative to diffusion models, demonstrating competitive performance in zero-shot image generation and fine-tuning capabilities. They hope that the open-sourcing of the model will promote further research in masked image modeling