[slides] aMUSEd%3A An Open MUSE Reproduction

**aMUSEd: An Open-Source, Lightweight Masked Image Model (MIM)** **Authors:** Suraj Patil, William Berman, Robin Rombach, Patrick von Platen **Affiliations:** Hugging Face, Stability AI **Abstract:** This paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd, with only 10% of MUSE's parameters, focuses on fast image generation. The authors argue that MIM is underexplored compared to latent diffusion models, which are currently the prevailing approach for text-to-image generation. MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with just one image. The paper aims to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code and checkpoints for models producing images at 256x256 and 512x512 resolutions. **Introduction:** The paper discusses the advancements in diffusion-based text-to-image generative models and highlights the benefits of MIM over diffusion models, including reduced inference steps and better interpretability. aMUSEd is designed to be efficient and lightweight, using a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The design focuses on reduced complexity and computational requirements to facilitate broader use and experimentation. **Related Work:** The paper reviews token-based image generation and few-step diffusion models, highlighting the advantages and limitations of each approach. It also discusses the interpretability of text-to-image models and the use of quantization techniques. **Method:** The method section details the training and inference pipelines of aMUSEd, including the use of VQ-GAN, text conditioning, and U-ViT. The paper explains the masking schedule and micro-conditioning techniques used. **Experimental Setup:** The pre-training and fine-tuning processes are described, including data preparation, training details, and masking rate sampling. The paper also discusses the use of 8-bit quantization and the application of Styldrop for fine-tuning. **Results:** The paper presents results on inference speed, model quality, and task transfer, including zero-shot image variation and in-painting. The authors show that aMUSEd's inference speed is competitive with distilled diffusion-based models and that the models perform well in various tasks. **Ethics and Safety:** The paper addresses ethical considerations, such as filtering out images with high watermark probabilities or NSFW content. **Conclusion:** The authors conclude that aMUSEd is a lightweight and efficient alternative to diffusion models, demonstrating competitive performance in zero-shot image generation and fine-tuning capabilities. They hope that the open-sourcing of the model will promote further research in masked image modeling**aMUSEd: An Open-Source, Lightweight Masked Image Model (MIM)** **Authors:** Suraj Patil, William Berman, Robin Rombach, Patrick von Platen **Affiliations:** Hugging Face, Stability AI **Abstract:** This paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd, with only 10% of MUSE's parameters, focuses on fast image generation. The authors argue that MIM is underexplored compared to latent diffusion models, which are currently the prevailing approach for text-to-image generation. MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with just one image. The paper aims to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code and checkpoints for models producing images at 256x256 and 512x512 resolutions. **Introduction:** The paper discusses the advancements in diffusion-based text-to-image generative models and highlights the benefits of MIM over diffusion models, including reduced inference steps and better interpretability. aMUSEd is designed to be efficient and lightweight, using a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The design focuses on reduced complexity and computational requirements to facilitate broader use and experimentation. **Related Work:** The paper reviews token-based image generation and few-step diffusion models, highlighting the advantages and limitations of each approach. It also discusses the interpretability of text-to-image models and the use of quantization techniques. **Method:** The method section details the training and inference pipelines of aMUSEd, including the use of VQ-GAN, text conditioning, and U-ViT. The paper explains the masking schedule and micro-conditioning techniques used. **Experimental Setup:** The pre-training and fine-tuning processes are described, including data preparation, training details, and masking rate sampling. The paper also discusses the use of 8-bit quantization and the application of Styldrop for fine-tuning. **Results:** The paper presents results on inference speed, model quality, and task transfer, including zero-shot image variation and in-painting. The authors show that aMUSEd's inference speed is competitive with distilled diffusion-based models and that the models perform well in various tasks. **Ethics and Safety:** The paper addresses ethical considerations, such as filtering out images with high watermark probabilities or NSFW content. **Conclusion:** The authors conclude that aMUSEd is a lightweight and efficient alternative to diffusion models, demonstrating competitive performance in zero-shot image generation and fine-tuning capabilities. They hope that the open-sourcing of the model will promote further research in masked image modeling

aMUSEd: AN OPEN MUSE REPRODUCTION

3 Jan 2024 | Suraj Patil, William Berman, Robin Rombach, Patrick von Platen