aMUSEd: An Open MUSE Reproduction

aMUSEd: An Open MUSE Reproduction

3 Jan 2024 | Suraj Patil, William Berman, Robin Rombach, Patrick von Platen
The paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd uses 10% of MUSE's parameters and is designed for fast image generation. The model is based on MUSE, which is a MIM that predicts all masked image tokens in parallel for a fixed number of inference steps. MIM is more efficient than latent diffusion models, requiring fewer inference steps and being more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. aMUSEd is an efficient, open-source model with 800 million parameters, based on MUSE. It uses a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The U-ViT backbone eliminates the need for a super-resolution model, allowing the model to be trained for 512x512 resolution. The design is focused on reduced complexity and computational requirements to facilitate broader use and experimentation. The model demonstrates several advantages, including 4-bit and 8-bit quantization, zero-shot in-painting, and single image style transfer with styledrop. The authors release all relevant model weights and source code. The paper also discusses related work, including token-based image generation, few-step diffusion models, and the interpretability of text-to-image models. The authors compare aMUSEd with other models, including diffusion models and other MIMs, and find that aMUSEd is faster and more efficient. The experimental setup includes pre-training on LAION-2B data, fine-tuning on various datasets, and evaluating the model's performance on zero-shot FID, CLIP, and inception score benchmarks. The results show that aMUSEd has competitive performance in terms of image quality and inference speed. The paper also discusses the model's ability to perform task transfer, including image variation and in-painting, and its ability to generate video. The authors also discuss the model's ethics and safety, including filtering out images with high watermark or NSFW probabilities. In conclusion, the paper presents aMUSEd as a lightweight and open-source model for text-to-image generation, demonstrating its efficiency and effectiveness in comparison to other models. The authors hope that by open-sourcing all model weights and code, future research into masked image modeling for text-to-image generation is made more accessible.The paper introduces aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. aMUSEd uses 10% of MUSE's parameters and is designed for fast image generation. The model is based on MUSE, which is a MIM that predicts all masked image tokens in parallel for a fixed number of inference steps. MIM is more efficient than latent diffusion models, requiring fewer inference steps and being more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. aMUSEd is an efficient, open-source model with 800 million parameters, based on MUSE. It uses a CLIP-L/14 text encoder, SDXL-style micro-conditioning, and a U-ViT backbone. The U-ViT backbone eliminates the need for a super-resolution model, allowing the model to be trained for 512x512 resolution. The design is focused on reduced complexity and computational requirements to facilitate broader use and experimentation. The model demonstrates several advantages, including 4-bit and 8-bit quantization, zero-shot in-painting, and single image style transfer with styledrop. The authors release all relevant model weights and source code. The paper also discusses related work, including token-based image generation, few-step diffusion models, and the interpretability of text-to-image models. The authors compare aMUSEd with other models, including diffusion models and other MIMs, and find that aMUSEd is faster and more efficient. The experimental setup includes pre-training on LAION-2B data, fine-tuning on various datasets, and evaluating the model's performance on zero-shot FID, CLIP, and inception score benchmarks. The results show that aMUSEd has competitive performance in terms of image quality and inference speed. The paper also discusses the model's ability to perform task transfer, including image variation and in-painting, and its ability to generate video. The authors also discuss the model's ethics and safety, including filtering out images with high watermark or NSFW probabilities. In conclusion, the paper presents aMUSEd as a lightweight and open-source model for text-to-image generation, demonstrating its efficiency and effectiveness in comparison to other models. The authors hope that by open-sourcing all model weights and code, future research into masked image modeling for text-to-image generation is made more accessible.
Reach us at info@study.space
[slides] aMUSEd%3A An Open MUSE Reproduction | StudySpace