On the Scalability of Diffusion-based Text-to-Image Generation

On the Scalability of Diffusion-based Text-to-Image Generation

3 Apr 2024 | Hao Li¹², Yang Zou¹², Ying Wang¹², Orchid Majumder¹², Yusheng Xie¹², R. Manmatha¹, Ashwin Swaminathan¹², Zhuowen Tu¹, Stefano Ermon¹, Stefano Soatto¹
This paper investigates the scalability of diffusion-based text-to-image (T2I) generation models, focusing on the effects of scaling both the denoising backbone and the training dataset. The study aims to understand which aspects of the model are most effective and efficient to scale, how to properly scale the dataset, and the scaling law among models, dataset, and compute. The authors perform extensive ablations on scaling both the denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets up to 600M images. They find that the location and amount of cross attention distinguishes the performance of existing UNet designs, and increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. They identify an efficient UNet variant that is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, they show that the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, they provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute, and dataset size. The study also compares different UNet designs and finds that SDXL's UNet design is significantly better than others in terms of performance and training efficiency. They also compare with Transformers and find that scaling the transformer depth in UNet is more parameter-efficient in improving the alignment performance in comparison with channel number. They show that properly scaling training data with synthetic captions improves image quality and speeds up the convergence. They also find that data scaling can improve small model’s performance significantly, a better designed model can have a higher performance upper bound. The study also explores the relationship between performance and model FLOPs, and between performance and data size, and provides numerical scaling laws for SDXL variants and SD2. The results show that larger models are more sample efficient and smaller models are more compute efficient. The study also evaluates models at low resolution and finds that the majority composition capability is developed at low resolution, which enables us to assess model's performance at the early stage of low resolution training. The study concludes that the findings can benefit the community for pursuing more scaling-efficient models.This paper investigates the scalability of diffusion-based text-to-image (T2I) generation models, focusing on the effects of scaling both the denoising backbone and the training dataset. The study aims to understand which aspects of the model are most effective and efficient to scale, how to properly scale the dataset, and the scaling law among models, dataset, and compute. The authors perform extensive ablations on scaling both the denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets up to 600M images. They find that the location and amount of cross attention distinguishes the performance of existing UNet designs, and increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. They identify an efficient UNet variant that is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, they show that the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, they provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute, and dataset size. The study also compares different UNet designs and finds that SDXL's UNet design is significantly better than others in terms of performance and training efficiency. They also compare with Transformers and find that scaling the transformer depth in UNet is more parameter-efficient in improving the alignment performance in comparison with channel number. They show that properly scaling training data with synthetic captions improves image quality and speeds up the convergence. They also find that data scaling can improve small model’s performance significantly, a better designed model can have a higher performance upper bound. The study also explores the relationship between performance and model FLOPs, and between performance and data size, and provides numerical scaling laws for SDXL variants and SD2. The results show that larger models are more sample efficient and smaller models are more compute efficient. The study also evaluates models at low resolution and finds that the majority composition capability is developed at low resolution, which enables us to assess model's performance at the early stage of low resolution training. The study concludes that the findings can benefit the community for pursuing more scaling-efficient models.
Reach us at info@study.space
[slides] On the Scalability of Diffusion-based Text-to-Image Generation | StudySpace