[slides and audio] On the Scalability of Diffusion-based Text-to-Image Generation

This paper investigates the scalability of diffusion-based text-to-image (T2I) models, focusing on both the denoising backbone and training data. The authors conduct extensive ablation studies to understand how different architectural choices and dataset properties affect model performance. They find that the location and amount of cross-attention in UNet designs significantly impact performance, and increasing transformer blocks is more parameter-efficient for improving text-image alignment compared to increasing channel numbers. They also identify an efficient UNet variant that is 45% smaller and 28% faster than SDXL's UNet while achieving similar performance. On the data side, they show that the quality and diversity of the training set matter more than dataset size, and increasing caption density and diversity improves text-image alignment performance and learning efficiency. The paper provides scaling functions to predict text-image alignment performance based on model size, compute, and dataset size. The findings contribute to the development of more efficient and effective T2I models.This paper investigates the scalability of diffusion-based text-to-image (T2I) models, focusing on both the denoising backbone and training data. The authors conduct extensive ablation studies to understand how different architectural choices and dataset properties affect model performance. They find that the location and amount of cross-attention in UNet designs significantly impact performance, and increasing transformer blocks is more parameter-efficient for improving text-image alignment compared to increasing channel numbers. They also identify an efficient UNet variant that is 45% smaller and 28% faster than SDXL's UNet while achieving similar performance. On the data side, they show that the quality and diversity of the training set matter more than dataset size, and increasing caption density and diversity improves text-image alignment performance and learning efficiency. The paper provides scaling functions to predict text-image alignment performance based on model size, compute, and dataset size. The findings contribute to the development of more efficient and effective T2I models.

On the Scalability of Diffusion-based Text-to-Image Generation

2024-04-03 | Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto