[slides] VersaT2I%3A Improving Text-to-Image Models with Versatile Reward

**VersaT2I: Improving Text-to-Image Models with Versatile Reward** **Authors:** Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, and Gaoang Wang **Institution:** Zhejiang University, University of Washington, Hong Kong University of Science and Technology (GZ), Fudan University **Abstract:** Recent text-to-image (T2I) models have shown impressive performance with large-scale and high-quality data, but they still struggle to produce aesthetically pleasing, geometrically accurate, and faithful images. To address these challenges, the authors propose VersaT2I, a versatile training framework that enhances T2I models using multiple rewards. The framework decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. For each aspect, high-quality images generated by the model are selected as training data for fine-tuning using Low-Rank Adaptation (LoRA). Additionally, a gating function is introduced to combine multiple quality aspects, avoiding conflicts between different metrics. The method is easy to extend and does not require manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms baseline methods across various quality criteria. **Contributions:** 1. **VersaT2I:** A self-training and model-agnostic framework that combines multiple evaluation models without requiring RL optimization. 2. **Self-training Method:** Uses model-generated data for training without additional data requirements. 3. **Mixture of LoRA (MoL):** Combines multiple LoRA models trained with different reward models to enhance overall image quality. **Methods:** - **Diffusion Models:** State-of-the-art generative models for high-quality image synthesis. - **LoRA Fine-tuning:** Efficient parameter-efficient fine-tuning method. - **Mixture of Experts (MoE):** Inspired by MoE, combines multiple LoRA models to enhance overall performance. **Experiments:** - **Implementation Details:** Utilizes GPT-4 for prompt generation, evaluates on Stable Diffusion v2.1 and SDXL, and uses specific training parameters. - **Reward Model Selection:** Uses Q-Align for aesthetic assessment, a model that detects geometric features, VQA for text-image alignment, and Q-Instruct for low-level quality. - **Results:** VersaT2I improves all evaluation benchmarks of SD v2.1 and SDXL, demonstrating better performance in aesthetics, text-image alignment, geometry, and low-level quality. **Qualitative Results:** - Examples show improved aesthetic, text faithfulness, and geometrical features in generated images. **Ablation Study:** - The gating balancing loss ensures a more uniform distribution of weights across different quality aspects. **Limitations and Social Impact**VersaT2I: Improving Text-to-Image Models with Versatile Reward** **Authors:** Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, and Gaoang Wang **Institution:** Zhejiang University, University of Washington, Hong Kong University of Science and Technology (GZ), Fudan University **Abstract:** Recent text-to-image (T2I) models have shown impressive performance with large-scale and high-quality data, but they still struggle to produce aesthetically pleasing, geometrically accurate, and faithful images. To address these challenges, the authors propose VersaT2I, a versatile training framework that enhances T2I models using multiple rewards. The framework decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. For each aspect, high-quality images generated by the model are selected as training data for fine-tuning using Low-Rank Adaptation (LoRA). Additionally, a gating function is introduced to combine multiple quality aspects, avoiding conflicts between different metrics. The method is easy to extend and does not require manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms baseline methods across various quality criteria. **Contributions:** 1. **VersaT2I:** A self-training and model-agnostic framework that combines multiple evaluation models without requiring RL optimization. 2. **Self-training Method:** Uses model-generated data for training without additional data requirements. 3. **Mixture of LoRA (MoL):** Combines multiple LoRA models trained with different reward models to enhance overall image quality. **Methods:** - **Diffusion Models:** State-of-the-art generative models for high-quality image synthesis. - **LoRA Fine-tuning:** Efficient parameter-efficient fine-tuning method. - **Mixture of Experts (MoE):** Inspired by MoE, combines multiple LoRA models to enhance overall performance. **Experiments:** - **Implementation Details:** Utilizes GPT-4 for prompt generation, evaluates on Stable Diffusion v2.1 and SDXL, and uses specific training parameters. - **Reward Model Selection:** Uses Q-Align for aesthetic assessment, a model that detects geometric features, VQA for text-image alignment, and Q-Instruct for low-level quality. - **Results:** VersaT2I improves all evaluation benchmarks of SD v2.1 and SDXL, demonstrating better performance in aesthetics, text-image alignment, geometry, and low-level quality. **Qualitative Results:** - Examples show improved aesthetic, text faithfulness, and geometrical features in generated images. **Ablation Study:** - The gating balancing loss ensures a more uniform distribution of weights across different quality aspects. **Limitations and Social Impact

VersaT2I: Improving Text-to-Image Models with Versatile Reward

27 Mar 2024 | Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang