27 Mar 2024 | Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang
VersaT2I is a versatile training framework that enhances text-to-image (T2I) models by combining multiple rewards from different quality aspects. The framework decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. For each aspect, high-quality images generated by the model are used as training data to fine-tune the T2I model using Low-Rank Adaptation (LoRA). A gating function is introduced to combine these aspects, avoiding conflicts between different quality metrics. The method is easy to extend, does not require manual annotation, reinforcement learning, or changes to the model architecture. Extensive experiments show that VersaT2I outperforms baseline methods across various quality criteria.
The framework includes a single reward training stage and a multi-reward combination stage. In the single reward training stage, the model is fine-tuned using LoRA with the best-scoring images for each quality aspect. In the multi-reward combination stage, a Mixture of LoRA (MoL) method is introduced, which combines different LoRA models trained for each quality aspect. This method uses a gating function to automatically determine the weight of different LoRAs, preventing conflicts between them. The MoL method improves the overall performance of the model by effectively combining the strengths of different LoRA models.
The framework is evaluated on several quality aspects, including aesthetics, text-image alignment, geometry, and low-level quality. The results show that VersaT2I significantly improves the performance of T2I models in these aspects. The method is efficient, scalable, and versatile, making it a promising approach for improving T2I generation without requiring extensive human-labeled datasets. The experiments demonstrate that VersaT2I outperforms existing methods in various quality aspects, offering a scalable, efficient, and versatile framework for improving T2I generation.VersaT2I is a versatile training framework that enhances text-to-image (T2I) models by combining multiple rewards from different quality aspects. The framework decomposes image quality into four aspects: aesthetics, text-image alignment, geometry, and low-level quality. For each aspect, high-quality images generated by the model are used as training data to fine-tune the T2I model using Low-Rank Adaptation (LoRA). A gating function is introduced to combine these aspects, avoiding conflicts between different quality metrics. The method is easy to extend, does not require manual annotation, reinforcement learning, or changes to the model architecture. Extensive experiments show that VersaT2I outperforms baseline methods across various quality criteria.
The framework includes a single reward training stage and a multi-reward combination stage. In the single reward training stage, the model is fine-tuned using LoRA with the best-scoring images for each quality aspect. In the multi-reward combination stage, a Mixture of LoRA (MoL) method is introduced, which combines different LoRA models trained for each quality aspect. This method uses a gating function to automatically determine the weight of different LoRAs, preventing conflicts between them. The MoL method improves the overall performance of the model by effectively combining the strengths of different LoRA models.
The framework is evaluated on several quality aspects, including aesthetics, text-image alignment, geometry, and low-level quality. The results show that VersaT2I significantly improves the performance of T2I models in these aspects. The method is efficient, scalable, and versatile, making it a promising approach for improving T2I generation without requiring extensive human-labeled datasets. The experiments demonstrate that VersaT2I outperforms existing methods in various quality aspects, offering a scalable, efficient, and versatile framework for improving T2I generation.