TextCrafter: Your Text Encoder Can be Image Quality Controller
This paper introduces TextCrafter, a novel framework that enhances the pre-trained text encoder in text-to-image diffusion models to improve image quality and text-image alignment. Traditional methods often struggle with generating images that align well with the input text, requiring multiple runs with different prompts to achieve satisfactory results. TextCrafter addresses this issue by fine-tuning the text encoder using reward functions, which assess image quality and text-image alignment. The proposed method does not require paired text-image datasets and can be trained with only text prompts, making it more efficient and flexible.
The authors demonstrate that fine-tuning the text encoder can lead to significant improvements in both quantitative benchmarks and human assessments. They also show that the technique can enable controllable image generation by interpolating different fine-tuned text encoders. Additionally, TextCrafter is orthogonal to UNet fine-tuning, meaning it can be combined with UNet fine-tuning to further enhance generative quality.
Experiments on various datasets, including Parti-Prompts and HPSv2, confirm the superior performance of TextCrafter compared to pre-trained models, reinforcement learning-based approaches, and prompt engineering. The method is also shown to be effective in downstream tasks such as ControlNet and image inpainting.
Overall, TextCrafter provides a stable and powerful framework for improving text-to-image generation, offering better image quality, text-image alignment, and controllability.TextCrafter: Your Text Encoder Can be Image Quality Controller
This paper introduces TextCrafter, a novel framework that enhances the pre-trained text encoder in text-to-image diffusion models to improve image quality and text-image alignment. Traditional methods often struggle with generating images that align well with the input text, requiring multiple runs with different prompts to achieve satisfactory results. TextCrafter addresses this issue by fine-tuning the text encoder using reward functions, which assess image quality and text-image alignment. The proposed method does not require paired text-image datasets and can be trained with only text prompts, making it more efficient and flexible.
The authors demonstrate that fine-tuning the text encoder can lead to significant improvements in both quantitative benchmarks and human assessments. They also show that the technique can enable controllable image generation by interpolating different fine-tuned text encoders. Additionally, TextCrafter is orthogonal to UNet fine-tuning, meaning it can be combined with UNet fine-tuning to further enhance generative quality.
Experiments on various datasets, including Parti-Prompts and HPSv2, confirm the superior performance of TextCrafter compared to pre-trained models, reinforcement learning-based approaches, and prompt engineering. The method is also shown to be effective in downstream tasks such as ControlNet and image inpainting.
Overall, TextCrafter provides a stable and powerful framework for improving text-to-image generation, offering better image quality, text-image alignment, and controllability.