TextCraftor is a text encoder fine-tuning method that enhances the performance of text-to-image diffusion models. Unlike replacing the CLIP text encoder with larger models, TextCraftor improves the text encoder through a fine-tuning approach, leading to significant improvements in image quality and text-image alignment. The method uses reward functions, such as aesthetic and alignment scores, to guide the fine-tuning process. It is orthogonal to UNet fine-tuning and can be combined to further improve generative quality. TextCraftor enables controllable image generation through interpolation of different text encoders. The method is trained using prompt data and reward functions, and it does not require paired text-image datasets. TextCraftor is evaluated on public benchmarks and human assessments, showing superior performance compared to other methods. The method is also effective for downstream tasks like image inpainting and control net applications. TextCraftor is trained on a large-scale GPU cluster and demonstrates strong generalization capabilities. The method is efficient and can be applied to various diffusion models, including SDXL. TextCraftor provides a flexible and controllable approach to text-to-image generation, with the potential for future improvements by incorporating style information from reward functions.TextCraftor is a text encoder fine-tuning method that enhances the performance of text-to-image diffusion models. Unlike replacing the CLIP text encoder with larger models, TextCraftor improves the text encoder through a fine-tuning approach, leading to significant improvements in image quality and text-image alignment. The method uses reward functions, such as aesthetic and alignment scores, to guide the fine-tuning process. It is orthogonal to UNet fine-tuning and can be combined to further improve generative quality. TextCraftor enables controllable image generation through interpolation of different text encoders. The method is trained using prompt data and reward functions, and it does not require paired text-image datasets. TextCraftor is evaluated on public benchmarks and human assessments, showing superior performance compared to other methods. The method is also effective for downstream tasks like image inpainting and control net applications. TextCraftor is trained on a large-scale GPU cluster and demonstrates strong generalization capabilities. The method is efficient and can be applied to various diffusion models, including SDXL. TextCraftor provides a flexible and controllable approach to text-to-image generation, with the potential for future improvements by incorporating style information from reward functions.