21 Jul 2024 | Ming Li¹, Taojinnan Yang¹, Huafeng Kuang², Jie Wu², Zhaoning Wang¹, Xuefeng Xiao², and Chen Chen¹
ControlNet++ improves conditional controls through efficient consistency feedback. This paper proposes ControlNet++, a novel approach that enhances controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. The method uses a pre-trained discriminative reward model to extract the corresponding condition of the generated images and optimizes the consistency loss between the input conditional control and extracted condition. To avoid the high computational and memory costs of traditional sampling methods, an efficient reward strategy is introduced, which disturbs the input images by adding noise and uses single-step denoised images for reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls, achieving improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE for segmentation mask, line-art edge, and depth conditions. The code, models, demo, and organized data are open-sourced on the GitHub repository. The method is evaluated under various conditional controls and demonstrates superior performance compared to existing methods. The paper also discusses the effectiveness of the method in different tasks, including segmentation, depth, hed edge, canny edge, and line-art edge. The results show that ControlNet++ significantly improves controllability without compromising image quality and image-text alignment. The method is also evaluated on human-generated data and shows better performance in terms of controllability and image quality. The paper concludes that ControlNet++ provides new insights into controllable visual generation and offers a more efficient and effective approach to improving conditional controls in text-to-image diffusion models.ControlNet++ improves conditional controls through efficient consistency feedback. This paper proposes ControlNet++, a novel approach that enhances controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. The method uses a pre-trained discriminative reward model to extract the corresponding condition of the generated images and optimizes the consistency loss between the input conditional control and extracted condition. To avoid the high computational and memory costs of traditional sampling methods, an efficient reward strategy is introduced, which disturbs the input images by adding noise and uses single-step denoised images for reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls, achieving improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE for segmentation mask, line-art edge, and depth conditions. The code, models, demo, and organized data are open-sourced on the GitHub repository. The method is evaluated under various conditional controls and demonstrates superior performance compared to existing methods. The paper also discusses the effectiveness of the method in different tasks, including segmentation, depth, hed edge, canny edge, and line-art edge. The results show that ControlNet++ significantly improves controllability without compromising image quality and image-text alignment. The method is also evaluated on human-generated data and shows better performance in terms of controllability and image quality. The paper concludes that ControlNet++ provides new insights into controllable visual generation and offers a more efficient and effective approach to improving conditional controls in text-to-image diffusion models.