Understanding Adding Conditional Control to Text-to-Image Diffusion Models

ControlNet is a neural network architecture designed to add spatial conditioning controls to large, pretrained text-to-image diffusion models. It leverages the deep and robust encoding layers of these models as a strong backbone to learn diverse conditional controls, such as edges, depth, segmentation, and human pose. The architecture uses "zero convolutions" to connect the original model with a trainable copy, ensuring that no harmful noise affects the finetuning process. Experiments with Stable Diffusion show that ControlNet can effectively control the image generation process using various conditioning inputs, including single or multiple conditions, with or without text prompts. The training of ControlNet is robust and scalable on datasets of different sizes, and it can achieve competitive results with industrial models trained on large computation clusters. User studies and ablation experiments further validate the effectiveness and robustness of ControlNet.ControlNet is a neural network architecture designed to add spatial conditioning controls to large, pretrained text-to-image diffusion models. It leverages the deep and robust encoding layers of these models as a strong backbone to learn diverse conditional controls, such as edges, depth, segmentation, and human pose. The architecture uses "zero convolutions" to connect the original model with a trainable copy, ensuring that no harmful noise affects the finetuning process. Experiments with Stable Diffusion show that ControlNet can effectively control the image generation process using various conditioning inputs, including single or multiple conditions, with or without text prompts. The training of ControlNet is robust and scalable on datasets of different sizes, and it can achieve competitive results with industrial models trained on large computation clusters. User studies and ablation experiments further validate the effectiveness and robustness of ControlNet.

Adding Conditional Control to Text-to-Image Diffusion Models

26 Nov 2023 | Lvmin Zhang, Anyi Rao, and Maneesh Agrawala