20 Jun 2024 | Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, J. Zico Kolter
This paper introduces Easy Consistency Tuning (ECT), a novel and efficient method for training consistency models (CMs). CMs are generative models that enforce that all points along a sampling trajectory map to the same initial point, enabling faster generation compared to traditional diffusion models. However, training CMs is resource-intensive, as demonstrated by the need for one week of training on 8 GPUs for the state-of-the-art CM on CIFAR-10.
ECT improves the efficiency of training CMs by leveraging a pre-trained diffusion model and progressively tightening the consistency condition during training. By expressing CM trajectories as a differential equation, ECT allows for fine-tuning a diffusion model to approximate the full consistency condition. This approach significantly reduces training time while improving sample quality. For example, ECT achieves a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU, matching the performance of Consistency Distillation trained over hundreds of GPU hours.
The paper also investigates the scaling laws of CMs under ECT, showing that they follow classic power law scaling, suggesting their potential for improved efficiency and performance at larger scales. ECT reduces inference cost by a factor of 1000 compared to pretrained Score SDE/DMs while maintaining comparable generation quality. The method is efficient, scalable, and effective, enabling the generation of high-quality samples in just one or two steps. The results demonstrate that ECT achieves state-of-the-art few-step generative capabilities with minimal tuning cost and benefits from scaling in training FLOPs.This paper introduces Easy Consistency Tuning (ECT), a novel and efficient method for training consistency models (CMs). CMs are generative models that enforce that all points along a sampling trajectory map to the same initial point, enabling faster generation compared to traditional diffusion models. However, training CMs is resource-intensive, as demonstrated by the need for one week of training on 8 GPUs for the state-of-the-art CM on CIFAR-10.
ECT improves the efficiency of training CMs by leveraging a pre-trained diffusion model and progressively tightening the consistency condition during training. By expressing CM trajectories as a differential equation, ECT allows for fine-tuning a diffusion model to approximate the full consistency condition. This approach significantly reduces training time while improving sample quality. For example, ECT achieves a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU, matching the performance of Consistency Distillation trained over hundreds of GPU hours.
The paper also investigates the scaling laws of CMs under ECT, showing that they follow classic power law scaling, suggesting their potential for improved efficiency and performance at larger scales. ECT reduces inference cost by a factor of 1000 compared to pretrained Score SDE/DMs while maintaining comparable generation quality. The method is efficient, scalable, and effective, enabling the generation of high-quality samples in just one or two steps. The results demonstrate that ECT achieves state-of-the-art few-step generative capabilities with minimal tuning cost and benefits from scaling in training FLOPs.