[slides] T2V-Turbo%3A Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback **Abstract:** Diffusion-based text-to-video (T2V) models have achieved significant success but are hampered by slow sampling speeds. To address this, consistency models have been proposed to enable faster inference, albeit at the cost of sample quality. This work aims to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. By directly optimizing rewards associated with single-step generations, we bypass the memory constraints of backpropagating gradients through iterative sampling. Notably, the 4-step generations from T2V-Turbo achieve the highest total score on the Vbench benchmark, surpassing state-of-the-art (SOTA) models like Gen-2 and Pika. Human evaluations further validate that the 4-step generations from T2V-Turbo are preferred over 50-step DDIM samples from their teacher models, representing over tenfold acceleration with improved video generation quality. **Contributions:** - Learn a T2V model with feedback from a mixture of reward models, including a video-text model. - Establish a new SOTA on the Vbench with only 4 inference steps, outperforming SOTA models trained with substantial resources. - 4-step generations from T2V-Turbo are favored over 50-step generations from teacher models in human evaluations, representing over 10 times inference acceleration with quality improvement. **Keywords:** Text-to-Video, Video Consistency Model, Consistency Distillation, Reward Feedback, Inference SpeedT2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback **Abstract:** Diffusion-based text-to-video (T2V) models have achieved significant success but are hampered by slow sampling speeds. To address this, consistency models have been proposed to enable faster inference, albeit at the cost of sample quality. This work aims to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. By directly optimizing rewards associated with single-step generations, we bypass the memory constraints of backpropagating gradients through iterative sampling. Notably, the 4-step generations from T2V-Turbo achieve the highest total score on the Vbench benchmark, surpassing state-of-the-art (SOTA) models like Gen-2 and Pika. Human evaluations further validate that the 4-step generations from T2V-Turbo are preferred over 50-step DDIM samples from their teacher models, representing over tenfold acceleration with improved video generation quality. **Contributions:** - Learn a T2V model with feedback from a mixture of reward models, including a video-text model. - Establish a new SOTA on the Vbench with only 4 inference steps, outperforming SOTA models trained with substantial resources. - 4-step generations from T2V-Turbo are favored over 50-step generations from teacher models in human evaluations, representing over 10 times inference acceleration with quality improvement. **Keywords:** Text-to-Video, Video Consistency Model, Consistency Distillation, Reward Feedback, Inference Speed

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

29 May 2024 | Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhui Chen, William Yang Wang