T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

29 May 2024 | Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback This paper introduces T2V-Turbo, a text-to-video (T2V) model that improves video generation quality while maintaining fast inference speeds. The model integrates reward feedback from multiple differentiable reward models (RMs) into the consistency distillation (CD) process of a pre-trained T2V model. By directly optimizing rewards associated with single-step generations, T2V-Turbo bypasses memory constraints typically imposed by backpropagating gradients through iterative sampling processes. The model achieves high-quality video generation in just 4-8 inference steps, outperforming existing methods on the VBench benchmark, including proprietary systems like Gen-2 and Pika. Human evaluations further confirm that T2V-Turbo's 4-step generations are preferred over 50-step DDIM samples from teacher models, representing over tenfold inference acceleration with improved video quality. The model's effectiveness is demonstrated through experiments on various video generation tasks, showing that incorporating mixed reward feedback significantly enhances video generation quality while maintaining fast inference speeds. The paper also discusses the limitations of the approach, including the use of video foundation models as RMs due to the lack of open-sourced video-text RMs. Overall, T2V-Turbo represents a significant advancement in efficient T2V synthesis by breaking the quality bottleneck of video consistency models.T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback This paper introduces T2V-Turbo, a text-to-video (T2V) model that improves video generation quality while maintaining fast inference speeds. The model integrates reward feedback from multiple differentiable reward models (RMs) into the consistency distillation (CD) process of a pre-trained T2V model. By directly optimizing rewards associated with single-step generations, T2V-Turbo bypasses memory constraints typically imposed by backpropagating gradients through iterative sampling processes. The model achieves high-quality video generation in just 4-8 inference steps, outperforming existing methods on the VBench benchmark, including proprietary systems like Gen-2 and Pika. Human evaluations further confirm that T2V-Turbo's 4-step generations are preferred over 50-step DDIM samples from teacher models, representing over tenfold inference acceleration with improved video quality. The model's effectiveness is demonstrated through experiments on various video generation tasks, showing that incorporating mixed reward feedback significantly enhances video generation quality while maintaining fast inference speeds. The paper also discusses the limitations of the approach, including the use of video foundation models as RMs due to the lack of open-sourced video-text RMs. Overall, T2V-Turbo represents a significant advancement in efficient T2V synthesis by breaking the quality bottleneck of video consistency models.
Reach us at info@study.space
Understanding T2V-Turbo%3A Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback