[slides and audio] FIFO-Diffusion%3A Generating Infinite Videos from Text without Training

FIFO-Diffusion is a novel inference technique for text-conditional video generation that enables the creation of infinitely long videos without additional training. Based on a pretrained video diffusion model, FIFO-Diffusion uses diagonal denoising, which processes a sequence of frames with increasing noise levels in a first-in-first-out manner. This method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising can create a training-inference gap, so the paper introduces latent partitioning and lookahead denoising to mitigate this issue and enhance video quality. Latent partitioning reduces the noise level differences in the input latents, while lookahead denoising leverages forward referencing to improve denoising accuracy. FIFO-Diffusion consumes a constant amount of memory regardless of the target video length and is well-suited for parallel inference on multiple GPUs. The method has been demonstrated to generate extremely long videos with high quality and consistent motion, outperforming other baselines in terms of motion smoothness, frame quality, and scene diversity. The paper also includes a user study showing that FIFO-Diffusion is preferred over FreeNoise in all criteria, especially related to motion. The results indicate that FIFO-Diffusion can generate natural dynamic videos with high scene context consistency and dynamic motion expression.FIFO-Diffusion is a novel inference technique for text-conditional video generation that enables the creation of infinitely long videos without additional training. Based on a pretrained video diffusion model, FIFO-Diffusion uses diagonal denoising, which processes a sequence of frames with increasing noise levels in a first-in-first-out manner. This method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising can create a training-inference gap, so the paper introduces latent partitioning and lookahead denoising to mitigate this issue and enhance video quality. Latent partitioning reduces the noise level differences in the input latents, while lookahead denoising leverages forward referencing to improve denoising accuracy. FIFO-Diffusion consumes a constant amount of memory regardless of the target video length and is well-suited for parallel inference on multiple GPUs. The method has been demonstrated to generate extremely long videos with high quality and consistent motion, outperforming other baselines in terms of motion smoothness, frame quality, and scene diversity. The paper also includes a user study showing that FIFO-Diffusion is preferred over FreeNoise in all criteria, especially related to motion. The results indicate that FIFO-Diffusion can generate natural dynamic videos with high scene context consistency and dynamic motion expression.

FIFO-Diffusion: Generating Infinite Videos from Text without Training

12 Jun 2024 | Jihwan Kim* 1 Junoh Kang* 1 Jinyoung Choi1 Bohyung Han1,2