2024 | Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han
FIFO-Diffusion is a novel inference technique for text-conditional video generation that enables the generation of infinitely long videos without additional training. Based on a pretrained text-conditional video generation model, FIFO-Diffusion uses diagonal denoising, which processes a series of consecutive frames with increasing noise levels in a queue. The method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising can cause a training-inference gap, so the paper introduces latent partitioning and lookahead denoising to reduce this gap and leverage the benefit of forward referencing. FIFO-Diffusion consumes a constant amount of memory regardless of the target video length, making it well-suited for parallel inference on multiple GPUs. The method has been demonstrated to generate extremely long videos with high quality and consistent motion. The paper also presents results on existing text-to-video generation baselines, showing that FIFO-Diffusion generates videos with natural motion without degradation over time. The method is compared with other techniques, including FreeNoise and Gen-L-Video, and is shown to outperform them in terms of motion smoothness, frame quality, and scene diversity. The paper also includes a user study, which shows that users prefer FIFO-Diffusion over FreeNoise in all criteria, especially related to motion. The method is also evaluated in terms of computational cost, showing that FIFO-Diffusion has a fixed memory allocation and can save time with parallelized computation. The paper concludes that FIFO-Diffusion is a promising approach for generating infinitely long videos from text without training.FIFO-Diffusion is a novel inference technique for text-conditional video generation that enables the generation of infinitely long videos without additional training. Based on a pretrained text-conditional video generation model, FIFO-Diffusion uses diagonal denoising, which processes a series of consecutive frames with increasing noise levels in a queue. The method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising can cause a training-inference gap, so the paper introduces latent partitioning and lookahead denoising to reduce this gap and leverage the benefit of forward referencing. FIFO-Diffusion consumes a constant amount of memory regardless of the target video length, making it well-suited for parallel inference on multiple GPUs. The method has been demonstrated to generate extremely long videos with high quality and consistent motion. The paper also presents results on existing text-to-video generation baselines, showing that FIFO-Diffusion generates videos with natural motion without degradation over time. The method is compared with other techniques, including FreeNoise and Gen-L-Video, and is shown to outperform them in terms of motion smoothness, frame quality, and scene diversity. The paper also includes a user study, which shows that users prefer FIFO-Diffusion over FreeNoise in all criteria, especially related to motion. The method is also evaluated in terms of computational cost, showing that FIFO-Diffusion has a fixed memory allocation and can save time with parallelized computation. The paper concludes that FIFO-Diffusion is a promising approach for generating infinitely long videos from text without training.