EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION

EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION

2024 | Sihyun Yu¹*, Weili Nie², De-An Huang², Boyi Li²,³, Jinwoo Shin¹, Anima Anandkumar⁴
This paper proposes a content-motion latent diffusion model (CMD), an efficient extension of pretrained image diffusion models for video generation. CMD addresses the high memory and computational costs of video diffusion models by decomposing videos into a content frame (like an image) and a low-dimensional motion latent representation. The content frame captures the common content of the video, while the motion latent representation captures the underlying motion. CMD generates the content frame by fine-tuning a pretrained image diffusion model and the motion latent representation by training a lightweight diffusion model. This approach leads to significantly better video generation quality and reduced computational costs. For example, CMD can sample a video of 512×1024 resolution and length 16 in 3.1 seconds, which is 7.7× faster than prior approaches. CMD achieves an FVD score of 238.3 on WebVid-10M, 18.5% better than the previous state-of-the-art of 292.4. CMD also demonstrates superior memory and computation efficiency, requiring only 5.56GB memory and 46.83 TFLOPs to generate a single video of 512×1024 resolution and length 16, compared to 8.51GB memory and 938.9 TFLOPs for recent Modelscope. CMD's key innovation is the design of a compact latent space that directly utilizes a pretrained image model, which has not been done in previous latent video diffusion models. The paper also presents extensive experiments and analysis, showing that CMD outperforms existing video generation methods in terms of quality, efficiency, and scalability. The results demonstrate that CMD is a promising approach for efficient large-scale video generation.This paper proposes a content-motion latent diffusion model (CMD), an efficient extension of pretrained image diffusion models for video generation. CMD addresses the high memory and computational costs of video diffusion models by decomposing videos into a content frame (like an image) and a low-dimensional motion latent representation. The content frame captures the common content of the video, while the motion latent representation captures the underlying motion. CMD generates the content frame by fine-tuning a pretrained image diffusion model and the motion latent representation by training a lightweight diffusion model. This approach leads to significantly better video generation quality and reduced computational costs. For example, CMD can sample a video of 512×1024 resolution and length 16 in 3.1 seconds, which is 7.7× faster than prior approaches. CMD achieves an FVD score of 238.3 on WebVid-10M, 18.5% better than the previous state-of-the-art of 292.4. CMD also demonstrates superior memory and computation efficiency, requiring only 5.56GB memory and 46.83 TFLOPs to generate a single video of 512×1024 resolution and length 16, compared to 8.51GB memory and 938.9 TFLOPs for recent Modelscope. CMD's key innovation is the design of a compact latent space that directly utilizes a pretrained image model, which has not been done in previous latent video diffusion models. The paper also presents extensive experiments and analysis, showing that CMD outperforms existing video generation methods in terms of quality, efficiency, and scalability. The results demonstrate that CMD is a promising approach for efficient large-scale video generation.
Reach us at info@study.space