EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION

EFFICIENT VIDEO DIFFUSION MODELS VIA CONTENT-FRAME MOTION-LATENT DECOMPOSITION

21 Mar 2024 | Sihyun Yu1*, Weili Nie2, De-An Huang2, Boyi Li2,3, Jinwoo Shin1, Anima Anandkumar4
The paper introduces the Content-Motion Latent Diffusion Model (CMD), an efficient extension of pretrained image diffusion models for video generation. CMD addresses the high memory and computational requirements of current video diffusion models by encoding videos into a combination of a content frame and a low-dimensional motion latent representation. The content frame represents the common content, while the motion latent represents the underlying motion in the video. CMD leverages a pretrained image diffusion model to generate the content frame and a lightweight diffusion model to generate the motion latent. This approach reduces the input dimensionality and computational costs, achieving better video generation quality and faster sampling speeds. Experimental results on benchmarks such as UCF-101 and WebVid-10M demonstrate that CMD outperforms existing methods in terms of FVD scores and computational efficiency.The paper introduces the Content-Motion Latent Diffusion Model (CMD), an efficient extension of pretrained image diffusion models for video generation. CMD addresses the high memory and computational requirements of current video diffusion models by encoding videos into a combination of a content frame and a low-dimensional motion latent representation. The content frame represents the common content, while the motion latent represents the underlying motion in the video. CMD leverages a pretrained image diffusion model to generate the content frame and a lightweight diffusion model to generate the motion latent. This approach reduces the input dimensionality and computational costs, achieving better video generation quality and faster sampling speeds. Experimental results on benchmarks such as UCF-101 and WebVid-10M demonstrate that CMD outperforms existing methods in terms of FVD scores and computational efficiency.
Reach us at info@study.space
[slides] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition | StudySpace