CV-VAE: A Compatible Video VAE for Latent Generative Video Models

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

23 Oct 2024 | Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan
CV-VAE: A Compatible Video VAE for Latent Generative Video Models This paper proposes CV-VAE, a video VAE that is compatible with existing image and video models, such as Stable Diffusion and SVD. The key idea is to design a latent space regularization method that aligns the latent space of the video VAE with that of the image VAE. This allows video models to be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. To improve training efficiency, a novel architecture for the video VAE is also designed. With our CV-VAE, existing video models can generate four times more frames with minimal fine-tuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE. The paper also discusses the challenges of training video VAEs, such as the lack of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Additionally, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, the paper proposes a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. To improve the training efficiency, a novel architecture for the video VAE is also designed. With our CV-VAE, existing video models can generate four times more frames with minimal fine-tuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.CV-VAE: A Compatible Video VAE for Latent Generative Video Models This paper proposes CV-VAE, a video VAE that is compatible with existing image and video models, such as Stable Diffusion and SVD. The key idea is to design a latent space regularization method that aligns the latent space of the video VAE with that of the image VAE. This allows video models to be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. To improve training efficiency, a novel architecture for the video VAE is also designed. With our CV-VAE, existing video models can generate four times more frames with minimal fine-tuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE. The paper also discusses the challenges of training video VAEs, such as the lack of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Additionally, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, the paper proposes a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. To improve the training efficiency, a novel architecture for the video VAE is also designed. With our CV-VAE, existing video models can generate four times more frames with minimal fine-tuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
Reach us at info@study.space