10 May 2024 | Yu Gao1, Jiancheng Huang1, Xiaopeng Sun1, Zequn Jie1†, Yujie Zhong1 Lin Ma1
This paper introduces Matten, a novel latent diffusion model for video generation that integrates the Mamba-Attention architecture. Mamba, a state-space model, is used to capture global temporal relationships, while attention mechanisms are employed to model local and spatial-temporal details. The model is designed to efficiently handle long sequences and complex spatio-temporal interactions in videos. Experimental evaluations show that Matten achieves competitive performance with current Transformer-based and GAN-based models, outperforming them in terms of FVD scores and efficiency. The study also demonstrates a positive correlation between the complexity of the model and the quality of generated videos, highlighting the model's scalability. Four model variants are explored, and the results indicate that a balanced approach combining Mamba and attention mechanisms is most effective. The paper concludes by discussing the contributions of Matten and its superior performance in video generation tasks.This paper introduces Matten, a novel latent diffusion model for video generation that integrates the Mamba-Attention architecture. Mamba, a state-space model, is used to capture global temporal relationships, while attention mechanisms are employed to model local and spatial-temporal details. The model is designed to efficiently handle long sequences and complex spatio-temporal interactions in videos. Experimental evaluations show that Matten achieves competitive performance with current Transformer-based and GAN-based models, outperforming them in terms of FVD scores and efficiency. The study also demonstrates a positive correlation between the complexity of the model and the quality of generated videos, highlighting the model's scalability. Four model variants are explored, and the results indicate that a balanced approach combining Mamba and attention mechanisms is most effective. The paper concludes by discussing the contributions of Matten and its superior performance in video generation tasks.