Matten: Video Generation with Mamba-Attention

Matten: Video Generation with Mamba-Attention

10 May 2024 | Yu Gao1, Jiancheng Huang1, Xiaopeng Sun1, Zequn Jie1†, Yujie Zhong1 Lin Ma1
This paper introduces Matten, a novel latent diffusion model for video generation that integrates the Mamba-Attention architecture. Mamba, a state-space model, is used to capture global temporal relationships, while attention mechanisms are employed to model local and spatial-temporal details. The model is designed to efficiently handle long sequences and complex spatio-temporal interactions in videos. Experimental evaluations show that Matten achieves competitive performance with current Transformer-based and GAN-based models, outperforming them in terms of FVD scores and efficiency. The study also demonstrates a positive correlation between the complexity of the model and the quality of generated videos, highlighting the model's scalability. Four model variants are explored, and the results indicate that a balanced approach combining Mamba and attention mechanisms is most effective. The paper concludes by discussing the contributions of Matten and its superior performance in video generation tasks.This paper introduces Matten, a novel latent diffusion model for video generation that integrates the Mamba-Attention architecture. Mamba, a state-space model, is used to capture global temporal relationships, while attention mechanisms are employed to model local and spatial-temporal details. The model is designed to efficiently handle long sequences and complex spatio-temporal interactions in videos. Experimental evaluations show that Matten achieves competitive performance with current Transformer-based and GAN-based models, outperforming them in terms of FVD scores and efficiency. The study also demonstrates a positive correlation between the complexity of the model and the quality of generated videos, highlighting the model's scalability. Four model variants are explored, and the results indicate that a balanced approach combining Mamba and attention mechanisms is most effective. The paper concludes by discussing the contributions of Matten and its superior performance in video generation tasks.
Reach us at info@study.space