1 Apr 2024 | Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Björn Ommer
ZigMa: A DiT-style Zigzag Mamba Diffusion Model
This paper introduces ZigMa, a DiT-style diffusion model that leverages the long sequence modeling capability of the Mamba state-space model to enhance visual data generation. The key challenge addressed is the lack of spatial continuity in Mamba-based vision methods, which is critical for effective 2D and 3D modeling. ZigMa improves upon Mamba by incorporating spatial continuity, leading to better performance and efficiency compared to transformer-based baselines. The model is extended to 3D video data by factorizing spatial and temporal sequences. ZigMa is integrated with the Stochastic Interpolant framework to investigate scalability on large-resolution visual datasets such as FacesHQ (1024x1024) and UCF101. The model outperforms Mamba-based baselines and demonstrates improved speed and memory utilization. The method also enables text-conditioning through a cross-attention block and is tested on complex visual data. The results show that ZigMa achieves superior performance on high-resolution and complex datasets, highlighting the effectiveness of spatial continuity in Mamba-based models. The paper also provides ablation studies on various scanning schemes, demonstrating the importance of spatial continuity and the benefits of different scanning strategies. The model is evaluated on multiple datasets, including COCO and UCF101, showing its versatility and effectiveness in image and video generation. The results indicate that ZigMa is a promising approach for scalable diffusion models in visual data generation.ZigMa: A DiT-style Zigzag Mamba Diffusion Model
This paper introduces ZigMa, a DiT-style diffusion model that leverages the long sequence modeling capability of the Mamba state-space model to enhance visual data generation. The key challenge addressed is the lack of spatial continuity in Mamba-based vision methods, which is critical for effective 2D and 3D modeling. ZigMa improves upon Mamba by incorporating spatial continuity, leading to better performance and efficiency compared to transformer-based baselines. The model is extended to 3D video data by factorizing spatial and temporal sequences. ZigMa is integrated with the Stochastic Interpolant framework to investigate scalability on large-resolution visual datasets such as FacesHQ (1024x1024) and UCF101. The model outperforms Mamba-based baselines and demonstrates improved speed and memory utilization. The method also enables text-conditioning through a cross-attention block and is tested on complex visual data. The results show that ZigMa achieves superior performance on high-resolution and complex datasets, highlighting the effectiveness of spatial continuity in Mamba-based models. The paper also provides ablation studies on various scanning schemes, demonstrating the importance of spatial continuity and the benefits of different scanning strategies. The model is evaluated on multiple datasets, including COCO and UCF101, showing its versatility and effectiveness in image and video generation. The results indicate that ZigMa is a promising approach for scalable diffusion models in visual data generation.