1 Apr 2024 | Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer, and Björn Ommer
The paper introduces Zigzag Mamba (ZigMa), a diffusion model that leverages the long sequence modeling capability of the State-Space Model (SSM) called Mamba to improve the scalability and efficiency of visual data generation. The key contribution is the identification and addressing of the critical issue of spatial continuity in Mamba, which is often overlooked in current Mamba-based vision methods. ZigMa proposes a simple, plug-and-play solution that enhances the network's position-awareness by arranging and rearranging the scan path of Mamba in a heuristic manner. This approach outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. The method is further integrated with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024 × 1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256 × 256. The paper also explores the generalization of the method to 3D video data by factorizing the spatial and temporal sequences. Experimental results show that ZigMa achieves superior performance in terms of FID scores and visual quality, validating the effectiveness of the proposed approach.The paper introduces Zigzag Mamba (ZigMa), a diffusion model that leverages the long sequence modeling capability of the State-Space Model (SSM) called Mamba to improve the scalability and efficiency of visual data generation. The key contribution is the identification and addressing of the critical issue of spatial continuity in Mamba, which is often overlooked in current Mamba-based vision methods. ZigMa proposes a simple, plug-and-play solution that enhances the network's position-awareness by arranging and rearranging the scan path of Mamba in a heuristic manner. This approach outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. The method is further integrated with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024 × 1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256 × 256. The paper also explores the generalization of the method to 3D video data by factorizing the spatial and temporal sequences. Experimental results show that ZigMa achieves superior performance in terms of FID scores and visual quality, validating the effectiveness of the proposed approach.