MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

20 Jun 2024 | Xuanyu Yi, Zike Wu, Qiuqong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang
MVGamba is a unified 3D generation framework that leverages Gaussian Splatting to generate high-quality 3D content in sub-seconds. The framework integrates multi-view diffusion models with scalable multi-view reconstructors to produce 3D Gaussian Splatting (3DGS) parameters. MVGamba features a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM), which enables causal context propagation for cross-view self-refinement while generating long sequences of Gaussians with linear complexity. This approach allows for efficient and high-quality 3D content generation from a single image, sparse images, or text prompts. MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately 0.1× the model size. The framework addresses the issue of multi-view inconsistency and blurred textures in existing Gaussian reconstruction models by using a lightweight and efficient SSM-based reconstructor. MVGamba is trained on a large-scale dataset and demonstrates superior performance in both qualitative and quantitative experiments. The model is capable of generating high-fidelity 3D content with efficient training and inference, making it suitable for real-world applications. The paper also discusses the limitations of MVGamba, including its dependency on multi-view diffusion models and the need for further improvements in input order optimization. Overall, MVGamba provides a general and efficient solution for 3D content generation, including text-to-3D, image-to-3D, and sparse-view reconstruction tasks.MVGamba is a unified 3D generation framework that leverages Gaussian Splatting to generate high-quality 3D content in sub-seconds. The framework integrates multi-view diffusion models with scalable multi-view reconstructors to produce 3D Gaussian Splatting (3DGS) parameters. MVGamba features a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM), which enables causal context propagation for cross-view self-refinement while generating long sequences of Gaussians with linear complexity. This approach allows for efficient and high-quality 3D content generation from a single image, sparse images, or text prompts. MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately 0.1× the model size. The framework addresses the issue of multi-view inconsistency and blurred textures in existing Gaussian reconstruction models by using a lightweight and efficient SSM-based reconstructor. MVGamba is trained on a large-scale dataset and demonstrates superior performance in both qualitative and quantitative experiments. The model is capable of generating high-fidelity 3D content with efficient training and inference, making it suitable for real-world applications. The paper also discusses the limitations of MVGamba, including its dependency on multi-view diffusion models and the need for further improvements in input order optimization. Overall, MVGamba provides a general and efficient solution for 3D content generation, including text-to-3D, image-to-3D, and sparse-view reconstruction tasks.
Reach us at info@study.space