16 Jul 2024 | Xianglong He1,2*, Junyi Chen1,3*, Sida Peng4, Di Huang1, Yangguang Li5, Xiaoshui Huang1, Chun Yuan2†, Wanli Ouyang1,6, and Tong He1†
The paper introduces GVGEN, a novel diffusion-based framework for generating 3D Gaussian representations from text input. GVGEN addresses the limitations of existing methods, such as the inability to produce diverse samples or prolonged inference times, by proposing two innovative techniques: Structured Volumetric Representation and Coarse-to-fine Generation Pipeline. The Structured Volumetric Representation arranges disorganized 3D Gaussian points into a structured form called GaussianVolume, enhancing detail fidelity through a Candidate Pool Strategy. The Coarse-to-fine Generation Pipeline first generates a basic geometric structure and then predicts complete Gaussian attributes, improving the quality and efficiency of the generation process. GVGEN demonstrates superior performance in both qualitative and quantitative assessments, maintaining a fast generation speed of approximately 7 seconds. The framework is designed to efficiently generate 3D models from text descriptions, making it suitable for various applications in computer graphics, video game design, film production, and AR/VR technologies.The paper introduces GVGEN, a novel diffusion-based framework for generating 3D Gaussian representations from text input. GVGEN addresses the limitations of existing methods, such as the inability to produce diverse samples or prolonged inference times, by proposing two innovative techniques: Structured Volumetric Representation and Coarse-to-fine Generation Pipeline. The Structured Volumetric Representation arranges disorganized 3D Gaussian points into a structured form called GaussianVolume, enhancing detail fidelity through a Candidate Pool Strategy. The Coarse-to-fine Generation Pipeline first generates a basic geometric structure and then predicts complete Gaussian attributes, improving the quality and efficiency of the generation process. GVGEN demonstrates superior performance in both qualitative and quantitative assessments, maintaining a fast generation speed of approximately 7 seconds. The framework is designed to efficiently generate 3D models from text descriptions, making it suitable for various applications in computer graphics, video game design, film production, and AR/VR technologies.