14 Mar 2024 | Cheng Chen, Xiaofeng Yang, Fan Yang, Chengzeng Feng, Zhoujie Fu, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu
Sculpt3D is a novel framework for multi-view consistent text-to-3D generation that integrates 3D priors from retrieved reference objects without retraining the 2D diffusion model. The method ensures high-quality and diverse 3D geometry through sparse ray sampling and keypoint supervision. It also modulates the 2D diffusion model to generate accurate appearances across different views without altering the object's style. The framework effectively leverages 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show that Sculpt3D significantly improves multi-view consistency while retaining fidelity and diversity. The method uses a sparse ray sampling approach to selectively discard points and supervise only a minimal number of keypoints that describe the overall structure. It also updates the template by pruning and generating new points in areas of low and high NeRF output density. The method further introduces a re-retrieval mechanism to correct retrieval results through the generated shape. The framework also utilizes the template's appearance information to refine the generated objects. The method ensures that the generated objects maintain the correct appearance patterns without altering their style. The method uses a unified image adapter to adapt the template to the generated object's style and then use the adapted image to align the generated erroneous appearances with the correct patterns. The method only requires four sparse template views to supervise the 3D space partitioned according to four standard orientations. The key contributions of Sculpt3D include explicitly integrating 3D shape and appearance information for multi-view consistent text-to-3D generation while maintaining the high-quality generation capabilities of the 2D diffusion model. The method enables creative point growth and pruning during the 2D diffusion and 3D geometry co-supervision process, which enhances the 2D diffusion's ability to produce shapes that are both accurate and creative. The method further uses the appearance pattern information of the template to modulate the output of the diffusion model for resolving appearance ambiguities. Extensive experiments show that the method significantly improves the multi-view consistency of text-to-3D generation while retaining generalizability. The method is evaluated on the T3Bench benchmark and compared with several baselines, including DreamFusion, Latent-NeRF, Magic3D, Fantasia3D, and ProlificDreamer. The results show that Sculpt3D outperforms these baselines in terms of multi-view consistency, fidelity, and diversity. The method is also evaluated using quantitative metrics, including the 3D consistent rate, which measures the proportion of generated objects that are consistent across multiple views. The results show that Sculpt3D significantly improves the 3D consistent rate compared to the baselines. The method is also evaluated using the T3Bench benchmark, which contains 100 text prompts covering various types of single objects. The results show that Sculpt3D can generate objects with accurate geometry using various text descriptions whileSculpt3D is a novel framework for multi-view consistent text-to-3D generation that integrates 3D priors from retrieved reference objects without retraining the 2D diffusion model. The method ensures high-quality and diverse 3D geometry through sparse ray sampling and keypoint supervision. It also modulates the 2D diffusion model to generate accurate appearances across different views without altering the object's style. The framework effectively leverages 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show that Sculpt3D significantly improves multi-view consistency while retaining fidelity and diversity. The method uses a sparse ray sampling approach to selectively discard points and supervise only a minimal number of keypoints that describe the overall structure. It also updates the template by pruning and generating new points in areas of low and high NeRF output density. The method further introduces a re-retrieval mechanism to correct retrieval results through the generated shape. The framework also utilizes the template's appearance information to refine the generated objects. The method ensures that the generated objects maintain the correct appearance patterns without altering their style. The method uses a unified image adapter to adapt the template to the generated object's style and then use the adapted image to align the generated erroneous appearances with the correct patterns. The method only requires four sparse template views to supervise the 3D space partitioned according to four standard orientations. The key contributions of Sculpt3D include explicitly integrating 3D shape and appearance information for multi-view consistent text-to-3D generation while maintaining the high-quality generation capabilities of the 2D diffusion model. The method enables creative point growth and pruning during the 2D diffusion and 3D geometry co-supervision process, which enhances the 2D diffusion's ability to produce shapes that are both accurate and creative. The method further uses the appearance pattern information of the template to modulate the output of the diffusion model for resolving appearance ambiguities. Extensive experiments show that the method significantly improves the multi-view consistency of text-to-3D generation while retaining generalizability. The method is evaluated on the T3Bench benchmark and compared with several baselines, including DreamFusion, Latent-NeRF, Magic3D, Fantasia3D, and ProlificDreamer. The results show that Sculpt3D outperforms these baselines in terms of multi-view consistency, fidelity, and diversity. The method is also evaluated using quantitative metrics, including the 3D consistent rate, which measures the proportion of generated objects that are consistent across multiple views. The results show that Sculpt3D significantly improves the 3D consistent rate compared to the baselines. The method is also evaluated using the T3Bench benchmark, which contains 100 text prompts covering various types of single objects. The results show that Sculpt3D can generate objects with accurate geometry using various text descriptions while