Understanding VP3D%3A Unleashing 2D Visual Prompt for Text-to-3D Generation

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation **Authors:** Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, and Tao Mei **Institution:** HiDream.ai Inc., Fudan University **Abstract:** Recent advancements in text-to-3D generation have leveraged Score Distillation Sampling (SDS) to enable zero-shot learning of implicit 3D models (NeRF) by distilling prior knowledge from 2D diffusion models. However, current SDS-based models struggle with intricate text prompts and often result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. This work introduces VP3D, a novel Visual Prompt-guided text-to-3D diffusion model that explicitly utilizes visual appearance knowledge from 2D visual prompts to enhance text-to-3D generation. Instead of solely supervising SDS with text prompts, VP3D first generates a high-quality image from input text using a 2D diffusion model, which then acts as a visual prompt to strengthen SDS optimization with explicit visual appearance. Additionally, VP3D couples the SDS optimization with a differentiable reward function that encourages rendering images of 3D models to better align with the 2D visual prompt and semantically match the text prompt. Extensive experiments show that VP3D significantly improves the visual fidelity of 3D models with more detailed textures and better view consistency. Notably, VP3D can also be adapted for stylized text-to-3D generation by replacing the self-generating visual prompt with a given reference image, producing 3D assets that semantically align with the text prompt and share similar geometric and visual styles with the reference image. **Keywords:** Text-to-3D generation, Score Distillation Sampling, Visual Prompt, 2D Diffusion Models, 3D Model OptimizationVP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation **Authors:** Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, and Tao Mei **Institution:** HiDream.ai Inc., Fudan University **Abstract:** Recent advancements in text-to-3D generation have leveraged Score Distillation Sampling (SDS) to enable zero-shot learning of implicit 3D models (NeRF) by distilling prior knowledge from 2D diffusion models. However, current SDS-based models struggle with intricate text prompts and often result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. This work introduces VP3D, a novel Visual Prompt-guided text-to-3D diffusion model that explicitly utilizes visual appearance knowledge from 2D visual prompts to enhance text-to-3D generation. Instead of solely supervising SDS with text prompts, VP3D first generates a high-quality image from input text using a 2D diffusion model, which then acts as a visual prompt to strengthen SDS optimization with explicit visual appearance. Additionally, VP3D couples the SDS optimization with a differentiable reward function that encourages rendering images of 3D models to better align with the 2D visual prompt and semantically match the text prompt. Extensive experiments show that VP3D significantly improves the visual fidelity of 3D models with more detailed textures and better view consistency. Notably, VP3D can also be adapted for stylized text-to-3D generation by replacing the self-generating visual prompt with a given reference image, producing 3D assets that semantically align with the text prompt and share similar geometric and visual styles with the reference image. **Keywords:** Text-to-3D generation, Score Distillation Sampling, Visual Prompt, 2D Diffusion Models, 3D Model Optimization

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

25 Mar 2024 | Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, and Tao Mei