This survey provides a comprehensive overview of text-to-3D shape generation methods, categorizing them into three families: 3DPT (using paired text and 3D data), 3DUT (using only 3D data without text pairing), and No3D (using no 3D data and relying on text-image models). The survey discusses the challenges and limitations of each approach, as well as promising directions for future research.
Text-to-3D generation has seen significant progress due to advances in 3D representations, large-scale pretraining, and differentiable rendering. However, there are still challenges in generating high-quality 3D shapes without explicit 3D training data, and in enabling natural editability of generated outputs. Recent work has addressed these challenges by learning shape priors from large 3D datasets and combining them with text-to-3D methods that do not require 3D data.
The survey summarizes four key components of text-to-3D generation: training data, 3D representation type, generative model, and training setup. It categorizes methods based on the type of supervision data required, and discusses the properties of each family. The No3D family is the focus of the survey, as it has not been addressed in detail by prior surveys. This family typically uses CLIP or diffusion models as guidance to optimize similarity and distillation with differentiable 3D representations.
The survey also discusses emerging work on generating multi-object 3D scenes and allowing editing of the output 3D shape in various ways. It presents a brief overview of evaluation methods for text-to-3D shape generation and concludes with a discussion of promising future directions. The survey highlights the importance of 3D representations, deep generative models, and guidance models in text-to-3D generation, and discusses various approaches to generating 3D shapes from text, including GANs, VAEs, and diffusion models.This survey provides a comprehensive overview of text-to-3D shape generation methods, categorizing them into three families: 3DPT (using paired text and 3D data), 3DUT (using only 3D data without text pairing), and No3D (using no 3D data and relying on text-image models). The survey discusses the challenges and limitations of each approach, as well as promising directions for future research.
Text-to-3D generation has seen significant progress due to advances in 3D representations, large-scale pretraining, and differentiable rendering. However, there are still challenges in generating high-quality 3D shapes without explicit 3D training data, and in enabling natural editability of generated outputs. Recent work has addressed these challenges by learning shape priors from large 3D datasets and combining them with text-to-3D methods that do not require 3D data.
The survey summarizes four key components of text-to-3D generation: training data, 3D representation type, generative model, and training setup. It categorizes methods based on the type of supervision data required, and discusses the properties of each family. The No3D family is the focus of the survey, as it has not been addressed in detail by prior surveys. This family typically uses CLIP or diffusion models as guidance to optimize similarity and distillation with differentiable 3D representations.
The survey also discusses emerging work on generating multi-object 3D scenes and allowing editing of the output 3D shape in various ways. It presents a brief overview of evaluation methods for text-to-3D shape generation and concludes with a discussion of promising future directions. The survey highlights the importance of 3D representations, deep generative models, and guidance models in text-to-3D generation, and discusses various approaches to generating 3D shapes from text, including GANs, VAEs, and diffusion models.