2 Mar 2024 | Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi
SceneCraft is an LLM agent that converts text descriptions into Blender-executable Python scripts to render complex 3D scenes. It addresses the challenge of spatial planning and arrangement by combining advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph to detail spatial relationships among assets, then generates Python scripts that translate these relationships into numerical constraints. It then uses a multimodal LLM to analyze rendered images and iteratively refine the scene. SceneCraft also features a library learning mechanism that compiles common script functions into a reusable library, enabling continuous self-improvement without expensive LLM parameter tuning. Evaluation shows SceneCraft outperforms existing LLM-based agents in rendering complex scenes, as demonstrated by its adherence to constraints and favorable human assessments. SceneCraft is also shown to have broader application potential, such as reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signals. SceneCraft's dual-loop optimization pipeline includes an inner-loop for per-scene layout optimization and an outer-loop for dynamically expanding its spatial skill library. The inner-loop focuses on per-scene layout optimization, where an LLM-based planner constructs a scene graph outlining spatial constraints for asset arrangement. SceneCraft then writes Python code to transform these relations into numerical constraints, which are fed to a specialized solver to determine the layout parameters of each asset. After rendering these scripts into images via Blender, a multimodal LLM assesses the alignment between the generated image and the textual description. If a misalignment is detected, the LLM identifies the problematic semantic relations and corresponding constraints, subsequently refining the scripts. This iterative process of refinement and feedback is crucial for enhancing the scene's fidelity. SceneCraft also learns a spatial skill library through outer-loop learning, integrating common code patterns to streamline the acquisition of new non-parametric skills for self-improvement. SceneCraft is evaluated on synthetic and real-world datasets, showing superior sample efficiency and accuracy in rendering intricate 3D scenes from textual descriptions. It achieves significant improvements in CLIP score and constraint passing score compared to a baseline LLM agent, BlenderGPT. SceneCraft's approach enables it to handle increasingly complex scenes and descriptions without external human expertise or LLM parameter tuning. The paper's contributions include an LLM agent that transforms input text queries into 3D scenes by generating Blender scripts, a spatial skill library learned from synthetic input queries, and experimental results showing SceneCraft's superior performance compared to BlenderGPT.SceneCraft is an LLM agent that converts text descriptions into Blender-executable Python scripts to render complex 3D scenes. It addresses the challenge of spatial planning and arrangement by combining advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph to detail spatial relationships among assets, then generates Python scripts that translate these relationships into numerical constraints. It then uses a multimodal LLM to analyze rendered images and iteratively refine the scene. SceneCraft also features a library learning mechanism that compiles common script functions into a reusable library, enabling continuous self-improvement without expensive LLM parameter tuning. Evaluation shows SceneCraft outperforms existing LLM-based agents in rendering complex scenes, as demonstrated by its adherence to constraints and favorable human assessments. SceneCraft is also shown to have broader application potential, such as reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signals. SceneCraft's dual-loop optimization pipeline includes an inner-loop for per-scene layout optimization and an outer-loop for dynamically expanding its spatial skill library. The inner-loop focuses on per-scene layout optimization, where an LLM-based planner constructs a scene graph outlining spatial constraints for asset arrangement. SceneCraft then writes Python code to transform these relations into numerical constraints, which are fed to a specialized solver to determine the layout parameters of each asset. After rendering these scripts into images via Blender, a multimodal LLM assesses the alignment between the generated image and the textual description. If a misalignment is detected, the LLM identifies the problematic semantic relations and corresponding constraints, subsequently refining the scripts. This iterative process of refinement and feedback is crucial for enhancing the scene's fidelity. SceneCraft also learns a spatial skill library through outer-loop learning, integrating common code patterns to streamline the acquisition of new non-parametric skills for self-improvement. SceneCraft is evaluated on synthetic and real-world datasets, showing superior sample efficiency and accuracy in rendering intricate 3D scenes from textual descriptions. It achieves significant improvements in CLIP score and constraint passing score compared to a baseline LLM agent, BlenderGPT. SceneCraft's approach enables it to handle increasingly complex scenes and descriptions without external human expertise or LLM parameter tuning. The paper's contributions include an LLM agent that transforms input text queries into 3D scenes by generating Blender scripts, a spatial skill library learned from synthetic input queries, and experimental results showing SceneCraft's superior performance compared to BlenderGPT.