14 Feb 2024 | Yutaro Yamada, Khyathi Chandu, Yuchen Lin, Jack Hessel, Ilker Yildirim, Yejin Choi
L3GO is a language agent that uses chain-of-3D-thoughts to generate unconventional objects. It addresses the limitations of existing diffusion models in generating objects with precise 3D spatial configurations, such as "a chair with five legs." The agent uses large language models to iteratively build 3D meshes in a simulation environment, guided by feedback from the environment. A new benchmark, Unconventionally Feasible Objects (UFO), was developed to evaluate the performance of text-to-3D models. L3GO outperforms other models, including DALL-E 3 and Stable Diffusion-XL, in both human and automatic evaluations. The agent uses a structured approach to construct objects by breaking them into parts, specifying their size and spatial relationships, and iteratively refining the design. It also integrates with Blender to generate 3D meshes and render them into 2D images using ControlNet. Experiments on ShapeNet show that L3GO performs better than GPT-4 and other agents in generating 3D objects. On the UFO benchmark, L3GO surpasses state-of-the-art text-to-2D and text-to-3D models. The research highlights the potential of integrating language agents into diffusion model pipelines for generating objects with specific attribute requirements.L3GO is a language agent that uses chain-of-3D-thoughts to generate unconventional objects. It addresses the limitations of existing diffusion models in generating objects with precise 3D spatial configurations, such as "a chair with five legs." The agent uses large language models to iteratively build 3D meshes in a simulation environment, guided by feedback from the environment. A new benchmark, Unconventionally Feasible Objects (UFO), was developed to evaluate the performance of text-to-3D models. L3GO outperforms other models, including DALL-E 3 and Stable Diffusion-XL, in both human and automatic evaluations. The agent uses a structured approach to construct objects by breaking them into parts, specifying their size and spatial relationships, and iteratively refining the design. It also integrates with Blender to generate 3D meshes and render them into 2D images using ControlNet. Experiments on ShapeNet show that L3GO performs better than GPT-4 and other agents in generating 3D objects. On the UFO benchmark, L3GO surpasses state-of-the-art text-to-2D and text-to-3D models. The research highlights the potential of integrating language agents into diffusion model pipelines for generating objects with specific attribute requirements.