Disentangled 3D Scene Generation with Layout Learning

Disentangled 3D Scene Generation with Layout Learning

26 Feb 2024 | Dave Epstein, Ben Poole, Ben Mildenhall, Alexei A. Efros, Aleksander Holynski
This paper introduces a method for generating disentangled 3D scenes using layout learning. The approach leverages a large pretrained text-to-image model to generate scenes that are decomposed into individual objects. The key idea is that objects can be identified by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. The method jointly optimizes multiple NeRFs (Neural Radiance Fields) from scratch, each representing a separate object, along with a set of layouts that composite these objects into scenes. The scenes are encouraged to be in-distribution according to the image generator. The approach enables new capabilities in text-to-3D content creation, allowing for object-level scene manipulation. The method is validated on various tasks, including building scenes around 3D assets, sampling different arrangements for given assets, and parsing a provided NeRF into its constituent objects. The results show that layout learning leads to effective object disentanglement, enabling meaningful decomposition of generated 3D scenes without any supervision beyond a text prompt. The method is implemented using a neural network architecture that incorporates layout learning, allowing for the compositional generation of 3D scenes by optimizing both NeRFs and layouts. The approach is evaluated on a range of text prompts, demonstrating its effectiveness in generating high-quality, disentangled 3D scenes. The method is also applied to various 3D editing tasks, showing its utility in arranging and modifying scenes. The results indicate that layout learning achieves competitive performance in object disentanglement and appearance, outperforming baselines. The method is also discussed in terms of its limitations and ethical implications, highlighting the need for careful consideration of its potential applications.This paper introduces a method for generating disentangled 3D scenes using layout learning. The approach leverages a large pretrained text-to-image model to generate scenes that are decomposed into individual objects. The key idea is that objects can be identified by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. The method jointly optimizes multiple NeRFs (Neural Radiance Fields) from scratch, each representing a separate object, along with a set of layouts that composite these objects into scenes. The scenes are encouraged to be in-distribution according to the image generator. The approach enables new capabilities in text-to-3D content creation, allowing for object-level scene manipulation. The method is validated on various tasks, including building scenes around 3D assets, sampling different arrangements for given assets, and parsing a provided NeRF into its constituent objects. The results show that layout learning leads to effective object disentanglement, enabling meaningful decomposition of generated 3D scenes without any supervision beyond a text prompt. The method is implemented using a neural network architecture that incorporates layout learning, allowing for the compositional generation of 3D scenes by optimizing both NeRFs and layouts. The approach is evaluated on a range of text prompts, demonstrating its effectiveness in generating high-quality, disentangled 3D scenes. The method is also applied to various 3D editing tasks, showing its utility in arranging and modifying scenes. The results indicate that layout learning achieves competitive performance in object disentanglement and appearance, outperforming baselines. The method is also discussed in terms of its limitations and ethical implications, highlighting the need for careful consideration of its potential applications.
Reach us at info@study.space