5 Feb 2024 | RIO AGUINA-KANG, MAXIM GUMIN, DO HEON HAN, STEWART MORRIS, SEUNG JEAN YOO, ADITYA GANESHAN, R. KENNY JONES, QIHONG ANNA WEI, KAILIANG FU, DANIEL RITCHIE
This paper presents an open-universe indoor scene generation system that produces 3D indoor scenes from text prompts. The system uses pre-trained large language models (LLMs) and vision-language models (VLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. These programs are then executed to solve a constraint satisfaction problem, which determines object positions and orientations. The system retrieves 3D meshes from a large, unannotated database using VLMs to match object specifications. The system outperforms generative models trained on 3D data and a recent LLM-based layout generation method in both closed-universe and open-universe scene generation tasks.
The system is designed to generate a wide variety of indoor scenes, including common spaces like bedrooms and specialized spaces like musician's practice rooms, as well as fantastical scenes like a wizard's lair. It uses a declarative domain-specific language (DSL) to describe scenes through spatial relations rather than explicit coordinates. The system includes a program synthesizer that generates scene programs from natural language descriptions, a layout optimizer that solves constraint satisfaction problems, and an object retrieval and orientation module that retrieves and orients 3D meshes from a large database.
The system's contributions include a DSL for specifying indoor scene layouts, a robust prompting workflow that leverages LLMs to synthesize programs, a pipeline using pretrained VLMs for retrieving and orienting 3D meshes, and protocols for evaluating open-universe indoor synthesis systems. The system is evaluated through perceptual studies and ablation studies, showing its effectiveness in generating realistic indoor scenes. The code will be made available as open source upon publication.This paper presents an open-universe indoor scene generation system that produces 3D indoor scenes from text prompts. The system uses pre-trained large language models (LLMs) and vision-language models (VLMs) to synthesize programs in a domain-specific layout language that describe objects and spatial relations between them. These programs are then executed to solve a constraint satisfaction problem, which determines object positions and orientations. The system retrieves 3D meshes from a large, unannotated database using VLMs to match object specifications. The system outperforms generative models trained on 3D data and a recent LLM-based layout generation method in both closed-universe and open-universe scene generation tasks.
The system is designed to generate a wide variety of indoor scenes, including common spaces like bedrooms and specialized spaces like musician's practice rooms, as well as fantastical scenes like a wizard's lair. It uses a declarative domain-specific language (DSL) to describe scenes through spatial relations rather than explicit coordinates. The system includes a program synthesizer that generates scene programs from natural language descriptions, a layout optimizer that solves constraint satisfaction problems, and an object retrieval and orientation module that retrieves and orients 3D meshes from a large database.
The system's contributions include a DSL for specifying indoor scene layouts, a robust prompting workflow that leverages LLMs to synthesize programs, a pipeline using pretrained VLMs for retrieving and orienting 3D meshes, and protocols for evaluating open-universe indoor synthesis systems. The system is evaluated through perceptual studies and ablation studies, showing its effectiveness in generating realistic indoor scenes. The code will be made available as open source upon publication.