Understanding InstructScene%3A Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior

**INSTRUCTSCENE: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior** Chenguo Lin, Yadong Mu Peking University chenguolin@stu.pku.edu.cn, myd@pku.edu.cn **Abstract:** Comprehending natural language instructions is crucial for 3D indoor scene synthesis systems. Existing methods often model object joint distributions and express object relations implicitly within a scene, which hinders controllability. INSTRUCTSCENE introduces a novel generative framework that integrates a semantic graph prior and a layout decoder to enhance controllability and fidelity in 3D scene synthesis. The semantic graph prior jointly learns scene appearances and layout distributions, enabling versatile applications in various downstream tasks with zero-shot learning. A high-quality dataset of scene-instruction pairs is curated to facilitate benchmarking. Extensive experiments show that INSTRUCTSCENE outperforms state-of-the-art methods by a significant margin, with thorough ablation studies confirming the efficacy of key design components. **Introduction:** 3D indoor scene synthesis aims to automatically generate controllable and realistic scenes. Effective systems should understand natural language instructions, design aesthetically pleasing compositions, and place objects accurately. Previous methods often struggle with implicit relationship modeling and lack detailed object attributes, leading to style inconsistency and limited customization. INSTRUCTSCENE addresses these issues by integrating a semantic graph prior and a layout decoder. The semantic graph prior learns high-level object and relation distributions, while the layout decoder generates precise scene configurations. This two-stage approach handles discrete and continuous attributes separately, reducing optimization complexity. The method also leverages object geometries and appearances by quantizing semantic features from a multimodal-aligned model. **Method:** - **Semantic Graph Prior:** learns high-level object and relation distributions conditioned on instructions. - **Layout Decoder:** generates precise layout configurations using the learned semantic graph prior. - **Model Architecture:** uses a Transformer for all models, including the semantic graph prior and layout decoder. **Experiments:** - **Scene-Instruction Pair Dataset:** constructed using professionally designed synthetic indoor scenes and enhanced with captions from BLIP and ChatGPT. - **Evaluation Metrics:** iRecall, FID, FIDCLIP, KID, and scene classification accuracy (SCA). - **Results:** INSTRUCTSCENE outperforms baselines in controllability and fidelity, especially in complex scenes. **Conclusion:** INSTRUCTSCENE significantly improves 3D indoor scene synthesis by integrating a semantic graph prior and a layout decoder, enhancing controllability and fidelity. The method's versatility and effectiveness make it a promising tool for various applications, including interior design, metaverse experiences, and VR/AR development.**INSTRUCTSCENE: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior** Chenguo Lin, Yadong Mu Peking University chenguolin@stu.pku.edu.cn, myd@pku.edu.cn **Abstract:** Comprehending natural language instructions is crucial for 3D indoor scene synthesis systems. Existing methods often model object joint distributions and express object relations implicitly within a scene, which hinders controllability. INSTRUCTSCENE introduces a novel generative framework that integrates a semantic graph prior and a layout decoder to enhance controllability and fidelity in 3D scene synthesis. The semantic graph prior jointly learns scene appearances and layout distributions, enabling versatile applications in various downstream tasks with zero-shot learning. A high-quality dataset of scene-instruction pairs is curated to facilitate benchmarking. Extensive experiments show that INSTRUCTSCENE outperforms state-of-the-art methods by a significant margin, with thorough ablation studies confirming the efficacy of key design components. **Introduction:** 3D indoor scene synthesis aims to automatically generate controllable and realistic scenes. Effective systems should understand natural language instructions, design aesthetically pleasing compositions, and place objects accurately. Previous methods often struggle with implicit relationship modeling and lack detailed object attributes, leading to style inconsistency and limited customization. INSTRUCTSCENE addresses these issues by integrating a semantic graph prior and a layout decoder. The semantic graph prior learns high-level object and relation distributions, while the layout decoder generates precise scene configurations. This two-stage approach handles discrete and continuous attributes separately, reducing optimization complexity. The method also leverages object geometries and appearances by quantizing semantic features from a multimodal-aligned model. **Method:** - **Semantic Graph Prior:** learns high-level object and relation distributions conditioned on instructions. - **Layout Decoder:** generates precise layout configurations using the learned semantic graph prior. - **Model Architecture:** uses a Transformer for all models, including the semantic graph prior and layout decoder. **Experiments:** - **Scene-Instruction Pair Dataset:** constructed using professionally designed synthetic indoor scenes and enhanced with captions from BLIP and ChatGPT. - **Evaluation Metrics:** iRecall, FID, FIDCLIP, KID, and scene classification accuracy (SCA). - **Results:** INSTRUCTSCENE outperforms baselines in controllability and fidelity, especially in complex scenes. **Conclusion:** INSTRUCTSCENE significantly improves 3D indoor scene synthesis by integrating a semantic graph prior and a layout decoder, enhancing controllability and fidelity. The method's versatility and effectiveness make it a promising tool for various applications, including interior design, metaverse experiences, and VR/AR development.

INSTRUCTSCENE: INSTRUCTION-DRIVEN 3D INDOOR SCENE SYNTHESIS WITH SEMANTIC GRAPH PRIOR

7 Feb 2024 | Chenguo Lin, Yadong Mu*