[slides] EchoScene%3A Indoor Scene Generation via Information Echo over Scene Graph Diffusion

**EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion** **Authors:** Guangyao Zhai, Evin Pinar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam **Institution:** Technical University of Munich, Ludwig Maximilian University of Munich, Google **Abstract:** EchoScene is an interactive and controllable generative model that generates 3D indoor scenes from scene graphs. It leverages a dual-branch diffusion model that dynamically adapts to scene graphs, addressing the challenges of varying node numbers, multiple edge combinations, and manipulator-induced node-edge operations. Each node is associated with a denoising process, enabling collaborative information exchange through an information echo scheme. This scheme ensures that denoising processes are aware of global constraints, facilitating the generation of coherent scenes. The model supports scene manipulation during inference by editing the input scene graph and sampling noise in the diffusion model. Extensive experiments validate the approach, demonstrating superior generation fidelity and robustness compared to previous methods. **Contributions:** 1. **EchoScene:** A scene generation method with a dual-branch diffusion model on dynamic scene graphs, capable of generating layouts and shapes with enhanced controllability. 2. **Information Echo Scheme:** Introduces an information echo scheme within each branch to allow multiple denoising processes to exchange denoising status, enhancing global awareness. 3. **Performance:** Achieves higher generation fidelity and better inter-object consistency, outperforming state-of-the-art methods in various metrics. **Related Work:** - **Semantic Scene Graph:** Scene graphs are used for semantic scene understanding, offering structured representations of scenes through nodes and edges. - **Diffusion Models:** Diffusion models have been applied to generate realistic 3D content, with recent advances focusing on improving flexibility and realism. - **Controllability in Scene Synthesis:** Methods explore various strategies to create controllable scene outputs, including text descriptions, spatial layouts, and probabilistic grammars. **Preliminary:** - **Scene Graph Convolution:** triplet-GCN is used to process semantic scene graphs, providing latent relation embeddings for each node. - **Contextual Graph:** CLIP features are used to infer semantic information for each node and triplet in the graph, enhancing contextual understanding. - **Conditional Diffusion Models:** Diffusion models learn to estimate target distributions through a progressive Markov process, guided by conditionals. **Method:** - **Graph Preprocessing:** Encodes the contextual graph to latent relation embeddings using triplet-GCN. - **Information Echo Scheme:** Introduces an information exchange unit to enable dynamic and interactive diffusion processes among graph elements. - **Layout Branch:** Models layout generation by setting each node with its own denoising process, encouraging interaction through layout echoes. - **Shape Branch:** Models shape generation by pretraining a VQ-VA**EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion** **Authors:** Guangyao Zhai, Evin Pinar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam **Institution:** Technical University of Munich, Ludwig Maximilian University of Munich, Google **Abstract:** EchoScene is an interactive and controllable generative model that generates 3D indoor scenes from scene graphs. It leverages a dual-branch diffusion model that dynamically adapts to scene graphs, addressing the challenges of varying node numbers, multiple edge combinations, and manipulator-induced node-edge operations. Each node is associated with a denoising process, enabling collaborative information exchange through an information echo scheme. This scheme ensures that denoising processes are aware of global constraints, facilitating the generation of coherent scenes. The model supports scene manipulation during inference by editing the input scene graph and sampling noise in the diffusion model. Extensive experiments validate the approach, demonstrating superior generation fidelity and robustness compared to previous methods. **Contributions:** 1. **EchoScene:** A scene generation method with a dual-branch diffusion model on dynamic scene graphs, capable of generating layouts and shapes with enhanced controllability. 2. **Information Echo Scheme:** Introduces an information echo scheme within each branch to allow multiple denoising processes to exchange denoising status, enhancing global awareness. 3. **Performance:** Achieves higher generation fidelity and better inter-object consistency, outperforming state-of-the-art methods in various metrics. **Related Work:** - **Semantic Scene Graph:** Scene graphs are used for semantic scene understanding, offering structured representations of scenes through nodes and edges. - **Diffusion Models:** Diffusion models have been applied to generate realistic 3D content, with recent advances focusing on improving flexibility and realism. - **Controllability in Scene Synthesis:** Methods explore various strategies to create controllable scene outputs, including text descriptions, spatial layouts, and probabilistic grammars. **Preliminary:** - **Scene Graph Convolution:** triplet-GCN is used to process semantic scene graphs, providing latent relation embeddings for each node. - **Contextual Graph:** CLIP features are used to infer semantic information for each node and triplet in the graph, enhancing contextual understanding. - **Conditional Diffusion Models:** Diffusion models learn to estimate target distributions through a progressive Markov process, guided by conditionals. **Method:** - **Graph Preprocessing:** Encodes the contextual graph to latent relation embeddings using triplet-GCN. - **Information Echo Scheme:** Introduces an information exchange unit to enable dynamic and interactive diffusion processes among graph elements. - **Layout Branch:** Models layout generation by setting each node with its own denoising process, encouraging interaction through layout echoes. - **Shape Branch:** Models shape generation by pretraining a VQ-VA

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

2 May 2024 | Guangyao Zhai1 Evin Pmar Örnek1 Dave Zhenyu Chen1 Ruotong Liao2 Yan Di1 Nassir Navab1 Federico Tombari1,3 Benjamin Busam1