25 Mar 2024 | Dejia Xu1*, Hanwen Liang2*, Neel P. Bhatt1, Hezhen Hu1, Hanxue Liang3, Konstantinos N. Plataniotis2, and Zhangyang Wang1
**Comp4D: LLM-Guided Compositional 4D Scene Generation**
Dejia Xu, Hanwen Liang, Neel P. Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N. Plataniotis, and Zhangyang Wang
**Abstract:**
Recent advancements in diffusion models for 2D and 3D content creation have sparked interest in generating 4D content. However, the scarcity of 3D scene datasets limits current methods to object-centric generation. To address this, we present Comp4D, a novel framework for compositional 4D scene generation. Unlike traditional methods that generate a single 4D representation of the entire scene, Comp4D constructs each 4D object separately within the scene. Utilizing Large Language Models (LLMs), the framework decomposes an input text prompt into distinct entities and maps their trajectories. It then constructs the 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate superior visual quality, motion fidelity, and enhanced object interactions compared to prior arts.
**Introduction:**
Recent advances in text-to-image diffusion models have revolutionized generative AI, simplifying digital content creation. However, the focus remains on static 3D assets due to the scarcity of comprehensive scene-level 3D datasets. While text-to-image diffusion models have improved in video generation, there is limited exploration into adapting them for 4D content creation. Comp4D extends the boundaries to the challenging task of constructing compositional 4D scenes by disentangling the process into scene decomposition and motion generation with object interactions. Our approach offloads the task of designing global displacement to an LLM, reducing the workload on deformation modules. Formulating each object as deformable 3D Gaussians enables flexible switching between single-object and multi-object renderings, facilitating stable optimization of object motion even in the presence of occlusions.
**Key Contributions:**
- We introduce Comp4D, a pioneering framework for compositional 4D scene creation.
- We propose decomposing object motion into global displacement and local deformation components, with LLMs tasked to design global displacement via kinematics templates.
- Each object is formulated as a set of deformable 3D Gaussians, enabling flexible rendering and stable optimization of object motion.
- Extensive experiments demonstrate superior performance in visual quality, motion realism, and object interaction compared to existing baselines.
**Related Work:**
Recent works in 4D content creation focus on object-centric generation due to the constraints of 3D-aware diffusion models. Our work is the first attempt to tackle the challenging task of compositional 4D scene generation by decomposing the scene into object components.
**Method**Comp4D: LLM-Guided Compositional 4D Scene Generation**
Dejia Xu, Hanwen Liang, Neel P. Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N. Plataniotis, and Zhangyang Wang
**Abstract:**
Recent advancements in diffusion models for 2D and 3D content creation have sparked interest in generating 4D content. However, the scarcity of 3D scene datasets limits current methods to object-centric generation. To address this, we present Comp4D, a novel framework for compositional 4D scene generation. Unlike traditional methods that generate a single 4D representation of the entire scene, Comp4D constructs each 4D object separately within the scene. Utilizing Large Language Models (LLMs), the framework decomposes an input text prompt into distinct entities and maps their trajectories. It then constructs the 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate superior visual quality, motion fidelity, and enhanced object interactions compared to prior arts.
**Introduction:**
Recent advances in text-to-image diffusion models have revolutionized generative AI, simplifying digital content creation. However, the focus remains on static 3D assets due to the scarcity of comprehensive scene-level 3D datasets. While text-to-image diffusion models have improved in video generation, there is limited exploration into adapting them for 4D content creation. Comp4D extends the boundaries to the challenging task of constructing compositional 4D scenes by disentangling the process into scene decomposition and motion generation with object interactions. Our approach offloads the task of designing global displacement to an LLM, reducing the workload on deformation modules. Formulating each object as deformable 3D Gaussians enables flexible switching between single-object and multi-object renderings, facilitating stable optimization of object motion even in the presence of occlusions.
**Key Contributions:**
- We introduce Comp4D, a pioneering framework for compositional 4D scene creation.
- We propose decomposing object motion into global displacement and local deformation components, with LLMs tasked to design global displacement via kinematics templates.
- Each object is formulated as a set of deformable 3D Gaussians, enabling flexible rendering and stable optimization of object motion.
- Extensive experiments demonstrate superior performance in visual quality, motion realism, and object interaction compared to existing baselines.
**Related Work:**
Recent works in 4D content creation focus on object-centric generation due to the constraints of 3D-aware diffusion models. Our work is the first attempt to tackle the challenging task of compositional 4D scene generation by decomposing the scene into object components.
**Method