25 Mar 2024 | Dejia Xu, Hanwen Liang, Neel P. Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N. Plataniotis, and Zhangyang Wang
Comp4D is a novel framework for generating compositional 4D scenes from text input, leveraging large language models (LLMs) to decompose text prompts into entities and their trajectories. Unlike conventional methods that generate a single 4D representation of the entire scene, Comp4D constructs each 4D object separately, enabling more realistic and interactive 4D content. The framework uses pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains to refine the scene, guided by pre-defined trajectories. It introduces a compositional score distillation technique to optimize object deformation and motion, allowing for flexible rendering between object-centric and trajectory-guided views. The method decomposes object motion into global displacement and local deformation, with LLMs responsible for trajectory design and scale assignment. Extensive experiments demonstrate that Comp4D outperforms existing methods in terms of visual quality, motion fidelity, and object interactions. The framework supports long-range motion and multi-concept interactions, and is capable of rendering high-resolution videos at 70 FPS. Despite its advantages, the method has limitations, including reliance on GPT-4's zero-shot ability and constraints from video diffusion models. Future work aims to generate longer and more complex motions for practical 4D content creation.Comp4D is a novel framework for generating compositional 4D scenes from text input, leveraging large language models (LLMs) to decompose text prompts into entities and their trajectories. Unlike conventional methods that generate a single 4D representation of the entire scene, Comp4D constructs each 4D object separately, enabling more realistic and interactive 4D content. The framework uses pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains to refine the scene, guided by pre-defined trajectories. It introduces a compositional score distillation technique to optimize object deformation and motion, allowing for flexible rendering between object-centric and trajectory-guided views. The method decomposes object motion into global displacement and local deformation, with LLMs responsible for trajectory design and scale assignment. Extensive experiments demonstrate that Comp4D outperforms existing methods in terms of visual quality, motion fidelity, and object interactions. The framework supports long-range motion and multi-concept interactions, and is capable of rendering high-resolution videos at 70 FPS. Despite its advantages, the method has limitations, including reliance on GPT-4's zero-shot ability and constraints from video diffusion models. Future work aims to generate longer and more complex motions for practical 4D content creation.