[slides and audio] EG4D%3A Explicit Generation of 4D Object without Score Distillation

The paper introduces EG4D, a novel framework for explicitly generating 4D objects from a single input image without relying on score distillation sampling (SDS). Traditional methods often suffer from issues like over-saturation and the Janus problem, leading to unsatisfactory results. EG4D addresses these challenges by optimizing a 4D representation through the explicit generation of multi-view videos from a single input image. The framework includes several key components: 1. **Attention Injection**: A training-free strategy that injects temporal information into the multi-view diffusion process, ensuring temporal consistency in the synthesized videos. 2. **4D Gaussian Splatting (4D-GS)**: A robust and efficient method for dynamic reconstruction, which models the scene using millions of 3D Gaussians. 3. **Refinement with Diffusion Priors**: An image-to-image diffusion model is used to refine the rendered images, improving semantic details while preserving object identity and style. The paper demonstrates that EG4D outperforms existing baselines in terms of generation quality, achieving more realistic 3D appearance, high image fidelity, and fine temporal consistency. Qualitative results and user preference studies validate the effectiveness of EG4D, showing superior performance in various metrics such as image-4D alignment, 3D appearance, motion realism, and overall quality. The framework is also shown to be efficient, with a shorter optimization time compared to previous methods while achieving better results.The paper introduces EG4D, a novel framework for explicitly generating 4D objects from a single input image without relying on score distillation sampling (SDS). Traditional methods often suffer from issues like over-saturation and the Janus problem, leading to unsatisfactory results. EG4D addresses these challenges by optimizing a 4D representation through the explicit generation of multi-view videos from a single input image. The framework includes several key components: 1. **Attention Injection**: A training-free strategy that injects temporal information into the multi-view diffusion process, ensuring temporal consistency in the synthesized videos. 2. **4D Gaussian Splatting (4D-GS)**: A robust and efficient method for dynamic reconstruction, which models the scene using millions of 3D Gaussians. 3. **Refinement with Diffusion Priors**: An image-to-image diffusion model is used to refine the rendered images, improving semantic details while preserving object identity and style. The paper demonstrates that EG4D outperforms existing baselines in terms of generation quality, achieving more realistic 3D appearance, high image fidelity, and fine temporal consistency. Qualitative results and user preference studies validate the effectiveness of EG4D, showing superior performance in various metrics such as image-4D alignment, 3D appearance, motion realism, and overall quality. The framework is also shown to be efficient, with a shorter optimization time compared to previous methods while achieving better results.

EG4D: Explicit Generation of 4D Object without Score Distillation

28 May 2024 | Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, Houqiang Li