L4GM: Large 4D Gaussian Reconstruction Model

L4GM: Large 4D Gaussian Reconstruction Model

14 Jun 2024 | Jiawei Ren15, Kevin Xie1,2, Ashkan Mirzaei1,2, Hanxue Liang1,3, Xiaohui Zeng1,2, Karsten Kreis1, Ziwei Liu5, Antonio Torralba3, Sanja Fidler1,2, Seung Wook Kim1, Huan Ling1,2
L4GM is a novel 4D Large Reconstruction Model that generates animated 3D objects from a single-view video input in a single feed-forward pass, taking only a second. The key to its success is a large-scale dataset of multiview videos containing curated, rendered animated objects from *Objaverse*. This dataset includes 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. L4GM builds on the pre-trained 3D Large Reconstruction Model (LGM) and extends it to output a per-frame 3D Gaussian Splatting representation from video frames sampled at a low frame rate. Temporal self-attention layers are added to help the model learn consistency across time, and a per-timestep multiview rendering loss is used for training. The representation is upsampled to a higher frame rate by training an interpolation model. L4GM generalizes well to in-the-wild videos, producing high-quality animated 3D assets. It achieves state-of-the-art quality while being 100 to 1,000 times faster than other approaches. L4GM also enables fast video-to-4D generation in combination with multiview generative models.L4GM is a novel 4D Large Reconstruction Model that generates animated 3D objects from a single-view video input in a single feed-forward pass, taking only a second. The key to its success is a large-scale dataset of multiview videos containing curated, rendered animated objects from *Objaverse*. This dataset includes 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. L4GM builds on the pre-trained 3D Large Reconstruction Model (LGM) and extends it to output a per-frame 3D Gaussian Splatting representation from video frames sampled at a low frame rate. Temporal self-attention layers are added to help the model learn consistency across time, and a per-timestep multiview rendering loss is used for training. The representation is upsampled to a higher frame rate by training an interpolation model. L4GM generalizes well to in-the-wild videos, producing high-quality animated 3D assets. It achieves state-of-the-art quality while being 100 to 1,000 times faster than other approaches. L4GM also enables fast video-to-4D generation in combination with multiview generative models.
Reach us at info@study.space
[slides and audio] L4GM%3A Large 4D Gaussian Reconstruction Model