HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

19 Mar 2024 | Hongyu Zhou, Jiahao Shao, Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting proposes a novel method for holistic urban scene understanding using 3D Gaussian Splatting. The method jointly optimizes geometry, appearance, semantics, and motion using static and dynamic 3D Gaussians, with moving object poses regularized via physical constraints. It enables real-time rendering of new viewpoints, yielding accurate 2D and 3D semantic information, and reconstructs dynamic scenes even with noisy 3D bounding box predictions. The approach is evaluated on KITTI, KITTI-360, and Virtual KITTI 2 datasets, demonstrating state-of-the-art performance in novel view synthesis, semantic synthesis, and 3D semantic reconstruction. The method decomposes scenes into static regions and dynamic objects, models their poses with unicycle constraints, and incorporates semantic information into 3D Gaussians for semantic map rendering. It also integrates RGB, semantics, and optical flow for joint model training. The method achieves high accuracy in rendering RGB images, semantic maps, and optical flow, and is robust to noisy inputs. The approach is efficient, with inference speed of approximately 93 fps on a single NVIDIA RTX 4090. The method outperforms existing approaches in dynamic scene reconstruction and semantic understanding, and is capable of scene editing and decomposition. The method is supported by multiple references and has been tested on various benchmarks.HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting proposes a novel method for holistic urban scene understanding using 3D Gaussian Splatting. The method jointly optimizes geometry, appearance, semantics, and motion using static and dynamic 3D Gaussians, with moving object poses regularized via physical constraints. It enables real-time rendering of new viewpoints, yielding accurate 2D and 3D semantic information, and reconstructs dynamic scenes even with noisy 3D bounding box predictions. The approach is evaluated on KITTI, KITTI-360, and Virtual KITTI 2 datasets, demonstrating state-of-the-art performance in novel view synthesis, semantic synthesis, and 3D semantic reconstruction. The method decomposes scenes into static regions and dynamic objects, models their poses with unicycle constraints, and incorporates semantic information into 3D Gaussians for semantic map rendering. It also integrates RGB, semantics, and optical flow for joint model training. The method achieves high accuracy in rendering RGB images, semantic maps, and optical flow, and is robust to noisy inputs. The approach is efficient, with inference speed of approximately 93 fps on a single NVIDIA RTX 4090. The method outperforms existing approaches in dynamic scene reconstruction and semantic understanding, and is capable of scene editing and decomposition. The method is supported by multiple references and has been tested on various benchmarks.
Reach us at info@study.space