27 Mar 2025 | Weijie Lyu, Xueling Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang
Gaga is a framework designed to reconstruct and segment open-world 3D scenes using inconsistent 2D masks generated by zero-shot class-agnostic segmentation models. Unlike prior methods that rely on video object tracking or contrastive learning, Gaga leverages spatial information and a novel 3D-aware memory bank to associate object masks across diverse camera poses. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images. This approach ensures precise mask label consistency and enables the process of lifting 2D segmentation to a consistent 3D segmentation. Gaga has been shown to produce accurate 3D object segmentation, achieving high-quality results for downstream applications such as scene manipulation, including changing the color of objects and removing them from scenes. Extensive qualitative and quantitative evaluations on various datasets demonstrate the effectiveness of Gaga, highlighting its potential for real-world applications in 3D scene understanding and manipulation.Gaga is a framework designed to reconstruct and segment open-world 3D scenes using inconsistent 2D masks generated by zero-shot class-agnostic segmentation models. Unlike prior methods that rely on video object tracking or contrastive learning, Gaga leverages spatial information and a novel 3D-aware memory bank to associate object masks across diverse camera poses. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images. This approach ensures precise mask label consistency and enables the process of lifting 2D segmentation to a consistent 3D segmentation. Gaga has been shown to produce accurate 3D object segmentation, achieving high-quality results for downstream applications such as scene manipulation, including changing the color of objects and removing them from scenes. Extensive qualitative and quantitative evaluations on various datasets demonstrate the effectiveness of Gaga, highlighting its potential for real-world applications in 3D scene understanding and manipulation.