22 Jul 2024 | Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, Chris Sweeney
**EgoLifter: Open-world 3D Segmentation for Egocentric Perception**
This paper introduces *EgoLifter*, a novel system designed to automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. *EgoLifter* is specifically tailored for egocentric data, where scenes contain hundreds of objects captured from natural (non-scanning) motion. The system uses 3D Gaussians as the underlying representation of 3D scenes and objects, leveraging segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances without relying on specific object taxonomies.
To address the challenge of dynamic objects in egocentric videos, *EgoLifter* incorporates a transient prediction module that filters out dynamic objects during the 3D reconstruction process. This module does not require additional supervision and is optimized alongside 3D Gaussian Splatting using only photometric reconstruction losses. The result is a fully automatic pipeline that reconstructs 3D object instances as collections of 3D Gaussians, composing the entire scene.
The authors created a new benchmark on the Aria Digital Twin dataset to quantitatively demonstrate *EgoLifter*'s state-of-the-art performance in open-world 3D segmentation from natural egocentric input. Experiments on various egocentric activity datasets show the method's promise for 3D egocentric perception at scale.
Key contributions of *EgoLifter* include:
- The first system capable of enabling open-world 3D understanding from natural dynamic egocentric videos.
- Leveraging recent image foundation models to achieve strong open-world 3D instance segmentation performance without expensive data annotation or extra training.
- Proposing a transient prediction network to filter out transient objects, improving both reconstruction and segmentation performance.
- Setting up the first benchmark for dynamic egocentric video data and demonstrating leading performance in quantitative evaluations.
The paper also discusses related work, including 3D Gaussian models, open-world 3D segmentation, and 3D reconstruction from egocentric videos. The method is evaluated on several egocentric video datasets, showing strong 3D reconstruction and open-world segmentation results. Qualitative applications, such as 3D object extraction and scene editing, are also showcased.**EgoLifter: Open-world 3D Segmentation for Egocentric Perception**
This paper introduces *EgoLifter*, a novel system designed to automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. *EgoLifter* is specifically tailored for egocentric data, where scenes contain hundreds of objects captured from natural (non-scanning) motion. The system uses 3D Gaussians as the underlying representation of 3D scenes and objects, leveraging segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances without relying on specific object taxonomies.
To address the challenge of dynamic objects in egocentric videos, *EgoLifter* incorporates a transient prediction module that filters out dynamic objects during the 3D reconstruction process. This module does not require additional supervision and is optimized alongside 3D Gaussian Splatting using only photometric reconstruction losses. The result is a fully automatic pipeline that reconstructs 3D object instances as collections of 3D Gaussians, composing the entire scene.
The authors created a new benchmark on the Aria Digital Twin dataset to quantitatively demonstrate *EgoLifter*'s state-of-the-art performance in open-world 3D segmentation from natural egocentric input. Experiments on various egocentric activity datasets show the method's promise for 3D egocentric perception at scale.
Key contributions of *EgoLifter* include:
- The first system capable of enabling open-world 3D understanding from natural dynamic egocentric videos.
- Leveraging recent image foundation models to achieve strong open-world 3D instance segmentation performance without expensive data annotation or extra training.
- Proposing a transient prediction network to filter out transient objects, improving both reconstruction and segmentation performance.
- Setting up the first benchmark for dynamic egocentric video data and demonstrating leading performance in quantitative evaluations.
The paper also discusses related work, including 3D Gaussian models, open-world 3D segmentation, and 3D reconstruction from egocentric videos. The method is evaluated on several egocentric video datasets, showing strong 3D reconstruction and open-world segmentation results. Qualitative applications, such as 3D object extraction and scene editing, are also showcased.