22 Jul 2024 | Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney
EgoLifter is a novel system that automatically segments scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. It is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural motion. EgoLifter uses 3D Gaussians as the underlying representation of 3D scenes and leverages segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances. To handle dynamic objects in egocentric videos, EgoLifter includes a transient prediction module that filters out dynamic objects during reconstruction. This results in a fully automatic pipeline that reconstructs 3D object instances as collections of 3D Gaussians. A new benchmark on the Aria Digital Twin dataset demonstrates EgoLifter's state-of-the-art performance in open-world 3D segmentation from natural egocentric input. EgoLifter is evaluated on various egocentric activity datasets, showing its promise for 3D egocentric perception at scale. The system enables multiple downstream applications including detection, segmentation, 3D object extraction, and scene editing. EgoLifter is agnostic to the specific 2D instance segmentation method and uses the Segment Anything Model (SAM) for its instance segmentation performance. The system is evaluated on several egocentric video datasets, demonstrating strong 3D reconstruction and open-world segmentation results. EgoLifter is the first system that can enable open-world 3D understanding from natural dynamic egocentric videos. By lifting outputs from recent image foundation models to 3D Gaussian Splatting, EgoLifter achieves strong open-world 3D instance segmentation performance without the need for expensive data annotation or extra training. EgoLifter proposes a transient prediction network that filters out transient objects from the 3D reconstruction results, improving performance on both reconstruction and segmentation of static objects. EgoLifter sets up the first benchmark of dynamic egocentric video data and quantitatively demonstrates its leading performance. On several large-scale egocentric video datasets, EgoLifter showcases the ability to decompose a 3D scene into a set of 3D object instances, opening up promising directions for egocentric video understanding in AR/VR applications. EgoLifter uses 3D Gaussian Splatting with feature rendering to enable 3D reconstruction and open-world segmentation. The system augments 3DGS to also render a feature map of arbitrary dimension, enabling the encoding of high-dimensional features in the learned 3D scenes and lifting segmentation from 2D to 3D. The system uses a transient prediction network to filter out transient objects in the 3D reconstruction, improving the quality of the reconstruction and segmentation. EgoLifter is evaluated on several egocentric video datasets, demonstrating strong 3D reconstruction and open-world segmentation results. The system is also used for 3D object extractionEgoLifter is a novel system that automatically segments scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. It is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural motion. EgoLifter uses 3D Gaussians as the underlying representation of 3D scenes and leverages segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances. To handle dynamic objects in egocentric videos, EgoLifter includes a transient prediction module that filters out dynamic objects during reconstruction. This results in a fully automatic pipeline that reconstructs 3D object instances as collections of 3D Gaussians. A new benchmark on the Aria Digital Twin dataset demonstrates EgoLifter's state-of-the-art performance in open-world 3D segmentation from natural egocentric input. EgoLifter is evaluated on various egocentric activity datasets, showing its promise for 3D egocentric perception at scale. The system enables multiple downstream applications including detection, segmentation, 3D object extraction, and scene editing. EgoLifter is agnostic to the specific 2D instance segmentation method and uses the Segment Anything Model (SAM) for its instance segmentation performance. The system is evaluated on several egocentric video datasets, demonstrating strong 3D reconstruction and open-world segmentation results. EgoLifter is the first system that can enable open-world 3D understanding from natural dynamic egocentric videos. By lifting outputs from recent image foundation models to 3D Gaussian Splatting, EgoLifter achieves strong open-world 3D instance segmentation performance without the need for expensive data annotation or extra training. EgoLifter proposes a transient prediction network that filters out transient objects from the 3D reconstruction results, improving performance on both reconstruction and segmentation of static objects. EgoLifter sets up the first benchmark of dynamic egocentric video data and quantitatively demonstrates its leading performance. On several large-scale egocentric video datasets, EgoLifter showcases the ability to decompose a 3D scene into a set of 3D object instances, opening up promising directions for egocentric video understanding in AR/VR applications. EgoLifter uses 3D Gaussian Splatting with feature rendering to enable 3D reconstruction and open-world segmentation. The system augments 3DGS to also render a feature map of arbitrary dimension, enabling the encoding of high-dimensional features in the learned 3D scenes and lifting segmentation from 2D to 3D. The system uses a transient prediction network to filter out transient objects in the 3D reconstruction, improving the quality of the reconstruction and segmentation. EgoLifter is evaluated on several egocentric video datasets, demonstrating strong 3D reconstruction and open-world segmentation results. The system is also used for 3D object extraction