Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

March 12, 2024 | Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
This paper introduces Exo2Ego, a generative framework for exocentric-to-egocentric cross-view translation. The goal is to generate a first-person (egocentric) view of an actor based on a third-person (exocentric) video. The framework decouples the translation process into two stages: high-level structure transformation, which infers the location and interaction manner of hands and objects in the egocentric view, and diffusion-based pixel-level hallucination, which enhances the fidelity of the generated egocentric view by incorporating a hand layout prior. The framework is evaluated on a benchmark consisting of synchronized ego-exo tabletop activity video pairs from three public datasets: H2O, Aria Pilot, and Assembly101. Experimental results show that Exo2Ego produces photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of synthesis quality and generalization ability to new actions. The framework is designed to be camera-agnostic, making it suitable for real-world applications where camera parameters are not available. It also avoids the need for ground truth semantic maps, which are typically infeasible for inference. The framework's key contributions include the explicit encouragement of cross-view correspondence and the introduction of a hand layout prior to improve the fidelity of generated hands. The framework is evaluated on various generalization tasks, including new actions, objects, subjects, and scenes, demonstrating its effectiveness in generating realistic egocentric views. The framework's ability to model egocentric viewpoint changes is crucial for applications involving physical activities with significant head and body motion. The paper also discusses the limitations of the framework, including its inability to generalize perfectly to in-the-wild objects, subjects, and backgrounds due to the modest scale of available training data. Future work involves integrating robust object geometric priors to improve the framework's performance.This paper introduces Exo2Ego, a generative framework for exocentric-to-egocentric cross-view translation. The goal is to generate a first-person (egocentric) view of an actor based on a third-person (exocentric) video. The framework decouples the translation process into two stages: high-level structure transformation, which infers the location and interaction manner of hands and objects in the egocentric view, and diffusion-based pixel-level hallucination, which enhances the fidelity of the generated egocentric view by incorporating a hand layout prior. The framework is evaluated on a benchmark consisting of synchronized ego-exo tabletop activity video pairs from three public datasets: H2O, Aria Pilot, and Assembly101. Experimental results show that Exo2Ego produces photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of synthesis quality and generalization ability to new actions. The framework is designed to be camera-agnostic, making it suitable for real-world applications where camera parameters are not available. It also avoids the need for ground truth semantic maps, which are typically infeasible for inference. The framework's key contributions include the explicit encouragement of cross-view correspondence and the introduction of a hand layout prior to improve the fidelity of generated hands. The framework is evaluated on various generalization tasks, including new actions, objects, subjects, and scenes, demonstrating its effectiveness in generating realistic egocentric views. The framework's ability to model egocentric viewpoint changes is crucial for applications involving physical activities with significant head and body motion. The paper also discusses the limitations of the framework, including its inability to generalize perfectly to in-the-wild objects, subjects, and backgrounds due to the modest scale of available training data. Future work involves integrating robust object geometric priors to improve the framework's performance.
Reach us at info@study.space