Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

March 12, 2024 | Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
The paper "Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos" by Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman addresses the challenge of generating a first-person (egocentric) view of an actor from a third-person (exocentric) video. The authors propose a generative framework called Exo2Ego, which decouples the translation process into two stages: high-level structure transformation and diffusion-based pixel-level hallucination. The high-level structure transformation stage explicitly encourages cross-view correspondence between exocentric and egocentric views, while the diffusion-based pixel-level hallucination stage incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To evaluate their approach, the authors curate a comprehensive benchmark dataset consisting of synchronized ego-exo tabletop activity video pairs from three public datasets: H2O, Aria Pilot, and Assembly101. Experimental results demonstrate that Exo2Ego produces photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of synthesis quality and generalization ability to new actions. The paper also discusses the limitations and future work, highlighting the potential for integrating robust object geometric priors to improve 3D-consistency in generated views.The paper "Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos" by Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman addresses the challenge of generating a first-person (egocentric) view of an actor from a third-person (exocentric) video. The authors propose a generative framework called Exo2Ego, which decouples the translation process into two stages: high-level structure transformation and diffusion-based pixel-level hallucination. The high-level structure transformation stage explicitly encourages cross-view correspondence between exocentric and egocentric views, while the diffusion-based pixel-level hallucination stage incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To evaluate their approach, the authors curate a comprehensive benchmark dataset consisting of synchronized ego-exo tabletop activity video pairs from three public datasets: H2O, Aria Pilot, and Assembly101. Experimental results demonstrate that Exo2Ego produces photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of synthesis quality and generalization ability to new actions. The paper also discusses the limitations and future work, highlighting the potential for integrating robust object geometric priors to improve 3D-consistency in generated views.
Reach us at info@study.space