Understanding EgoVideo%3A Exploring Egocentric Foundation Model and Downstream Adaptation

This paper presents the solutions to the EgoVis Challenges in CVPR 2024, focusing on five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, designed to handle the unique characteristics of egocentric videos. EgoVideo is trained using a three-stage process: data selection, post-training, and task-specific fine-tuning. The model is evaluated on various tasks, including natural language queries, step grounding, moment queries, short-term object interaction anticipation, and long-term action anticipation. The results demonstrate EgoVideo's effectiveness in understanding fine-grained, action-specific information and its versatility across different egocentric video analysis scenarios. The codebase and pre-trained models are publicly available at <https://github.com/OpenGVLab/EgoVideo>. The paper also discusses the limitations of the approach, such as the high computational resources required and the challenges in temporal localization and long-term action anticipation.This paper presents the solutions to the EgoVis Challenges in CVPR 2024, focusing on five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, designed to handle the unique characteristics of egocentric videos. EgoVideo is trained using a three-stage process: data selection, post-training, and task-specific fine-tuning. The model is evaluated on various tasks, including natural language queries, step grounding, moment queries, short-term object interaction anticipation, and long-term action anticipation. The results demonstrate EgoVideo's effectiveness in understanding fine-grained, action-specific information and its versatility across different egocentric video analysis scenarios. The codebase and pre-trained models are publicly available at <https://github.com/OpenGVLab/EgoVideo>. The paper also discusses the limitations of the approach, such as the high computational resources required and the challenges in temporal localization and long-term action anticipation.

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

1 Jul 2024 | Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao