EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

1 Jul 2024 | Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao
This report presents the solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, which is specifically designed for egocentric video understanding. The model is built upon a video-language two-tower structure and leverages meticulously organized egocentric video data. It is adapted to various tasks, including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation in the Ego4D challenge, as well as Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition in the EPIC-Kitchens challenge. The model demonstrates strong performance in these tasks, showcasing its versatility and effectiveness in egocentric video analysis. The authors also discuss the limitations of their approach, including the high computational cost and the challenges of temporal localization. The results show that the EgoVideo model achieves significant improvements in performance across various tasks, highlighting its potential for advancing research in egocentric video understanding. The codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.This report presents the solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, which is specifically designed for egocentric video understanding. The model is built upon a video-language two-tower structure and leverages meticulously organized egocentric video data. It is adapted to various tasks, including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation in the Ego4D challenge, as well as Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition in the EPIC-Kitchens challenge. The model demonstrates strong performance in these tasks, showcasing its versatility and effectiveness in egocentric video analysis. The authors also discuss the limitations of their approach, including the high computational cost and the challenges of temporal localization. The results show that the EgoVideo model achieves significant improvements in performance across various tasks, highlighting its potential for advancing research in egocentric video understanding. The codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
Reach us at info@study.space