This paper presents the solutions to the EgoVis Challenges in CVPR 2024, focusing on five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, designed to handle the unique characteristics of egocentric videos. EgoVideo is trained using a three-stage process: data selection, post-training, and task-specific fine-tuning. The model is evaluated on various tasks, including natural language queries, step grounding, moment queries, short-term object interaction anticipation, and long-term action anticipation. The results demonstrate EgoVideo's effectiveness in understanding fine-grained, action-specific information and its versatility across different egocentric video analysis scenarios. The codebase and pre-trained models are publicly available at <https://github.com/OpenGVLab/EgoVideo>. The paper also discusses the limitations of the approach, such as the high computational resources required and the challenges in temporal localization and long-term action anticipation.This paper presents the solutions to the EgoVis Challenges in CVPR 2024, focusing on five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. The authors introduce a novel foundation model called EgoVideo, designed to handle the unique characteristics of egocentric videos. EgoVideo is trained using a three-stage process: data selection, post-training, and task-specific fine-tuning. The model is evaluated on various tasks, including natural language queries, step grounding, moment queries, short-term object interaction anticipation, and long-term action anticipation. The results demonstrate EgoVideo's effectiveness in understanding fine-grained, action-specific information and its versatility across different egocentric video analysis scenarios. The codebase and pre-trained models are publicly available at <https://github.com/OpenGVLab/EgoVideo>. The paper also discusses the limitations of the approach, such as the high computational resources required and the challenges in temporal localization and long-term action anticipation.