Retrieval-Augmented Egocentric Video Captioning

Retrieval-Augmented Egocentric Video Captioning

19 Jun 2024 | Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
This paper introduces EgoInstructor, a retrieval-augmented multimodal captioning model for egocentric video understanding. The model retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. It addresses the challenge of limited dataset scale and the lack of benefit from third-person videos in prior approaches. The model is trained using a novel EgoExoNCE loss that aligns egocentric and exocentric video features with shared text features describing similar actions. The cross-view retrieval module is trained on automatically generated pseudo ego-exo video pairs, enabling the model to effectively retrieve relevant third-person videos for captioning. The model demonstrates superior performance across seven benchmarks, including EK100 Multi-instance Retrieval, Ego4d Multiple Choice Questions, YouCook2 video-text retrieval, and CharadesEgo cross-view video retrieval. The model also shows significant improvements in egocentric video captioning by leveraging third-person instructional videos as references. The cross-view retrieval module is trained on a large-scale dataset of egocentric and exocentric videos, and the model is evaluated on various benchmarks to demonstrate its effectiveness in enhancing egocentric video understanding. The model's performance is further validated through extensive experiments and qualitative results, showing that the retrieved third-person videos enable more accurate and detailed captions for egocentric videos. The model's ability to align egocentric and exocentric video features with shared text features is key to its success in enhancing video captioning. The model is implemented using a combination of pre-trained vision and language models, and the cross-view retrieval module is trained using a novel loss function that encourages the alignment of video features with shared text features. The model's performance is evaluated on multiple benchmarks, demonstrating its effectiveness in improving egocentric video captioning through the use of third-person instructional videos.This paper introduces EgoInstructor, a retrieval-augmented multimodal captioning model for egocentric video understanding. The model retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. It addresses the challenge of limited dataset scale and the lack of benefit from third-person videos in prior approaches. The model is trained using a novel EgoExoNCE loss that aligns egocentric and exocentric video features with shared text features describing similar actions. The cross-view retrieval module is trained on automatically generated pseudo ego-exo video pairs, enabling the model to effectively retrieve relevant third-person videos for captioning. The model demonstrates superior performance across seven benchmarks, including EK100 Multi-instance Retrieval, Ego4d Multiple Choice Questions, YouCook2 video-text retrieval, and CharadesEgo cross-view video retrieval. The model also shows significant improvements in egocentric video captioning by leveraging third-person instructional videos as references. The cross-view retrieval module is trained on a large-scale dataset of egocentric and exocentric videos, and the model is evaluated on various benchmarks to demonstrate its effectiveness in enhancing egocentric video understanding. The model's performance is further validated through extensive experiments and qualitative results, showing that the retrieved third-person videos enable more accurate and detailed captions for egocentric videos. The model's ability to align egocentric and exocentric video features with shared text features is key to its success in enhancing video captioning. The model is implemented using a combination of pre-trained vision and language models, and the cross-view retrieval module is trained using a novel loss function that encourages the alignment of video features with shared text features. The model's performance is evaluated on multiple benchmarks, demonstrating its effectiveness in improving egocentric video captioning through the use of third-person instructional videos.
Reach us at info@study.space