Understanding Retrieval-Augmented Egocentric Video Captioning

This paper addresses the challenge of understanding human actions from first-person view videos, a task known as egocentric video captioning. The authors propose a model called Egoinstructor, which retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. The key contributions include: 1. **Egoinstructor Model**: A retrieval-augmented multimodal captioning model that automatically retrieves third-person instructional videos to assist in generating captions for egocentric videos. 2. **Cross-View Retrieval Module**: An automatic pipeline to discover ego-exo video pairs from large-scale egocentric and exocentric datasets, and a novel EgoExoNCE loss to align egocentric and exocentric video features with shared text features. 3. **Performance**: Extensive experiments demonstrate superior performance across seven benchmarks, showing that the cross-view retrieval module outperforms existing methods. 4. **Improvements in Egocentric Video Captioning**: Egoinstructor significantly improves egocentric video captioning by leveraging third-person videos as references. The paper also discusses related work, including egocentric video understanding, ego-exo video understanding, and retrieval-augmented models. The methodology section details the architecture of Egoinstructor, including the cross-view visual representation alignment and the automatic ego-exo pair generation process. The experimental results show the effectiveness of the proposed approach in both cross-view retrieval and retrieval-augmented captioning tasks.This paper addresses the challenge of understanding human actions from first-person view videos, a task known as egocentric video captioning. The authors propose a model called Egoinstructor, which retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. The key contributions include: 1. **Egoinstructor Model**: A retrieval-augmented multimodal captioning model that automatically retrieves third-person instructional videos to assist in generating captions for egocentric videos. 2. **Cross-View Retrieval Module**: An automatic pipeline to discover ego-exo video pairs from large-scale egocentric and exocentric datasets, and a novel EgoExoNCE loss to align egocentric and exocentric video features with shared text features. 3. **Performance**: Extensive experiments demonstrate superior performance across seven benchmarks, showing that the cross-view retrieval module outperforms existing methods. 4. **Improvements in Egocentric Video Captioning**: Egoinstructor significantly improves egocentric video captioning by leveraging third-person videos as references. The paper also discusses related work, including egocentric video understanding, ego-exo video understanding, and retrieval-augmented models. The methodology section details the architecture of Egoinstructor, including the cross-view visual representation alignment and the automatic ego-exo pair generation process. The experimental results show the effectiveness of the proposed approach in both cross-view retrieval and retrieval-augmented captioning tasks.

Retrieval-Augmented Egocentric Video Captioning

19 Jun 2024 | Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie