This paper addresses the challenge of understanding human actions from first-person view videos, a task known as egocentric video captioning. The authors propose a model called Egoinstructor, which retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. The key contributions include:
1. **Egoinstructor Model**: A retrieval-augmented multimodal captioning model that automatically retrieves third-person instructional videos to assist in generating captions for egocentric videos.
2. **Cross-View Retrieval Module**: An automatic pipeline to discover ego-exo video pairs from large-scale egocentric and exocentric datasets, and a novel EgoExoNCE loss to align egocentric and exocentric video features with shared text features.
3. **Performance**: Extensive experiments demonstrate superior performance across seven benchmarks, showing that the cross-view retrieval module outperforms existing methods.
4. **Improvements in Egocentric Video Captioning**: Egoinstructor significantly improves egocentric video captioning by leveraging third-person videos as references.
The paper also discusses related work, including egocentric video understanding, ego-exo video understanding, and retrieval-augmented models. The methodology section details the architecture of Egoinstructor, including the cross-view visual representation alignment and the automatic ego-exo pair generation process. The experimental results show the effectiveness of the proposed approach in both cross-view retrieval and retrieval-augmented captioning tasks.This paper addresses the challenge of understanding human actions from first-person view videos, a task known as egocentric video captioning. The authors propose a model called Egoinstructor, which retrieves semantically relevant third-person instructional videos to enhance the captioning of egocentric videos. The key contributions include:
1. **Egoinstructor Model**: A retrieval-augmented multimodal captioning model that automatically retrieves third-person instructional videos to assist in generating captions for egocentric videos.
2. **Cross-View Retrieval Module**: An automatic pipeline to discover ego-exo video pairs from large-scale egocentric and exocentric datasets, and a novel EgoExoNCE loss to align egocentric and exocentric video features with shared text features.
3. **Performance**: Extensive experiments demonstrate superior performance across seven benchmarks, showing that the cross-view retrieval module outperforms existing methods.
4. **Improvements in Egocentric Video Captioning**: Egoinstructor significantly improves egocentric video captioning by leveraging third-person videos as references.
The paper also discusses related work, including egocentric video understanding, ego-exo video understanding, and retrieval-augmented models. The methodology section details the architecture of Egoinstructor, including the cross-view visual representation alignment and the automatic ego-exo pair generation process. The experimental results show the effectiveness of the proposed approach in both cross-view retrieval and retrieval-augmented captioning tasks.