11 Apr 2024 | Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim
This paper introduces a novel framework for dense video captioning, inspired by human cognitive processes. The framework, named Cross-Modal Memory-based dense video captioning (CM²), aims to improve event localization and captioning by leveraging external memory and cross-modal retrieval. The model uses a versatile encoder-decoder architecture with visual and textual cross-attention modules to effectively incorporate retrieved text features. The external memory is constructed from sentence-level features extracted from training data, and segment-level video-to-text retrieval is performed using CLIP. The model's performance is evaluated on ActivityNet Captions and YouCook2 datasets, showing promising results without extensive pretraining on large video datasets. The study highlights the effectiveness of memory retrieval in dense video captioning and provides insights into the interplay between visual and textual modalities.This paper introduces a novel framework for dense video captioning, inspired by human cognitive processes. The framework, named Cross-Modal Memory-based dense video captioning (CM²), aims to improve event localization and captioning by leveraging external memory and cross-modal retrieval. The model uses a versatile encoder-decoder architecture with visual and textual cross-attention modules to effectively incorporate retrieved text features. The external memory is constructed from sentence-level features extracted from training data, and segment-level video-to-text retrieval is performed using CLIP. The model's performance is evaluated on ActivityNet Captions and YouCook2 datasets, showing promising results without extensive pretraining on large video datasets. The study highlights the effectiveness of memory retrieval in dense video captioning and provides insights into the interplay between visual and textual modalities.