Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

11 Apr 2024 | Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim
This paper introduces a novel framework for dense video captioning, called Cross-Modal Memory-based Dense Video Captioning (CM²), which leverages cross-modal memory retrieval to improve event localization and caption generation. The framework is inspired by human cognitive processes, where relevant memories are recalled to describe scenes. CM² utilizes an external memory bank to store semantic information, which is then used to enhance the model's ability to generate accurate and natural language descriptions of events in untrimmed videos. The model incorporates a versatile encoder-decoder structure with visual and textual cross-attention modules to effectively integrate retrieved text features with visual features. The encoder processes both visual and textual features, while the decoder generates event queries that are refined using cross-attention mechanisms. The model then uses these queries to predict event boundaries and generate captions. The framework is evaluated on two benchmark datasets: ActivityNet Captions and YouCook2. Experimental results show that CM² achieves competitive performance without the need for extensive pretraining on large video datasets. The model's ability to retrieve relevant text features from an external memory bank significantly improves the quality of event localization and caption generation. The study also conducts ablation studies to validate the effectiveness of various components of the framework, including the use of a weight-shared versatile encoder and the incorporation of textual and visual cross-attention. Results indicate that these components contribute to the model's performance in both event localization and caption generation. Overall, CM² demonstrates the potential of memory-augmented models in dense video captioning, offering a more efficient and effective approach to generating accurate and natural language descriptions of events in untrimmed videos.This paper introduces a novel framework for dense video captioning, called Cross-Modal Memory-based Dense Video Captioning (CM²), which leverages cross-modal memory retrieval to improve event localization and caption generation. The framework is inspired by human cognitive processes, where relevant memories are recalled to describe scenes. CM² utilizes an external memory bank to store semantic information, which is then used to enhance the model's ability to generate accurate and natural language descriptions of events in untrimmed videos. The model incorporates a versatile encoder-decoder structure with visual and textual cross-attention modules to effectively integrate retrieved text features with visual features. The encoder processes both visual and textual features, while the decoder generates event queries that are refined using cross-attention mechanisms. The model then uses these queries to predict event boundaries and generate captions. The framework is evaluated on two benchmark datasets: ActivityNet Captions and YouCook2. Experimental results show that CM² achieves competitive performance without the need for extensive pretraining on large video datasets. The model's ability to retrieve relevant text features from an external memory bank significantly improves the quality of event localization and caption generation. The study also conducts ablation studies to validate the effectiveness of various components of the framework, including the use of a weight-shared versatile encoder and the incorporation of textual and visual cross-attention. Results indicate that these components contribute to the model's performance in both event localization and caption generation. Overall, CM² demonstrates the potential of memory-augmented models in dense video captioning, offering a more efficient and effective approach to generating accurate and natural language descriptions of events in untrimmed videos.
Reach us at info@study.space