Describing Videos by Exploiting Temporal Structure

Describing Videos by Exploiting Temporal Structure

1 Oct 2015 | Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville
This paper addresses the challenging task of generating natural language descriptions for videos, emphasizing the importance of capturing both local and global temporal structures. The authors propose a novel approach that integrates a spatial-temporal 3-D convolutional neural network (3-D CNN) to model short-term temporal dynamics and a temporal attention mechanism to exploit the global temporal structure. The 3-D CNN, trained on video action recognition tasks, captures fine-grained motion information, while the temporal attention mechanism allows the model to selectively focus on relevant temporal segments. The proposed method outperforms state-of-the-art models on the Youtube2Text dataset using BLEU and METEOR metrics and demonstrates complementary benefits when both approaches are combined. The paper also presents results on a larger and more challenging dataset, the DVS dataset, showing improved performance with the proposed method.This paper addresses the challenging task of generating natural language descriptions for videos, emphasizing the importance of capturing both local and global temporal structures. The authors propose a novel approach that integrates a spatial-temporal 3-D convolutional neural network (3-D CNN) to model short-term temporal dynamics and a temporal attention mechanism to exploit the global temporal structure. The 3-D CNN, trained on video action recognition tasks, captures fine-grained motion information, while the temporal attention mechanism allows the model to selectively focus on relevant temporal segments. The proposed method outperforms state-of-the-art models on the Youtube2Text dataset using BLEU and METEOR metrics and demonstrates complementary benefits when both approaches are combined. The paper also presents results on a larger and more challenging dataset, the DVS dataset, showing improved performance with the proposed method.
Reach us at info@study.space