30 Apr 2015 | Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko
This paper addresses the challenge of translating videos into natural language descriptions using deep recurrent neural networks. The authors propose a unified deep neural network with both convolutional and recurrent structures to directly convert video pixels into sentences. By leveraging knowledge from large image datasets (1.2M+ images with category labels and 100K+ images with captions), the method can generate sentence descriptions for open-domain videos with large vocabularies. The approach is evaluated using metrics such as BLEU and METEOR scores, as well as human evaluation, demonstrating significant improvements over existing methods. The key contributions include the first end-to-end deep model for video-to-text generation, the effective use of transfer learning from image classification and caption data, and a detailed evaluation on the YouTube corpus. The results show that the proposed model outperforms previous approaches in terms of SVO accuracy, sentence generation quality, and human relevance scores.This paper addresses the challenge of translating videos into natural language descriptions using deep recurrent neural networks. The authors propose a unified deep neural network with both convolutional and recurrent structures to directly convert video pixels into sentences. By leveraging knowledge from large image datasets (1.2M+ images with category labels and 100K+ images with captions), the method can generate sentence descriptions for open-domain videos with large vocabularies. The approach is evaluated using metrics such as BLEU and METEOR scores, as well as human evaluation, demonstrating significant improvements over existing methods. The key contributions include the first end-to-end deep model for video-to-text generation, the effective use of transfer learning from image classification and caption data, and a detailed evaluation on the YouTube corpus. The results show that the proposed model outperforms previous approaches in terms of SVO accuracy, sentence generation quality, and human relevance scores.