[slides and audio] Sequence to Sequence -- Video to Text

This paper presents a novel end-to-end sequence-to-sequence model, S2VT, for generating video descriptions. The model uses Long Short Term Memory (LSTM) networks to handle the temporal structure of video frames and generate a sequence of words. The input to the model is a sequence of video frames, and the output is a sequence of words describing the video content. The model is trained on video-sentence pairs, where each video clip is paired with a natural language description. The authors evaluate the model on three datasets: MSVD, MPII-MD, and M-VAD, achieving state-of-the-art performance. The model's ability to handle variable-length input and output sequences, learn temporal structures, and generate natural, grammatical sentences is highlighted. The paper also discusses related work and compares S2VT with other approaches, demonstrating its effectiveness and robustness.This paper presents a novel end-to-end sequence-to-sequence model, S2VT, for generating video descriptions. The model uses Long Short Term Memory (LSTM) networks to handle the temporal structure of video frames and generate a sequence of words. The input to the model is a sequence of video frames, and the output is a sequence of words describing the video content. The model is trained on video-sentence pairs, where each video clip is paired with a natural language description. The authors evaluate the model on three datasets: MSVD, MPII-MD, and M-VAD, achieving state-of-the-art performance. The model's ability to handle variable-length input and output sequences, learn temporal structures, and generate natural, grammatical sentences is highlighted. The paper also discusses related work and compares S2VT with other approaches, demonstrating its effectiveness and robustness.

Sequence to Sequence – Video to Text

19 Oct 2015 | Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko