Understanding Unsupervised Learning of Video Representations using LSTMs

This paper presents a method for unsupervised learning of video representations using Long Short Term Memory (LSTM) networks. The authors propose an LSTM Encoder-Decoder framework where an encoder LSTM maps input video sequences into a fixed-length representation, which is then decoded by one or multiple decoder LSTMs to perform tasks such as reconstructing the input sequence or predicting future frames. The input sequences can be either image patches or high-level representations extracted from video frames using a pre-trained convolutional neural network. The model is evaluated through qualitative analysis, quantitative evaluation on action recognition tasks, and stress testing on longer time scales and out-of-domain data. The results show that the learned representations improve classification accuracy, especially with limited training examples, and even models pretrained on unrelated datasets can enhance action recognition performance. The paper also discusses the benefits of using conditional and unconditional decoders and introduces a composite model that combines autoencoder and future predictor tasks. The authors conclude by suggesting potential extensions and future work, including applying the model convolutionally across video patches and stacking multiple layers.This paper presents a method for unsupervised learning of video representations using Long Short Term Memory (LSTM) networks. The authors propose an LSTM Encoder-Decoder framework where an encoder LSTM maps input video sequences into a fixed-length representation, which is then decoded by one or multiple decoder LSTMs to perform tasks such as reconstructing the input sequence or predicting future frames. The input sequences can be either image patches or high-level representations extracted from video frames using a pre-trained convolutional neural network. The model is evaluated through qualitative analysis, quantitative evaluation on action recognition tasks, and stress testing on longer time scales and out-of-domain data. The results show that the learned representations improve classification accuracy, especially with limited training examples, and even models pretrained on unrelated datasets can enhance action recognition performance. The paper also discusses the benefits of using conditional and unconditional decoders and introduces a composite model that combines autoencoder and future predictor tasks. The authors conclude by suggesting potential extensions and future work, including applying the model convolutionally across video patches and stacking multiple layers.

Unsupervised Learning of Video Representations using LSTMs

4 Jan 2016 | Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov