4 Jan 2016 | Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov
This paper presents an unsupervised learning approach using Long Short-Term Memory (LSTM) networks to learn video representations. The model uses an encoder LSTM to map input video sequences into a fixed-length representation, which is then decoded by one or more decoder LSTMs to perform tasks such as reconstructing the input or predicting future frames. The model is tested on two types of input sequences: image patches and high-level video percepts extracted using a pretrained convolutional network. The model is evaluated qualitatively and quantitatively, with the latter involving fine-tuning the learned representations for supervised learning tasks like human action recognition on the UCF-101 and HMDB-51 datasets. The results show that the learned representations improve classification accuracy, especially when there are few training examples. The model is also tested on out-of-domain data and longer time scales, demonstrating its ability to extrapolate learned video representations into the future and past. The paper also discusses the design of different model variants, including autoencoders, future predictors, and composite models that combine both. The results show that the composite model performs best, combining the strengths of both autoencoders and future predictors. The paper concludes that unsupervised learning with LSTMs can effectively learn good video representations that improve performance on supervised tasks.This paper presents an unsupervised learning approach using Long Short-Term Memory (LSTM) networks to learn video representations. The model uses an encoder LSTM to map input video sequences into a fixed-length representation, which is then decoded by one or more decoder LSTMs to perform tasks such as reconstructing the input or predicting future frames. The model is tested on two types of input sequences: image patches and high-level video percepts extracted using a pretrained convolutional network. The model is evaluated qualitatively and quantitatively, with the latter involving fine-tuning the learned representations for supervised learning tasks like human action recognition on the UCF-101 and HMDB-51 datasets. The results show that the learned representations improve classification accuracy, especially when there are few training examples. The model is also tested on out-of-domain data and longer time scales, demonstrating its ability to extrapolate learned video representations into the future and past. The paper also discusses the design of different model variants, including autoencoders, future predictors, and composite models that combine both. The results show that the composite model performs best, combining the strengths of both autoencoders and future predictors. The paper concludes that unsupervised learning with LSTMs can effectively learn good video representations that improve performance on supervised tasks.