13 Apr 2015 | Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
This paper presents two deep neural network architectures for video classification, aiming to combine image information across longer time periods than previously attempted. The first method explores various convolutional temporal feature pooling architectures, while the second method uses a recurrent neural network (RNN) with Long Short-Term Memory (LSTM) cells to model the video as an ordered sequence of frames. The best networks achieve significant performance improvements on the Sports 1 million and UCF-101 datasets, outperforming previous state-of-the-art methods. The authors also confirm the importance of motion information through optical flow, which enhances the performance of LSTM models. The paper discusses the design choices and evaluation of these architectures, highlighting the benefits of learning over longer video sequences and the effectiveness of LSTM in capturing temporal relationships.This paper presents two deep neural network architectures for video classification, aiming to combine image information across longer time periods than previously attempted. The first method explores various convolutional temporal feature pooling architectures, while the second method uses a recurrent neural network (RNN) with Long Short-Term Memory (LSTM) cells to model the video as an ordered sequence of frames. The best networks achieve significant performance improvements on the Sports 1 million and UCF-101 datasets, outperforming previous state-of-the-art methods. The authors also confirm the importance of motion information through optical flow, which enhances the performance of LSTM models. The paper discusses the design choices and evaluation of these architectures, highlighting the benefits of learning over longer video sequences and the effectiveness of LSTM in capturing temporal relationships.