[slides] Two-Stream Convolutional Networks for Action Recognition in Videos

The paper "Two-Stream Convolutional Networks for Action Recognition in Videos" by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group at the University of Oxford explores the use of deep Convolutional Networks (ConvNets) for action recognition in videos. The authors propose a two-stream architecture that combines spatial and temporal networks to capture complementary information from still frames and motion between frames. The spatial stream processes individual video frames, while the temporal stream processes dense optical flow data. The paper demonstrates that a ConvNet trained on multi-frame dense optical flow can achieve good performance despite limited training data. Multi-task learning is also applied to increase the amount of training data and improve performance on two different action classification datasets, UCF-101 and HMDB-51. The proposed architecture is evaluated on these benchmarks and shows competitive performance with state-of-the-art methods, outperforming previous attempts using deep nets for video classification. The authors also discuss the limitations and future directions, including the need for larger datasets and improvements in handling camera motion.The paper "Two-Stream Convolutional Networks for Action Recognition in Videos" by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group at the University of Oxford explores the use of deep Convolutional Networks (ConvNets) for action recognition in videos. The authors propose a two-stream architecture that combines spatial and temporal networks to capture complementary information from still frames and motion between frames. The spatial stream processes individual video frames, while the temporal stream processes dense optical flow data. The paper demonstrates that a ConvNet trained on multi-frame dense optical flow can achieve good performance despite limited training data. Multi-task learning is also applied to increase the amount of training data and improve performance on two different action classification datasets, UCF-101 and HMDB-51. The proposed architecture is evaluated on these benchmarks and shows competitive performance with state-of-the-art methods, outperforming previous attempts using deep nets for video classification. The authors also discuss the limitations and future directions, including the need for larger datasets and improvements in handling camera motion.

Two-Stream Convolutional Networks for Action Recognition in Videos

12 Nov 2014 | Karen Simonyan, Andrew Zisserman