Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos

12 Nov 2014 | Karen Simonyan, Andrew Zisserman
This paper presents a two-stream Convolutional Network (ConvNet) architecture for action recognition in videos. The architecture combines spatial and temporal streams to capture complementary information from still frames and motion between frames. The spatial stream processes individual video frames for action recognition, while the temporal stream uses dense optical flow to capture motion. The model is trained and evaluated on the UCF-101 and HMDB-51 video action recognition benchmarks, achieving competitive performance with the state of the art. The model outperforms previous attempts to use deep nets for video classification. The spatial stream is pre-trained on the ImageNet dataset, allowing it to leverage large-scale image data. The temporal stream is trained on multi-frame dense optical flow, which is computed using a method based on the assumptions of constancy and smoothness. The model uses multi-task learning to increase the amount of training data and improve performance on both the UCF-101 and HMDB-51 datasets. The two-stream architecture is trained on the UCF-101 and HMDB-51 datasets, achieving high accuracy. The spatial and temporal streams are combined using late fusion, with the spatial stream providing static appearance information and the temporal stream capturing motion. The model outperforms previous methods, including those based on hand-crafted features, and is competitive with the state of the art in action recognition. The model uses a multi-task learning framework to combine training data from multiple datasets, improving performance. The temporal ConvNet is trained on optical flow, which is computed using a method based on the assumptions of constancy and smoothness. The model is evaluated on the UCF-101 and HMDB-51 datasets, achieving high accuracy and outperforming previous methods. The results show that the two-stream architecture is effective for action recognition in videos.This paper presents a two-stream Convolutional Network (ConvNet) architecture for action recognition in videos. The architecture combines spatial and temporal streams to capture complementary information from still frames and motion between frames. The spatial stream processes individual video frames for action recognition, while the temporal stream uses dense optical flow to capture motion. The model is trained and evaluated on the UCF-101 and HMDB-51 video action recognition benchmarks, achieving competitive performance with the state of the art. The model outperforms previous attempts to use deep nets for video classification. The spatial stream is pre-trained on the ImageNet dataset, allowing it to leverage large-scale image data. The temporal stream is trained on multi-frame dense optical flow, which is computed using a method based on the assumptions of constancy and smoothness. The model uses multi-task learning to increase the amount of training data and improve performance on both the UCF-101 and HMDB-51 datasets. The two-stream architecture is trained on the UCF-101 and HMDB-51 datasets, achieving high accuracy. The spatial and temporal streams are combined using late fusion, with the spatial stream providing static appearance information and the temporal stream capturing motion. The model outperforms previous methods, including those based on hand-crafted features, and is competitive with the state of the art in action recognition. The model uses a multi-task learning framework to combine training data from multiple datasets, improving performance. The temporal ConvNet is trained on optical flow, which is computed using a method based on the assumptions of constancy and smoothness. The model is evaluated on the UCF-101 and HMDB-51 datasets, achieving high accuracy and outperforming previous methods. The results show that the two-stream architecture is effective for action recognition in videos.
Reach us at info@study.space