Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

26 Sep 2016 | Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
This paper proposes a new ConvNet architecture for spatiotemporal fusion of video snippets, achieving state-of-the-art results on standard benchmarks. The authors investigate various ways to fuse spatial and temporal information in ConvNet towers, finding that fusing at a convolutional layer rather than the softmax layer can save parameters without performance loss. They also find that fusing spatially at the last convolutional layer and combining with class prediction layers improves accuracy. Pooling abstract convolutional features over spatiotemporal neighborhoods further boosts performance. The proposed architecture builds upon the two-stream ConvNet approach, which uses separate networks for spatial (RGB) and temporal (optical flow) information. The authors introduce a novel fusion method that combines spatial and temporal cues at multiple levels of feature abstraction, with spatial and temporal integration. They investigate three aspects of fusion: (i) how to fuse the two networks considering spatial registration, (ii) where to fuse the two networks, and (iii) how to fuse the networks temporally. The authors evaluate different fusion strategies, including sum, max, concatenation, convolution, and bilinear fusion. They find that convolution fusion performs best, with a slight advantage over bilinear fusion. They also find that fusing at the ReLU5 layer provides better performance than fusing at FC layers. The proposed architecture uses 3D convolution and pooling to fuse spatial and temporal information, achieving state-of-the-art results on the UCF101 and HMDB51 datasets. The authors also compare their approach to other methods, finding that their approach outperforms previous methods on both datasets. They show that using deeper networks improves performance, but at the cost of increased computation time. They also find that combining ConvNet predictions with FV-encoded IDT features further improves performance. The authors conclude that their approach is effective in learning correspondences between highly abstract ConvNet features both spatially and temporally. They also note that current datasets are either too small or too noisy, which may affect the generalizability of their results.This paper proposes a new ConvNet architecture for spatiotemporal fusion of video snippets, achieving state-of-the-art results on standard benchmarks. The authors investigate various ways to fuse spatial and temporal information in ConvNet towers, finding that fusing at a convolutional layer rather than the softmax layer can save parameters without performance loss. They also find that fusing spatially at the last convolutional layer and combining with class prediction layers improves accuracy. Pooling abstract convolutional features over spatiotemporal neighborhoods further boosts performance. The proposed architecture builds upon the two-stream ConvNet approach, which uses separate networks for spatial (RGB) and temporal (optical flow) information. The authors introduce a novel fusion method that combines spatial and temporal cues at multiple levels of feature abstraction, with spatial and temporal integration. They investigate three aspects of fusion: (i) how to fuse the two networks considering spatial registration, (ii) where to fuse the two networks, and (iii) how to fuse the networks temporally. The authors evaluate different fusion strategies, including sum, max, concatenation, convolution, and bilinear fusion. They find that convolution fusion performs best, with a slight advantage over bilinear fusion. They also find that fusing at the ReLU5 layer provides better performance than fusing at FC layers. The proposed architecture uses 3D convolution and pooling to fuse spatial and temporal information, achieving state-of-the-art results on the UCF101 and HMDB51 datasets. The authors also compare their approach to other methods, finding that their approach outperforms previous methods on both datasets. They show that using deeper networks improves performance, but at the cost of increased computation time. They also find that combining ConvNet predictions with FV-encoded IDT features further improves performance. The authors conclude that their approach is effective in learning correspondences between highly abstract ConvNet features both spatially and temporally. They also note that current datasets are either too small or too noisy, which may affect the generalizability of their results.
Reach us at info@study.space