Convolutional Two-Stream Network Fusion for Video Action Recognition

Convolutional Two-Stream Network Fusion for Video Action Recognition

26 Sep 2016 | Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
This paper addresses the challenge of human action recognition in videos using Convolutional Neural Networks (ConvNets). The authors explore various methods for fusing spatial and temporal information from ConvNet towers to enhance performance. Key findings include: 1. **Spatial and Temporal Fusion at Convolutional Layers**: Fusing ConvNet towers at convolutional layers, rather than the softmax layer, can improve performance without increasing parameter count. 2. **Spatial Fusion Location**: Fusion at the last convolutional layer is more effective than earlier layers, and fusing at the class prediction layer can further boost accuracy. 3. **Pooling of Abstract Features**: Pooling abstract convolutional features over spatiotemporal neighborhoods further enhances performance. Based on these findings, the authors propose a new ConvNet architecture for spatiotemporal fusion of video snippets. This architecture is evaluated on standard benchmarks (UCF101 and HMDB51) and achieves state-of-the-art results. The proposed method outperforms previous approaches, including the two-stream architecture, by leveraging both spatial and temporal cues more effectively. The code and models are available for public use.This paper addresses the challenge of human action recognition in videos using Convolutional Neural Networks (ConvNets). The authors explore various methods for fusing spatial and temporal information from ConvNet towers to enhance performance. Key findings include: 1. **Spatial and Temporal Fusion at Convolutional Layers**: Fusing ConvNet towers at convolutional layers, rather than the softmax layer, can improve performance without increasing parameter count. 2. **Spatial Fusion Location**: Fusion at the last convolutional layer is more effective than earlier layers, and fusing at the class prediction layer can further boost accuracy. 3. **Pooling of Abstract Features**: Pooling abstract convolutional features over spatiotemporal neighborhoods further enhances performance. Based on these findings, the authors propose a new ConvNet architecture for spatiotemporal fusion of video snippets. This architecture is evaluated on standard benchmarks (UCF101 and HMDB51) and achieves state-of-the-art results. The proposed method outperforms previous approaches, including the two-stream architecture, by leveraging both spatial and temporal cues more effectively. The code and models are available for public use.
Reach us at info@study.space