Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

27 Jul 2018 | Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy
The paper "Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification" by Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy explores the challenges and improvements in video classification using convolutional neural networks (CNNs). The authors address three main issues: spatial feature representation, temporal information representation, and model/computation complexity. They focus on 3D CNNs, which have shown promise in spatial and temporal representation learning but are computationally expensive. The paper introduces several network designs to balance speed and accuracy, including replacing some 3D convolutions with 2D convolutions, using separable spatial/temporal convolutions, and incorporating spatio-temporal feature gating. These modifications result in efficient and accurate video classification systems that outperform previous methods on various benchmarks, such as Kinetics, Something-something, UCF101, and HMDB, as well as action detection tasks on JHMDB and UCF101-24. The authors also demonstrate the generalization of their models to different input modalities and datasets, showing consistent performance improvements.The paper "Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification" by Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy explores the challenges and improvements in video classification using convolutional neural networks (CNNs). The authors address three main issues: spatial feature representation, temporal information representation, and model/computation complexity. They focus on 3D CNNs, which have shown promise in spatial and temporal representation learning but are computationally expensive. The paper introduces several network designs to balance speed and accuracy, including replacing some 3D convolutions with 2D convolutions, using separable spatial/temporal convolutions, and incorporating spatio-temporal feature gating. These modifications result in efficient and accurate video classification systems that outperform previous methods on various benchmarks, such as Kinetics, Something-something, UCF101, and HMDB, as well as action detection tasks on JHMDB and UCF101-24. The authors also demonstrate the generalization of their models to different input modalities and datasets, showing consistent performance improvements.
Reach us at info@study.space
[slides] Rethinking Spatiotemporal Feature Learning%3A Speed-Accuracy Trade-offs in Video Classification | StudySpace