27 Jul 2018 | Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy
This paper addresses the challenge of improving video classification performance while balancing speed and accuracy. The authors propose a novel approach that combines several key design choices to achieve this balance. They start with the I3D model, a 3D convolutional neural network that has shown promise in video classification. However, I3D is computationally expensive and prone to overfitting. To address these issues, the authors explore various network architectures, including Bottom-Heavy I3D and Top-Heavy I3D, where 3D convolutions are applied at different layers of the network. They find that Top-Heavy I3D models are faster and often more accurate, as they apply 3D convolutions to higher-level feature maps, which are smaller and more semantically rich.
The authors also introduce a separable 3D convolution (S3D) that decomposes 3D convolutions into spatial and temporal components, reducing computational cost while maintaining accuracy. They further enhance the model with a spatio-temporal feature gating mechanism, which allows the model to focus on important features and ignore irrelevant ones, improving accuracy without increasing computational cost significantly.
The proposed S3D-G model outperforms previous methods on several video classification benchmarks, including Kinetics, Something-something, UCF101, and HMDB. It also performs well on action detection tasks, such as JHMDB and UCF101-24. The model is efficient, with a significant reduction in computational cost compared to I3D, while maintaining high accuracy. The authors also show that the model generalizes well to different datasets and tasks, demonstrating its effectiveness in various video understanding applications. The key contributions of the paper include the development of a top-heavy model design, temporally separable convolution, and spatio-temporal feature gating, which together enable a more efficient and accurate video classification system.This paper addresses the challenge of improving video classification performance while balancing speed and accuracy. The authors propose a novel approach that combines several key design choices to achieve this balance. They start with the I3D model, a 3D convolutional neural network that has shown promise in video classification. However, I3D is computationally expensive and prone to overfitting. To address these issues, the authors explore various network architectures, including Bottom-Heavy I3D and Top-Heavy I3D, where 3D convolutions are applied at different layers of the network. They find that Top-Heavy I3D models are faster and often more accurate, as they apply 3D convolutions to higher-level feature maps, which are smaller and more semantically rich.
The authors also introduce a separable 3D convolution (S3D) that decomposes 3D convolutions into spatial and temporal components, reducing computational cost while maintaining accuracy. They further enhance the model with a spatio-temporal feature gating mechanism, which allows the model to focus on important features and ignore irrelevant ones, improving accuracy without increasing computational cost significantly.
The proposed S3D-G model outperforms previous methods on several video classification benchmarks, including Kinetics, Something-something, UCF101, and HMDB. It also performs well on action detection tasks, such as JHMDB and UCF101-24. The model is efficient, with a significant reduction in computational cost compared to I3D, while maintaining high accuracy. The authors also show that the model generalizes well to different datasets and tasks, demonstrating its effectiveness in various video understanding applications. The key contributions of the paper include the development of a top-heavy model design, temporally separable convolution, and spatio-temporal feature gating, which together enable a more efficient and accurate video classification system.