Understanding X3D%3A Expanding Architectures for Efficient Video Recognition

This paper introduces X3D, a family of efficient video recognition networks that progressively expand a small 2D image classification architecture along multiple axes: temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. Inspired by feature selection methods in machine learning, X3D employs a stepwise network expansion approach, expanding only one axis at a time to achieve a good accuracy-complexity trade-off. The expansion process involves forward expansion and backward contraction to reach the desired complexity. X3D achieves state-of-the-art performance while requiring significantly fewer multiply-add operations and parameters compared to previous work. A surprising finding is that networks with high spatiotemporal resolution can perform well while being lightweight in terms of network width and parameters. The paper reports competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.This paper introduces X3D, a family of efficient video recognition networks that progressively expand a small 2D image classification architecture along multiple axes: temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. Inspired by feature selection methods in machine learning, X3D employs a stepwise network expansion approach, expanding only one axis at a time to achieve a good accuracy-complexity trade-off. The expansion process involves forward expansion and backward contraction to reach the desired complexity. X3D achieves state-of-the-art performance while requiring significantly fewer multiply-add operations and parameters compared to previous work. A surprising finding is that networks with high spatiotemporal resolution can perform well while being lightweight in terms of network width and parameters. The paper reports competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.

X3D: Expanding Architectures for Efficient Video Recognition

9 Apr 2020 | Christoph Feichtenhofer