X3D: Expanding Architectures for Efficient Video Recognition

X3D: Expanding Architectures for Efficient Video Recognition

9 Apr 2020 | Christoph Feichtenhofer
X3D is a family of efficient video networks that progressively expands a tiny 2D image classification architecture along multiple network axes, including space, time, width, and depth. Inspired by feature selection methods in machine learning, X3D uses a stepwise expansion approach that expands one axis at a time to achieve a good accuracy-to-complexity trade-off. X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy compared to previous work. The most surprising finding is that networks with high spatiotemporal resolution can perform well while being extremely light in terms of network width and parameters. X3D achieves competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code is available at https://github.com/facebookresearch/SlowFast. X3D expands a tiny base 2D image architecture into a spatiotemporal one by expanding multiple axes, including temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. The base architecture is based on the MobileNet concept of channel-wise separable convolutions but is made tiny by having over 10× fewer multiply-add operations than mobile image models. The expansion progressively increases computation by expanding one axis at a time, training and validating the resultant architecture, and selecting the axis that achieves the best computation/accuracy trade-off. This process is repeated until the architecture reaches a desired computational budget, resembling coordinate descent in the hyper-parameter space defined by those axes. X3D is compared to previous research on Kinetics-400, Kinetics-600, Charades, and AVA. For systematic studies, models are classified into different levels of complexity for small, medium, and large models. X3D produces a sequence of spatiotemporal architectures covering a wide range of computation/accuracy trade-offs. They can be used under different computational budgets that are application-dependent. X3D performs favorably to state-of-the-art while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work. Expansion is simple and cheap, with a low-compute model completed after training 30 tiny models that collectively require over 25× fewer multiply-add operations than one large state-of-the-art network. Conceptually, X3D's most surprising finding is that very thin video architectures created by expanding spatiotemporal resolution perform well while being light in terms of network width and parameters. X3D networks have lower width than image-based video models, making them similar to the high-resolution Fast pathway. These advances are expected to facilitate future research and applications. Efficient 2D networks have been extensively developed for image classification, with MobileNetV1&2 and ShuffleNet exploring channel-wise separable convolutions and expanded bottlenecks. Several methods for neural architecture search have been proposed, adding SX3D is a family of efficient video networks that progressively expands a tiny 2D image classification architecture along multiple network axes, including space, time, width, and depth. Inspired by feature selection methods in machine learning, X3D uses a stepwise expansion approach that expands one axis at a time to achieve a good accuracy-to-complexity trade-off. X3D achieves state-of-the-art performance while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy compared to previous work. The most surprising finding is that networks with high spatiotemporal resolution can perform well while being extremely light in terms of network width and parameters. X3D achieves competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code is available at https://github.com/facebookresearch/SlowFast. X3D expands a tiny base 2D image architecture into a spatiotemporal one by expanding multiple axes, including temporal duration, frame rate, spatial resolution, width, bottleneck width, and depth. The base architecture is based on the MobileNet concept of channel-wise separable convolutions but is made tiny by having over 10× fewer multiply-add operations than mobile image models. The expansion progressively increases computation by expanding one axis at a time, training and validating the resultant architecture, and selecting the axis that achieves the best computation/accuracy trade-off. This process is repeated until the architecture reaches a desired computational budget, resembling coordinate descent in the hyper-parameter space defined by those axes. X3D is compared to previous research on Kinetics-400, Kinetics-600, Charades, and AVA. For systematic studies, models are classified into different levels of complexity for small, medium, and large models. X3D produces a sequence of spatiotemporal architectures covering a wide range of computation/accuracy trade-offs. They can be used under different computational budgets that are application-dependent. X3D performs favorably to state-of-the-art while requiring 4.8× and 5.5× fewer multiply-adds and parameters for similar accuracy as previous work. Expansion is simple and cheap, with a low-compute model completed after training 30 tiny models that collectively require over 25× fewer multiply-add operations than one large state-of-the-art network. Conceptually, X3D's most surprising finding is that very thin video architectures created by expanding spatiotemporal resolution perform well while being light in terms of network width and parameters. X3D networks have lower width than image-based video models, making them similar to the high-resolution Fast pathway. These advances are expected to facilitate future research and applications. Efficient 2D networks have been extensively developed for image classification, with MobileNetV1&2 and ShuffleNet exploring channel-wise separable convolutions and expanded bottlenecks. Several methods for neural architecture search have been proposed, adding S
Reach us at info@study.space