Video Swin Transformer

Video Swin Transformer

24 Jun 2021 | Ze Liu112, Jia Ning*13, Yue Cao11, Yixuan Wei14, Zheng Zhang1, Stephen Lin1, Han Hu1†
The paper introduces the Video Swin Transformer, a pure Transformer-based architecture for video recognition. Unlike previous approaches that compute global self-attention, the Video Swin Transformer leverages spatiotemporal locality to improve the speed-accuracy trade-off. The architecture is adapted from the Swin Transformer, which was designed for image recognition, and extends it to handle video data by incorporating spatiotemporal attention. The key innovation is the use of shifted 3D windows to compute local attention, which reduces computation and model size while maintaining performance. The model achieves state-of-the-art accuracy on several video recognition benchmarks, including action recognition on Kinetics-400 and Kinetics-600, and temporal modeling on Something-Something v2. The paper also discusses the initialization of the model using pre-trained image models and provides ablation studies to validate the effectiveness of the proposed design choices. The code and models are made publicly available to facilitate further research.The paper introduces the Video Swin Transformer, a pure Transformer-based architecture for video recognition. Unlike previous approaches that compute global self-attention, the Video Swin Transformer leverages spatiotemporal locality to improve the speed-accuracy trade-off. The architecture is adapted from the Swin Transformer, which was designed for image recognition, and extends it to handle video data by incorporating spatiotemporal attention. The key innovation is the use of shifted 3D windows to compute local attention, which reduces computation and model size while maintaining performance. The model achieves state-of-the-art accuracy on several video recognition benchmarks, including action recognition on Kinetics-400 and Kinetics-600, and temporal modeling on Something-Something v2. The paper also discusses the initialization of the model using pre-trained image models and provides ablation studies to validate the effectiveness of the proposed design choices. The code and models are made publicly available to facilitate further research.
Reach us at info@study.space