24 Jun 2021 | Ze Liu112, Jia Ning*13, Yue Cao11, Yixuan Wei14, Zheng Zhang1, Stephen Lin1, Han Hu1†
The Video Swin Transformer is a pure-transformer architecture for video recognition that leverages spatiotemporal locality to achieve a better speed-accuracy trade-off compared to previous methods. It adapts the Swin Transformer, originally designed for image recognition, to handle video data by using a 3D shifted window mechanism for self-attention. This approach allows for efficient computation and better performance on video recognition benchmarks. The model is initialized with pre-trained image models, leading to improved performance with less pre-training data and smaller model size. The Video Swin Transformer achieves state-of-the-art results on action recognition tasks such as Kinetics-400 (84.9% top-1 accuracy) and Kinetics-600 (86.1% top-1 accuracy), as well as temporal modeling on Something-Something v2 (69.6% top-1 accuracy). The model is publicly available for further research and development.The Video Swin Transformer is a pure-transformer architecture for video recognition that leverages spatiotemporal locality to achieve a better speed-accuracy trade-off compared to previous methods. It adapts the Swin Transformer, originally designed for image recognition, to handle video data by using a 3D shifted window mechanism for self-attention. This approach allows for efficient computation and better performance on video recognition benchmarks. The model is initialized with pre-trained image models, leading to improved performance with less pre-training data and smaller model size. The Video Swin Transformer achieves state-of-the-art results on action recognition tasks such as Kinetics-400 (84.9% top-1 accuracy) and Kinetics-600 (86.1% top-1 accuracy), as well as temporal modeling on Something-Something v2 (69.6% top-1 accuracy). The model is publicly available for further research and development.