ViViT: A Video Vision Transformer

ViViT: A Video Vision Transformer

1 Nov 2021 | Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario LučiㆠCordelia Schmid†
ViViT is a pure-transformer based model for video classification, inspired by the success of transformers in image classification. The model extracts spatio-temporal tokens from video input and encodes them using transformer layers. To handle long sequences, the model is factorized along spatial and temporal dimensions, improving efficiency. The model is trained using regularization techniques and leverages pre-trained image models to handle smaller datasets. ViViT outperforms prior methods based on deep 3D convolutional networks on benchmarks like Kinetics 400, 600, Epic Kitchens, Something-Something v2, and Moments in Time. The model includes several variants, such as factorized encoders and self-attention mechanisms, which improve efficiency and accuracy. The model is initialized using pre-trained image models and uses positional embeddings and factorized attention to process video data. Empirical results show that ViViT achieves state-of-the-art performance across multiple video classification tasks.ViViT is a pure-transformer based model for video classification, inspired by the success of transformers in image classification. The model extracts spatio-temporal tokens from video input and encodes them using transformer layers. To handle long sequences, the model is factorized along spatial and temporal dimensions, improving efficiency. The model is trained using regularization techniques and leverages pre-trained image models to handle smaller datasets. ViViT outperforms prior methods based on deep 3D convolutional networks on benchmarks like Kinetics 400, 600, Epic Kitchens, Something-Something v2, and Moments in Time. The model includes several variants, such as factorized encoders and self-attention mechanisms, which improve efficiency and accuracy. The model is initialized using pre-trained image models and uses positional embeddings and factorized attention to process video data. Empirical results show that ViViT achieves state-of-the-art performance across multiple video classification tasks.
Reach us at info@study.space
[slides] ViViT%3A A Video Vision Transformer | StudySpace