Understanding ViViT%3A A Video Vision Transformer

ViViT: A Video Vision Transformer **Abstract:** We present pure-transformer based models for video classification, inspired by the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. To handle long sequences of tokens, we propose several efficient variants of our model that factorize the spatial and temporal dimensions of the input. Despite the known effectiveness of transformer models requiring large training datasets, we show how to effectively regularize the model during training and leverage pre-trained image models to train on smaller datasets. We conduct thorough ablation studies and achieve state-of-the-art results on multiple video classification benchmarks, outperforming prior methods based on deep 3D convolutional networks. **Introduction:** Deep convolutional neural networks have advanced image classification significantly, but attention-based models in natural language processing have shown promise in computer vision. Inspired by the success of Vision Transformers (ViT) in image classification, we develop pure-transformer models for video classification. We propose several variants of our model, including those that are more efficient by factorizing the spatial and temporal dimensions of the input video. We also show how additional regularization and pre-trained models can be used to improve performance on smaller datasets. Our models achieve state-of-the-art results on multiple popular video classification benchmarks. **Related Work:** Video understanding has mirrored advances in image recognition, evolving from hand-crafted features to 2D convolutional networks and then to 3D CNNs. Recent work has explored transformer-based approaches, but they have primarily been used as a layer within CNNs or for image classification. Our work extends these ideas to video classification, addressing the challenges of handling large sequences of spatio-temporal tokens. **Empirical Evaluation:** We present experimental setup and implementation details, including network architecture, training, and datasets. We conduct ablation studies to evaluate the impact of different components of our model, such as input encoding methods, model variants, and regularization techniques. Our models achieve state-of-the-art results on multiple datasets, including Kinetics 400 and 600, Epic Kitchens, Something-Something v2, and Moments in Time.ViViT: A Video Vision Transformer **Abstract:** We present pure-transformer based models for video classification, inspired by the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. To handle long sequences of tokens, we propose several efficient variants of our model that factorize the spatial and temporal dimensions of the input. Despite the known effectiveness of transformer models requiring large training datasets, we show how to effectively regularize the model during training and leverage pre-trained image models to train on smaller datasets. We conduct thorough ablation studies and achieve state-of-the-art results on multiple video classification benchmarks, outperforming prior methods based on deep 3D convolutional networks. **Introduction:** Deep convolutional neural networks have advanced image classification significantly, but attention-based models in natural language processing have shown promise in computer vision. Inspired by the success of Vision Transformers (ViT) in image classification, we develop pure-transformer models for video classification. We propose several variants of our model, including those that are more efficient by factorizing the spatial and temporal dimensions of the input video. We also show how additional regularization and pre-trained models can be used to improve performance on smaller datasets. Our models achieve state-of-the-art results on multiple popular video classification benchmarks. **Related Work:** Video understanding has mirrored advances in image recognition, evolving from hand-crafted features to 2D convolutional networks and then to 3D CNNs. Recent work has explored transformer-based approaches, but they have primarily been used as a layer within CNNs or for image classification. Our work extends these ideas to video classification, addressing the challenges of handling large sequences of spatio-temporal tokens. **Empirical Evaluation:** We present experimental setup and implementation details, including network architecture, training, and datasets. We conduct ablation studies to evaluate the impact of different components of our model, such as input encoding methods, model variants, and regularization techniques. Our models achieve state-of-the-art results on multiple datasets, including Kinetics 400 and 600, Epic Kitchens, Something-Something v2, and Moments in Time.

ViViT: A Video Vision Transformer

1 Nov 2021 | Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario Lučić† Cordelia Schmid†