Is Space-Time Attention All You Need for Video Understanding?

Is Space-Time Attention All You Need for Video Understanding?

9 Jun 2021 | Gedas Bertasius, Heng Wang, Lorenzo Torresani
This paper introduces TimeSformer, a video classification model that replaces convolutional operations with self-attention mechanisms for spatiotemporal feature learning. Unlike traditional 3D convolutional networks, TimeSformer uses self-attention to capture long-range dependencies across both space and time, enabling it to process longer video clips and achieve state-of-the-art performance on action recognition benchmarks like Kinetics-400 and Kinetics-600. The model is trained with a "divided attention" architecture, where temporal and spatial attention are applied separately within each block, leading to improved accuracy and efficiency. TimeSformer is faster to train and more efficient in inference compared to 3D convolutional networks, and it can handle videos over one minute long. The model is implemented using a Vision Transformer (ViT) architecture, with video patches extracted from individual frames and linearly mapped into embeddings. The model's performance is evaluated on several action recognition datasets, and it outperforms existing methods in terms of accuracy and efficiency. The paper also explores various self-attention schemes, including spatial-only, joint spatiotemporal, and divided spatiotemporal attention, and finds that the divided attention approach provides the best results. Additionally, the model is shown to be effective for long-term video modeling on the HowTo100M dataset, demonstrating its ability to recognize activities over extended temporal spans. The results indicate that TimeSformer is a promising alternative to traditional convolutional video models, particularly for tasks requiring long-term video understanding.This paper introduces TimeSformer, a video classification model that replaces convolutional operations with self-attention mechanisms for spatiotemporal feature learning. Unlike traditional 3D convolutional networks, TimeSformer uses self-attention to capture long-range dependencies across both space and time, enabling it to process longer video clips and achieve state-of-the-art performance on action recognition benchmarks like Kinetics-400 and Kinetics-600. The model is trained with a "divided attention" architecture, where temporal and spatial attention are applied separately within each block, leading to improved accuracy and efficiency. TimeSformer is faster to train and more efficient in inference compared to 3D convolutional networks, and it can handle videos over one minute long. The model is implemented using a Vision Transformer (ViT) architecture, with video patches extracted from individual frames and linearly mapped into embeddings. The model's performance is evaluated on several action recognition datasets, and it outperforms existing methods in terms of accuracy and efficiency. The paper also explores various self-attention schemes, including spatial-only, joint spatiotemporal, and divided spatiotemporal attention, and finds that the divided attention approach provides the best results. Additionally, the model is shown to be effective for long-term video modeling on the HowTo100M dataset, demonstrating its ability to recognize activities over extended temporal spans. The results indicate that TimeSformer is a promising alternative to traditional convolutional video models, particularly for tasks requiring long-term video understanding.
Reach us at info@study.space