9 Jun 2021 | Gedas Bertasius, Heng Wang, Lorenzo Torresani
The paper introduces TimeSformer, a convolution-free approach to video classification that leverages self-attention over space and time. By adapting the Transformer architecture to video, TimeSformer enables spatiotemporal feature learning directly from a sequence of frame-level patches. The study compares different self-attention schemes and finds that "divided attention," which separately applies temporal and spatial attention within each block, yields the best performance in video classification. Despite its novel design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including Kinetics-400 and Kinetics-600. Compared to 3D convolutional networks, TimeSformer is faster to train, achieves higher test efficiency, and can handle longer video clips (over one minute). The code and models are available on GitHub.The paper introduces TimeSformer, a convolution-free approach to video classification that leverages self-attention over space and time. By adapting the Transformer architecture to video, TimeSformer enables spatiotemporal feature learning directly from a sequence of frame-level patches. The study compares different self-attention schemes and finds that "divided attention," which separately applies temporal and spatial attention within each block, yields the best performance in video classification. Despite its novel design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including Kinetics-400 and Kinetics-600. Compared to 3D convolutional networks, TimeSformer is faster to train, achieves higher test efficiency, and can handle longer video clips (over one minute). The code and models are available on GitHub.