12 Apr 2018 | Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri
This paper presents an empirical study of spatiotemporal convolutions for action recognition in video. The authors propose a new spatiotemporal convolutional block called R(2+1)D, which decomposes 3D convolutions into separate spatial and temporal components. This design achieves results comparable or superior to the state-of-the-art on several benchmarks including Sports-1M, Kinetics, UCF101, and HMDB51. The R(2+1)D block is shown to be more efficient than full 3D convolutions, as it doubles the number of nonlinearities in the network and facilitates easier optimization. The authors also compare different spatiotemporal convolutional approaches, including 2D convolutions, mixed 3D-2D convolutions, and 3D convolutions. They find that R(2+1)D outperforms these approaches in terms of accuracy and computational efficiency. The paper also discusses the benefits of using longer clip inputs for action recognition and shows that training on longer clips leads to better performance. The authors conclude that their R(2+1)D architecture is a competitive approach for action recognition and that further research is needed to explore more suitable architectures for this task.This paper presents an empirical study of spatiotemporal convolutions for action recognition in video. The authors propose a new spatiotemporal convolutional block called R(2+1)D, which decomposes 3D convolutions into separate spatial and temporal components. This design achieves results comparable or superior to the state-of-the-art on several benchmarks including Sports-1M, Kinetics, UCF101, and HMDB51. The R(2+1)D block is shown to be more efficient than full 3D convolutions, as it doubles the number of nonlinearities in the network and facilitates easier optimization. The authors also compare different spatiotemporal convolutional approaches, including 2D convolutions, mixed 3D-2D convolutions, and 3D convolutions. They find that R(2+1)D outperforms these approaches in terms of accuracy and computational efficiency. The paper also discusses the benefits of using longer clip inputs for action recognition and shows that training on longer clips leads to better performance. The authors conclude that their R(2+1)D architecture is a competitive approach for action recognition and that further research is needed to explore more suitable architectures for this task.