12 Apr 2018 | Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri
This paper explores the effectiveness of different spatiotemporal convolutions for video analysis, particularly in the context of action recognition. The authors motivate their study by noting that 2D CNNs applied to individual frames of videos have shown strong performance in action recognition, but the potential of 3D CNNs remains underexplored. They demonstrate that 3D ResNets outperform 2D ResNets in large-scale action recognition benchmarks such as Sports-1M and Kinetics. Inspired by these findings, the authors introduce two new spatiotemporal convolutional blocks: mixed convolution (MC) and (2+1)D convolution. MC combines 3D convolutions in early layers with 2D convolutions in top layers, while (2+1)D explicitly factorsizes 3D convolutions into 2D spatial and 1D temporal convolutions. The (2+1)D decomposition is shown to improve optimization and accuracy, leading to the design of the R(2+1)D architecture. This architecture achieves state-of-the-art performance on multiple benchmarks, outperforming both 3D and 2D ResNets. The paper also discusses the benefits of training on longer clips and the trade-offs between computational complexity and accuracy. Overall, the study highlights the importance of temporal reasoning in action recognition and provides a new framework for designing efficient and accurate spatiotemporal convolutional networks.This paper explores the effectiveness of different spatiotemporal convolutions for video analysis, particularly in the context of action recognition. The authors motivate their study by noting that 2D CNNs applied to individual frames of videos have shown strong performance in action recognition, but the potential of 3D CNNs remains underexplored. They demonstrate that 3D ResNets outperform 2D ResNets in large-scale action recognition benchmarks such as Sports-1M and Kinetics. Inspired by these findings, the authors introduce two new spatiotemporal convolutional blocks: mixed convolution (MC) and (2+1)D convolution. MC combines 3D convolutions in early layers with 2D convolutions in top layers, while (2+1)D explicitly factorsizes 3D convolutions into 2D spatial and 1D temporal convolutions. The (2+1)D decomposition is shown to improve optimization and accuracy, leading to the design of the R(2+1)D architecture. This architecture achieves state-of-the-art performance on multiple benchmarks, outperforming both 3D and 2D ResNets. The paper also discusses the benefits of training on longer clips and the trade-offs between computational complexity and accuracy. Overall, the study highlights the importance of temporal reasoning in action recognition and provides a new framework for designing efficient and accurate spatiotemporal convolutional networks.