2 Jun 2017 | Gül Varol, Ivan Laptev, and Cordelia Schmid, Fellow, IEEE
This paper introduces and evaluates long-term temporal convolutions (LTC) for action recognition, demonstrating their effectiveness in capturing the spatio-temporal structure of human actions. The authors use neural networks with LTC to learn video representations over extended periods, improving the accuracy of action recognition compared to methods that model actions at short temporal extents. They also study the impact of different low-level representations, such as raw video pixel values and optical flow vector fields, highlighting the importance of high-quality optical flow estimation for accurate action models. The proposed method achieves state-of-the-art results on two challenging benchmarks, UCF101 (92.7%) and HMDB51 (67.2%). The contributions of the work include the demonstration of the advantages of long-term temporal convolutions and the importance of high-quality optical flow estimation for learning efficient video representations for human action recognition.This paper introduces and evaluates long-term temporal convolutions (LTC) for action recognition, demonstrating their effectiveness in capturing the spatio-temporal structure of human actions. The authors use neural networks with LTC to learn video representations over extended periods, improving the accuracy of action recognition compared to methods that model actions at short temporal extents. They also study the impact of different low-level representations, such as raw video pixel values and optical flow vector fields, highlighting the importance of high-quality optical flow estimation for accurate action models. The proposed method achieves state-of-the-art results on two challenging benchmarks, UCF101 (92.7%) and HMDB51 (67.2%). The contributions of the work include the demonstration of the advantages of long-term temporal convolutions and the importance of high-quality optical flow estimation for learning efficient video representations for human action recognition.