2 Jun 2017 | Gül Varol, Ivan Laptev, and Cordelia Schmid, Fellow, IEEE
This paper introduces long-term temporal convolutions (LTC) for action recognition in videos. The authors propose a neural network architecture with LTC to learn video representations that capture the full temporal extent of human actions. They demonstrate that LTC-CNN models with increased temporal extent improve action recognition accuracy. The study also investigates the impact of different low-level representations, such as raw pixel values and optical flow vector fields, showing that high-quality optical flow estimation is crucial for learning accurate action models. The authors report state-of-the-art results on two benchmarks: UCF101 (92.7%) and HMDB51 (67.2%).
The paper discusses the challenges of action recognition, noting that typical human actions last several seconds and span many video frames. Traditional methods often fail to capture this long-term structure, leading to suboptimal performance. The authors propose a network architecture with LTC to address this issue, enabling the model to learn representations over extended periods of time. They also explore the impact of varying temporal and spatial resolutions, as well as different input modalities (RGB and optical flow).
The authors compare their method with state-of-the-art approaches and show that LTC outperforms 2D convolutions and other methods. They also demonstrate that combining LTC with optical flow and RGB inputs leads to significant improvements in action recognition. The study highlights the importance of high-quality optical flow estimation and the benefits of using long-term temporal convolutions for capturing complex motion patterns in video.
The paper also discusses the computational efficiency of the proposed method, showing that it is feasible to train and test on large-scale datasets. The authors conclude that LTC provides a significant improvement in action recognition performance, particularly for long actions that span many video frames. The results show that LTC networks achieve state-of-the-art performance on two widely used benchmarks for human action recognition.This paper introduces long-term temporal convolutions (LTC) for action recognition in videos. The authors propose a neural network architecture with LTC to learn video representations that capture the full temporal extent of human actions. They demonstrate that LTC-CNN models with increased temporal extent improve action recognition accuracy. The study also investigates the impact of different low-level representations, such as raw pixel values and optical flow vector fields, showing that high-quality optical flow estimation is crucial for learning accurate action models. The authors report state-of-the-art results on two benchmarks: UCF101 (92.7%) and HMDB51 (67.2%).
The paper discusses the challenges of action recognition, noting that typical human actions last several seconds and span many video frames. Traditional methods often fail to capture this long-term structure, leading to suboptimal performance. The authors propose a network architecture with LTC to address this issue, enabling the model to learn representations over extended periods of time. They also explore the impact of varying temporal and spatial resolutions, as well as different input modalities (RGB and optical flow).
The authors compare their method with state-of-the-art approaches and show that LTC outperforms 2D convolutions and other methods. They also demonstrate that combining LTC with optical flow and RGB inputs leads to significant improvements in action recognition. The study highlights the importance of high-quality optical flow estimation and the benefits of using long-term temporal convolutions for capturing complex motion patterns in video.
The paper also discusses the computational efficiency of the proposed method, showing that it is feasible to train and test on large-scale datasets. The authors conclude that LTC provides a significant improvement in action recognition performance, particularly for long actions that span many video frames. The results show that LTC networks achieve state-of-the-art performance on two widely used benchmarks for human action recognition.