Evaluation of local spatio-temporal features for action recognition

Evaluation of local spatio-temporal features for action recognition

2009 | Heng Wang, Muhammad Muneeb Ullah, Alexander Kläser, Ivan Laptev, Cordelia Schmid
This paper evaluates and compares local spatio-temporal features for action recognition in a common experimental setup. Four feature detectors and six local feature descriptors are tested on 25 action classes across three datasets with varying difficulty. A standard bag-of-features SVM approach is used for action recognition. The results show that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors in realistic settings. A consistent ranking is observed for most methods across different datasets, and their advantages and limitations are discussed. The paper introduces four detectors: Harris3D, Cuboid, Hessian, and Dense sampling. Six descriptors are evaluated: Cuboid, HOG/HOF, HOG3D, HOF, ESURF, and HOG. The performance of these methods is evaluated on three datasets: KTH, UCF Sports, and Hollywood2. The results show that dense sampling provides the best performance, especially on the Hollywood2 dataset. The HOG/HOF descriptor performs best for the most challenging Hollywood2 dataset, while the HOG3D descriptor performs best on the UCF dataset when combined with dense sampling. The paper also investigates the influence of spatial video resolution and shot boundaries on performance. It is found that shot boundary features do not significantly affect the evaluation results. The computational complexity of the tested methods is also evaluated, with Cuboid being the slowest and Hessian being the most efficient. The paper concludes that dense sampling consistently outperforms all tested interest point detectors in realistic video settings, but performs worse on the simple KTH dataset. This indicates the importance of using realistic experimental video data and the limitations of current interest point detectors. The combination of gradient-based and optical flow-based descriptors seems to be a good choice for action recognition.This paper evaluates and compares local spatio-temporal features for action recognition in a common experimental setup. Four feature detectors and six local feature descriptors are tested on 25 action classes across three datasets with varying difficulty. A standard bag-of-features SVM approach is used for action recognition. The results show that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors in realistic settings. A consistent ranking is observed for most methods across different datasets, and their advantages and limitations are discussed. The paper introduces four detectors: Harris3D, Cuboid, Hessian, and Dense sampling. Six descriptors are evaluated: Cuboid, HOG/HOF, HOG3D, HOF, ESURF, and HOG. The performance of these methods is evaluated on three datasets: KTH, UCF Sports, and Hollywood2. The results show that dense sampling provides the best performance, especially on the Hollywood2 dataset. The HOG/HOF descriptor performs best for the most challenging Hollywood2 dataset, while the HOG3D descriptor performs best on the UCF dataset when combined with dense sampling. The paper also investigates the influence of spatial video resolution and shot boundaries on performance. It is found that shot boundary features do not significantly affect the evaluation results. The computational complexity of the tested methods is also evaluated, with Cuboid being the slowest and Hessian being the most efficient. The paper concludes that dense sampling consistently outperforms all tested interest point detectors in realistic video settings, but performs worse on the simple KTH dataset. This indicates the importance of using realistic experimental video data and the limitations of current interest point detectors. The combination of gradient-based and optical flow-based descriptors seems to be a good choice for action recognition.
Reach us at info@study.space
[slides and audio] Evaluation of Local Spatio-temporal Features for Action Recognition