[slides] Evaluation of Local Spatio-temporal Features for Action Recognition

This paper evaluates and compares various local spatio-temporal features for action recognition using a common experimental setup. The authors consider four different feature detectors (Harris3D, Cuboid, Hessian, and Dense) and six local feature descriptors (Cuboid, HOG/HOF, HOG3D, ESURF, SIFT, and dense sampling). They use a standard bag-of-features SVM approach to evaluate the performance on three datasets: KTH, UCF Sports, and Hollywood2, each with varying levels of difficulty. The main findings include: 1. **Regular Sampling vs. Interest Point Detectors**: Regular sampling consistently outperforms all tested space-time interest point detectors in realistic settings, particularly for human actions. 2. **Consistent Ranking Across Datasets**: Most methods show a consistent ranking across different datasets, indicating their generalizability. 3. **Performance on Datasets**: - **KTH Dataset**: Harris3D + HOF and HOG/HOF achieve the best results. - **UCF Sports Dataset**: Dense sampling performs the best, capturing different types of motions and background context. - **Hollywood2 Dataset**: Dense sampling also performs well, with HOG/HOF and HOF providing the best results. 4. **Influence of Shot Boundaries**: Removing shot boundary features does not significantly affect performance. 5. **Subsampling**: Full spatial resolution significantly improves performance compared to half resolution. 6. **Computational Complexity**: Cuboid detector extracts the densest features but is the slowest, while Hessian is the sparsest and most efficient. The paper concludes that dense sampling is superior in realistic settings but produces a large number of features, which may be challenging to handle. The choice of feature detector and descriptor depends on the specific dataset and the type of action being recognized.This paper evaluates and compares various local spatio-temporal features for action recognition using a common experimental setup. The authors consider four different feature detectors (Harris3D, Cuboid, Hessian, and Dense) and six local feature descriptors (Cuboid, HOG/HOF, HOG3D, ESURF, SIFT, and dense sampling). They use a standard bag-of-features SVM approach to evaluate the performance on three datasets: KTH, UCF Sports, and Hollywood2, each with varying levels of difficulty. The main findings include: 1. **Regular Sampling vs. Interest Point Detectors**: Regular sampling consistently outperforms all tested space-time interest point detectors in realistic settings, particularly for human actions. 2. **Consistent Ranking Across Datasets**: Most methods show a consistent ranking across different datasets, indicating their generalizability. 3. **Performance on Datasets**: - **KTH Dataset**: Harris3D + HOF and HOG/HOF achieve the best results. - **UCF Sports Dataset**: Dense sampling performs the best, capturing different types of motions and background context. - **Hollywood2 Dataset**: Dense sampling also performs well, with HOG/HOF and HOF providing the best results. 4. **Influence of Shot Boundaries**: Removing shot boundary features does not significantly affect performance. 5. **Subsampling**: Full spatial resolution significantly improves performance compared to half resolution. 6. **Computational Complexity**: Cuboid detector extracts the densest features but is the slowest, while Hessian is the sparsest and most efficient. The paper concludes that dense sampling is superior in realistic settings but produces a large number of features, which may be challenging to handle. The choice of feature detector and descriptor depends on the specific dataset and the type of action being recognized.

Evaluation of local spatio-temporal features for action recognition

2009 | Heng Wang, Muhammad Muneeb Ullah, Alexander Kläser, Ivan Laptev, Cordelia Schmid