Sep 2008 | Alexander Klaser, Marcin Marszalek, Cordelia Schmid
This paper presents a novel spatio-temporal descriptor based on 3D gradients for video action recognition. The proposed descriptor is built on histograms of oriented 3D spatio-temporal gradients. The key contributions include: (i) an efficient algorithm for computing 3D gradients at arbitrary scales using integral videos; (ii) a generic 3D orientation quantization based on regular polyhedrons; (iii) an in-depth evaluation and optimization of all descriptor parameters for action recognition; and (iv) application of the descriptor to multiple action datasets (KTH, Weizmann, Hollywood), where it outperforms the state-of-the-art.
The descriptor is based on the success of HoG-based descriptors for static images, generalized to 3D. It computes 3D gradients for arbitrary spatial and temporal scales using integral videos. Orientation quantization is performed using regular polyhedrons, such as the icosahedron, which provides a more robust and efficient method compared to traditional polar coordinate-based approaches. The descriptor is evaluated on three action datasets: KTH, Weizmann, and Hollywood. It achieves high accuracy on all datasets, outperforming existing methods on two of them and matching on the third.
The descriptor is constructed by dividing the video into subblocks, computing mean gradients for each subblock, quantizing the gradient orientations, and then creating histograms of these quantized gradients. The histograms are then concatenated into a feature vector, which is normalized and used for classification. The parameters of the descriptor are optimized for action recognition, with the best results achieved using an icosahedron with full orientation quantization.
The descriptor is applied to the KTH, Weizmann, and Hollywood datasets, achieving state-of-the-art performance on all three. It outperforms existing methods on two of the datasets and matches on the third. The results show that the proposed descriptor is effective for action recognition in videos, with high accuracy and robustness to changes in illumination and small deformations. The method is efficient and memory-friendly, making it suitable for real-world applications. Future work includes learning descriptor parameters on a per-class basis to further improve performance.This paper presents a novel spatio-temporal descriptor based on 3D gradients for video action recognition. The proposed descriptor is built on histograms of oriented 3D spatio-temporal gradients. The key contributions include: (i) an efficient algorithm for computing 3D gradients at arbitrary scales using integral videos; (ii) a generic 3D orientation quantization based on regular polyhedrons; (iii) an in-depth evaluation and optimization of all descriptor parameters for action recognition; and (iv) application of the descriptor to multiple action datasets (KTH, Weizmann, Hollywood), where it outperforms the state-of-the-art.
The descriptor is based on the success of HoG-based descriptors for static images, generalized to 3D. It computes 3D gradients for arbitrary spatial and temporal scales using integral videos. Orientation quantization is performed using regular polyhedrons, such as the icosahedron, which provides a more robust and efficient method compared to traditional polar coordinate-based approaches. The descriptor is evaluated on three action datasets: KTH, Weizmann, and Hollywood. It achieves high accuracy on all datasets, outperforming existing methods on two of them and matching on the third.
The descriptor is constructed by dividing the video into subblocks, computing mean gradients for each subblock, quantizing the gradient orientations, and then creating histograms of these quantized gradients. The histograms are then concatenated into a feature vector, which is normalized and used for classification. The parameters of the descriptor are optimized for action recognition, with the best results achieved using an icosahedron with full orientation quantization.
The descriptor is applied to the KTH, Weizmann, and Hollywood datasets, achieving state-of-the-art performance on all three. It outperforms existing methods on two of the datasets and matches on the third. The results show that the proposed descriptor is effective for action recognition in videos, with high accuracy and robustness to changes in illumination and small deformations. The method is efficient and memory-friendly, making it suitable for real-world applications. Future work includes learning descriptor parameters on a per-class basis to further improve performance.