2010 | Florent Perronnin, Jorge Sánchez, and Thomas Mensink
The paper "Improving the Fisher Kernel for Large-Scale Image Classification" by Florent Perronnin, Jorge Sánchez, and Thomas Mensink addresses the challenge of image classification using large annotated datasets. The authors propose several modifications to the Fisher Kernel (FK) framework to enhance its performance. These modifications include L2 normalization, power normalization, and the integration of spatial pyramids. The L2 normalization removes the dependence on the mixing coefficient, while the power normalization addresses the sparsity issue in Fisher vectors by applying a power function to each dimension. Spatial pyramids are used to incorporate spatial information by extracting Fisher vectors from sub-regions of images.
The proposed improvements are evaluated on two datasets: PASCAL VOC 2007 and CalTech 256. On PASCAL VOC 2007, the Average Precision (AP) is increased from 47.9% to 58.3%, and on CalTech 256, the system achieves state-of-the-art performance using only SIFT descriptors and linear classifiers. The authors also compare the effectiveness of two large-scale resources for training classifiers: ImageNet and Flickr groups. They find that Flickr groups perform surprisingly well and complement classifiers trained on more carefully annotated datasets like VOC 2007.
The paper concludes that the improved Fisher kernel has the potential to become a new standard representation in image classification, especially for large-scale applications. The results also highlight the complementary nature of different training resources, suggesting that combining them can lead to better classification performance.The paper "Improving the Fisher Kernel for Large-Scale Image Classification" by Florent Perronnin, Jorge Sánchez, and Thomas Mensink addresses the challenge of image classification using large annotated datasets. The authors propose several modifications to the Fisher Kernel (FK) framework to enhance its performance. These modifications include L2 normalization, power normalization, and the integration of spatial pyramids. The L2 normalization removes the dependence on the mixing coefficient, while the power normalization addresses the sparsity issue in Fisher vectors by applying a power function to each dimension. Spatial pyramids are used to incorporate spatial information by extracting Fisher vectors from sub-regions of images.
The proposed improvements are evaluated on two datasets: PASCAL VOC 2007 and CalTech 256. On PASCAL VOC 2007, the Average Precision (AP) is increased from 47.9% to 58.3%, and on CalTech 256, the system achieves state-of-the-art performance using only SIFT descriptors and linear classifiers. The authors also compare the effectiveness of two large-scale resources for training classifiers: ImageNet and Flickr groups. They find that Flickr groups perform surprisingly well and complement classifiers trained on more carefully annotated datasets like VOC 2007.
The paper concludes that the improved Fisher kernel has the potential to become a new standard representation in image classification, especially for large-scale applications. The results also highlight the complementary nature of different training resources, suggesting that combining them can lead to better classification performance.