Improving the Fisher Kernel for Large-Scale Image Classification

Improving the Fisher Kernel for Large-Scale Image Classification

2010 | Florent Perronnin, Jorge Sánchez, and Thomas Mensink
This paper presents improvements to the Fisher Kernel (FK) for large-scale image classification. The FK is a generic framework that combines the benefits of generative and discriminative approaches. In the context of image classification, the FK extends the popular bag-of-visual-words (BOV) approach by going beyond count statistics. However, in practice, the enriched representation has not shown superiority over BOV. The authors propose several well-motivated modifications to the original FK framework, which significantly improve classification accuracy. On the PASCAL VOC 2007 dataset, the Average Precision (AP) increases from 47.9% to 58.3%. On the CalTech 256 dataset, the authors demonstrate state-of-the-art performance. These results are achieved using only SIFT descriptors and costless linear classifiers. The authors also compare two abundant sources of labeled images, ImageNet and Flickr groups, to learn classifiers. In an evaluation involving hundreds of thousands of training images, classifiers learned on Flickr groups perform surprisingly well and can complement classifiers learned on more carefully annotated datasets. The paper introduces three improvements to the Fisher vector: L2 normalization, power normalization, and spatial pyramids. L2 normalization helps remove the influence of the mixing coefficient, while power normalization makes the Fisher vector more robust to sparse data. Spatial pyramids take into account the spatial structure of images. These improvements significantly enhance classification accuracy. The authors evaluate their improvements on two challenging datasets: PASCAL VOC 2007 and CalTech 256. On PASCAL VOC 2007, the proposed improved Fisher kernel (IFK) achieves an AP of 58.3%, which is the best result reported to date using SIFT descriptors. On CalTech 256, the IFK outperforms several state-of-the-art methods. The authors also perform large-scale experiments using ImageNet and Flickr groups. They show that classifiers trained on Flickr groups can achieve high accuracy on 12 out of 20 categories. The results demonstrate that Flickr groups are a valuable resource for training image classifiers, even though they were not intended for this purpose. The authors conclude that the proposed improved Fisher vector has the potential to become a new standard representation in image classification.This paper presents improvements to the Fisher Kernel (FK) for large-scale image classification. The FK is a generic framework that combines the benefits of generative and discriminative approaches. In the context of image classification, the FK extends the popular bag-of-visual-words (BOV) approach by going beyond count statistics. However, in practice, the enriched representation has not shown superiority over BOV. The authors propose several well-motivated modifications to the original FK framework, which significantly improve classification accuracy. On the PASCAL VOC 2007 dataset, the Average Precision (AP) increases from 47.9% to 58.3%. On the CalTech 256 dataset, the authors demonstrate state-of-the-art performance. These results are achieved using only SIFT descriptors and costless linear classifiers. The authors also compare two abundant sources of labeled images, ImageNet and Flickr groups, to learn classifiers. In an evaluation involving hundreds of thousands of training images, classifiers learned on Flickr groups perform surprisingly well and can complement classifiers learned on more carefully annotated datasets. The paper introduces three improvements to the Fisher vector: L2 normalization, power normalization, and spatial pyramids. L2 normalization helps remove the influence of the mixing coefficient, while power normalization makes the Fisher vector more robust to sparse data. Spatial pyramids take into account the spatial structure of images. These improvements significantly enhance classification accuracy. The authors evaluate their improvements on two challenging datasets: PASCAL VOC 2007 and CalTech 256. On PASCAL VOC 2007, the proposed improved Fisher kernel (IFK) achieves an AP of 58.3%, which is the best result reported to date using SIFT descriptors. On CalTech 256, the IFK outperforms several state-of-the-art methods. The authors also perform large-scale experiments using ImageNet and Flickr groups. They show that classifiers trained on Flickr groups can achieve high accuracy on 12 out of 20 categories. The results demonstrate that Flickr groups are a valuable resource for training image classifiers, even though they were not intended for this purpose. The authors conclude that the proposed improved Fisher vector has the potential to become a new standard representation in image classification.
Reach us at info@study.space