Understanding Fine-Tuning CNN Image Retrieval with No Human Annotation

This paper addresses the challenge of fine-tuning Convolutional Neural Networks (CNNs) for image retrieval without requiring human-annotated data. The authors propose a fully automated method that leverages 3D models reconstructed from unordered images using Structure-from-Motion (SfM) techniques. These 3D models guide the selection of training data, ensuring that both hard-positive and hard-negative examples are included. The proposed method also introduces a novel Generalized-Mean (GeM) pooling layer, which generalizes max and average pooling and significantly improves retrieval performance. Additionally, the paper introduces an $\alpha$-weighted query expansion technique that enhances the robustness of the retrieval results. The proposed method achieves state-of-the-art performance on standard benchmarks such as Oxford Buildings, Paris, and Holidays datasets. The key contributions include the exploitation of 3D model information for training, the introduction of the GeM pooling layer, and the $\alpha$-weighted query expansion technique. The experiments demonstrate the effectiveness of these methods in enhancing the discriminative power and efficiency of CNN-based image retrieval systems.This paper addresses the challenge of fine-tuning Convolutional Neural Networks (CNNs) for image retrieval without requiring human-annotated data. The authors propose a fully automated method that leverages 3D models reconstructed from unordered images using Structure-from-Motion (SfM) techniques. These 3D models guide the selection of training data, ensuring that both hard-positive and hard-negative examples are included. The proposed method also introduces a novel Generalized-Mean (GeM) pooling layer, which generalizes max and average pooling and significantly improves retrieval performance. Additionally, the paper introduces an $\alpha$-weighted query expansion technique that enhances the robustness of the retrieval results. The proposed method achieves state-of-the-art performance on standard benchmarks such as Oxford Buildings, Paris, and Holidays datasets. The key contributions include the exploitation of 3D model information for training, the introduction of the GeM pooling layer, and the $\alpha$-weighted query expansion technique. The experiments demonstrate the effectiveness of these methods in enhancing the discriminative power and efficiency of CNN-based image retrieval systems.

Fine-tuning CNN Image Retrieval with No Human Annotation

10 Jul 2018 | Filip Radenović, Giorgos Tolias, Ondřej Chum