Fine-tuning CNN Image Retrieval with No Human Annotation

Fine-tuning CNN Image Retrieval with No Human Annotation

10 Jul 2018 | Filip Radenović, Giorgos Tolias, Ondřej Chum
This paper proposes a fully automated method for fine-tuning Convolutional Neural Networks (CNNs) for image retrieval without human annotation. The method leverages 3D models reconstructed from structure-from-motion (SfM) pipelines to select training data, enabling the automatic generation of hard-positive and hard-negative examples. These examples are derived from the geometry and camera positions in the 3D models, enhancing the performance of particular-object retrieval. The paper introduces a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling, improving retrieval performance. It also proposes a discriminative whitening method learned from the same training data, which complements fine-tuning and further boosts performance. The method is applied to the VGG network, achieving state-of-the-art performance on the Oxford Buildings, Paris, and Holidays datasets. The paper addresses the challenge of unsupervised fine-tuning of CNNs for image retrieval. Key contributions include: (1) using SfM information to enforce both hard-positive and hard-negative examples for CNN training, which enhances image representation. (2) Demonstrating that traditional whitening on short representations can be unstable, and proposing a discriminative whitening method learned from the same training data. (3) Introducing a trainable pooling layer that generalizes existing popular pooling schemes for CNNs, significantly improving retrieval performance while preserving the same descriptor dimensionality. (4) Proposing a novel α-weighted query expansion method that is more robust than standard average query expansion. (5) Achieving new state-of-the-art results on the Oxford Buildings, Paris, and Holidays datasets by retraining commonly used CNN architectures. The paper also discusses related work, including previous methods for training data collection, pooling approaches for constructing global image descriptors, and descriptor whitening. It presents a detailed architecture, learning procedure, and search process for the proposed method. The method is evaluated on various datasets, showing significant improvements in retrieval performance compared to existing approaches. The results demonstrate that the proposed method outperforms traditional methods in terms of accuracy and efficiency, achieving state-of-the-art results on standard benchmarks.This paper proposes a fully automated method for fine-tuning Convolutional Neural Networks (CNNs) for image retrieval without human annotation. The method leverages 3D models reconstructed from structure-from-motion (SfM) pipelines to select training data, enabling the automatic generation of hard-positive and hard-negative examples. These examples are derived from the geometry and camera positions in the 3D models, enhancing the performance of particular-object retrieval. The paper introduces a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling, improving retrieval performance. It also proposes a discriminative whitening method learned from the same training data, which complements fine-tuning and further boosts performance. The method is applied to the VGG network, achieving state-of-the-art performance on the Oxford Buildings, Paris, and Holidays datasets. The paper addresses the challenge of unsupervised fine-tuning of CNNs for image retrieval. Key contributions include: (1) using SfM information to enforce both hard-positive and hard-negative examples for CNN training, which enhances image representation. (2) Demonstrating that traditional whitening on short representations can be unstable, and proposing a discriminative whitening method learned from the same training data. (3) Introducing a trainable pooling layer that generalizes existing popular pooling schemes for CNNs, significantly improving retrieval performance while preserving the same descriptor dimensionality. (4) Proposing a novel α-weighted query expansion method that is more robust than standard average query expansion. (5) Achieving new state-of-the-art results on the Oxford Buildings, Paris, and Holidays datasets by retraining commonly used CNN architectures. The paper also discusses related work, including previous methods for training data collection, pooling approaches for constructing global image descriptors, and descriptor whitening. It presents a detailed architecture, learning procedure, and search process for the proposed method. The method is evaluated on various datasets, showing significant improvements in retrieval performance compared to existing approaches. The results demonstrate that the proposed method outperforms traditional methods in terms of accuracy and efficiency, achieving state-of-the-art results on standard benchmarks.
Reach us at info@study.space