Understanding NetVLAD%3A CNN Architecture for Weakly Supervised Place Recognition

NetVLAD is a CNN architecture designed for weakly supervised place recognition. The main contribution is the NetVLAD layer, a generalized VLAD layer inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation. This layer is pluggable into any CNN architecture and amenable to training via backpropagation. The architecture is trained using a weakly supervised ranking loss based on images from Google Street View Time Machine. The proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks. The NetVLAD layer provides a powerful pooling mechanism with learnable parameters that can be easily plugged into any other CNN architecture. The weakly supervised ranking loss enables end-to-end learning for other ranking tasks where large amounts of weakly labelled data are available. The architecture is tested on two publicly available datasets: Pittsburgh (Pitts250k) and Tokyo 24/7. The results show that the trained representations outperform off-the-shelf CNN models and significantly improve over the state-of-the-art on the challenging 24/7 Tokyo dataset, as well as on the Oxford and Paris image retrieval benchmarks. The NetVLAD layer and weakly supervised ranking loss are generic CNN building blocks applicable beyond the place recognition task.NetVLAD is a CNN architecture designed for weakly supervised place recognition. The main contribution is the NetVLAD layer, a generalized VLAD layer inspired by the Vector of Locally Aggregated Descriptors (VLAD) representation. This layer is pluggable into any CNN architecture and amenable to training via backpropagation. The architecture is trained using a weakly supervised ranking loss based on images from Google Street View Time Machine. The proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks. The NetVLAD layer provides a powerful pooling mechanism with learnable parameters that can be easily plugged into any other CNN architecture. The weakly supervised ranking loss enables end-to-end learning for other ranking tasks where large amounts of weakly labelled data are available. The architecture is tested on two publicly available datasets: Pittsburgh (Pitts250k) and Tokyo 24/7. The results show that the trained representations outperform off-the-shelf CNN models and significantly improve over the state-of-the-art on the challenging 24/7 Tokyo dataset, as well as on the Oxford and Paris image retrieval benchmarks. The NetVLAD layer and weakly supervised ranking loss are generic CNN building blocks applicable beyond the place recognition task.

NetVLAD: CNN architecture for weakly supervised place recognition

2 May 2016 | Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdl, Josef Sivic