12 May 2024 | Amar Ali-bey*, Brahim Chaib-draa, Philippe Giguère
BoQ: A Place is Worth a Bag of Learnable Queries
This paper introduces a novel technique called Bag-of-Queries (BoQ) for visual place recognition. BoQ learns a set of global queries to capture universal place-specific attributes. Unlike existing methods that use self-attention and generate queries directly from the input, BoQ employs distinct learnable global queries that probe input features via cross-attention, ensuring consistent information aggregation. BoQ provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. Extensive experiments on 14 large-scale benchmarks show that BoQ consistently outperforms state-of-the-art techniques including NetVLAD, MixVPR, and EigenPlaces. Moreover, BoQ surpasses two-stage retrieval methods such as Patch-NetVLAD, TransVPR, and R2Former while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries. BoQ is a global retrieval technique that does not employ reranking, making it particularly suitable for applications with limited computational resources. The method is end-to-end trainable and integrates with both conventional CNN and ViT backbones. The effectiveness of BoQ is validated through extensive experiments on multiple large-scale benchmarks, consistently outperforming state-of-the-art techniques. BoQ's performance is demonstrated across various environments, including viewpoint changes, seasonal changes, historical locations, and extreme weather conditions. The method uses a cross-attention mechanism to dynamically assess the importance of each input feature and aggregate them into the output. The final global descriptor is L2-normalized to optimize it for similarity search. BoQ's performance is evaluated using recall@k metrics, showing significant improvements over existing methods. The method is also compared with other state-of-the-art techniques, including Conv-AP, CosPlace, and MixVPR. The results show that BoQ achieves state-of-the-art performance on most benchmarks, demonstrating robustness to extreme weather and illumination changes. The method is also compared with two-stage retrieval techniques, showing that BoQ's global retrieval performance surpasses that of existing two-stage techniques. BoQ's performance is further validated through ablation studies, showing that the number of learnable queries and the number of BoQ blocks significantly impact performance. The method is also compared with different backbone architectures, showing that BoQ achieves competitive performance with a lightweight ResNet-34 backbone. The method is also compared with other techniques, showing that BoQ's performance is superior in terms of both accuracy and efficiency. The method is also visualized, showing how the learned queries aggregate input features. The results demonstrate that BoQ is a promising approach for visual place recognition, offering high accuracy and efficiency.BoQ: A Place is Worth a Bag of Learnable Queries
This paper introduces a novel technique called Bag-of-Queries (BoQ) for visual place recognition. BoQ learns a set of global queries to capture universal place-specific attributes. Unlike existing methods that use self-attention and generate queries directly from the input, BoQ employs distinct learnable global queries that probe input features via cross-attention, ensuring consistent information aggregation. BoQ provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. Extensive experiments on 14 large-scale benchmarks show that BoQ consistently outperforms state-of-the-art techniques including NetVLAD, MixVPR, and EigenPlaces. Moreover, BoQ surpasses two-stage retrieval methods such as Patch-NetVLAD, TransVPR, and R2Former while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries. BoQ is a global retrieval technique that does not employ reranking, making it particularly suitable for applications with limited computational resources. The method is end-to-end trainable and integrates with both conventional CNN and ViT backbones. The effectiveness of BoQ is validated through extensive experiments on multiple large-scale benchmarks, consistently outperforming state-of-the-art techniques. BoQ's performance is demonstrated across various environments, including viewpoint changes, seasonal changes, historical locations, and extreme weather conditions. The method uses a cross-attention mechanism to dynamically assess the importance of each input feature and aggregate them into the output. The final global descriptor is L2-normalized to optimize it for similarity search. BoQ's performance is evaluated using recall@k metrics, showing significant improvements over existing methods. The method is also compared with other state-of-the-art techniques, including Conv-AP, CosPlace, and MixVPR. The results show that BoQ achieves state-of-the-art performance on most benchmarks, demonstrating robustness to extreme weather and illumination changes. The method is also compared with two-stage retrieval techniques, showing that BoQ's global retrieval performance surpasses that of existing two-stage techniques. BoQ's performance is further validated through ablation studies, showing that the number of learnable queries and the number of BoQ blocks significantly impact performance. The method is also compared with different backbone architectures, showing that BoQ achieves competitive performance with a lightweight ResNet-34 backbone. The method is also compared with other techniques, showing that BoQ's performance is superior in terms of both accuracy and efficiency. The method is also visualized, showing how the learned queries aggregate input features. The results demonstrate that BoQ is a promising approach for visual place recognition, offering high accuracy and efficiency.