15 Apr 2024 | Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann
This paper introduces a novel approach to pose regression called map-relative pose regression (marepo), which addresses the limitations of existing absolute pose regression (APR) methods by incorporating scene-specific geometric priors. Traditional APR methods require extensive training data and time to achieve accurate pose estimation, often leading to performance degradation in new environments. Marepo combines a scene-specific metric representation with a scene-agnostic pose regression network, enabling accurate and efficient pose estimation across diverse scenes. The method leverages a transformer-based architecture with dynamic positional encoding to capture 3D geometric information and improve pose prediction accuracy. The network is trained on a large corpus of data and can be fine-tuned in minutes for new scenes, achieving state-of-the-art performance on two public datasets: 7-Scenes (indoor) and Wayspots (outdoor). Experimental results show that marepo outperforms existing APR methods in terms of accuracy and efficiency, with the ability to generalize to new scenes without requiring extensive retraining. The method is also robust to variations in camera parameters and can handle noisy scene coordinates, making it a versatile solution for visual relocalization. The paper also presents ablation studies and supplementary experiments to validate the effectiveness of the proposed approach.This paper introduces a novel approach to pose regression called map-relative pose regression (marepo), which addresses the limitations of existing absolute pose regression (APR) methods by incorporating scene-specific geometric priors. Traditional APR methods require extensive training data and time to achieve accurate pose estimation, often leading to performance degradation in new environments. Marepo combines a scene-specific metric representation with a scene-agnostic pose regression network, enabling accurate and efficient pose estimation across diverse scenes. The method leverages a transformer-based architecture with dynamic positional encoding to capture 3D geometric information and improve pose prediction accuracy. The network is trained on a large corpus of data and can be fine-tuned in minutes for new scenes, achieving state-of-the-art performance on two public datasets: 7-Scenes (indoor) and Wayspots (outdoor). Experimental results show that marepo outperforms existing APR methods in terms of accuracy and efficiency, with the ability to generalize to new scenes without requiring extensive retraining. The method is also robust to variations in camera parameters and can handle noisy scene coordinates, making it a versatile solution for visual relocalization. The paper also presents ablation studies and supplementary experiments to validate the effectiveness of the proposed approach.