Understanding XFeat%3A Accelerated Features for Lightweight Image Matching

XFeat is a lightweight and accurate convolutional neural network (CNN) architecture designed for efficient image matching. The method, named XFeat (Accelerated Features), addresses the need for fast and robust algorithms suitable for resource-limited devices. XFeat is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. XFeat is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. It runs in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at www.verlab.dcc.ufmg.br/descritors/xfeat_cvpr24. XFeat introduces a novel lightweight CNN architecture that can be deployed on resource-constrained platforms and downstream tasks that require high throughput or computational efficiency, without the requirement of time-consuming hardware-specific optimizations. The method can readily replace existing lightweight handcrafted solutions, expensive deep models, and lightweight deep models in several downstream tasks such as visual localization and camera pose estimation. XFeat also introduces a minimalist, learnable keypoint detection branch that is fast and suitable for small extractor backbones, showing its effectiveness in visual localization, camera pose estimation, and homography registration. Lastly, XFeat proposes a novel match refinement module for obtaining pixel-level offsets from coarse semi-dense matches. This module does not require high resolution features besides the local descriptors themselves, greatly reducing compute and achieving high accuracy and matching density. XFeat's backbone is designed to be hardware-agnostic, ensuring broad applicability across platforms. The backbone uses a featherweight network backbone with a strategy to minimize early-layer depth and reconfigure channel distribution, significantly improving the accuracy-compute trade-off. The backbone features six blocks, halving resolution and increasing depth in sequence: {4, 8, 24, 64, 64, 128}, plus a fusion block for multi-resolution features. The descriptor head extracts a dense feature map, obtained by merging multiscale features from the encoder. The keypoint head uses a dedicated parallel branch for keypoint detection focused on low-level image structures. The dense matching module proposes a lightweight module for dense feature matching, differing from other detector-free methods in two ways: controlling memory and compute footprint by selecting top-K image regions according to their reliability score and caching them for future matching, and proposing a simple and lightweight Multi-Layer Perceptron (MLP) to perform coarse-to-fine matching without high-resolution feature maps. XFeat is trained in a supervised manner with pixel-level ground truth correspondences. The learning of local descriptors is supervised using the negative log-likelihood (NLL) loss. TheXFeat is a lightweight and accurate convolutional neural network (CNN) architecture designed for efficient image matching. The method, named XFeat (Accelerated Features), addresses the need for fast and robust algorithms suitable for resource-limited devices. XFeat is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. XFeat is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization. It runs in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at www.verlab.dcc.ufmg.br/descritors/xfeat_cvpr24. XFeat introduces a novel lightweight CNN architecture that can be deployed on resource-constrained platforms and downstream tasks that require high throughput or computational efficiency, without the requirement of time-consuming hardware-specific optimizations. The method can readily replace existing lightweight handcrafted solutions, expensive deep models, and lightweight deep models in several downstream tasks such as visual localization and camera pose estimation. XFeat also introduces a minimalist, learnable keypoint detection branch that is fast and suitable for small extractor backbones, showing its effectiveness in visual localization, camera pose estimation, and homography registration. Lastly, XFeat proposes a novel match refinement module for obtaining pixel-level offsets from coarse semi-dense matches. This module does not require high resolution features besides the local descriptors themselves, greatly reducing compute and achieving high accuracy and matching density. XFeat's backbone is designed to be hardware-agnostic, ensuring broad applicability across platforms. The backbone uses a featherweight network backbone with a strategy to minimize early-layer depth and reconfigure channel distribution, significantly improving the accuracy-compute trade-off. The backbone features six blocks, halving resolution and increasing depth in sequence: {4, 8, 24, 64, 64, 128}, plus a fusion block for multi-resolution features. The descriptor head extracts a dense feature map, obtained by merging multiscale features from the encoder. The keypoint head uses a dedicated parallel branch for keypoint detection focused on low-level image structures. The dense matching module proposes a lightweight module for dense feature matching, differing from other detector-free methods in two ways: controlling memory and compute footprint by selecting top-K image regions according to their reliability score and caching them for future matching, and proposing a simple and lightweight Multi-Layer Perceptron (MLP) to perform coarse-to-fine matching without high-resolution feature maps. XFeat is trained in a supervised manner with pixel-level ground truth correspondences. The learning of local descriptors is supervised using the negative log-likelihood (NLL) loss. The

XFeat: Accelerated Features for Lightweight Image Matching

2024 | Guilherme Potje¹, Felipe Cadar¹,², André Araujo³, Renato Martins²,⁴, Erickson R. Nascimento¹,⁵