29 Jul 2016 | Kwang Moo Yi*,1, Eduard Trulls*,1, Vincent Lepetit2, Pascal Fua1
LIFT: Learned Invariant Feature Transform is a novel deep network architecture that integrates feature detection, orientation estimation, and feature description into a single differentiable pipeline. Unlike previous methods that handle these tasks separately, LIFT learns to perform all three in unison while maintaining end-to-end differentiability. The architecture consists of three components: a Detector, an Orientation Estimator, and a Descriptor, all based on Convolutional Neural Networks (CNNs). The Detector identifies feature points, the Orientation Estimator determines their orientation, and the Descriptor generates robust feature representations. To ensure end-to-end differentiability, the system uses Spatial Transformers and a softargmax function to rectify image patches and suppress non-local maxima.
The system is trained using a Siamese network with a four-branch architecture, where each branch corresponds to one component of the pipeline. The training data is generated from photo-tourism image sets, including Piccadilly and the Roman Forum, reconstructed using SfM. The network is trained in a problem-specific manner, starting with the Descriptor, then the Orientation Estimator, and finally the Detector. This approach allows the components to be optimized together, leading to better overall performance.
The LIFT pipeline outperforms state-of-the-art methods on benchmark datasets such as Strecha, DTU, and Webcam. It achieves superior performance in terms of repeatability, nearest neighbor mean average precision, and matching score. The results show that integrating the components into a unified pipeline is crucial for optimal performance, as individual components may not perform as well when evaluated separately. The system's ability to learn invariant features and maintain end-to-end differentiability makes it a powerful approach for local feature detection and description.LIFT: Learned Invariant Feature Transform is a novel deep network architecture that integrates feature detection, orientation estimation, and feature description into a single differentiable pipeline. Unlike previous methods that handle these tasks separately, LIFT learns to perform all three in unison while maintaining end-to-end differentiability. The architecture consists of three components: a Detector, an Orientation Estimator, and a Descriptor, all based on Convolutional Neural Networks (CNNs). The Detector identifies feature points, the Orientation Estimator determines their orientation, and the Descriptor generates robust feature representations. To ensure end-to-end differentiability, the system uses Spatial Transformers and a softargmax function to rectify image patches and suppress non-local maxima.
The system is trained using a Siamese network with a four-branch architecture, where each branch corresponds to one component of the pipeline. The training data is generated from photo-tourism image sets, including Piccadilly and the Roman Forum, reconstructed using SfM. The network is trained in a problem-specific manner, starting with the Descriptor, then the Orientation Estimator, and finally the Detector. This approach allows the components to be optimized together, leading to better overall performance.
The LIFT pipeline outperforms state-of-the-art methods on benchmark datasets such as Strecha, DTU, and Webcam. It achieves superior performance in terms of repeatability, nearest neighbor mean average precision, and matching score. The results show that integrating the components into a unified pipeline is crucial for optimal performance, as individual components may not perform as well when evaluated separately. The system's ability to learn invariant features and maintain end-to-end differentiability makes it a powerful approach for local feature detection and description.