[slides] GIM%3A Learning Generalizable Image Matcher From Internet Videos

The paper introduces GIM (Generalizable Image Matcher), a self-training framework that leverages internet videos to learn a single, generalizable image matching model. GIM addresses the limitations of existing methods, which often require separate training for different scene types and struggle to generalize to unseen scenarios. By training on diverse internet videos, GIM improves the zero-shot performance of state-of-the-art image matching architectures (SuperGlue, LoFTR, DKM) and enhances their generalization to challenging tasks such as 3D reconstruction and visual localization. The framework uses a combination of domain-specific datasets and complementary matching methods to generate dense labels on nearby frames of videos, which are then filtered and propagated to distant frames. The final model is trained with strong augmentations, making it more efficient and robust than traditional SfM and MVS-based frameworks. The paper also introduces ZEB (Zero-shot Evaluation Benchmark), a novel benchmark for evaluating image matching models' cross-domain generalization. Experiments demonstrate that GIM consistently improves zero-shot performance and generalizes well to extreme cross-domain data, such as Bird Eye View (BEV) images of projected 3D point clouds.The paper introduces GIM (Generalizable Image Matcher), a self-training framework that leverages internet videos to learn a single, generalizable image matching model. GIM addresses the limitations of existing methods, which often require separate training for different scene types and struggle to generalize to unseen scenarios. By training on diverse internet videos, GIM improves the zero-shot performance of state-of-the-art image matching architectures (SuperGlue, LoFTR, DKM) and enhances their generalization to challenging tasks such as 3D reconstruction and visual localization. The framework uses a combination of domain-specific datasets and complementary matching methods to generate dense labels on nearby frames of videos, which are then filtered and propagated to distant frames. The final model is trained with strong augmentations, making it more efficient and robust than traditional SfM and MVS-based frameworks. The paper also introduces ZEB (Zero-shot Evaluation Benchmark), a novel benchmark for evaluating image matching models' cross-domain generalization. Experiments demonstrate that GIM consistently improves zero-shot performance and generalizes well to extreme cross-domain data, such as Bird Eye View (BEV) images of projected 3D point clouds.

GIM: LEARNING GENERALIZABLE IMAGE MATCHER FROM INTERNET VIDEOS

16 Feb 2024 | Xuelun Shen1†, Zhipeng Cai2†, Wei Yin3†, Matthias Müller2, Zijun Li1, Kaixuan Wang3, Xiaozhi Chen3, Cheng Wang1

GIM: LEARNING GENERALIZABLE IMAGE MATCHER FROM INTERNET VIDEOS

16 Feb 2024 | Xuelun Shen1†, Zhipeng Cai2†*, Wei Yin3†, Matthias Müller2, Zijun Li1, Kaixuan Wang3, Xiaozhi Chen3, Cheng Wang1*

16 Feb 2024 | Xuelun Shen1†, Zhipeng Cai2†, Wei Yin3†, Matthias Müller2, Zijun Li1, Kaixuan Wang3, Xiaozhi Chen3, Cheng Wang1