GIM: LEARNING GENERALIZABLE IMAGE MATCHER FROM INTERNET VIDEOS

GIM: LEARNING GENERALIZABLE IMAGE MATCHER FROM INTERNET VIDEOS

2024 | Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, Cheng Wang
GIM: Learning Generalizable Image Matcher from Internet Videos This paper proposes GIM, a self-training framework for learning a generalizable image matching model from internet videos. GIM is designed to learn a single image matcher that can generalize to various domains and scales well with the amount of data. The framework first trains a model on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. GIM is more efficient and less likely to fail than standard SfM-and-MVS based frameworks. The paper also proposes ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of three state-of-the-art image matching architectures as the number of downloaded videos increases. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds. The paper also evaluates the performance of GIM on various downstream tasks such as visual localization, homography estimation and 3D reconstruction. The results show that a single GIM model achieves cross-the-board performance improvements on these tasks, even comparing to in-domain baselines on their specific domains. The source code, a demo, and the benchmark are available at https://xuelunshen.com/gim/.GIM: Learning Generalizable Image Matcher from Internet Videos This paper proposes GIM, a self-training framework for learning a generalizable image matching model from internet videos. GIM is designed to learn a single image matcher that can generalize to various domains and scales well with the amount of data. The framework first trains a model on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. GIM is more efficient and less likely to fail than standard SfM-and-MVS based frameworks. The paper also proposes ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Experiments demonstrate the effectiveness and generality of GIM. Applying GIM consistently improves the zero-shot performance of three state-of-the-art image matching architectures as the number of downloaded videos increases. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds. The paper also evaluates the performance of GIM on various downstream tasks such as visual localization, homography estimation and 3D reconstruction. The results show that a single GIM model achieves cross-the-board performance improvements on these tasks, even comparing to in-domain baselines on their specific domains. The source code, a demo, and the benchmark are available at https://xuelunshen.com/gim/.
Reach us at info@study.space
[slides and audio] GIM%3A Learning Generalizable Image Matcher From Internet Videos