Understanding Img2Loc%3A Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation

Img2Loc: Revisiting Image Geolocalization Using Multi-Modality Foundation Models and Image-Based Retrieval-Augmented Generation Image geolocalization, the task of determining the geographic coordinates of an image, remains a challenging problem in computer vision and information retrieval. Traditional methods rely on either classification, which divides the Earth's surface into grid cells and classifies images accordingly, or retrieval, which identifies locations by matching images with a database of image-location pairs. However, classification-based approaches are limited by cell size and cannot yield precise predictions, while retrieval-based systems suffer from poor search quality and inadequate coverage of the global landscape at various scales and aggregation levels. To address these limitations, the authors propose Img2Loc, a novel system that redefines image geolocalization as a text generation task using large multimodality models (LMMs) like GPT-4V or LLaVA with retrieval augmented generation. Img2Loc first generates an image-based coordinate query database using CLIP-based representations. It then uniquely combines query results with images to form elaborate prompts customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art models but does so without any model training. The study contributes significantly to the field of image geolocalization. It is the first successful demonstration of multi-modality foundation models in addressing geolocalization tasks. The approach is training-free, avoiding the need for specialized model architectures and training paradigms, and significantly reduces computational overhead. Using a refined sampling process, the method not only identifies reference points closely associated with the query image but also effectively minimizes the likelihood of generating significantly inaccurate coordinates. The model achieves outstanding performance on challenging benchmark datasets compared to other state-of-the-art approaches. The method involves constructing an image-location database using the CLIP model for feature encoding and FAISS for efficient nearest neighbor search. It then generates locations with augmented prompts by incorporating information from similar and dissimilar locations. The system allows users to input any image for geolocalization, processes the image through a query and retrieval module, and feeds the results into a multi-modality model to display the geolocalization result as an interactive map. Experiments on benchmark datasets show that Img2Loc outperforms previous classification and retrieval methods across all granularity levels. On the Im2GPS3k dataset, the method achieves significant improvements over the prior top-performing method, GeoCLIP, without training on geo-tagged data. On the YFCC4k dataset, the method surpasses the previous best model, GeoGuessNet, by margins of up to 8.18% for the respective distance thresholds. In conclusion, Img2Loc presents a cutting-edge system that leverages multi-modality foundation models and integrates advanced image-based information retrieval techniques for image geolocalization. The approachImg2Loc: Revisiting Image Geolocalization Using Multi-Modality Foundation Models and Image-Based Retrieval-Augmented Generation Image geolocalization, the task of determining the geographic coordinates of an image, remains a challenging problem in computer vision and information retrieval. Traditional methods rely on either classification, which divides the Earth's surface into grid cells and classifies images accordingly, or retrieval, which identifies locations by matching images with a database of image-location pairs. However, classification-based approaches are limited by cell size and cannot yield precise predictions, while retrieval-based systems suffer from poor search quality and inadequate coverage of the global landscape at various scales and aggregation levels. To address these limitations, the authors propose Img2Loc, a novel system that redefines image geolocalization as a text generation task using large multimodality models (LMMs) like GPT-4V or LLaVA with retrieval augmented generation. Img2Loc first generates an image-based coordinate query database using CLIP-based representations. It then uniquely combines query results with images to form elaborate prompts customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art models but does so without any model training. The study contributes significantly to the field of image geolocalization. It is the first successful demonstration of multi-modality foundation models in addressing geolocalization tasks. The approach is training-free, avoiding the need for specialized model architectures and training paradigms, and significantly reduces computational overhead. Using a refined sampling process, the method not only identifies reference points closely associated with the query image but also effectively minimizes the likelihood of generating significantly inaccurate coordinates. The model achieves outstanding performance on challenging benchmark datasets compared to other state-of-the-art approaches. The method involves constructing an image-location database using the CLIP model for feature encoding and FAISS for efficient nearest neighbor search. It then generates locations with augmented prompts by incorporating information from similar and dissimilar locations. The system allows users to input any image for geolocalization, processes the image through a query and retrieval module, and feeds the results into a multi-modality model to display the geolocalization result as an interactive map. Experiments on benchmark datasets show that Img2Loc outperforms previous classification and retrieval methods across all granularity levels. On the Im2GPS3k dataset, the method achieves significant improvements over the prior top-performing method, GeoCLIP, without training on geo-tagged data. On the YFCC4k dataset, the method surpasses the previous best model, GeoGuessNet, by margins of up to 8.18% for the respective distance thresholds. In conclusion, Img2Loc presents a cutting-edge system that leverages multi-modality foundation models and integrates advanced image-based information retrieval techniques for image geolocalization. The approach

Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation

28 Mar 2024 | Zhongliang Zhou, Jielu Zhang, Zihan Guan, Mengxuan Hu, Ni Lao, Lan Mu, Sheng Li, Gengchen Mai