NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Abstract
Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.
Community
Navig is a novel framework that reasons and searches with tools to locate an image.
π 1. Navig learns from GeoGuessr experts: We introduce the first reasoning dataset for Image Geo-localization, which uses image details to infer the location step-by-step. This data is collected from expert players on YouTube.
πΊοΈ 2. Navig searches on maps: Navig identifies and searches text on images, such as road signs or store names, improving accuracy in pinpointing fine-grained locations.
π 3. Performance of Navig: By incorporating language-based reasoning, Navig reduces the average distance error by 14% compared to previous state-of-the-art models.
For more details, check out our dataset here: Navig GitHub. Feel free to reach out if you have any questions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework (2025)
- PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction (2025)
- MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models (2024)
- CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation (2025)
- Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? (2025)
- VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks (2025)
- GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper