arxiv: 2604.09025 · v2 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Skill-Conditioned Visual Geolocation for Vision-Language Models

Chenjie Yang , Yutian Jiang , Chenyu Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords geolocationvision-language modelsskill graphautonomous evolutiongeographic reasoningtraining-freeself-improvementhallucination reduction

0 comments

The pith

GeoSkill equips vision-language models with an evolving Skill-Graph that improves image geolocation accuracy and reasoning faithfulness without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GeoSkill as a training-free method that organizes geographic reasoning into an expandable graph of atomic natural-language skills. It starts by distilling expert trajectories into the initial graph and then uses a larger model to run multiple rollouts on web-sourced image-coordinate pairs. Successful and failed trajectories are analyzed to synthesize new skills and remove biases, allowing the graph to grow iteratively. This process yields measurable gains in location accuracy on the GeoRC benchmark and stronger generalization on external datasets while producing verifiable new skills.

Core claim

GeoSkill maintains a Skill-Graph of atomic natural-language skills that conditions direct reasoning by the inference model; an Autonomous Evolution loop then runs a larger model on web-scale image-coordinate pairs, verifies the resulting trajectories, and uses both successes and failures to synthesize additional skills or prune geographic biases, expanding the graph and correcting errors without any parameter updates.

What carries the argument

The evolving Skill-Graph, a dynamic collection of atomic natural-language skills that guides and is refined by reasoning trajectories.

If this is right

Geolocation accuracy rises on the GeoRC benchmark while reasoning faithfulness improves.
Performance generalizes better across diverse external datasets than prior implicit-memory approaches.
Novel, verifiable skills emerge automatically from analysis of successful and failed trajectories.
Geographic biases are corrected through iterative pruning without retraining the underlying model.
The system supports continuous self-evolution through repeated rollout-and-refine cycles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout-verification loop could be applied to other structured reasoning domains such as medical diagnosis or legal analysis.
If verification quality scales with model size, larger models could bootstrap progressively richer skill graphs for smaller inference models.
Interpretability of VLM outputs increases because every reasoning step is explicitly tied to a named skill from the graph.
Long-term maintenance of the graph could become a lightweight data-curation task rather than a full retraining process.

Load-bearing premise

Rollouts generated by a larger model on web-sourced image-coordinate pairs can be verified reliably enough to synthesize new skills and prune biases without introducing fresh hallucinations.

What would settle it

Running the evolved Skill-Graph on a held-out collection of images with known ground-truth coordinates and finding no gain in accuracy or an increase in erroneous reasoning steps compared with the initial graph would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09025 by Chenjie Yang, Chenyu Wu, Yutian Jiang.

**Figure 1.** Figure 1: The evolution of global visual geo-localization paradigms. Unlike traditional feature-based matching and static [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The overall framework of GeoSkill. The system follows a four-step workflow: (1) Offline Skill Library Initialization, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Case study of GeoSkill. 10 25 200 750 2000 Distance threshold 0.0 0.2 0.4 0.6 0.8 Accuracy r=1 r=3 r=5 1 3 5 Rollout times 30 40 50 60 PRF score P R F1 Time 20 40 Time / sample 6.00 21.00 47.00 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of multi-step rollouts. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoSkill sketches a training-free Skill-Graph that grows via larger-model rollouts on web data, but the abstract supplies no metrics or verification details so the central claims stay untested.

read the letter

The main thing here is a training-free setup where an expert-seeded Skill-Graph guides VLM geolocation and then expands itself by running multiple rollouts on web image-coordinate pairs, analyzing wins and losses, and adding or dropping skills. That autonomous loop is the piece that feels new compared with static prompting or retrieval baselines. It directly targets the one-shot inference problem and the risk of outdated parametric knowledge by trying to create an external feedback signal without touching weights. If the rollout analysis works as described, it could give a practical route to structured geographic skills that generalize beyond the training distribution of the base model. The framing is clean and the graph operations look straightforward to implement. The paper positions the work against implicit-memory methods and claims better accuracy plus faithfulness on GeoRC plus external sets, plus the emergence of new verifiable skills. Those are reasonable targets. The soft spot is that none of the performance numbers, ablations, or error breakdowns appear in the abstract, and the verification step for the rollouts is described only at a high level. It relies on the larger model judging its own trajectories, which leaves open the possibility that success labels and new skills simply reinforce whatever geographic patterns the judge already knows rather than grounding them in independent evidence. Without a stated oracle, map cross-check, or human-verified subset, the pruning and synthesis claims are difficult to evaluate from the given description. The architecture itself does not introduce obvious contradictions or unstated assumptions beyond that verification gap. This is the kind of paper that would interest people working on reasoning scaffolds and self-improvement loops for VLMs in grounded domains. A reader already thinking about training-free methods or graph-based knowledge injection would get value from the concrete mechanism even before the results are fully convincing. I would send it to peer review so the experiments and verification protocol can be examined in detail; the idea has enough structure to repay the effort, though it will need quantitative backing and clearer safeguards against circular evaluation to hold up.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GeoSkill, a training-free framework for vision-language models in visual geolocation. It initializes a Skill-Graph from refined human expert trajectories, guides inference with the current graph, and uses an Autonomous Evolution mechanism in which a larger model performs multiple reasoning rollouts on web-sourced image-coordinate pairs. Successful and failed trajectories are analyzed to iteratively synthesize new skills and prune biases, expanding the graph without parameter updates. The authors claim that this yields promising geolocation accuracy and reasoning faithfulness on the GeoRC benchmark, superior generalization on external datasets, and the emergence of novel verifiable skills.

Significance. If the autonomous evolution mechanism can be shown to produce skills that are independently verifiable and that genuinely correct geographic biases rather than reinforcing internal model patterns, the work would provide a concrete demonstration of training-free self-improvement for VLMs on tasks that require structured external knowledge. The Skill-Graph approach offers a potential alternative to purely parametric memory and could generalize to other reasoning domains.

major comments (3)

[Autonomous Evolution mechanism] Autonomous Evolution mechanism (as described in the abstract and method): verification of rollouts is performed solely by a larger model on web-sourced image-coordinate pairs with no independent ground-truth oracle, external map cross-validation, or human-verified subset specified. This is load-bearing for the central claims of 'verifiable skills' and 'correcting geographic biases,' because shared hallucination patterns between the inference and evolution models could produce circular reinforcement rather than genuine correction.
[Experiments and results] Experiments and results (abstract and evaluation sections): the manuscript asserts 'promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC' and 'superior generalization across diverse external datasets,' yet the provided description contains no quantitative metrics, ablation tables, error breakdowns, or baseline comparisons. Without these, the empirical support for the headline claims cannot be assessed.
[Skill-Graph initialization and evolution] Skill-Graph initialization and evolution (method section): the process of distilling expert trajectories into atomic skills and then synthesizing/pruning via rollouts lacks any description of conflict resolution, redundancy detection, or consistency checks within the graph. This omission directly affects the reliability of the guided inference step that the framework relies upon.

minor comments (2)

[Abstract] The abstract introduces the GeoRC benchmark and external datasets without a citation or brief description of their construction and size.
[Method] Notation for the Skill-Graph (nodes, edges, skill representation) is used throughout but never given a formal definition or pseudocode, making it difficult to reproduce the initialization and update rules.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity, empirical rigor, and methodological transparency that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Autonomous Evolution mechanism] Autonomous Evolution mechanism (as described in the abstract and method): verification of rollouts is performed solely by a larger model on web-sourced image-coordinate pairs with no independent ground-truth oracle, external map cross-validation, or human-verified subset specified. This is load-bearing for the central claims of 'verifiable skills' and 'correcting geographic biases,' because shared hallucination patterns between the inference and evolution models could produce circular reinforcement rather than genuine correction.

Authors: We appreciate this observation on the verification process. The current description relies on the larger model for rollout analysis of web-sourced pairs, described as 'verified real-world reasoning.' We acknowledge the potential for circular reinforcement if hallucination patterns are shared. In the revised manuscript we will add a dedicated subsection on verification, including a human-verified subset of 200 rollouts (with reported agreement statistics) and explicit prompting instructions that require the larger model to reference external geographic facts where possible. This will be presented as a partial revision to strengthen the verifiability claims. revision: partial
Referee: [Experiments and results] Experiments and results (abstract and evaluation sections): the manuscript asserts 'promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC' and 'superior generalization across diverse external datasets,' yet the provided description contains no quantitative metrics, ablation tables, error breakdowns, or baseline comparisons. Without these, the empirical support for the headline claims cannot be assessed.

Authors: We agree that the current manuscript text lacks the specific quantitative results, tables, and comparisons needed to fully support the claims. The revised version will expand the evaluation section with concrete accuracy numbers on GeoRC, baseline comparisons, ablation studies isolating the Skill-Graph and evolution components, error breakdowns by geographic region, and generalization results on the external datasets. Corresponding tables and figures will be added. revision: yes
Referee: [Skill-Graph initialization and evolution] Skill-Graph initialization and evolution (method section): the process of distilling expert trajectories into atomic skills and then synthesizing/pruning via rollouts lacks any description of conflict resolution, redundancy detection, or consistency checks within the graph. This omission directly affects the reliability of the guided inference step that the framework relies upon.

Authors: Thank you for pointing out this gap in the method description. We will revise the Skill-Graph section to explicitly describe: (1) conflict resolution via confidence-weighted voting from successful rollouts, (2) redundancy detection using embedding similarity thresholds with a merge policy, and (3) consistency checks performed by re-evaluating a held-out expert trajectory set after each evolution cycle. These additions will clarify how the graph remains reliable for inference. revision: yes

Circularity Check

1 steps flagged

Autonomous evolution's skill synthesis reduces to larger-model self-verification of its own rollouts on web pairs, with no independent oracle.

specific steps

self definitional [Abstract (Autonomous Evolution mechanism)]
"an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates."

The phrase 'verified real-world reasoning' is supplied by the larger model itself judging its own rollouts; therefore the success/failure classification that drives skill synthesis is defined in terms of the model's internal outputs rather than an independent external criterion, making the emergence of 'novel, verifiable skills' equivalent to re-labeling the model's existing behavior.

full rationale

The paper's central claim of autonomous growth and emergence of novel verifiable skills rests on the evolution mechanism. The description states that a larger model performs rollouts on web-sourced image-coordinate pairs and then analyzes successful/failed trajectories to synthesize and prune skills in the Skill-Graph. Because verification of 'real-world reasoning' is performed internally by the same class of VLM (the larger model), the success labels, bias corrections, and new skills are generated from the model's own outputs rather than from external ground truth, cross-validation, or human-verified subsets. This makes the reported generalization gains and skill emergence dependent on the internal consistency of the model's judgments, satisfying the self-definitional pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the effectiveness of the newly introduced Skill-Graph and Autonomous Evolution mechanism, which are postulated rather than derived from prior results.

axioms (2)

domain assumption VLMs rely on implicit parametric memory that produces outdated knowledge and hallucinations
Stated directly in the abstract as the core limitation being addressed.
domain assumption Image-coordinate pairs from web-scale data can serve as reliable ground truth for verifying reasoning trajectories
Required for the autonomous evolution step to label successes and failures.

invented entities (2)

Skill-Graph no independent evidence
purpose: To represent and guide geographic reasoning with atomic natural-language skills
Core data structure of the proposed framework, initialized from expert trajectories.
Autonomous Evolution mechanism no independent evidence
purpose: To iteratively synthesize new skills and prune biases by analyzing model rollouts
New self-improvement loop that operates without parameter updates.

pith-pipeline@v0.9.0 · 5539 in / 1521 out tokens · 44091 ms · 2026-05-14T21:58:38.977782+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs... iteratively synthesizes and prunes skills
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Skill-Graph... atomic skills... Jcost not referenced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Gabriele Berton, Carlo Masone, and Barbara Caputo. 2022. Rethinking visual geo- localization for large-scale applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4878–4888

work page 2022
[2]

Gabriele Berton, Carlo Masone, Valerio Paolicelli, and Barbara Caputo. 2021. Viewpoint invariant dense matching for visual geolocalization. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12169–12178

work page 2021
[3]

Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aritra Dutta, and Mubarak Shah. 2025. Gaea: A geolocation aware conversational model.arXiv e-prints(2025), arXiv–2503

work page 2025
[4]

Zhiyang Dou, Zipeng Wang, Xumeng Han, Guorong Li, Zhipei Huang, and Zhenjun Han. 2024. Gaga: Towards interactive global geolocation assistant. arXiv preprint arXiv:2412.08907(2024)

work page arXiv 2024
[5]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Lukas Haas, Silas Alberti, and Michal Skreta. 2023. Learning generalized zero-shot learners for open-domain image geolocalization.arXiv preprint arXiv:2302.00275 (2023)

work page arXiv 2023
[7]

Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. 2024. Pigeon: Predict- ing image geolocations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12893–12902

work page 2024
[8]

Xiao Han, Chen Zhu, Hengshu Zhu, and Xiangyu Zhao. 2025. Swarm intelligence in geo-localization: A multi-agent large vision-language model collaborative framework. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 814–825

work page 2025
[9]

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guan- hua Chen, Liaoni Wu, and Xiangxiang Chu. 2026. Thinking with Map: Re- inforced Parallel Map-Augmented Agent for Geolocalization.arXiv preprint arXiv:2601.05432(2026)

work page arXiv 2026
[10]

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, and Yu Liu

work page
[11]

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning.arXiv preprint arXiv:2602.09463(2026)

work page arXiv 2026
[12]

Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, and Sharon Li. 2025. GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization.arXiv preprint arXiv:2505.13731(2025)

work page arXiv 2025
[13]

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, and Qibin Hou. 2026. GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics.arXiv preprint arXiv:2602.12617(2026)

work page arXiv 2026
[14]

Ling Li, Yu Ye, Yao Zhou, Bingchuan Jiang, and Wei Zeng. 2024. Georeasoner: Geo-localization with reasoning in street views using a large vision-language model.arXiv preprint arXiv:2406.18572(2024)

work page arXiv 2024
[15]

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. 2025. Recogni- tion through Reasoning: Reinforcing Image Geo-localization with Large Vision- Language Models.arXiv preprint arXiv:2506.14674(2025)

work page arXiv 2025
[16]

Qiujun Li, Zijin Xiao, Xulin Wang, Zhidan Ma, Cheng Yang, and Haifeng Li. 2026. LocationAgent: A Hierarchical Agent for Image Geolocation via Decoupling Strat- egy and Evidence from Parametric Knowledge.arXiv preprint arXiv:2601.19155 (2026)

work page arXiv 2026
[17]

Yi Liu, Junchen Ding, Gelei Deng, Yuekang Li, Tianwei Zhang, Weisong Sun, Yaowen Zheng, Jingquan Ge, and Yang Liu. 2024. Image-based geolocation using large vision-language models.arXiv preprint arXiv:2408.09474(2024)

work page arXiv 2024
[18]

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. 2025. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. 2018. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of the European conference on computer vision (ECCV). 563–579

work page 2018
[20]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[21]

Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Carlos D Castillo, and Rama Chellappa. 2022. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision. Springer, 196–215

work page 2022
[22]

Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, et al. 2025. Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales. arXiv preprint arXiv:2510.10880(2025)

work page arXiv 2025
[23]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

work page 2023
[24]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs.CL] https://arxiv.org/abs/ 1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

2009.The probabilistic relevance frame- work: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

work page 2009
[26]

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. InPro- ceedings of the European Conference on Computer Vision (ECCV). 536–551

work page 2018
[27]

Zirui Song, Jingpu Yang, Yuan Huang, Jonathan Tonglet, Zeyu Zhang, Tao Cheng, Meng Fang, Iryna Gurevych, and Xiuying Chen. 2025. Geolocation with real hu- man gameplay data: A large-scale dataset and human-like reasoning framework. arXiv preprint arXiv:2502.13759(2025)

work page arXiv 2025
[28]

Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam, Ethan Mendes, Alan Ritter, Wei Xu, and James Hays. 2026. GeoRC: A Benchmark for Geolocation Reasoning Chains.arXiv preprint arXiv:2601.21278(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. 2023. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization.Advances in Neural Information Processing Systems36 (2023), 8690–8701

work page 2023
[31]

Nam Vo, Nathan Jacobs, and James Hays. 2017. Revisiting im2gps in the deep learning era. InProceedings of the IEEE international conference on computer vision. 2621–2630

work page 2017
[32]

Chun Wang, Xiaojun Ye, Xiaoran Pan, Zihao Pan, Haofan Wang, and Yiren Song

work page
[33]

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains.arXiv preprint arXiv:2505.18700(2025)

work page arXiv 2025
[34]

Yikun Wang, Zuyan Liu, Ziyi Wang, Han Hu, Pengfei Liu, and Yongming Rao

work page
[35]

arXiv preprint arXiv:2511.15705(2025)

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization. arXiv preprint arXiv:2511.15705(2025)

work page arXiv 2025
[36]

Zhangyu Wang, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Zeping Liu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, et al. 2025. LocDiffusion: Identifying locations on Earth by diffusing in the Hilbert space. (2025)

work page 2025
[37]

Tobias Weyand, Ilya Kostrikov, and James Philbin. 2016. Planet-photo geolocation with convolutional neural networks. InEuropean conference on computer vision. Springer, 37–55

work page 2016
[38]

Daniel Wilson, Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. 2024. Image and object geo-localization.International Journal of Computer Vision132, 4 (2024), 1350–1392

work page 2024
[39]

Renjun Xu and Yang Yan. 2026. Agent skills for large language models: Architec- ture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, and Jieping Ye. 2024. Addressclip: Empowering vision-language models for city-wide image address localization. InEuropean Conference on Computer Vision. Springer, 76–92

work page 2024
[41]

Bo Yu, Fengze Yang, Yiming Liu, Chao Wang, Xuewen Luo, Taozhe Li, Ruimin Ke, Xiaofan Zhou, and Chenxi Liu. 2026. Locatability-Guided Adaptive Reason- ing for Image Geo-Localization with Vision-Language Models.arXiv preprint arXiv:2603.13628(2026)

work page arXiv 2026
[42]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. 2025. Dapo: An open- source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. 2025. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. 2025. Group sequence policy optimization.arXiv preprint arXiv:2507.18071(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, and Jin Huang. 2025. GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks.arXiv preprint arXiv:2511.00908(2025)

work page arXiv 2025
[46]

Xin Zheng, Jialong Han, and Aixin Sun. 2018. A survey of location prediction on twitter.IEEE Transactions on Knowledge and Data Engineering30, 9 (2018), 1652–1671

work page 2018
[47]

Fan Zhou, Xiuxiu Qi, Kunpeng Zhang, Goce Trajcevski, and Ting Zhong. 2022. Metageo: a general framework for social user geolocation identification with few-shot learning.IEEE Transactions on Neural Networks and Learning Systems 34, 11 (2022), 8950–8964. Conference’17, July 2017, Washington, DC, USA Chengjie Yang, Yutian Jiang, and Chenyu Wu

work page 2022
[48]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large lan- guage models.arXiv preprint arXiv:2304.10592(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023