REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

Dacheng Yin; Fan Zhang; Fengyun Rao; Furong Jia; Jing Lyu; Kang Rong; Yong Li

arxiv: 2605.26861 · v1 · pith:45F4C4KYnew · submitted 2026-05-26 · 💻 cs.CV

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

Yong Li , Furong Jia , Dacheng Yin , Kang Rong , Fengyun Rao , Jing Lyu , Fan Zhang This is my paper

Pith reviewed 2026-06-29 18:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords image geo-localizationagentic reasoningreinforcement learningevidence verificationvisual groundingretrieval augmentationmulti-turn reasoning

0 comments

The pith

REVERSE trains a 4B model to perform multi-turn evidence search and verification for image geo-localization, outperforming larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that image geo-localization improves when models learn to iteratively inspect regions, query external evidence, and filter results rather than predicting locations directly or using limited retrieval. It constructs annotated trajectories for three decisions—where to look, what to query, and what to trust—and uses process rewards during reinforcement learning with a stable search cache. If this holds, smaller models can match or exceed the performance of much larger ones on standard benchmarks like Im2GPS3k and YFCC4k by acquiring and verifying evidence like human experts. This matters because current methods miss the iterative human workflow, leading to brittle results on ambiguous images.

Core claim

REVERSE reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning in image geo-localization. It teaches three intermediate decisions through tool-grounded trajectories annotated with region selections, search observations, and geo-informative evidence labels, along with process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache stabilizes retrieval observations for dense supervision. A 4B model trained this way outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k.

What carries the argument

The REVERSE framework, which uses tool-grounded trajectories and process rewards to supervise the decisions of where to look, what to query, and what evidence to trust in an agentic reasoning loop.

If this is right

Models can handle ambiguous images by revising judgments as new clues appear.
Retrieval-augmented methods gain dense supervision on intermediate search decisions rather than only final accuracy.
Smaller models achieve competitive geo-localization without scaling model size.
Offline caches make reinforcement learning practical by reusing stable retrieval observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reinforcement on search trajectories could extend to other agentic tasks like visual question answering or document analysis.
The approach might reduce reliance on ever-larger models if the supervision generalizes across domains.
Testing on images from underrepresented regions could reveal if the trajectories capture global geo-informative cues.

Load-bearing premise

The constructed tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels supply effective and unbiased supervision for the three intermediate decisions.

What would settle it

If a model trained with these trajectories shows no improvement over baselines when evaluated on a held-out set of images requiring novel search strategies not covered in the annotations, the supervision would be shown insufficient.

Figures

Figures reproduced from arXiv: 2605.26861 by Dacheng Yin, Fan Zhang, Fengyun Rao, Furong Jia, Jing Lyu, Kang Rong, Yong Li.

**Figure 2.** Figure 2: Data generation pipeline. Kimi-K2.6 generates multi-turn geo-localization trajectories over MP-16 Pro images using live search APIs. Trajectories undergo quality filtering, geo-informative label annotation, and bounding box re-annotation to correct full-image crops. The resulting dense annotations populate an offline cache that supports API-free RL training. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: REVERSE training pipeline. The model first learns direct image geo-localization through SFT, then acquires multi-turn tool-use behavior from cold start trajectories, and finally undergoes agentic RL with offline search observations. The reward combines geographical accuracy, format compliance, and process-level signals for where to look, how to query, and what to trust. Stage 1: SFT. We fine-tune Qwen3-VL-… view at source ↗

**Figure 4.** Figure 4: Comparison of two agent trajectories on the same geo-localization case. The Qwen3- VL-4B-Instruct trajectory first retrieves strong evidence for Joie de Vivre at Zuccotti Park, New York City, but rejects it based on a Chicago architectural prior and follows a cropped-search distractor to Chicago. REVERSE’s trajectory keeps the full-image evidence, identifies the sculpture as Mark di Suvero’s Joie de Vivre … view at source ↗

**Figure 5.** Figure 5: Tool calls per sample across training stages on Im2GPS3k. Base: untuned Qwen3- VL-4B. Cold Start: Stage 2. REVERSE: fullcurriculum RL. RL shapes tool-use patterns, not just frequency [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: SFT scaling curves on Im2GPS3k (2,997 images). Accuracy at five distance thresholds as a function of training samples seen. Solid lines: Qwen3-VL-4B; dashed lines: Qwen3-VL-8B. G Limitations and Broader Impacts REVERSE trades live retrieval for reproducibility: RL training uses a static offline cache, which enables high-throughput training and controlled evaluation but means the model learns search behavio… view at source ↗

read the original abstract

Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at https://github.com/yonglleee/REVERSE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REVERSE adds RL process rewards plus an offline cache to train a 4B model on iterative geo-localization decisions, but the abstract supplies no evidence that the trajectories give unbiased supervision.

read the letter

REVERSE trains a vision-language model to handle image geo-localization through multiple turns of region selection, search queries, and evidence filtering. It does this with three process rewards and an offline retrieval cache that keeps observations stable during RL. The reported outcome is that a 4B model beats retrieval baselines and matches larger models on Im2GPS3k and YFCC4k.

The concrete addition is the explicit supervision on the three intermediate choices via annotated trajectories, plus the cache trick that makes dense rewards feasible. That combination is not in the prior retrieval-augmented work they cite, and the cache is a practical step that avoids the usual instability when retrieval results change between training steps.

The main uncertainty is whether the trajectories actually supply clean supervision. The method stands or falls on how the region annotations, query labels, and evidence judgments were created. If those labels embed the same heuristics or model biases that the agent is supposed to learn past, the rewards will simply reinforce the construction process rather than produce generalizable behavior. The abstract gives no description of the annotation pipeline or any checks for bias, and no ablations appear. That leaves the performance numbers hard to interpret.

The work is aimed at people already working on agentic visual search or geo-localization pipelines. A reader who wants to replicate the cache and reward structure could extract useful implementation details from the released code. It does not reorganize the broader field.

Send it to review if the full paper shows the trajectory construction details and ablations that test whether the supervision is robust. Otherwise the central claim stays unverified.

Referee Report

1 major / 0 minor

Summary. The paper introduces REVERSE, a reinforcement learning framework for agentic image geo-localization. It constructs tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels to provide supervision for three intermediate decisions (where to look, what to query, what evidence to trust). Process rewards for visual grounding, query utility, and evidence discrimination are combined with an offline search cache to enable stable multi-turn reasoning. A 4B model is reported to outperform strong retrieval-augmented baselines and rival larger models on Im2GPS3k and YFCC4k.

Significance. If the gains prove robust, the work shows that dense process-level supervision on intermediate agentic decisions can allow smaller models to compete with much larger ones in tool-augmented visual reasoning. The public code release is a clear strength that supports reproducibility.

major comments (1)

[Abstract] Abstract: the central claim that the 4B model outperforms retrieval baselines on Im2GPS3k and YFCC4k rests on the constructed trajectories supplying effective, unbiased supervision for the three decisions. No detail is given on how the offline cache and annotation process avoid embedding systematic biases in region choice, query formulation, or evidence filtering; if such biases exist, the process rewards would reinforce them rather than produce generalizable agentic behavior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the 4B model outperforms retrieval baselines on Im2GPS3k and YFCC4k rests on the constructed trajectories supplying effective, unbiased supervision for the three decisions. No detail is given on how the offline cache and annotation process avoid embedding systematic biases in region choice, query formulation, or evidence filtering; if such biases exist, the process rewards would reinforce them rather than produce generalizable agentic behavior.

Authors: We agree that the abstract (and the current level of detail in the methods) does not sufficiently address potential biases in trajectory construction. The manuscript describes the use of annotated region selections, search observations, and geo-informative evidence labels together with an offline cache for stability, but provides no explicit analysis of how the annotation pipeline or cache construction avoids systematic biases in the three decisions. We will revise the abstract to note this limitation and add a dedicated subsection detailing the annotation protocol (including annotator diversity, validation against held-out geo-tags, and randomization steps) plus cache construction (e.g., uniform sampling across locations). This will allow readers to evaluate the risk that process rewards reinforce dataset-specific artifacts rather than generalizable reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmarks

full rationale

The paper constructs custom tool-grounded trajectories and process rewards to train an agentic model for geo-localization decisions. Its central performance claims (4B model outperforming baselines on Im2GPS3k and YFCC4k) are evaluated against external, standard benchmarks rather than quantities defined inside the training loop. No equations, fitted parameters renamed as predictions, or self-citation chains are visible in the provided text that would reduce the reported results to the inputs by construction. The derivation chain remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5783 in / 1091 out tokens · 34265 ms · 2026-06-29T18:16:11.488482+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 canonical work pages · 8 internal anchors

[1]

On the opportunities and challenges of foundation models for GeoAI.ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. On the opportunities and challenges of foundation models for GeoAI.ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

2024
[2]

OpenStreetView-5M: The many roads to global visual geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Constantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vincent, Lintao Xu, Hongyu Zhou, and Loic Landrieu. OpenStreetView-5M: The many roads to global visual geolocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

2024
[3]

PlaNet - photo geolocation with convolu- tional neural networks

Tobias Weyand, Ilya Kostrikov, and James Philbin. PlaNet - photo geolocation with convolu- tional neural networks. InEuropean Conference on Computer Vision (ECCV), 2016

2016
[4]

CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps. InEuropean Conference on Computer Vision (ECCV), 2018

2018
[5]

Geolocation estimation of photos using a hierarchical model and scene classification

Eric Müller-Budack, Kader Pustu-Iren, and Ralph Ewerth. Geolocation estimation of photos using a hierarchical model and scene classification. InEuropean Conference on Computer Vision (ECCV), 2018

2018
[6]

Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah

V . Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[7]

PIGEON: Predicting image geolocations

Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. PIGEON: Predicting image geolocations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[8]

Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation

Zhongliang Zhou, Jielu Zhang, Zihan Guan, Jiayu Hu, Shuwei Lao, Kaiye Mu, Yunqi Li, and Gengchen Mai. Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

2024
[9]

G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models

Pengyue Jia, Yiding Liu, Xiaopeng Li, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[10]

Vision-language reasoning for geolocalization: A reinforcement learning approach

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, and Jun Wang. Vision-language reasoning for geolocalization: A reinforcement learning approach. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026
[11]

Geoagent: Learning to geolocate everywhere with reinforced geographic characteristics.arXiv preprint arXiv:2602.12617, 2026

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic characteristics.arXiv preprint arXiv:2602.12617, 2026

work page arXiv 2026
[12]

Spotagent: Grounding visual geo-localization in large vision-language models through agentic reasoning

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, and Yu Liu. Spotagent: Grounding visual geo-localization in large vision-language models through agentic reasoning. arXiv preprint arXiv:2602.09463, 2026

work page arXiv 2026
[13]

Thinking with map: Reinforced parallel map-augmented agent for geolocalization.arXiv preprint arXiv:2601.05432, 2026

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, and Xiangxiang Chu. Thinking with map: Reinforced parallel map-augmented agent for geolocalization.arXiv preprint arXiv:2601.05432, 2026

work page arXiv 2026
[14]

Swarm intelligence in geo-localization: A multi-agent large vision-language model collaborative framework

Xiao Han, Chen Zhu, Xiangyu Zhao, and Hengshu Zhu. Swarm intelligence in geo-localization: A multi-agent large vision-language model collaborative framework. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 814–825, 2025. 10

2025
[15]

GeoVista: Web-augmented agentic visual reasoning for geolocalization.arXiv preprint arXiv:2511.15705, 2025

Yikun Wang, Zuyan Liu, Ziyi Wang, Han Hu, Pengfei Liu, and Yongming Rao. GeoVista: Web-augmented agentic visual reasoning for geolocalization.arXiv preprint arXiv:2511.15705, 2025

work page arXiv 2025
[16]

Revisiting IM2GPS in the deep learning era

Nam V o, Nathan Jacobs, and James Hays. Revisiting IM2GPS in the deep learning era. InIEEE International Conference on Computer Vision (ICCV), 2017

2017
[17]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. In Communications of the ACM, 2016

2016
[18]

Nowara, Joshua Gleason, Carlos D

Shraman Pramanick, Ewa M. Nowara, Joshua Gleason, Carlos D. Castillo, and Rama Chellappa. Where in the world is this image? Transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision (ECCV), 2022

2022
[19]

Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23182–23190, 2023

2023
[20]

GeoRanker: Distance- aware ranking for worldwide image geolocalization

Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, and Sharon Li. GeoRanker: Distance- aware ranking for worldwide image geolocalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[21]

Recognition through reasoning: Reinforcing image geo-localization with large vision-language models.arXiv preprint arXiv:2506.14674, 2025

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. Recognition through reasoning: Reinforcing image geo-localization with large vision-language models.arXiv preprint arXiv:2506.14674, 2025

work page arXiv 2025
[22]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

2023
[24]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

2023
[25]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

2024
[27]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, and Wanli Ouyang. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026
[29]

Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models.arXiv preprint arXiv:2602.02185, 2026

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, and Shaosheng Cao. Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models.arXiv preprint arXiv:2602.02185, 2026. 11

work page arXiv 2026
[30]

Visualtoolagent (vista): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2505.20289, 2025

Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, and Yong Jae Lee. Visualtoolagent (vista): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2505.20289, 2025

work page arXiv 2025
[31]

Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

2025
[33]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

GeoToken: Hierar- chical geolocalization of images via next token prediction

Narges Ghasemi, Amir Ziashahabi, Salman Avestimehr, and Cyrus Shahabi. GeoToken: Hierar- chical geolocalization of images via next token prediction. InIEEE International Conference on Data Mining (ICDM), 2025

2025
[36]

Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[37]

GeoBayes: Probabilistic image geo-localization inference via sequential bayesian updating

Weimin Shi, Xiang Li, Kaige Li, Junhao Fang, Qiang Zhou, Qichuan Geng, and Zhong Zhou. GeoBayes: Probabilistic image geo-localization inference via sequential bayesian updating. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026
[38]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. 12 A Agent Implementation Details A.1 Prompt To ensure consistency between t...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

On the opportunities and challenges of foundation models for GeoAI.ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. On the opportunities and challenges of foundation models for GeoAI.ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

2024

[2] [2]

OpenStreetView-5M: The many roads to global visual geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Constantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vincent, Lintao Xu, Hongyu Zhou, and Loic Landrieu. OpenStreetView-5M: The many roads to global visual geolocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

2024

[3] [3]

PlaNet - photo geolocation with convolu- tional neural networks

Tobias Weyand, Ilya Kostrikov, and James Philbin. PlaNet - photo geolocation with convolu- tional neural networks. InEuropean Conference on Computer Vision (ECCV), 2016

2016

[4] [4]

CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. CPlaNet: Enhancing image geolocalization by combinatorial partitioning of maps. InEuropean Conference on Computer Vision (ECCV), 2018

2018

[5] [5]

Geolocation estimation of photos using a hierarchical model and scene classification

Eric Müller-Budack, Kader Pustu-Iren, and Ralph Ewerth. Geolocation estimation of photos using a hierarchical model and scene classification. InEuropean Conference on Computer Vision (ECCV), 2018

2018

[6] [6]

Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah

V . Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[7] [7]

PIGEON: Predicting image geolocations

Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. PIGEON: Predicting image geolocations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[8] [8]

Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation

Zhongliang Zhou, Jielu Zhang, Zihan Guan, Jiayu Hu, Shuwei Lao, Kaiye Mu, Yunqi Li, and Gengchen Mai. Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

2024

[9] [9]

G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models

Pengyue Jia, Yiding Liu, Xiaopeng Li, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. G3: An effective and adaptive framework for worldwide geolocalization using large multi-modality models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[10] [10]

Vision-language reasoning for geolocalization: A reinforcement learning approach

Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, and Jun Wang. Vision-language reasoning for geolocalization: A reinforcement learning approach. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026

[11] [11]

Geoagent: Learning to geolocate everywhere with reinforced geographic characteristics.arXiv preprint arXiv:2602.12617, 2026

Modi Jin, Yiming Zhang, Boyuan Sun, Dingwen Zhang, MingMing Cheng, and Qibin Hou. Geoagent: Learning to geolocate everywhere with reinforced geographic characteristics.arXiv preprint arXiv:2602.12617, 2026

work page arXiv 2026

[12] [12]

Spotagent: Grounding visual geo-localization in large vision-language models through agentic reasoning

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, and Yu Liu. Spotagent: Grounding visual geo-localization in large vision-language models through agentic reasoning. arXiv preprint arXiv:2602.09463, 2026

work page arXiv 2026

[13] [13]

Thinking with map: Reinforced parallel map-augmented agent for geolocalization.arXiv preprint arXiv:2601.05432, 2026

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, and Xiangxiang Chu. Thinking with map: Reinforced parallel map-augmented agent for geolocalization.arXiv preprint arXiv:2601.05432, 2026

work page arXiv 2026

[14] [14]

Swarm intelligence in geo-localization: A multi-agent large vision-language model collaborative framework

Xiao Han, Chen Zhu, Xiangyu Zhao, and Hengshu Zhu. Swarm intelligence in geo-localization: A multi-agent large vision-language model collaborative framework. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 814–825, 2025. 10

2025

[15] [15]

GeoVista: Web-augmented agentic visual reasoning for geolocalization.arXiv preprint arXiv:2511.15705, 2025

Yikun Wang, Zuyan Liu, Ziyi Wang, Han Hu, Pengfei Liu, and Yongming Rao. GeoVista: Web-augmented agentic visual reasoning for geolocalization.arXiv preprint arXiv:2511.15705, 2025

work page arXiv 2025

[16] [16]

Revisiting IM2GPS in the deep learning era

Nam V o, Nathan Jacobs, and James Hays. Revisiting IM2GPS in the deep learning era. InIEEE International Conference on Computer Vision (ICCV), 2017

2017

[17] [17]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. In Communications of the ACM, 2016

2016

[18] [18]

Nowara, Joshua Gleason, Carlos D

Shraman Pramanick, Ewa M. Nowara, Joshua Gleason, Carlos D. Castillo, and Rama Chellappa. Where in the world is this image? Transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision (ECCV), 2022

2022

[19] [19]

Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23182–23190, 2023

2023

[20] [20]

GeoRanker: Distance- aware ranking for worldwide image geolocalization

Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, and Sharon Li. GeoRanker: Distance- aware ranking for worldwide image geolocalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[21] [21]

Recognition through reasoning: Reinforcing image geo-localization with large vision-language models.arXiv preprint arXiv:2506.14674, 2025

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. Recognition through reasoning: Reinforcing image geo-localization with large vision-language models.arXiv preprint arXiv:2506.14674, 2025

work page arXiv 2025

[22] [22]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

2023

[24] [24]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

2023

[25] [25]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

2024

[27] [27]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Xu Tang, Yao Hu, Shaohui Lin, Philip Torr, Feng Zhao, and Wanli Ouyang. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

work page arXiv 2026

[29] [29]

Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models.arXiv preprint arXiv:2602.02185, 2026

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Xu Tang, Yao Hu, Philip Torr, Wanli Ouyang, and Shaosheng Cao. Vision-deepresearch benchmark: Rethinking visual and textual search for multimodal large language models.arXiv preprint arXiv:2602.02185, 2026. 11

work page arXiv 2026

[30] [30]

Visualtoolagent (vista): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2505.20289, 2025

Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, and Yong Jae Lee. Visualtoolagent (vista): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2505.20289, 2025

work page arXiv 2025

[31] [31]

Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo. Visual reasoning through tool-supervised reinforcement learning.arXiv preprint arXiv:2604.19945, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

2025

[33] [33]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

GeoToken: Hierar- chical geolocalization of images via next token prediction

Narges Ghasemi, Amir Ziashahabi, Salman Avestimehr, and Cyrus Shahabi. GeoToken: Hierar- chical geolocalization of images via next token prediction. InIEEE International Conference on Data Mining (ICDM), 2025

2025

[36] [36]

Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[37] [37]

GeoBayes: Probabilistic image geo-localization inference via sequential bayesian updating

Weimin Shi, Xiang Li, Kaige Li, Junhao Fang, Qiang Zhou, Qichuan Geng, and Zhong Zhou. GeoBayes: Probabilistic image geo-localization inference via sequential bayesian updating. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026

[38] [38]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. 12 A Agent Implementation Details A.1 Prompt To ensure consistency between t...

work page internal anchor Pith review Pith/arXiv arXiv 2023