arxiv: 2602.18600 · v2 · submitted 2026-02-20 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Ziqiao Shang , Lingyue Ge , Yang Chen , Shi-Yu Tian , Zhenyu Huang , Wenbo Fu , Yu-Feng Li , Lan-Zhe Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal large language modelsroute planningbenchmarkmulti-criteria reasoningmap imagestabular dataevaluationmetromap

0 comments

The pith

Multimodal large language models face substantial challenges in multi-criteria route planning that combines map images with tabular data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MapTab as a benchmark that tests whether MLLMs can plan routes by grounding visual information from metro maps or attraction maps together with attributes such as travel time, price, comfort, and reliability. The benchmark includes hundreds of thousands of queries across 160 cities and 168 attractions, forcing models to balance four competing criteria at once. Evaluations of fifteen different MLLMs show consistent difficulties in this integrated reasoning. A notable result is that combining vision and language often performs worse than language alone when the model struggles to extract clear details from the map images. The authors position the benchmark as a realistic testbed for measuring progress in multimodal reasoning.

Core claim

MapTab is a multimodal benchmark for evaluating MLLMs on holistic multi-criteria reasoning in route planning tasks. It requires models to perceive and ground visual cues from map images alongside route attributes from structured tabular data. The benchmark covers metro networks in 160 cities and 168 tourist attractions, with all queries incorporating the four criteria of Time, Price, Comfort, and Reliability. Extensive evaluations across 15 MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning, and that multimodal collaboration often underperforms unimodal approaches under conditions of limited visual perception.

What carries the argument

The MapTab benchmark, which pairs map images with structured tabular route attributes to generate queries that demand simultaneous perception of visuals and balancing of four criteria.

If this is right

MLLMs require stronger mechanisms for integrating visual map details with tabular criteria to succeed at route planning.
Multimodal fusion can degrade performance relative to single-modality baselines when visual extraction is unreliable.
Benchmarks that force explicit trade-offs among time, price, comfort, and reliability expose specific weaknesses in current models.
Progress toward more capable systems will need targeted improvements in visual grounding before such tasks become reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures might benefit from separate visual encoders trained specifically on map imagery rather than general scenes.
Similar evaluation setups could be extended to dynamic routing problems that include real-time updates.
The observed multimodal underperformance suggests training data may under-represent map-based decision scenarios.
Applications in navigation or logistics could adopt unimodal text pipelines until visual perception improves.

Load-bearing premise

The constructed queries and four criteria form a faithful proxy for real-world multi-criteria route planning.

What would settle it

A new MLLM that achieves high accuracy across the MapTab queries while showing consistent gains from multimodal inputs even when visual perception is restricted would falsify the reported substantial challenges.

Figures

Figures reproduced from arXiv: 2602.18600 by Lan-Zhe Guo, Lingyue Ge, Shi-Yu Tian, Wenbo Fu, Yang Chen, Yu-Feng Li, Zhenyu Huang, Ziqiao Shang.

**Figure 2.** Figure 2: Schematic overview of the MapTab construction pipeline, comprising 5 main steps: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of Image Resolution on RP and QA Tasks in Metromap and Travelmap Sce [PITH_FULL_IMAGE:figures/full_fig_p049_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of model accuracy across Map Difficulty and Query Difficulty under different [PITH_FULL_IMAGE:figures/full_fig_p050_4.png] view at source ↗

**Figure 5.** Figure 5: Model accuracy matrix under different combinations of Map Difficulty and Query Difficulty. [PITH_FULL_IMAGE:figures/full_fig_p050_5.png] view at source ↗

**Figure 6.** Figure 6: All images have the exact same height. Widths are adjusted automatically. [PITH_FULL_IMAGE:figures/full_fig_p052_6.png] view at source ↗

**Figure 7.** Figure 7: Error Case 1: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗

**Figure 8.** Figure 8: Error Case 2: Missed Transfer Label [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗

**Figure 9.** Figure 9: Error Case 3: Wrong source and destination [PITH_FULL_IMAGE:figures/full_fig_p055_9.png] view at source ↗

**Figure 10.** Figure 10: Error Case 4: Illegal Line Jumping 56 [PITH_FULL_IMAGE:figures/full_fig_p056_10.png] view at source ↗

**Figure 11.** Figure 11: Error Case 5: Overthinking 57 [PITH_FULL_IMAGE:figures/full_fig_p057_11.png] view at source ↗

**Figure 12.** Figure 12: Error Case 6: Failed multi-criteria Reasoning [PITH_FULL_IMAGE:figures/full_fig_p058_12.png] view at source ↗

read the original abstract

Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapTab is a scaled benchmark for MLLM route planning that mixes map images with tabular data under four criteria, but the multimodal underperformance claim needs verification that inputs are matched across conditions.

read the letter

The main thing here is a new benchmark called MapTab that tests MLLMs on planning routes from map images plus tabular attributes under four criteria, and the evaluations show current models still fall short, with multimodal sometimes lagging unimodal when vision is limited. The paper builds two scenarios, one for metro networks in 160 cities across 52 countries and one for 168 tourist attractions from 19 countries. That produces 328 images and 196,800 route planning queries, all using Time, Price, Comfort, and Reliability. They run 15 models and release the code, which makes the work easy to check and extend. The scale and the concrete task of grounding visuals while applying multiple constraints give a practical way to measure progress on grounded reasoning. The evaluations provide a broad baseline that highlights where these models need work. The soft spot is the key comparison under limited visual perception. The result that multimodal collaboration often underperforms unimodal approaches depends on the inputs carrying equivalent information in both conditions. If degrading the map image makes perception harder while the tabular text stays clean, the gap could reflect input difficulty rather than reasoning or fusion limits. The abstract skips the query generation method and verification steps, though the code should let readers inspect those details. This paper is for researchers who build or test MLLMs on decision tasks that mix vision and structured data. Anyone working on benchmarks for navigation or constrained planning will find the dataset and the reported gaps useful. The central point about model challenges holds as a starting point even if the experimental contrast needs tightening. I recommend sending it for peer review so referees can examine the input construction and confirm the results.

Referee Report

2 major / 2 minor

Summary. The paper introduces MapTab, a multimodal benchmark for evaluating MLLMs on multi-criteria route planning in heterogeneous graphs. It comprises 328 map images across Metromap (metro networks in 160 cities) and Travelmap (168 tourist attractions), generating 196,800 route planning queries and 3,936 QA queries based on four criteria (Time, Price, Comfort, Reliability). Evaluations of 15 MLLMs show substantial challenges in multi-criteria multimodal reasoning, with the key finding that multimodal collaboration often underperforms unimodal approaches under limited visual perception.

Significance. If the benchmark and comparisons hold, MapTab supplies a challenging, realistic testbed for assessing MLLM reasoning over visual maps and structured tabular attributes, addressing a gap in existing multimodal benchmarks. The public code release supports reproducibility and further use of the dataset.

major comments (2)

[Experimental evaluation / query construction] The headline claim that multimodal collaboration often underperforms unimodal approaches under limited visual perception (abstract) rests on an underspecified experimental contrast. It is unclear how 'limited visual perception' is operationalized (e.g., image degradation method) and whether the unimodal tabular inputs are informationally matched to the multimodal condition; any mismatch confounds perception difficulty with fusion/reasoning difficulty and directly undermines the reported comparison.
[Benchmark construction and evaluation protocol] Details on query generation for the 196,800 route planning queries, answer verification procedure, and statistical significance testing of the unimodal-vs-multimodal results are absent from the provided description. These omissions make it impossible to assess whether the performance gaps reflect genuine reasoning limits or artifacts of query construction and evaluation.

minor comments (2)

Clarify the exact prompting regime and input formatting used for each condition (unimodal text-only vs. multimodal image+text) to allow readers to replicate the fairness of the comparison.
The claim that the four criteria form a 'faithful proxy' for real-world planning would benefit from explicit discussion of how the 160 cities and 168 attractions were selected and whether results generalize beyond these instances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their constructive and detailed feedback, which has identified important areas for clarification in our experimental design and benchmark construction. We address each major comment point by point below. We will revise the manuscript to incorporate additional details on the experimental contrasts and query generation procedures to enhance reproducibility and address the concerns raised.

read point-by-point responses

Referee: The headline claim that multimodal collaboration often underperforms unimodal approaches under limited visual perception (abstract) rests on an underspecified experimental contrast. It is unclear how 'limited visual perception' is operationalized (e.g., image degradation method) and whether the unimodal tabular inputs are informationally matched to the multimodal condition; any mismatch confounds perception difficulty with fusion/reasoning difficulty and directly undermines the reported comparison.

Authors: We thank the referee for this observation. The manuscript's description of the 'limited visual perception' condition was indeed too brief. In the experiments, this was operationalized via controlled image degradation (downsampling combined with partial occlusions) applied only to the visual map inputs in the multimodal setting, while the unimodal tabular condition received the identical route attributes (Time, Price, Comfort, Reliability) extracted from the same underlying graphs and presented in text form. This ensures informational equivalence and isolates the impact of multimodal fusion under perceptual constraints. We will add a dedicated paragraph in the revised Section 4 (Experimental Setup) that explicitly describes the degradation parameters, provides example degraded images, and confirms the matching of tabular data across conditions to prevent any confounding. revision: yes
Referee: Details on query generation for the 196,800 route planning queries, answer verification procedure, and statistical significance testing of the unimodal-vs-multimodal results are absent from the provided description. These omissions make it impossible to assess whether the performance gaps reflect genuine reasoning limits or artifacts of query construction and evaluation.

Authors: We agree these methodological details are essential for evaluating the validity of the results. The queries were generated by enumerating all feasible origin-destination pairs per map and all combinations of the four criteria with discretized priority weights. Verification used ground-truth optimal routes computed via a multi-objective graph algorithm on the source data. Statistical tests consisted of paired t-tests with correction for the unimodal-multimodal comparisons. While these steps are referenced in Sections 3 and 4, the descriptions were insufficiently detailed. In the revision we will expand Section 3.2 with pseudocode for query generation, add a full description of the verification procedure and solver in Section 3.3, and include a new subsection in Section 4 reporting the exact statistical methods, test statistics, and p-values. revision: yes

Circularity Check

0 steps flagged

No circularity: independent benchmark construction and empirical evaluation

full rationale

The paper introduces MapTab as a new multimodal benchmark with 328 images, 196800 queries, and 3936 QA items across Metromap and Travelmap scenarios, using four criteria (Time, Price, Comfort, Reliability). All central claims rest on direct empirical results from evaluating 15 MLLMs under specified conditions, without any equations, fitted parameters, predictions, or derivations that reduce to quantities defined by the authors' own prior work. No self-citation chains are invoked to justify uniqueness or force results. The benchmark construction is presented as an original contribution rather than a renaming or self-referential fit. This is a standard non-circular empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities, fitted parameters, or ad-hoc axioms beyond the standard assumption that MLLMs can ingest image and tabular inputs and that the chosen criteria reflect realistic trade-offs.

axioms (1)

domain assumption MLLMs can process and ground information from both map images and structured tabular data
The entire evaluation rests on this capability being present to varying degrees in the tested models.

pith-pipeline@v0.9.0 · 5566 in / 1277 out tokens · 90396 ms · 2026-05-15T20:10:15.420352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data... four key criteria: Time, Price, Comfort, and Reliability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

Reference graph

Works this paper leans on

211 extracted references · 211 canonical work pages · cited by 1 Pith paper · 20 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

work page 2024
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026

work page arXiv 2026
[5]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

doubao-seed-1.6-thinking

ByteDance. doubao-seed-1.6-thinking. https://www.volcengine.com/docs/82379/ 1593702?utm_source=chatgpt.com&lang=zh, 2025. 10

work page 2025
[8]

Seed1.6: Tech introduction

ByteDance Seed Team. Seed1.6: Tech introduction. https://seed.bytedance.com/en/ seed1_6, June 2025. Model ID: doubao-seed-1-6-251015. Accessed: 2025-12-25

work page 2025
[9]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025

work page arXiv 2025
[10]

Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

Chao Cao, Hongbiao Zhu, Zhongqiang Ren, Howie Choset, and Ji Zhang. Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023

work page 2023
[11]

Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding

Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 21819–21830, 2024

work page 2024
[12]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

work page arXiv 2025
[13]

Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025

work page 2025
[14]

Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

work page arXiv 2025
[15]

Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025
[16]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

A survey on multimodal large language models for autonomous driving

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024

work page 2024
[18]

Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024

work page arXiv 2024
[19]

Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

Bowen Fang, Zixiao Yang, and Xuan Di. Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024

work page arXiv 2024
[20]

Citybench: Evaluating the capabilities of large language models for urban tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

work page 2025
[21]

Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

Jie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024

work page 2024
[22]

Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025

work page arXiv 2025
[23]

Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 11

work page arXiv 2025
[24]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025
[25]

Drive like a human: Rethinking autonomous driving with large language models

Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024

work page 2024
[26]

Gemini 3 flash: Frontier intelligence built for speed

Google. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

work page
[27]

Accessed: 2025-12-25

Model ID: gemini-3-flash-preview. Accessed: 2025-12-25

work page 2025
[28]

Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T Kwok, and Yu Zhang. Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025

work page arXiv 2025
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

Lingfeng Guo, Zihan Li, and Shengjie Min. Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025

work page 2025
[31]

R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025

work page arXiv 2025
[32]

Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025

work page arXiv 2025
[33]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025
[34]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

work page 2024
[35]

Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025

work page arXiv 2025
[36]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks

Varvara Krechetova and Denis Kochedykov. Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, pages 27–35, 2025

work page 2025
[38]

Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024

work page arXiv 2024
[39]

Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark

Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13337–13349, 2025. 12

work page 2025
[40]

Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

Zekun Li, Malcolm Grossman, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang, et al. Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025

work page arXiv 2025
[41]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[42]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

work page arXiv 2025
[43]

Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen. Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025

work page arXiv 2025
[44]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Inter- gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021
[46]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review arXiv 2025
[47]

Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025

work page arXiv 2025
[48]

Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, and Jingrui He. Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026

work page arXiv 2026
[49]

OpenAI o1.https://openai.com/o1/, 2024

OpenAI. OpenAI o1.https://openai.com/o1/, 2024

work page 2024
[50]

Gpt-4.1 model card

OpenAI. Gpt-4.1 model card. https://platform.openai.com/docs/models/gpt-4.1, April 2025. Released on April 14, 2025

work page 2025
[51]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025

work page 2025
[52]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, et al. Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025

work page arXiv 2025
[54]

Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, et al. Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025

work page arXiv 2025
[55]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

work page 2025
[56]

Navbench: Probing multimodal large language models for embodied navigation

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. arXiv preprint arXiv:2506.01031, 2025. 13

work page arXiv 2025
[57]

Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems

Jarabala Ranga, A ARUL PRASATH, Neeraj Kumar, R Naveenkumar, Parashuram S Vadar, and AS Syed Fiaz. Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems. In2025 8th International Conference on Trends in Electronics and Informatics (ICOEI), pages 1575–1582. IEEE, 2025

work page 2025
[58]

Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025

work page arXiv 2025
[59]

Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan Ruan, and Weijun Hu. Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025

work page arXiv 2025
[60]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems

Mohamed R Shoaib, Heba M Emara, and Jun Zhao. A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1–7. IEEE, 2023

work page 2023
[62]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

work page 2024
[64]

Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

work page arXiv 2025
[65]

Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025

work page arXiv 2025
[66]

Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, and Ross Maciejewski. Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025

work page arXiv 2025
[67]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

work page arXiv 2025
[68]

Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025

work page arXiv 2025
[69]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

work page arXiv 2025
[70]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding

Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, et al. Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding. InProceedings of the 33rd ACM International Conference on Advances in Geographic I...

work page 2025
[73]

A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

Sangeeth Venu and Muralimohan Gurusamy. A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025

work page 2025
[74]

Measuring multimodal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

work page 2024
[75]

Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

Wenzhuang Wang, Xiaoguang Di, Maozhen Liu, and Feng Gao. Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024

work page 2024
[76]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025

work page arXiv 2025
[78]

Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

work page arXiv 2023
[79]

A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025

work page arXiv 2025
[80]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.