Recognition: 1 theorem link
· Lean TheoremMapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
Pith reviewed 2026-05-15 20:10 UTC · model grok-4.3
The pith
Multimodal large language models face substantial challenges in multi-criteria route planning that combines map images with tabular data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MapTab is a multimodal benchmark for evaluating MLLMs on holistic multi-criteria reasoning in route planning tasks. It requires models to perceive and ground visual cues from map images alongside route attributes from structured tabular data. The benchmark covers metro networks in 160 cities and 168 tourist attractions, with all queries incorporating the four criteria of Time, Price, Comfort, and Reliability. Extensive evaluations across 15 MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning, and that multimodal collaboration often underperforms unimodal approaches under conditions of limited visual perception.
What carries the argument
The MapTab benchmark, which pairs map images with structured tabular route attributes to generate queries that demand simultaneous perception of visuals and balancing of four criteria.
If this is right
- MLLMs require stronger mechanisms for integrating visual map details with tabular criteria to succeed at route planning.
- Multimodal fusion can degrade performance relative to single-modality baselines when visual extraction is unreliable.
- Benchmarks that force explicit trade-offs among time, price, comfort, and reliability expose specific weaknesses in current models.
- Progress toward more capable systems will need targeted improvements in visual grounding before such tasks become reliable.
Where Pith is reading between the lines
- Architectures might benefit from separate visual encoders trained specifically on map imagery rather than general scenes.
- Similar evaluation setups could be extended to dynamic routing problems that include real-time updates.
- The observed multimodal underperformance suggests training data may under-represent map-based decision scenarios.
- Applications in navigation or logistics could adopt unimodal text pipelines until visual perception improves.
Load-bearing premise
The constructed queries and four criteria form a faithful proxy for real-world multi-criteria route planning.
What would settle it
A new MLLM that achieves high accuracy across the MapTab queries while showing consistent gains from multimodal inputs even when visual perception is restricted would falsify the reported substantial challenges.
Figures
read the original abstract
Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their reasoning capabilities under multi-criteria constraints. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate holistic multi-criteria reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key criteria: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in multi-criteria multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs. Our code is available at https://github.com/Ziqiao-Shang/MapTab.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MapTab, a multimodal benchmark for evaluating MLLMs on multi-criteria route planning in heterogeneous graphs. It comprises 328 map images across Metromap (metro networks in 160 cities) and Travelmap (168 tourist attractions), generating 196,800 route planning queries and 3,936 QA queries based on four criteria (Time, Price, Comfort, Reliability). Evaluations of 15 MLLMs show substantial challenges in multi-criteria multimodal reasoning, with the key finding that multimodal collaboration often underperforms unimodal approaches under limited visual perception.
Significance. If the benchmark and comparisons hold, MapTab supplies a challenging, realistic testbed for assessing MLLM reasoning over visual maps and structured tabular attributes, addressing a gap in existing multimodal benchmarks. The public code release supports reproducibility and further use of the dataset.
major comments (2)
- [Experimental evaluation / query construction] The headline claim that multimodal collaboration often underperforms unimodal approaches under limited visual perception (abstract) rests on an underspecified experimental contrast. It is unclear how 'limited visual perception' is operationalized (e.g., image degradation method) and whether the unimodal tabular inputs are informationally matched to the multimodal condition; any mismatch confounds perception difficulty with fusion/reasoning difficulty and directly undermines the reported comparison.
- [Benchmark construction and evaluation protocol] Details on query generation for the 196,800 route planning queries, answer verification procedure, and statistical significance testing of the unimodal-vs-multimodal results are absent from the provided description. These omissions make it impossible to assess whether the performance gaps reflect genuine reasoning limits or artifacts of query construction and evaluation.
minor comments (2)
- Clarify the exact prompting regime and input formatting used for each condition (unimodal text-only vs. multimodal image+text) to allow readers to replicate the fairness of the comparison.
- The claim that the four criteria form a 'faithful proxy' for real-world planning would benefit from explicit discussion of how the 160 cities and 168 attractions were selected and whether results generalize beyond these instances.
Simulated Author's Rebuttal
We sincerely thank the referee for their constructive and detailed feedback, which has identified important areas for clarification in our experimental design and benchmark construction. We address each major comment point by point below. We will revise the manuscript to incorporate additional details on the experimental contrasts and query generation procedures to enhance reproducibility and address the concerns raised.
read point-by-point responses
-
Referee: The headline claim that multimodal collaboration often underperforms unimodal approaches under limited visual perception (abstract) rests on an underspecified experimental contrast. It is unclear how 'limited visual perception' is operationalized (e.g., image degradation method) and whether the unimodal tabular inputs are informationally matched to the multimodal condition; any mismatch confounds perception difficulty with fusion/reasoning difficulty and directly undermines the reported comparison.
Authors: We thank the referee for this observation. The manuscript's description of the 'limited visual perception' condition was indeed too brief. In the experiments, this was operationalized via controlled image degradation (downsampling combined with partial occlusions) applied only to the visual map inputs in the multimodal setting, while the unimodal tabular condition received the identical route attributes (Time, Price, Comfort, Reliability) extracted from the same underlying graphs and presented in text form. This ensures informational equivalence and isolates the impact of multimodal fusion under perceptual constraints. We will add a dedicated paragraph in the revised Section 4 (Experimental Setup) that explicitly describes the degradation parameters, provides example degraded images, and confirms the matching of tabular data across conditions to prevent any confounding. revision: yes
-
Referee: Details on query generation for the 196,800 route planning queries, answer verification procedure, and statistical significance testing of the unimodal-vs-multimodal results are absent from the provided description. These omissions make it impossible to assess whether the performance gaps reflect genuine reasoning limits or artifacts of query construction and evaluation.
Authors: We agree these methodological details are essential for evaluating the validity of the results. The queries were generated by enumerating all feasible origin-destination pairs per map and all combinations of the four criteria with discretized priority weights. Verification used ground-truth optimal routes computed via a multi-objective graph algorithm on the source data. Statistical tests consisted of paired t-tests with correction for the unimodal-multimodal comparisons. While these steps are referenced in Sections 3 and 4, the descriptions were insufficiently detailed. In the revision we will expand Section 3.2 with pseudocode for query generation, add a full description of the verification procedure and solver in Section 3.3, and include a new subsection in Section 4 reporting the exact statistical methods, test statistics, and p-values. revision: yes
Circularity Check
No circularity: independent benchmark construction and empirical evaluation
full rationale
The paper introduces MapTab as a new multimodal benchmark with 328 images, 196800 queries, and 3936 QA items across Metromap and Travelmap scenarios, using four criteria (Time, Price, Comfort, Reliability). All central claims rest on direct empirical results from evaluating 15 MLLMs under specified conditions, without any equations, fitted parameters, predictions, or derivations that reduce to quantities defined by the authors' own prior work. No self-citation chains are invoked to justify uniqueness or force results. The benchmark construction is presented as an original contribution rather than a renaming or self-referential fit. This is a standard non-circular empirical benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs can process and ground information from both map images and structured tabular data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data... four key criteria: Time, Price, Comfort, and Reliability.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...
work page 2024
-
[2]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, and Mattia Rigotti. Sparc: Separating perception and reasoning circuits for test-time scaling of vlms.arXiv preprint arXiv:2602.06566, 2026
-
[5]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
ByteDance. doubao-seed-1.6-thinking. https://www.volcengine.com/docs/82379/ 1593702?utm_source=chatgpt.com&lang=zh, 2025. 10
work page 2025
-
[8]
ByteDance Seed Team. Seed1.6: Tech introduction. https://seed.bytedance.com/en/ seed1_6, June 2025. Model ID: doubao-seed-1-6-251015. Accessed: 2025-12-25
work page 2025
-
[9]
Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142, 2025
-
[10]
Chao Cao, Hongbiao Zhu, Zhongqiang Ren, Howie Choset, and Ji Zhang. Representation granularity enables time-efficient autonomous exploration in large, complex worlds.Science Robotics, 8(80):eadf0970, 2023
work page 2023
-
[11]
Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding
Xu Cao, Tong Zhou, Yunsheng Ma, Wenqian Ye, Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang, Ziran Wang, James M Rehg, et al. Maplm: A real-world large-scale vision-language benchmark for map and traffic scene understanding. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 21819–21830, 2024
work page 2024
-
[12]
Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025
-
[13]
Yan Chen. Path planning algorithm for logistics autonomous vehicles at cainiao stations based on multi-sensor data fusion.PLoS One, 20(5):e0321257, 2025
work page 2025
-
[14]
Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025
Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025
-
[15]
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025
-
[16]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
A survey on multimodal large language models for autonomous driving
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large language models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024
work page 2024
-
[18]
Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mape- val: A map-based evaluation of geo-spatial reasoning in foundation models.arXiv preprint arXiv:2501.00316, 2024
-
[19]
Bowen Fang, Zixiao Yang, and Xuan Di. Travellm: Could you plan my new public transit route in face of a network disruption?arXiv preprint arXiv:2407.14926, 2024
-
[20]
Citybench: Evaluating the capabilities of large language models for urban tasks
Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025
work page 2025
-
[21]
Jie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, and Yong Li. Citybench: Evaluating the capabilities of large language model as world model.arXiv e-prints, pages arXiv–2406, 2024
work page 2024
-
[22]
Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025
Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.arXiv preprint arXiv:2504.10903, 2025
-
[23]
Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, and Huan Wang. Rewardmap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage rein- forcement learning.arXiv preprint arXiv:2510.02240, 2025. 11
-
[24]
Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025
-
[25]
Drive like a human: Rethinking autonomous driving with large language models
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024
work page 2024
-
[26]
Gemini 3 flash: Frontier intelligence built for speed
Google. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December
- [27]
-
[28]
Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T Kwok, and Yu Zhang. Reasoning-aligned perception decoupling for scalable multi-modal reasoning.arXiv preprint arXiv:2506.04559, 2025
-
[29]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Lingfeng Guo, Zihan Li, and Shengjie Min. Enhanced natural language annotation and query for semantic mapping in visual slam using large language models.Journal of Sustainability, Policy, and Practice, 1(3):131–143, 2025
work page 2025
-
[31]
Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, et al. R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation.arXiv preprint arXiv:2505.02018, 2025
-
[32]
Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.arXiv preprint arXiv:2506.15677, 2025
-
[33]
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025
-
[34]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024
work page 2024
-
[35]
Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. Mllm-for3d: Adapting multimodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135, 2025
-
[36]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks
Varvara Krechetova and Denis Kochedykov. Geobenchx: Benchmarking llms in agent solving multistep geospatial tasks. InProceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, pages 27–35, 2025
work page 2025
-
[38]
Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich programming problems.arXiv preprint arXiv:2404.09486, 2024
-
[39]
Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark
Ming Li, Jike Zhong, Tianle Chen, Yuxiang Lai, and Konstantinos Psounis. Eee-bench: A comprehensive multimodal electrical and electronics engineering benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13337–13349, 2025. 12
work page 2025
-
[40]
Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025
Zekun Li, Malcolm Grossman, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang, et al. Mapqa: Open-domain geospatial question answering on map data.arXiv preprint arXiv:2503.07871, 2025
-
[41]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[42]
Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025
-
[43]
Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, and Ying-Cong Chen. Uniugp: Unifying understanding, generation, and planing for end-to-end autonomous driving.arXiv preprint arXiv:2512.09864, 2025
-
[44]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Inter- gps: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021
-
[46]
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. Rsvp: Reasoning segmentation via visual prompting and multi-modal chain-of-thought.arXiv preprint arXiv:2506.04277, 2025
-
[48]
Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, and Jingrui He. Mc-search: Evaluating and enhancing multimodal agentic search with structured long reasoning chains.arXiv preprint arXiv:2603.00873, 2026
-
[49]
OpenAI o1.https://openai.com/o1/, 2024
OpenAI. OpenAI o1.https://openai.com/o1/, 2024
work page 2024
-
[50]
OpenAI. Gpt-4.1 model card. https://platform.openai.com/docs/models/gpt-4.1, April 2025. Released on April 14, 2025
work page 2025
-
[51]
OpenAI o3 and o4-mini System Card
OpenAI. OpenAI o3 and o4-mini System Card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf , 2025
work page 2025
-
[52]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, et al. Frieda: Benchmarking multi-step cartographic reasoning in vision-language models.arXiv preprint arXiv:2512.08016, 2025
-
[54]
Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, et al. Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities.arXiv preprint arXiv:2510.08759, 2025
-
[55]
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...
work page 2025
-
[56]
Navbench: Probing multimodal large language models for embodied navigation
Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. arXiv preprint arXiv:2506.01031, 2025. 13
-
[57]
Jarabala Ranga, A ARUL PRASATH, Neeraj Kumar, R Naveenkumar, Parashuram S Vadar, and AS Syed Fiaz. Urbandrivepathway: A decision-making framework for navigating urban autonomous vehicles in complex traffic systems. In2025 8th International Conference on Trends in Electronics and Informatics (ICOEI), pages 1575–1582. IEEE, 2025
work page 2025
-
[58]
Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, and Filippos Kokkinos. Vgrp-bench: Visual grid reasoning puzzle benchmark for large vision-language models.arXiv preprint arXiv:2503.23064, 2025
-
[59]
Tianyi Shang, Zhenyu Li, Pengjie Xu, Jinwei Qiao, Gang Chen, Zihan Ruan, and Weijun Hu. Bridging text and vision: A multi-view text-vision registration approach for cross-modal place recognition.arXiv preprint arXiv:2502.14195, 2025
-
[60]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Mohamed R Shoaib, Heba M Emara, and Jun Zhao. A survey on the applications of frontier ai, foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1–7. IEEE, 2023
work page 2023
-
[62]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[63]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024
work page 2024
-
[64]
Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025
-
[65]
Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, and Xiang Yue. Visualpuz- zles: Decoupling multimodal reasoning evaluation from domain knowledge.arXiv preprint arXiv:2504.10342, 2025
-
[66]
Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, and Ross Maciejewski. Mapiq: Evaluating multimodal large language models for map question answering.arXiv preprint arXiv:2507.11625, 2025
-
[67]
Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
-
[68]
Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, and Bo An. Stardojo: Benchmarking open-ended behaviors of agentic multimodal llms in production-living simulations with stardew valley.arXiv preprint arXiv:2507.07445, 2025
-
[69]
Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, et al. Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025
-
[70]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi, Tomoki Oya, Ryoichi Kojima, Masato Taya, Chihiro Ono, Atsunori Minamikawa, et al. Cartomapqa: A fundamental benchmark dataset evaluating vision-language models on cartographic map understanding. InProceedings of the 33rd ACM International Conference on Advances in Geographic I...
work page 2025
-
[73]
Sangeeth Venu and Muralimohan Gurusamy. A comprehensive review of path planning algorithms for autonomous navigation.Results in Engineering, page 107750, 2025
work page 2025
-
[74]
Measuring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024
work page 2024
-
[75]
Wenzhuang Wang, Xiaoguang Di, Maozhen Liu, and Feng Gao. Multi-level symmetric semantic alignment network for image–text matching.Neurocomputing, 599:128082, 2024
work page 2024
-
[76]
Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, et al. Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv preprint arXiv:2510.23691, 2025
-
[78]
Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023
-
[79]
Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, and Jianwei Zhang. A survey of robotic navigation and manipulation with physics simulators in the era of embodied ai.arXiv preprint arXiv:2505.01458, 2025
-
[80]
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spa- tialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.