pith. machine review for the scientific record. sign in

arxiv: 2604.09712 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningmultimodal large language modelstool augmentationLAST-Boxvision toolsprogressive traininggeometric layouts
0
0 comments X

The pith

LAST framework turns vision tool outputs into hints that boost MLLM spatial reasoning by around 20 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LAST as a unified framework that lets multimodal large language models draw on specialized vision tools to handle complex spatial layouts more accurately. It creates LAST-Box, an interactive sandbox that converts calls to heterogeneous tools into atomic instructions and reusable spatial skills, then returns annotated images and text that the models can read directly. A three-stage training process first teaches the models to interpret tool outputs, then builds skill in invoking tools, and finally refines adaptive use. This approach targets the core problem that pure data scaling fails to instill reliable geometric priors, leading to hallucinations on spatial tasks. If the method works as described, models become better at perceiving and reasoning about physical arrangements without requiring entirely new large-scale training data.

Core claim

LAST-Box abstracts diverse vision tool calls into atomic instructions and reusable spatial skills that return multimodal hints directly usable by LLMs. A three-stage progressive training strategy then moves models from basic understanding of those hints to proficient and adaptive tool invocation. On four datasets, the resulting LAST-7B model records approximately 20 percent gains over its backbone and exceeds the performance of several strong proprietary closed-source LLMs on complex spatial reasoning.

What carries the argument

LAST-Box, an extensible interactive sandbox that converts heterogeneous tool invocations into atomic instructions and reusable spatial skills while returning multimodal hints for direct LLM consumption.

If this is right

  • LAST-7B records around 20 percent performance improvement over its backbone model on spatial reasoning benchmarks.
  • The three-stage training enables models to progress from interpreting tool outputs to adaptive and proficient tool use.
  • Multimodal hints from abstracted tools allow smaller open models to outperform certain closed-source LLMs on complex geometric tasks.
  • The framework provides an alternative to data scaling when internalizing structured geometric priors and spatial constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hint-abstraction pattern could be applied to other multimodal domains such as temporal or causal reasoning.
  • Reusable spatial skills created inside LAST-Box might be shared across different models and tasks as modular components.
  • In practice this method could reduce the amount of task-specific fine-tuning data needed for reliable physical-world interaction.
  • Integrating additional tool types beyond vision, such as simulation engines, would be a direct next extension.

Load-bearing premise

The multimodal hints produced by LAST-Box can be fed directly to LLMs and used for high-level spatial reasoning without adding new hallucinations or losing critical information.

What would settle it

A test that runs LAST-7B on the same four datasets but supplies it with no hints or with deliberately noisy hints from LAST-Box and checks whether the reported 20 percent gain vanishes.

Figures

Figures reproduced from arXiv: 2604.09712 by Kun-Yang Yu, Lan-Zhe Guo, Ming Yang, Shi-Yu Tian, Yang Chen, Yu-Feng Li, Zhi Zhou, Ziqiao Shang.

Figure 1
Figure 1. Figure 1: Visualization of preliminary experimental results for problem analysis. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed LAST-Box in language, such as Relative Direction, textual hints provide a clear performance boost. Conversely, for tasks where textual de￾scription is inherently limited—such as metric estimation (Absolute Distance and Size Estimation)—visual image hints demonstrate a decisive advantage. Surprisingly, when both modalities are pro￾vided simultaneously (“both-hint”), the performance … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the progressive training strategy. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative experimental examples of LAST-7B from CVBench, EmbSpatial, and MSMU. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of tool invocation behavior between [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The system prompt designed to guide the model in selecting and executing visual tools. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20\% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LAST, a framework for tool-augmented spatial reasoning in multimodal LLMs. It introduces LAST-Box, an extensible sandbox that abstracts heterogeneous vision-tool calls (e.g., segmentation, depth) into atomic instructions and returns multimodal hints (annotated images plus textual descriptions) directly consumable by the LLM. A three-stage progressive training strategy is used to move the model from understanding tool outputs to adaptive invocation. On four datasets, LAST-7B is reported to deliver ~20% gains over its backbone and to outperform several closed-source models on complex spatial tasks.

Significance. If the reported gains prove robust, the work offers a concrete, extensible route for grounding MLLMs in mature vision tools without requiring the LLM itself to internalize low-level geometric priors. The abstraction of tool outputs into reusable multimodal hints and the staged training curriculum are practical contributions that could generalize beyond the evaluated tasks. The manuscript does not mention open-sourced code or parameter-free derivations, so reproducibility will depend on the experimental details supplied in revision.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central performance claim (~20% gains and outperformance of closed models) is stated without any mention of the precise baselines, number of runs, error bars, dataset splits, or ablation controls. Because the entire significance rests on these empirical numbers, the absence of this information makes the claim impossible to evaluate at present.
  2. [Method (LAST-Box)] Method section (LAST-Box description): no quantitative metric is supplied that measures information preservation or hallucination rate when low-level tool outputs (masks, depth maps) are converted into the multimodal hints. The weakest assumption—that these hints can be reliably consumed without introducing new geometric errors or hallucinations—is therefore untested, yet it is load-bearing for the claim that tool augmentation improves spatial reasoning.
  3. [Method (three-stage training)] Training-strategy subsection: the three-stage progressive curriculum is presented as essential, but no ablation removing individual stages is reported. Without such controls it is unclear whether the observed gains are attributable to the staged training, to the hints themselves, or to other factors.
minor comments (2)
  1. [Method] Notation for the atomic instructions and reusable spatial skills inside LAST-Box should be defined once in a table or figure caption rather than scattered across prose.
  2. [Figures] Figure captions for the annotated-image examples should explicitly state which vision tool produced each annotation so readers can trace the hint-generation pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We provide detailed responses to each major comment below and commit to revising the paper to address the raised issues.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central performance claim (~20% gains and outperformance of closed models) is stated without any mention of the precise baselines, number of runs, error bars, dataset splits, or ablation controls. Because the entire significance rests on these empirical numbers, the absence of this information makes the claim impossible to evaluate at present.

    Authors: We agree that the current presentation lacks sufficient detail for rigorous evaluation of the performance claims. In the revised version, we will clearly specify the baselines used in comparisons, indicate the number of experimental runs performed (with error bars or standard deviations if multiple runs were conducted), detail the dataset splits, and ensure ablation studies are comprehensively described. We will also update the abstract to better contextualize these results. revision: yes

  2. Referee: [Method (LAST-Box)] Method section (LAST-Box description): no quantitative metric is supplied that measures information preservation or hallucination rate when low-level tool outputs (masks, depth maps) are converted into the multimodal hints. The weakest assumption—that these hints can be reliably consumed without introducing new geometric errors or hallucinations—is therefore untested, yet it is load-bearing for the claim that tool augmentation improves spatial reasoning.

    Authors: We acknowledge this limitation in the current manuscript. Although the overall performance gains on spatial reasoning tasks provide indirect evidence of the hints' utility, we will add a quantitative evaluation of the hint generation process in the revision. This may include metrics such as the fidelity of mask annotations or depth information preservation, and an assessment of potential hallucinations in the textual descriptions accompanying the hints. revision: yes

  3. Referee: [Method (three-stage training)] Training-strategy subsection: the three-stage progressive curriculum is presented as essential, but no ablation removing individual stages is reported. Without such controls it is unclear whether the observed gains are attributable to the staged training, to the hints themselves, or to other factors.

    Authors: This is an important point for validating the training strategy. We will conduct and report additional ablation experiments in the revised manuscript, where we train models omitting one or more stages of the curriculum and compare their performance to the full three-stage approach. This will help attribute the gains specifically to the progressive training. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper describes a tool-augmented framework (LAST with LAST-Box sandbox and three-stage training) for enhancing MLLM spatial reasoning via multimodal hints, followed by empirical evaluation on four datasets. No equations, parameter fittings, self-citations, or derivations are present that reduce any claim to its own inputs by construction. Performance gains are reported from direct experiments rather than statistical forcing or renamed patterns, rendering the central results self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on two new invented components (LAST-Box and the training strategy) whose effectiveness is asserted via experiments but lacks independent external validation.

axioms (1)
  • domain assumption Specialized vision models produce accurate low-level outputs (segmentation masks, depth maps) that can be turned into useful high-level hints for LLMs.
    Invoked to justify why tool integration solves hallucinations in spatial reasoning.
invented entities (2)
  • LAST-Box no independent evidence
    purpose: Extensible interactive sandbox that abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills.
    New component introduced to solve the challenge of invoking and consuming tool outputs.
  • Three-stage progressive training strategy no independent evidence
    purpose: Guides models from understanding tool outputs to proficient and adaptive tool invocation.
    New training procedure proposed to make the framework work.

pith-pipeline@v0.9.0 · 5550 in / 1397 out tokens · 47303 ms · 2026-05-10T19:22:20.955770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

Reference graph

Works this paper leans on

53 extracted references · 27 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  2. [2]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

  3. [3]

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. 2024. End-to-end autonomous driving: Challenges and frontiers. IEEE Transactions on Pattern Analysis and Machine Intelligence(2024)

  4. [4]

    Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, and Jieping Ye. 2025. SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models.arXiv preprint arXiv:2509.17664(2025)

  5. [5]

    Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. 2025. SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL.arXiv preprint arXiv:2512.04069(2025)

  6. [6]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198

  7. [7]

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. 2024. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems37 (2024), 135062–135093

  8. [8]

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 346–355

  9. [9]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision. Springer, 148–166

  10. [10]

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. InProceedings of the 40th International Conference on Machine Learning. 10764– 10799

  11. [11]

    Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Com- positional visual reasoning without training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14953–14962

  12. [12]

    Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, and Shanghang Zhang. 2025. TIGeR: Tool- Integrated Geometric Reasoning in Vision-Language Models for Robotics.arXiv preprint arXiv:2510.07181(2025)

  13. [13]

    Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould

  14. [14]

    InPro- ceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

    Vln bert: A recurrent vision-and-language bert for navigation. InPro- ceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 1643–1653

  15. [15]

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. 2024. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems37 (2024), 139348–139379

  16. [16]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4015–4026

  17. [17]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

  18. [18]

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision- Language Models.arXiv preprint arXiv:2505.21500(2025)

  19. [19]

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531(2025)

  20. [20]

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. 2025. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765(2025)

  21. [21]

    Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, and Xiaoxing Ma. 2024. Neuro-Symbolic Data Generation for Math Reasoning. arXiv preprint arXiv:2412.04857(2024)

  22. [22]

    Fangyu Liu, Guy Emerson, and Nigel Collier. 2023. Visual spatial reasoning. Transactions of the Association for Computational Linguistics11 (2023), 635–651

  23. [23]

    Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, and Huaxiu Yao. 2025. Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning.arXiv preprint arXiv:2511.19900 (2025)

  24. [24]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

  25. [25]

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. 2025. Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics(2025)

  26. [26]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al

  27. [27]

    Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

  28. [28]

    Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic- lm: Empowering large language models with symbolic solvers for faithful logical reasoning.arXiv preprint arXiv:2305.12295(2023)

  29. [29]

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Xuanhe Zhou, Yufei Huang, Chaojun Xiao, et al. 2024. Tool learning with foundation models.Comput. Surveys57, 4 (2024), 1–40

  30. [30]

    Ziqiao Shang, Lingyue Ge, Yang Chen, Shi-Yu Tian, Zhenyu Huang, Wenbo Fu, Yu-Feng Li, and Lan-Zhe Guo. 2026. MapTab: Can MLLMs Master Constrained Route Planning?arXiv preprint arXiv:2602.18600(2026)

  31. [31]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  32. [32]

    Shiyu Tian, Hongxin Wei, Yiqun Wang, and Lei Feng. 2024. Crosel: Cross selection of confident pseudo labels for partial-label learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19479–19488

  33. [33]

    Shi-Yu Tian, Zhi Zhou, Wei Dong, Kun-Yang Yu, Ming Yang, Zi-Jian Cheng, Lan- Zhe Guo, and Yu-Feng Li. 2025. TabularMath: Understanding Math Reasoning over Tables with Large Language Models.arXiv preprint arXiv:2505.19563(2025)

  34. [34]

    Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Lin-Han Jia, Lan-Zhe Guo, and Yu-Feng Li. 2025. VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 12721–12742

  35. [35]

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Veda- giri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems37 (2024), 87310–87356

  36. [36]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. VGGT: Visual Geometry Grounded Trans- former. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5294–5306

  37. [37]

    Rong Wang, Kun Sun, and Jonas Kuhn. 2024. Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs.arXiv preprint arXiv:2411.18564 (2024)

  38. [38]

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965 (2025)

  39. [39]

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. 2024. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision. Springer, 131–147

  40. [40]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. 2025. Thinking in space: How multimodal large language models see, 9 Tian et al. remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  41. [41]

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth Anything V2. arXiv:2406.09414 [cs.CV]

  42. [42]

    Ming Yang, Zhi Zhou, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, and Yu-Feng Li. 2026. NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing.arXiv preprint arXiv:2603.16307(2026)

  43. [43]

    Zhun Yang, Adam Ishay, and Joohyung Lee. 2023. Coupling large language models with logic programming for robust and general reasoning from text. arXiv preprint arXiv:2307.07696(2023)

  44. [44]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the 11th International Conference on Learning Represen- tations

  45. [45]

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. 2024. RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics.arXiv preprint arXiv:2406.10721(2024)

  46. [46]

    Shaokun Zhang, Yi Dong, Jieyu Zhang, Jan Kautz, Bryan Catanzaro, Andrew Tao, Qingyun Wu, Zhiding Yu, and Guilin Liu. 2025. Nemotron-Research-Tool- N1: Exploring Tool-Using Language Models with Reinforced Reasoning.arXiv preprint arXiv:2505.00024(2025)

  47. [47]

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

  48. [48]

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. 2025. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630(2025)

  49. [49]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al . 2025. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 29733–29735

  50. [50]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. 2025. DeepEyes: Incentivizing" Thinking with Images" via Reinforcement Learning.arXiv preprint arXiv:2505.14362(2025)

  51. [51]

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. 2025. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656(2025)

  52. [52]

    Zhi Zhou, Kun-Yang Yu, Shi-Yu Tian, Xiao-Wen Yang, Jiang-Xin Shi, Pengxiao Song, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li. 2025. LawGPT: Knowledge- guided data generation and its application to legal LLM.arXiv preprint arXiv:2502.06572(2025)

  53. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). 10 LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Languag...