pith. sign in

arxiv: 2606.11770 · v1 · pith:2NBL3WW5new · submitted 2026-06-10 · 💻 cs.AI

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

Pith reviewed 2026-06-27 09:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords spatial reasoningmultimodal large language modelsreinforcement learningstate transitionsvisualization of thoughtgroup relative policy optimizationmulti-hop inferenceintermediate state verification
0
0 comments X

The pith

SVoT makes intermediate states and transitions explicit and verifiable so multimodal models can perform reliable multi-hop spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that current multimodal models fail at spatial reasoning because they leave intermediate states unverified and treat transitions as hidden steps. It proposes SVoT, a reinforcement learning method that forces the model to output interleaved text and images representing each state, then scores those outputs with transition-aware rewards. The authors create five new domains, including Pacman and Gather, that require multi-object interactions and numerical updates instead of the single-variable changes used in prior benchmarks. If the approach works, models should reach state-of-the-art accuracy and show large gains on out-of-distribution tests because verification replaces implicit guessing.

Core claim

SVoT integrates transition reasoning chains into generation, trains via Group Relative Policy Optimization with rewards that verify action preconditions and effects, and reaches state-of-the-art performance across the new domains with up to 65 percent absolute accuracy gain on out-of-distribution sets.

What carries the argument

SVoT framework that generates interleaved verifiable textual and visual states and applies transition-aware supervision through GRPO reward design.

If this is right

  • Models explicitly check preconditions and effects of each action instead of relying on implicit transitions.
  • Quantitative verification of generated states becomes possible in environments with multiple objects and numerical reasoning.
  • Different fine-grained reward designs can be compared for their effect on verification quality.
  • Performance gains appear most clearly on out-of-distribution test sets that require multi-hop reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit-state generation pattern could be applied to other multi-hop tasks such as planning or causal inference.
  • Forcing visual outputs at each step may reduce hallucinations in any visual reasoning pipeline that currently skips state checks.
  • If the reward structure generalizes, similar RL supervision could enforce consistency in non-spatial reasoning chains.

Load-bearing premise

The new domains and reward signals truly capture multi-hop state transitions without the single-variable simplifications the authors criticize in existing benchmarks.

What would settle it

SVoT shows no accuracy improvement over baselines or produces states that cannot be quantitatively verified when tested on the Pacman and Gather out-of-distribution sets.

Figures

Figures reproduced from arXiv: 2606.11770 by Chao Lei, Krista A. Ehinger, Markus Hiller, Nir Lipovetzky, Xunye Tian, Yanbei Jiang, Zhijian Zhou.

Figure 1
Figure 1. Figure 1: Illustration of the five domains used in SVoT. Coordinates are (row, column), starting from [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of the CoT (transition reasoning chain) in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architectures of MVoT and SVoT. To unify reasoning and generation for se￾quential state tracking, SVoT augments the task specification X with an initial state de￾scription s0, resulting in X = ⟨d, s0, v0, A⟩. SVoT represents each intermediate state as zi = ⟨ai , si⟩. In addition, SVoT produces a corresponding visualization vi for each zi . Crucially, the MLLM is instructed to generate a transition reas… view at source ↗
Figure 4
Figure 4. Figure 4: GRPO training in SVoT with PRM. Left: sampling CoT with intermediate state and CoT [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example visualizations generated in Table [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reward curves of SVoTp and SVoTo on the SOKOBAN and GATHER validation sets for single-step prediction (876 and 742 samples, respec￾tively). Shaded regions indicate standard deviation. Reward Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization examples generated in Table 1 for M [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes State-aware Visualization-of-Thought (SVoT), an RL framework that uses Group Relative Policy Optimization (GRPO) to train MLLMs to generate interleaved, verifiable textual and visual intermediate states for multi-hop spatial reasoning. It introduces five domains (extending classical environments and adding Pacman and Gather) to enable quantitative verification of state transitions, which existing benchmarks simplify. The central claim is that transition-aware supervision yields SOTA results with up to 65% absolute accuracy gain on OOD test sets.

Significance. If the empirical results hold under rigorous evaluation, the work could advance reliable spatial reasoning in MLLMs by explicitly modeling and rewarding state transitions rather than treating them as implicit. The creation of new domains that support verifiable multi-object and numerical reasoning is a constructive contribution to benchmark design. Credit is due for grounding the approach in explicit reward components for transition verification.

major comments (1)
  1. [Abstract] Abstract: the central claim of 'up to a 65% absolute accuracy gain on out-of-distribution test sets' and 'state-of-the-art performance' is presented without any description of baselines, number of trials, error bars, statistical tests, or how the OOD splits and verification procedures are implemented. This information is load-bearing for assessing whether the data supports the claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater transparency in the abstract regarding our empirical claims. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'up to a 65% absolute accuracy gain on out-of-distribution test sets' and 'state-of-the-art performance' is presented without any description of baselines, number of trials, error bars, statistical tests, or how the OOD splits and verification procedures are implemented. This information is load-bearing for assessing whether the data supports the claim.

    Authors: We agree that the abstract, constrained by length, omits these details and that this weakens the standalone presentation of the central claims. The full manuscript describes the baselines in Section 4.2, the OOD split construction and verification procedures in Sections 3.3 and 4.1, and reports results across multiple random seeds with standard deviations in Table 2 and the appendix; statistical significance is assessed via paired t-tests where appropriate. To address the concern directly, we will revise the abstract to include a concise statement of the primary baselines, the number of evaluation domains, and the OOD protocol. The reported 65% figure represents the largest absolute gain observed across the five domains; per-domain breakdowns and variance are provided in the experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and summary describe an empirical RL framework (SVoT with GRPO) that introduces new domains (Pacman, Gather) for multi-hop spatial reasoning evaluation and reports accuracy gains on those domains. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations are present. The central claims are performance outcomes on custom benchmarks rather than first-principles results that reduce to inputs by construction. The paper is self-contained against external benchmarks with no detectable circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into parameters and assumptions; no specific fitted values or invented entities are named.

free parameters (1)
  • GRPO reward weights and components
    The paper states it evaluates different fine-grained rewards but supplies no numerical values or fitting procedure in the abstract.
axioms (1)
  • domain assumption Reinforcement learning with designed rewards can enforce verifiable state transitions in MLLMs
    The framework rests on the premise that reward signals derived from action preconditions and effects will produce reliable multi-hop reasoning.

pith-pipeline@v0.9.1-grok · 5772 in / 1271 out tokens · 22990 ms · 2026-06-27T09:57:33.817252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 12 linked inside Pith

  1. [1]

    Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms

    Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Diagnosing the dual challenges of perception and reasoning in spatial planning tasks for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2270–2280, 2025

  2. [2]

    Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025

    Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, and Jiajun Zhang. Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025

  3. [3]

    Navbench: Probing multimodal large language models for embodied navigation

    Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation. In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  4. [4]

    Flame: Learning to navigate with multimodal llm in urban environments

    Yunzhe Xu, Yiyuan Pan, Zhe Liu, and Hesheng Wang. Flame: Learning to navigate with multimodal llm in urban environments. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9005–9013, 2025

  5. [5]

    Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning.arXiv preprint arXiv:2311.17842, 2023

    Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning.arXiv preprint arXiv:2311.17842, 2023

  6. [6]

    Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  7. [7]

    Dolphins: Multimodal language model for driving

    Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024

  8. [8]

    Mind’s eye of LLMs: Visualization-of-thought elicits spatial reasoning in large language models

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of LLMs: Visualization-of-thought elicits spatial reasoning in large language models. InProceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  9. [9]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  10. [10]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  11. [11]

    Is a picture worth a thousand words? delving into spatial reasoning for vision language models

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Sharon Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. 37:75392–75421, 2024

  12. [12]

    TopViewRS: Vision-language models as top-view spatial reasoners

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. TopViewRS: Vision-language models as top-view spatial reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1786–1807, 2024

  13. [13]

    Imagine while reasoning in space: Multimodal visualization-of-thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. In Proceedings of the Forty-second International Conference on Machine Learning, 2025

  14. [14]

    A config- urable library for generating and manipulating maze datasets.arXiv preprint arXiv:2309.10498, 2023

    Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, et al. A config- urable library for generating and manipulating maze datasets.arXiv preprint arXiv:2309.10498, 2023. 10

  15. [15]

    Openai gym.arXiv preprint arXiv:1606.01540, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

  16. [16]

    Tiling with polyominoes.Journal of Combinatorial Theory, 1(2):280–296, 1966

    Solomon W Golomb. Tiling with polyominoes.Journal of Combinatorial Theory, 1(2):280–296, 1966

  17. [17]

    Multi-step prediction of occupancy grid maps with recurrent neural networks

    Nima Mohajerin and Mohsen Rohani. Multi-step prediction of occupancy grid maps with recurrent neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10600–10608, 2019

  18. [18]

    Hybrid map-based path planning for robot navigation in unstructured environments

    Jiayang Liu, Xieyuanli Chen, Junhao Xiao, Sichao Lin, Zhiqiang Zheng, and Huimin Lu. Hybrid map-based path planning for robot navigation in unstructured environments. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2216–2223. IEEE, 2023

  19. [19]

    Multi-agent pathfinding: Definitions, variants, and benchmarks

    Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, TK Kumar, et al. Multi-agent pathfinding: Definitions, variants, and benchmarks. InProceedings of the international symposium on combinatorial search, volume 10, pages 151–158, 2019

  20. [20]

    Research challenges and opportunities in multi-agent path finding and multi-agent pickup and delivery problems

    Oren Salzman and Roni Stern. Research challenges and opportunities in multi-agent path finding and multi-agent pickup and delivery problems. InProceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pages 1711–1715, 2020

  21. [21]

    Mental rotation of three-dimensional objects.Science, 171(3972):701–703, 1971

    Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects.Science, 171(3972):701–703, 1971

  22. [22]

    Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

    Edward C Tolman. Cognitive maps in rats and men.Psychological review, 55(4):189, 1948

  23. [23]

    Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024

    Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024

  24. [24]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  25. [25]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  26. [26]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In Proceedings of the Twelfth International Conference on Learning Representations, 2024

  27. [27]

    Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

  28. [28]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations, 2022

  29. [29]

    Trl: Transformer reinforce- ment learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforce- ment learning, 2020

  30. [30]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the Seventh International Conference on Learning Representations, 2019

  31. [31]

    Hello gpt-4o.OpenAI, 2024

    OpenAI. Hello gpt-4o.OpenAI, 2024. URL https://www.openai.com/index/ hello-gpt-4o. Accessed: 2026-04-18. 11

  32. [32]

    CounterCurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples

    Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. CounterCurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples. InFindings of the Association for Computational Linguistics: ACL 2024, pages 15481–15495, Bangkok, Thailand, 2024. Association for Computational Linguistics

  33. [33]

    Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

  34. [34]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In Proceedings of Thirty-Eighth Advances in Neural Information Processing Systems, volume 37, 2024

  35. [35]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. InProceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  36. [36]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  37. [37]

    Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  38. [38]

    Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  39. [39]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  40. [40]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  41. [41]

    Propa: Toward process-level optimization in visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.10279, 2025

    Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, and Jey Han Lau. Propa: Toward process-level optimization in visual reasoning via reinforcement learning.arXiv preprint arXiv:2511.10279, 2025

  42. [42]

    Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement.arXiv preprint arXiv:2503.17352, 2025

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement.arXiv preprint arXiv:2503.17352, 2025

  43. [43]

    Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

    Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

  44. [44]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1859–1869, 2025

  45. [45]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  46. [46]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale. InProceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  47. [47]

    Pddlgym: Gym environments from pddl problems.arXiv preprint arXiv:2002.06432, 2020

    Tom Silver and Rohan Chitnis. Pddlgym: Gym environments from pddl problems.arXiv preprint arXiv:2002.06432, 2020. 12

  48. [48]

    Springer, 2019

    Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, Christian Muise, Ronald Brachman, Francesca Rossi, and Peter Stone.An introduction to the planning domain definition lan- guage, volume 13. Springer, 2019. 13 A Appendix A.1 Related Work Static Spatial Reasoning. Despite significant advancements in Large Language Models (LLMs), Vision-Language Models (VLMs...

  49. [49]

    slippery

    introduces a relative-depth injection module and region proposals to ground objects in 3D space, while Spatial-MLLM [35] leverages a dual-encoder architecture to recover implicit structural priors from 2D inputs without external depth data. However, these approaches primarily target instantaneous spatial understanding, localizing objects and inferring rel...