pith. sign in

arxiv: 2606.02459 · v1 · pith:67FHEKCXnew · submitted 2026-06-01 · 💻 cs.CV

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Pith reviewed 2026-06-28 15:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial reasoningvision-language modelscognitive mapreinforcement learningagentic modelsMindCube benchmarkdense rewards
0
0 comments X

The pith

Vision-language models using a dynamic cognitive map and spatial assertion codes reach 80.5 percent accuracy on the MindCube benchmark by verifying intermediate reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an agentic pipeline that lets vision-language models actively explore scenes rather than observe them passively. It introduces a dynamic cognitive map that stores object positions and orientations as persistent memory across observations. Spatial Assertion Codes, expressed as Python statements, describe spatial relations and allow the model to check its own intermediate steps. These checks generate dense reward signals that guide both supervised fine-tuning and reinforcement learning. The resulting system records 80.5 percent overall accuracy on MindCube and a 29.5-point gain on the rotation subset compared with prior methods.

Core claim

A dynamic cognitive map that parameterizes scene layout by object positions and orientations, together with Spatial Assertion Codes expressed as Python expressions, verifies intermediate spatial reasoning steps and supplies dense reward signals during supervised and reinforcement finetuning, enabling state-of-the-art performance on spatial reasoning tasks.

What carries the argument

The dynamic cognitive map paired with Spatial Assertion Codes (SAC), which together maintain persistent scene memory and generate verifiable assertions for dense rewards.

If this is right

  • The model outperforms the previous best method by 29.5 accuracy points on the Rotation subset.
  • Dense rewards from step verification improve results on complex multi-step spatial tasks where sparse rewards previously limited progress.
  • Treating VLMs as active agents rather than passive observers enables better handling of real-world spatial queries that require exploration.
  • The combination of persistent memory and programmatic assertions scales to new observations without resetting the scene representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification mechanism could be tested on embodied navigation benchmarks that require physical movement rather than static image reasoning.
  • If SAC expressions generalize beyond spatial relations, similar assertion-based rewards might apply to other VLM reasoning domains such as temporal or causal inference.
  • Open-sourcing the code allows direct measurement of how much the cognitive map contributes when the model faces novel object configurations not seen in MindCube.

Load-bearing premise

The dynamic cognitive map and SAC together produce reliable verification of intermediate steps that actually supply effective dense reward signals during finetuning.

What would settle it

Removing either the dynamic cognitive map or the SAC component and observing no drop in accuracy on the Rotation subset of MindCube would falsify the claim that these elements drive the reported performance gains.

Figures

Figures reproduced from arXiv: 2606.02459 by Mengshi Qi, Wei Deng, Xianlin Zhang.

Figure 1
Figure 1. Figure 1: Illustration of the active exploring like a pigeon. (Left) The pigeon can build a cognitive map from observations in mind. (Right) The cognitive map guides the pigeon to navigation. (VLMs) (Bai et al., 2025; OpenAI, 2024; Chen et al., 2024b) have shown impressive performance in visual understanding and reasoning (Zhang et al., 2024; Peng et al., 2025; Yang et al., 2025c). Despite these advancements, enabli… view at source ↗
Figure 2
Figure 2. Figure 2: (Top) Question Q, view transformation relationships E are given. (A) There are multiple view images V = {vn} N n=1 provided to be perceived by the VLM. (B) Our VLM outputs SAC alongside the natural language reasoning when performing spatial reasoning. (C) We propose the dynamic cognitive map that stores observations, recalls memory by the VLM, and computes dense rewards collaborating with SAC. (Bottom) Our… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage training process of our model. (Top) Supervised Finetuning. We synthesize dataset with aspects of related view retrieval, dynamic cognitive map updating, and spatial reasoning with SAC for supervised finetuning. (Bottom) Reinforcement Finetuning. We define a reward function that measures the retrieval relatedness, cognitive map correctness, and spatial reasoning correctness for reinforcement fine… view at source ↗
Figure 4
Figure 4. Figure 4: Pass@k accuracy curves demonstrating the contributions of SFT and RFT stages (left) and comparison of the adaptive and greedy retrieval strategies (right) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A case study visualizing our method’s active exploration process. Our model retrieves two views and builds a dynamic cognitive map that parameterizes the layout of the scene. During reasoning, the model outputs SAC that states v2 is in front of v1, meanwhile to the left of v1. Therefore, the correct answer “B. Diagonally forward and left” is inferred. yields a substantial gain, propelling the pass@1 accura… view at source ↗
Figure 6
Figure 6. Figure 6: Fine-grained reward analysis (left): removing Y from retrieval supervision in Rretrieval (Retrieval w/o Y), using only the outcome term 1correct as reward (0/1), dropping the gating factor 1correct in Equation (7) (Ungated), and our full reward (Full). Per￾formance gains breakdown (right): passive RL-only (P+RL), active RL-only (A+RL), passive with SFT followed by RL (P+SFT+RL), and the full active pipelin… view at source ↗
read the original abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an agentic pipeline for spatial reasoning in VLMs, inspired by pigeons' cognitive maps. It introduces a dynamic cognitive map to represent scene layouts (object positions and orientations) as persistent memory and Spatial Assertion Codes (SAC) as Python expressions for spatial relationships. These enable verification of intermediate reasoning steps to generate dense rewards during supervised and reinforcement finetuning (SAC + dynamic map collaboration). On the MindCube benchmark, the approach achieves SOTA performance of 80.5% overall accuracy, with a 29.5-point (53.2% relative) gain on the challenging Rotation subset over prior methods. Code and data are open-sourced.

Significance. If the reported gains are shown to stem from the SAC-derived dense rewards and dynamic cognitive map enabling verifiable intermediate steps, the work would be significant for advancing RL-based finetuning of VLMs on complex spatial tasks, addressing the limitations of sparse rewards and passive observation. The open-sourcing strengthens reproducibility.

major comments (2)
  1. [Experiments] Experiments section: The central claim requires that the dynamic cognitive map + SAC pipeline supplies verifiable intermediate steps yielding effective dense rewards, producing the 80.5% overall / 29.5-point Rotation gain. No component ablations, reward histograms, step-verification accuracy metrics, or error bars are reported to show that removing SAC collapses performance or that SAC expressions succeed at scale; the margin could arise from base-model differences, data curation, or prompt engineering instead.
  2. [Method] Method section: The description of how SAC Python expressions programmatically describe spatial relationships and collaborate with the dynamic cognitive map for verification is high-level only. Without concrete examples of SAC expressions, verification logic, or how they integrate into the RL reward function, it is impossible to assess whether they actually provide the claimed dense signals or are load-bearing for the finetuning pipeline.
minor comments (1)
  1. [Abstract] The abstract and method overview would benefit from a brief table or figure summarizing the MindCube dataset splits and baseline methods for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional evidence and detail would strengthen the paper. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim requires that the dynamic cognitive map + SAC pipeline supplies verifiable intermediate steps yielding effective dense rewards, producing the 80.5% overall / 29.5-point Rotation gain. No component ablations, reward histograms, step-verification accuracy metrics, or error bars are reported to show that removing SAC collapses performance or that SAC expressions succeed at scale; the margin could arise from base-model differences, data curation, or prompt engineering instead.

    Authors: We agree that the manuscript would benefit from explicit ablations and supporting metrics to isolate the contributions of the dynamic cognitive map and SAC. In the revised version we will add component ablations (full pipeline vs. without SAC, without dynamic map, and without both), reward histograms comparing dense vs. sparse settings, step-verification accuracy on held-out examples, and error bars from multiple random seeds. These additions will directly test whether performance collapses without the proposed mechanisms. revision: yes

  2. Referee: [Method] Method section: The description of how SAC Python expressions programmatically describe spatial relationships and collaborate with the dynamic cognitive map for verification is high-level only. Without concrete examples of SAC expressions, verification logic, or how they integrate into the RL reward function, it is impossible to assess whether they actually provide the claimed dense signals or are load-bearing for the finetuning pipeline.

    Authors: We acknowledge the Method section remains at a high level. The revised manuscript will include (1) multiple concrete SAC Python expression examples for common spatial relations (e.g., relative position, orientation, containment), (2) pseudocode for the verification procedure that queries the dynamic cognitive map, and (3) the exact mapping from verification outcomes (true/false/unknown) to the per-step dense reward term used in both supervised and RL stages. revision: yes

Circularity Check

0 steps flagged

No circularity; novel components introduced without reducing claims to fitted inputs or self-citations

full rationale

The paper presents a dynamic cognitive map and Spatial Assertion Codes (SAC) as newly introduced mechanisms that collaborate to verify intermediate spatial reasoning steps and supply dense rewards during supervised and reinforcement finetuning of VLMs. No equations, fitted parameters, or self-citations appear in the provided text that would make the claimed 80.5% accuracy or 29.5-point Rotation gain equivalent to the inputs by construction. The derivation relies on standard RL/VLM finetuning pipelines augmented by these independent components, with benchmark results treated as empirical outcomes rather than tautological predictions. This is the most common honest finding for papers that add new architectural elements without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review; full details on parameters, assumptions, and evidence are unavailable. The two main additions are the newly introduced entities listed below.

axioms (1)
  • standard math Standard VLM finetuning and RL assumptions hold without modification.
    The abstract invokes supervised and reinforcement finetuning without stating deviations from common practice.
invented entities (2)
  • dynamic cognitive map no independent evidence
    purpose: Persistent memory storing scene layout as object positions and orientations
    Introduced as a new parameterization serving as memory for observations.
  • Spatial Assertion Codes (SAC) no independent evidence
    purpose: Python expressions that programmatically describe spatial relationships for step verification and dense rewards
    Proposed as a novel mechanism collaborating with the cognitive map.

pith-pipeline@v0.9.1-grok · 5754 in / 1331 out tokens · 31316 ms · 2026-06-28T15:14:44.050012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 9 canonical work pages · 7 internal anchors

  1. [1]

    Claude 3.5 sonnet

    Anthropic. Claude 3.5 sonnet. Blog, 10 2024. Accessed: November 22, 2024

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., et al. Qwen2.5-vl technical report. arXiv preprint: 2502.13923, 2025

  3. [3]

    Bingman, V., Jechura, T., and Kahn, M. C. Behavioral and neural mechanisms of homing and migration in birds. Animal Spatial Cognition: Comparative, Neural, and Computational Approaches,[On-line]. Available: pigeon.psy.tufts.edu/asc/Bingman/Default.htm, 2006

  4. [4]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Chen, B., Xu, Z., Kirmani, S., et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 14455--14465, June 2024 a

  5. [5]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code. arXiv preprint: 2107.03374, 2021

  6. [6]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Chen, Z., Wu, J., Wang, W., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In IEEE Conf. Comput. Vis. Pattern Recog., June 2024 b

  7. [7]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv

    Chen, Z., Lu, R., Zhao, A., et al. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 57654--57689, 2025

  8. [8]

    Think with 3d: Geometric imagination grounded spatial reasoning from limited views

    Chen, Z., Zhang, M., Yu, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

  9. [9]

    Global-local tree search in vlms for 3d indoor scene generation

    Deng, W., Qi, M., and Ma, H. Global-local tree search in vlms for 3d indoor scene generation. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 8975--8984, June 2025

  10. [10]

    Scene-llm: Extending language model for 3d visual reasoning

    Fu, R., Liu, J., Chen, X., et al. Scene-llm: Extending language model for 3d visual reasoning. In IEEE Win. Conf. on App. of Comput. Vis., 2025

  11. [11]

    A survey on llm-as-a-judge

    Gu, J., Jiang, X., Shi, Z., et al. A survey on llm-as-a-judge. The Innovation, pp.\ 101253, 2026. ISSN 2666-6758

  12. [12]

    L., Wolfer, D

    Lipp, H.-P., Vyssotski, A. L., Wolfer, D. P., et al. Pigeon homing along highways and exits. Current Biology, 14 0 (14): 0 1239--1249, 2004. ISSN 0960-9822

  13. [13]

    Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation

    Lv, C., Qi, M., Li, X., et al. Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation. AAAI, 38 0 (5): 0 4035--4043, 2024

  14. [14]

    T2sg: Traffic topology scene graph for topology reasoning in autonomous driving

    Lv, C., Qi, M., Liu, L., et al. T2sg: Traffic topology scene graph for topology reasoning in autonomous driving. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 17197--17206, June 2025

  15. [16]

    GPT-4o System Card

    OpenAI. Gpt-4o system card. arXiv preprint: 2410.21276, 2024

  16. [17]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Ouyang, K., Liu, Y., Wu, H., et al. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint: 2504.01805, 2025

  17. [18]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., et al. Training language models to follow instructions with human feedback. In Adv. Neural Inform. Process. Syst., volume 35, 2022

  18. [19]

    Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0

    O’Neill, A., Rehman, A., Maddukuri, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0. In IEEE Int. Conf. on Robot. and Auto., pp.\ 6892--6903, 2024

  19. [20]

    Skywork r1v: Pioneering multimodal reasoning with chain-of-thought

    Peng, Y., Wang, P., Wang, X., et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought. arXiv preprint: 2504.05599, 2025

  20. [21]

    Action quality assessment via hierarchical pose-guided multi-stage contrastive regression

    Qi, M., Ye, H., Peng, J., et al. Action quality assessment via hierarchical pose-guided multi-stage contrastive regression. IEEE Trans. Image Process., 34: 0 6461--6474, 2025

  21. [22]

    Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning

    Qi, M., Lv, C., and Ma, H. Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (3): 0 2514--2527, 2026 a

  22. [23]

    Dc-sam: In-context segment anything in images and videos via dual consistency

    Qi, M., Zhu, P., Li, X., et al. Dc-sam: In-context segment anything in images and videos via dual consistency. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (4): 0 4642--4656, 2026 b

  23. [24]

    Direct preference optimization: Your language model is secretly a reward model

    Rafailov, R., Sharma, A., Mitchell, E., et al. Direct preference optimization: Your language model is secretly a reward model. In Adv. Neural Inform. Process. Syst., 2023

  24. [25]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., et al. Proximal policy optimization algorithms. arXiv preprint: 1707.06347, 2017

  25. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint: 2402.03300, 2024

  26. [27]

    Hybridflow: A flexible and efficient rlhf framework

    Sheng, G., Zhang, C., Ye, Z., et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pp.\ 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961

  27. [28]

    V., Lee, J., Xu, K., et al

    Snell, C. V., Lee, J., Xu, K., et al. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In Int. Conf. Learn. Represent., 2025

  28. [29]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

    Tan, H., Ji, Y., Hao, X., et al. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 5772--5822. Curran Associates, Inc., 2025

  29. [30]

    Vggt: Visual geometry grounded transformer

    Wang, J., Chen, M., Karaev, N., et al. Vggt: Visual geometry grounded transformer. In IEEE Conf. Comput. Vis. Pattern Recog., June 2025

  30. [31]

    Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

    Wang, T., Mao, X., Zhu, C., et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 19757--19767, June 2024

  31. [32]

    Chain-of-thought prompting elicits reasoning in large language models

    Wei, J., Wang, X., Schuurmans, D., et al. Chain-of-thought prompting elicits reasoning in large language models. In Adv. Neural Inform. Process. Syst., volume 35, pp.\ 24824--24837, 2022

  32. [33]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

    Wu, D., Liu, F., Hung, Y.-H., et al. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 13569--13597. Curran Associates, Inc., 2025 a

  33. [34]

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

    Wu, J., Guan, J., Feng, K., et al. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 143297--143330, 2025 b

  34. [35]

    R., He, Z., et al

    Xia, F., Zamir, A. R., He, Z., et al. Gibson env: Real-world perception for embodied agents. In IEEE Conf. Comput. Vis. Pattern Recog., June 2018

  35. [36]

    W., et al

    Yang, J., Yang, S., Gupta, A. W., et al. Thinking in space: How multimodal large language models see, remember, and recall spaces. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 10632--10643, 2025 a

  36. [37]

    Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents

    Yang, R., Chen, H., Zhang, J., et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. In Int. Conf. Mach. Learn., volume 267, pp.\ 70576--70631, 13--19 Jul 2025 b

  37. [38]

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

    Yang, Y., He, X., Pan, H., et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Int. Conf. Comput. Vis., pp.\ 2376--2385, October 2025 c

  38. [39]

    Tree of thoughts: Deliberate problem solving with large language models

    Yao, S., Yu, D., Zhao, J., et al. Tree of thoughts: Deliberate problem solving with large language models. In Adv. Neural Inform. Process. Syst., volume 36, pp.\ 11809--11822, 2023

  39. [40]

    Spatial Mental Modeling from Limited Views

    Yin, B., Wang, Q., Zhang, P., et al. Spatial Mental Modeling from Limited Views . In Structural Priors for Vision Workshop at ICCV '25 , 2025

  40. [41]

    Thinking in 360°: Humanoid visual search in the wild

    Yu, H., Han, Y., Zhang, X., et al. Thinking in 360°: Humanoid visual search in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

  41. [42]

    Multimodal chain-of-thought reasoning in language models

    Zhang, Z., Zhang, A., Li, M., et al. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856

  42. [44]

    L lama F actory: Unified efficient fine-tuning of 100+ language models

    Zheng, Y., Zhang, R., Zhang, J., et al. L lama F actory: Unified efficient fine-tuning of 100+ language models. In Annual Meeting of the Ass. for Comput. Ling., pp.\ 400--410, 2024

  43. [45]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

    Zhu, C., Wang, T., Zhang, W., et al. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. In Int. Conf. Comput. Vis., pp.\ 4295--4305, October 2025

  44. [46]

    Yin, Baiqiao and Wang, Qineng and Zhang, Pingyue and Zhang, Jianshu and Wang, Kangrui and Wang, Zihan and Zhang, Jieyu and Chandrasegaran, Keshigeyan and Liu, Han and Krishna, Ranjay and Xie, Saining and Li, Manling and Wu, Jiajun and Fei-Fei, Li , booktitle =. Spatial. 2025 , organization =

  45. [47]

    2025 , pages =

    Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui , title =. 2025 , pages =

  46. [48]

    2025 , volume=

    Fu, Rao and Liu, Jingyu and Chen, Xilun and Nie, Yixin and Xiong, Wenhan , booktitle=WACV, title=. 2025 , volume=

  47. [49]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =

    Wu, Diankun and Liu, Fangfu and Hung, Yi-Hsin and Duan, Yueqi , booktitle = NIPS, editor =. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =

  48. [50]

    Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David , title =

  49. [51]

    2025 , journal=

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. 2025 , journal=

  50. [52]

    Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =

    Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Chen, Xiansheng and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang , booktitle = NIPS, editor =. Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =

  51. [53]

    2024 , journal=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=

  52. [54]

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  53. [55]

    2017 , journal=

    Proximal Policy Optimization Algorithms , author=. 2017 , journal=

  54. [56]

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle = NIPS, title =

  55. [57]

    2025 , journal=

    Qwen2.5-VL Technical Report , author=. 2025 , journal=

  56. [58]

    Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=

  57. [59]

    Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun MA and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=

  58. [60]

    Long Context Transfer from Language to Vision , author=

  59. [61]

    Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou , booktitle=ICLR, year=. m

  60. [62]

    Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =

  61. [63]

    Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=

    Building and better understanding vision-language models: insights and future directions , author=. Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=

  62. [64]

    2024 , journal=

    DeepSeek-VL: Towards Real-World Vision-Language Understanding , author=. 2024 , journal=

  63. [65]

    2025 , journal=

    Gemma 3 Technical Report , author=. 2025 , journal=

  64. [66]

    Mantis: Interleaved Multi-Image Instruction Tuning , author=

  65. [67]

    2024 , journal=

    GPT-4o System Card , author=. 2024 , journal=

  66. [68]

    2024 , month =

    Anthropic , title =. 2024 , month =

  67. [69]

    Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and Xue, Xinda and Su, Qinghang and Lyu, Huaihai and Zheng, Xiaolong and Liu, Jiaming and Wang, Zhongyuan and Zhang, Shanghang , title =

  68. [70]

    2024 , pages =

    Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , title =. 2024 , pages =

  69. [71]

    2026 , month =

    Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views , author=. 2026 , month =

  70. [72]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =

    Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung-Yeung , booktitle = NIPS, editor =. Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =

  71. [73]

    Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=ICLR, year=. Scaling

  72. [74]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =

    Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao , booktitle = NIPS, pages =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =

  73. [75]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =

    Wu, Junfei and Guan, Jian and Feng, Kaituo and Liu, Qiang and Wu, Shu and Wang, Liang and Wu, Wei and Tan, Tieniu , booktitle = NIPS, pages =. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =

  74. [76]

    2004 , doi =

    Dora Biro and Jessica Meade and Tim Guilford , title =. 2004 , doi =

  75. [77]

    and Wolfer, David P

    Lipp, Hans-Peter and Vyssotski, Alexei L. and Wolfer, David P. and Renaudineau, Sophie and Savini, Maria and Tr. Pigeon Homing along Highways and Exits , journal=. 2004 , volume=

  76. [78]

    2013 , month=

    The internal compass of the pigeon , journal=. 2013 , month=

  77. [79]

    2024 , pages =

    Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao , title =. 2024 , pages =

  78. [80]

    and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =

    Xia, Fei and Zamir, Amir R. and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =

  79. [81]

    2025 , pages =

    Yan, Tianyi and Wu, Dongming and Han, Wencheng and Jiang, Junpeng and Zhou, Xia and Zhan, Kun and Xu, Cheng-zhong and Shen, Jianbing , title =. 2025 , pages =

  80. [82]

    arXiv preprint:2601.05172 , year=

    CoV: Chain-of-View Prompting for Spatial Reasoning , author=. arXiv preprint:2601.05172 , year=

Showing first 80 references.