Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Mengshi Qi; Wei Deng; Xianlin Zhang

arxiv: 2606.02459 · v1 · pith:67FHEKCXnew · submitted 2026-06-01 · 💻 cs.CV

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Wei Deng , Xianlin Zhang , Mengshi Qi This is my paper

Pith reviewed 2026-06-28 15:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningvision-language modelscognitive mapreinforcement learningagentic modelsMindCube benchmarkdense rewards

0 comments

The pith

Vision-language models using a dynamic cognitive map and spatial assertion codes reach 80.5 percent accuracy on the MindCube benchmark by verifying intermediate reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an agentic pipeline that lets vision-language models actively explore scenes rather than observe them passively. It introduces a dynamic cognitive map that stores object positions and orientations as persistent memory across observations. Spatial Assertion Codes, expressed as Python statements, describe spatial relations and allow the model to check its own intermediate steps. These checks generate dense reward signals that guide both supervised fine-tuning and reinforcement learning. The resulting system records 80.5 percent overall accuracy on MindCube and a 29.5-point gain on the rotation subset compared with prior methods.

Core claim

A dynamic cognitive map that parameterizes scene layout by object positions and orientations, together with Spatial Assertion Codes expressed as Python expressions, verifies intermediate spatial reasoning steps and supplies dense reward signals during supervised and reinforcement finetuning, enabling state-of-the-art performance on spatial reasoning tasks.

What carries the argument

The dynamic cognitive map paired with Spatial Assertion Codes (SAC), which together maintain persistent scene memory and generate verifiable assertions for dense rewards.

If this is right

The model outperforms the previous best method by 29.5 accuracy points on the Rotation subset.
Dense rewards from step verification improve results on complex multi-step spatial tasks where sparse rewards previously limited progress.
Treating VLMs as active agents rather than passive observers enables better handling of real-world spatial queries that require exploration.
The combination of persistent memory and programmatic assertions scales to new observations without resetting the scene representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same verification mechanism could be tested on embodied navigation benchmarks that require physical movement rather than static image reasoning.
If SAC expressions generalize beyond spatial relations, similar assertion-based rewards might apply to other VLM reasoning domains such as temporal or causal inference.
Open-sourcing the code allows direct measurement of how much the cognitive map contributes when the model faces novel object configurations not seen in MindCube.

Load-bearing premise

The dynamic cognitive map and SAC together produce reliable verification of intermediate steps that actually supply effective dense reward signals during finetuning.

What would settle it

Removing either the dynamic cognitive map or the SAC component and observing no drop in accuracy on the Rotation subset of MindCube would falsify the claim that these elements drive the reported performance gains.

Figures

Figures reproduced from arXiv: 2606.02459 by Mengshi Qi, Wei Deng, Xianlin Zhang.

**Figure 1.** Figure 1: Illustration of the active exploring like a pigeon. (Left) The pigeon can build a cognitive map from observations in mind. (Right) The cognitive map guides the pigeon to navigation. (VLMs) (Bai et al., 2025; OpenAI, 2024; Chen et al., 2024b) have shown impressive performance in visual understanding and reasoning (Zhang et al., 2024; Peng et al., 2025; Yang et al., 2025c). Despite these advancements, enabli… view at source ↗

**Figure 2.** Figure 2: (Top) Question Q, view transformation relationships E are given. (A) There are multiple view images V = {vn} N n=1 provided to be perceived by the VLM. (B) Our VLM outputs SAC alongside the natural language reasoning when performing spatial reasoning. (C) We propose the dynamic cognitive map that stores observations, recalls memory by the VLM, and computes dense rewards collaborating with SAC. (Bottom) Our… view at source ↗

**Figure 3.** Figure 3: Two-stage training process of our model. (Top) Supervised Finetuning. We synthesize dataset with aspects of related view retrieval, dynamic cognitive map updating, and spatial reasoning with SAC for supervised finetuning. (Bottom) Reinforcement Finetuning. We define a reward function that measures the retrieval relatedness, cognitive map correctness, and spatial reasoning correctness for reinforcement fine… view at source ↗

**Figure 4.** Figure 4: Pass@k accuracy curves demonstrating the contributions of SFT and RFT stages (left) and comparison of the adaptive and greedy retrieval strategies (right) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A case study visualizing our method’s active exploration process. Our model retrieves two views and builds a dynamic cognitive map that parameterizes the layout of the scene. During reasoning, the model outputs SAC that states v2 is in front of v1, meanwhile to the left of v1. Therefore, the correct answer “B. Diagonally forward and left” is inferred. yields a substantial gain, propelling the pass@1 accura… view at source ↗

**Figure 6.** Figure 6: Fine-grained reward analysis (left): removing Y from retrieval supervision in Rretrieval (Retrieval w/o Y), using only the outcome term 1correct as reward (0/1), dropping the gating factor 1correct in Equation (7) (Ungated), and our full reward (Full). Performance gains breakdown (right): passive RL-only (P+RL), active RL-only (A+RL), passive with SFT followed by RL (P+SFT+RL), and the full active pipelin… view at source ↗

read the original abstract

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a dynamic cognitive map and programmatic Spatial Assertion Codes to supply dense rewards during VLM finetuning, reporting a 29.5-point gain on MindCube Rotation, but the gains need component ablations to attribute them to the new pieces.

read the letter

The paper's main contribution is a pipeline that keeps a dynamic cognitive map of object positions and orientations as persistent memory, then uses Spatial Assertion Codes—Python expressions that encode spatial relations—to verify intermediate steps and produce dense rewards for both supervised and RL finetuning.

This is a straightforward way to address sparse rewards in spatial reasoning tasks. The reported results on MindCube are the strongest part: 80.5% overall accuracy and a 29.5-point lift on the Rotation subset, with code and data released.

The soft spot is the lack of clear evidence that the SAC and map are what drive the improvement. Without ablations that turn those components off, or reward histograms showing the assertions actually fire correctly at scale, the margin could come from data choices, base model, or prompt details instead. The stress-test concern lands here.

The work is aimed at researchers building agentic VLMs for navigation or robotics. It engages honestly with the sparse-reward problem and offers a reproducible starting point.

Send it to peer review. The technical proposal is distinct enough and the benchmark numbers large enough that referees can usefully pressure the attribution and ask for the missing controls.

Referee Report

2 major / 1 minor

Summary. The paper proposes an agentic pipeline for spatial reasoning in VLMs, inspired by pigeons' cognitive maps. It introduces a dynamic cognitive map to represent scene layouts (object positions and orientations) as persistent memory and Spatial Assertion Codes (SAC) as Python expressions for spatial relationships. These enable verification of intermediate reasoning steps to generate dense rewards during supervised and reinforcement finetuning (SAC + dynamic map collaboration). On the MindCube benchmark, the approach achieves SOTA performance of 80.5% overall accuracy, with a 29.5-point (53.2% relative) gain on the challenging Rotation subset over prior methods. Code and data are open-sourced.

Significance. If the reported gains are shown to stem from the SAC-derived dense rewards and dynamic cognitive map enabling verifiable intermediate steps, the work would be significant for advancing RL-based finetuning of VLMs on complex spatial tasks, addressing the limitations of sparse rewards and passive observation. The open-sourcing strengthens reproducibility.

major comments (2)

[Experiments] Experiments section: The central claim requires that the dynamic cognitive map + SAC pipeline supplies verifiable intermediate steps yielding effective dense rewards, producing the 80.5% overall / 29.5-point Rotation gain. No component ablations, reward histograms, step-verification accuracy metrics, or error bars are reported to show that removing SAC collapses performance or that SAC expressions succeed at scale; the margin could arise from base-model differences, data curation, or prompt engineering instead.
[Method] Method section: The description of how SAC Python expressions programmatically describe spatial relationships and collaborate with the dynamic cognitive map for verification is high-level only. Without concrete examples of SAC expressions, verification logic, or how they integrate into the RL reward function, it is impossible to assess whether they actually provide the claimed dense signals or are load-bearing for the finetuning pipeline.

minor comments (1)

[Abstract] The abstract and method overview would benefit from a brief table or figure summarizing the MindCube dataset splits and baseline methods for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional evidence and detail would strengthen the paper. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim requires that the dynamic cognitive map + SAC pipeline supplies verifiable intermediate steps yielding effective dense rewards, producing the 80.5% overall / 29.5-point Rotation gain. No component ablations, reward histograms, step-verification accuracy metrics, or error bars are reported to show that removing SAC collapses performance or that SAC expressions succeed at scale; the margin could arise from base-model differences, data curation, or prompt engineering instead.

Authors: We agree that the manuscript would benefit from explicit ablations and supporting metrics to isolate the contributions of the dynamic cognitive map and SAC. In the revised version we will add component ablations (full pipeline vs. without SAC, without dynamic map, and without both), reward histograms comparing dense vs. sparse settings, step-verification accuracy on held-out examples, and error bars from multiple random seeds. These additions will directly test whether performance collapses without the proposed mechanisms. revision: yes
Referee: [Method] Method section: The description of how SAC Python expressions programmatically describe spatial relationships and collaborate with the dynamic cognitive map for verification is high-level only. Without concrete examples of SAC expressions, verification logic, or how they integrate into the RL reward function, it is impossible to assess whether they actually provide the claimed dense signals or are load-bearing for the finetuning pipeline.

Authors: We acknowledge the Method section remains at a high level. The revised manuscript will include (1) multiple concrete SAC Python expression examples for common spatial relations (e.g., relative position, orientation, containment), (2) pseudocode for the verification procedure that queries the dynamic cognitive map, and (3) the exact mapping from verification outcomes (true/false/unknown) to the per-step dense reward term used in both supervised and RL stages. revision: yes

Circularity Check

0 steps flagged

No circularity; novel components introduced without reducing claims to fitted inputs or self-citations

full rationale

The paper presents a dynamic cognitive map and Spatial Assertion Codes (SAC) as newly introduced mechanisms that collaborate to verify intermediate spatial reasoning steps and supply dense rewards during supervised and reinforcement finetuning of VLMs. No equations, fitted parameters, or self-citations appear in the provided text that would make the claimed 80.5% accuracy or 29.5-point Rotation gain equivalent to the inputs by construction. The derivation relies on standard RL/VLM finetuning pipelines augmented by these independent components, with benchmark results treated as empirical outcomes rather than tautological predictions. This is the most common honest finding for papers that add new architectural elements without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review; full details on parameters, assumptions, and evidence are unavailable. The two main additions are the newly introduced entities listed below.

axioms (1)

standard math Standard VLM finetuning and RL assumptions hold without modification.
The abstract invokes supervised and reinforcement finetuning without stating deviations from common practice.

invented entities (2)

dynamic cognitive map no independent evidence
purpose: Persistent memory storing scene layout as object positions and orientations
Introduced as a new parameterization serving as memory for observations.
Spatial Assertion Codes (SAC) no independent evidence
purpose: Python expressions that programmatically describe spatial relationships for step verification and dense rewards
Proposed as a novel mechanism collaborating with the cognitive map.

pith-pipeline@v0.9.1-grok · 5754 in / 1331 out tokens · 31316 ms · 2026-06-28T15:14:44.050012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Blog, 10 2024. Accessed: November 22, 2024

2024
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., et al. Qwen2.5-vl technical report. arXiv preprint: 2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bingman, V., Jechura, T., and Kahn, M. C. Behavioral and neural mechanisms of homing and migration in birds. Animal Spatial Cognition: Comparative, Neural, and Computational Approaches,[On-line]. Available: pigeon.psy.tufts.edu/asc/Bingman/Default.htm, 2006

2006
[4]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Chen, B., Xu, Z., Kirmani, S., et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 14455--14465, June 2024 a

2024
[5]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code. arXiv preprint: 2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chen, Z., Wu, J., Wang, W., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In IEEE Conf. Comput. Vis. Pattern Recog., June 2024 b

2024
[7]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv

Chen, Z., Lu, R., Zhao, A., et al. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 57654--57689, 2025

2025
[8]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views

Chen, Z., Zhang, M., Yu, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

2026
[9]

Global-local tree search in vlms for 3d indoor scene generation

Deng, W., Qi, M., and Ma, H. Global-local tree search in vlms for 3d indoor scene generation. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 8975--8984, June 2025

2025
[10]

Scene-llm: Extending language model for 3d visual reasoning

Fu, R., Liu, J., Chen, X., et al. Scene-llm: Extending language model for 3d visual reasoning. In IEEE Win. Conf. on App. of Comput. Vis., 2025

2025
[11]

A survey on llm-as-a-judge

Gu, J., Jiang, X., Shi, Z., et al. A survey on llm-as-a-judge. The Innovation, pp.\ 101253, 2026. ISSN 2666-6758

2026
[12]

L., Wolfer, D

Lipp, H.-P., Vyssotski, A. L., Wolfer, D. P., et al. Pigeon homing along highways and exits. Current Biology, 14 0 (14): 0 1239--1249, 2004. ISSN 0960-9822

2004
[13]

Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation

Lv, C., Qi, M., Li, X., et al. Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation. AAAI, 38 0 (5): 0 4035--4043, 2024

2024
[14]

T2sg: Traffic topology scene graph for topology reasoning in autonomous driving

Lv, C., Qi, M., Liu, L., et al. T2sg: Traffic topology scene graph for topology reasoning in autonomous driving. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 17197--17206, June 2025

2025
[16]

GPT-4o System Card

OpenAI. Gpt-4o system card. arXiv preprint: 2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Ouyang, K., Liu, Y., Wu, H., et al. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint: 2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. Training language models to follow instructions with human feedback. In Adv. Neural Inform. Process. Syst., volume 35, 2022

2022
[19]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0

O’Neill, A., Rehman, A., Maddukuri, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0. In IEEE Int. Conf. on Robot. and Auto., pp.\ 6892--6903, 2024

2024
[20]

Skywork r1v: Pioneering multimodal reasoning with chain-of-thought

Peng, Y., Wang, P., Wang, X., et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought. arXiv preprint: 2504.05599, 2025

work page arXiv 2025
[21]

Action quality assessment via hierarchical pose-guided multi-stage contrastive regression

Qi, M., Ye, H., Peng, J., et al. Action quality assessment via hierarchical pose-guided multi-stage contrastive regression. IEEE Trans. Image Process., 34: 0 6461--6474, 2025

2025
[22]

Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning

Qi, M., Lv, C., and Ma, H. Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (3): 0 2514--2527, 2026 a

2026
[23]

Dc-sam: In-context segment anything in images and videos via dual consistency

Qi, M., Zhu, P., Li, X., et al. Dc-sam: In-context segment anything in images and videos via dual consistency. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (4): 0 4642--4656, 2026 b

2026
[24]

Direct preference optimization: Your language model is secretly a reward model

Rafailov, R., Sharma, A., Mitchell, E., et al. Direct preference optimization: Your language model is secretly a reward model. In Adv. Neural Inform. Process. Syst., 2023

2023
[25]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., et al. Proximal policy optimization algorithms. arXiv preprint: 1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint: 2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Hybridflow: A flexible and efficient rlhf framework

Sheng, G., Zhang, C., Ye, Z., et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pp.\ 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961

2025
[28]

V., Lee, J., Xu, K., et al

Snell, C. V., Lee, J., Xu, K., et al. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In Int. Conf. Learn. Represent., 2025

2025
[29]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Tan, H., Ji, Y., Hao, X., et al. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 5772--5822. Curran Associates, Inc., 2025

2025
[30]

Vggt: Visual geometry grounded transformer

Wang, J., Chen, M., Karaev, N., et al. Vggt: Visual geometry grounded transformer. In IEEE Conf. Comput. Vis. Pattern Recog., June 2025

2025
[31]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Wang, T., Mao, X., Zhu, C., et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 19757--19767, June 2024

2024
[32]

Chain-of-thought prompting elicits reasoning in large language models

Wei, J., Wang, X., Schuurmans, D., et al. Chain-of-thought prompting elicits reasoning in large language models. In Adv. Neural Inform. Process. Syst., volume 35, pp.\ 24824--24837, 2022

2022
[33]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

Wu, D., Liu, F., Hung, Y.-H., et al. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 13569--13597. Curran Associates, Inc., 2025 a

2025
[34]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

Wu, J., Guan, J., Feng, K., et al. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 143297--143330, 2025 b

2025
[35]

R., He, Z., et al

Xia, F., Zamir, A. R., He, Z., et al. Gibson env: Real-world perception for embodied agents. In IEEE Conf. Comput. Vis. Pattern Recog., June 2018

2018
[36]

W., et al

Yang, J., Yang, S., Gupta, A. W., et al. Thinking in space: How multimodal large language models see, remember, and recall spaces. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 10632--10643, 2025 a

2025
[37]

Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents

Yang, R., Chen, H., Zhang, J., et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. In Int. Conf. Mach. Learn., volume 267, pp.\ 70576--70631, 13--19 Jul 2025 b

2025
[38]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yang, Y., He, X., Pan, H., et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Int. Conf. Comput. Vis., pp.\ 2376--2385, October 2025 c

2025
[39]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., et al. Tree of thoughts: Deliberate problem solving with large language models. In Adv. Neural Inform. Process. Syst., volume 36, pp.\ 11809--11822, 2023

2023
[40]

Spatial Mental Modeling from Limited Views

Yin, B., Wang, Q., Zhang, P., et al. Spatial Mental Modeling from Limited Views . In Structural Priors for Vision Workshop at ICCV '25 , 2025

2025
[41]

Thinking in 360°: Humanoid visual search in the wild

Yu, H., Han, Y., Zhang, X., et al. Thinking in 360°: Humanoid visual search in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

2026
[42]

Multimodal chain-of-thought reasoning in language models

Zhang, Z., Zhang, A., Li, M., et al. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024
[44]

L lama F actory: Unified efficient fine-tuning of 100+ language models

Zheng, Y., Zhang, R., Zhang, J., et al. L lama F actory: Unified efficient fine-tuning of 100+ language models. In Annual Meeting of the Ass. for Comput. Ling., pp.\ 400--410, 2024

2024
[45]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

Zhu, C., Wang, T., Zhang, W., et al. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. In Int. Conf. Comput. Vis., pp.\ 4295--4305, October 2025

2025
[46]

Yin, Baiqiao and Wang, Qineng and Zhang, Pingyue and Zhang, Jianshu and Wang, Kangrui and Wang, Zihan and Zhang, Jieyu and Chandrasegaran, Keshigeyan and Liu, Han and Krishna, Ranjay and Xie, Saining and Li, Manling and Wu, Jiajun and Fei-Fei, Li , booktitle =. Spatial. 2025 , organization =

2025
[47]

2025 , pages =

Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui , title =. 2025 , pages =

2025
[48]

2025 , volume=

Fu, Rao and Liu, Jingyu and Chen, Xilun and Nie, Yixin and Xiong, Wenhan , booktitle=WACV, title=. 2025 , volume=

2025
[49]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =

Wu, Diankun and Liu, Fangfu and Hung, Yi-Hsin and Duan, Yueqi , booktitle = NIPS, editor =. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =
[50]

Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David , title =
[51]

2025 , journal=

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. 2025 , journal=

2025
[52]

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =

Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Chen, Xiansheng and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang , booktitle = NIPS, editor =. Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =
[53]

2024 , journal=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=

2024
[54]

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[55]

2017 , journal=

Proximal Policy Optimization Algorithms , author=. 2017 , journal=

2017
[56]

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle = NIPS, title =
[57]

2025 , journal=

Qwen2.5-VL Technical Report , author=. 2025 , journal=

2025
[58]

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=
[59]

Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun MA and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=
[60]

Long Context Transfer from Language to Vision , author=
[61]

Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou , booktitle=ICLR, year=. m
[62]

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =
[63]

Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=

Building and better understanding vision-language models: insights and future directions , author=. Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=
[64]

2024 , journal=

DeepSeek-VL: Towards Real-World Vision-Language Understanding , author=. 2024 , journal=

2024
[65]

2025 , journal=

Gemma 3 Technical Report , author=. 2025 , journal=

2025
[66]

Mantis: Interleaved Multi-Image Instruction Tuning , author=
[67]

2024 , journal=

GPT-4o System Card , author=. 2024 , journal=

2024
[68]

2024 , month =

Anthropic , title =. 2024 , month =

2024
[69]

Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and Xue, Xinda and Su, Qinghang and Lyu, Huaihai and Zheng, Xiaolong and Liu, Jiaming and Wang, Zhongyuan and Zhang, Shanghang , title =
[70]

2024 , pages =

Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , title =. 2024 , pages =

2024
[71]

2026 , month =

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views , author=. 2026 , month =

2026
[72]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =

Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung-Yeung , booktitle = NIPS, editor =. Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =
[73]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=ICLR, year=. Scaling
[74]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =

Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao , booktitle = NIPS, pages =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =
[75]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =

Wu, Junfei and Guan, Jian and Feng, Kaituo and Liu, Qiang and Wu, Shu and Wang, Liang and Wu, Wei and Tan, Tieniu , booktitle = NIPS, pages =. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =
[76]

2004 , doi =

Dora Biro and Jessica Meade and Tim Guilford , title =. 2004 , doi =

2004
[77]

and Wolfer, David P

Lipp, Hans-Peter and Vyssotski, Alexei L. and Wolfer, David P. and Renaudineau, Sophie and Savini, Maria and Tr. Pigeon Homing along Highways and Exits , journal=. 2004 , volume=

2004
[78]

2013 , month=

The internal compass of the pigeon , journal=. 2013 , month=

2013
[79]

2024 , pages =

Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao , title =. 2024 , pages =

2024
[80]

and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =

Xia, Fei and Zamir, Amir R. and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =
[81]

2025 , pages =

Yan, Tianyi and Wu, Dongming and Han, Wencheng and Jiang, Junpeng and Zhou, Xia and Zhan, Kun and Xu, Cheng-zhong and Shen, Jianbing , title =. 2025 , pages =

2025
[82]

arXiv preprint:2601.05172 , year=

CoV: Chain-of-View Prompting for Spatial Reasoning , author=. arXiv preprint:2601.05172 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Blog, 10 2024. Accessed: November 22, 2024

2024

[2] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., et al. Qwen2.5-vl technical report. arXiv preprint: 2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Bingman, V., Jechura, T., and Kahn, M. C. Behavioral and neural mechanisms of homing and migration in birds. Animal Spatial Cognition: Comparative, Neural, and Computational Approaches,[On-line]. Available: pigeon.psy.tufts.edu/asc/Bingman/Default.htm, 2006

2006

[4] [4]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Chen, B., Xu, Z., Kirmani, S., et al. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 14455--14465, June 2024 a

2024

[5] [5]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., et al. Evaluating large language models trained on code. arXiv preprint: 2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chen, Z., Wu, J., Wang, W., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In IEEE Conf. Comput. Vis. Pattern Recog., June 2024 b

2024

[7] [7]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv

Chen, Z., Lu, R., Zhao, A., et al. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 57654--57689, 2025

2025

[8] [8]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views

Chen, Z., Zhang, M., Yu, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

2026

[9] [9]

Global-local tree search in vlms for 3d indoor scene generation

Deng, W., Qi, M., and Ma, H. Global-local tree search in vlms for 3d indoor scene generation. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 8975--8984, June 2025

2025

[10] [10]

Scene-llm: Extending language model for 3d visual reasoning

Fu, R., Liu, J., Chen, X., et al. Scene-llm: Extending language model for 3d visual reasoning. In IEEE Win. Conf. on App. of Comput. Vis., 2025

2025

[11] [11]

A survey on llm-as-a-judge

Gu, J., Jiang, X., Shi, Z., et al. A survey on llm-as-a-judge. The Innovation, pp.\ 101253, 2026. ISSN 2666-6758

2026

[12] [12]

L., Wolfer, D

Lipp, H.-P., Vyssotski, A. L., Wolfer, D. P., et al. Pigeon homing along highways and exits. Current Biology, 14 0 (14): 0 1239--1249, 2004. ISSN 0960-9822

2004

[13] [13]

Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation

Lv, C., Qi, M., Li, X., et al. Sgformer: Semantic graph transformer for point cloud-based 3d scene graph generation. AAAI, 38 0 (5): 0 4035--4043, 2024

2024

[14] [14]

T2sg: Traffic topology scene graph for topology reasoning in autonomous driving

Lv, C., Qi, M., Liu, L., et al. T2sg: Traffic topology scene graph for topology reasoning in autonomous driving. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 17197--17206, June 2025

2025

[15] [16]

GPT-4o System Card

OpenAI. Gpt-4o system card. arXiv preprint: 2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [17]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Ouyang, K., Liu, Y., Wu, H., et al. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint: 2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., et al. Training language models to follow instructions with human feedback. In Adv. Neural Inform. Process. Syst., volume 35, 2022

2022

[18] [19]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0

O’Neill, A., Rehman, A., Maddukuri, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0. In IEEE Int. Conf. on Robot. and Auto., pp.\ 6892--6903, 2024

2024

[19] [20]

Skywork r1v: Pioneering multimodal reasoning with chain-of-thought

Peng, Y., Wang, P., Wang, X., et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought. arXiv preprint: 2504.05599, 2025

work page arXiv 2025

[20] [21]

Action quality assessment via hierarchical pose-guided multi-stage contrastive regression

Qi, M., Ye, H., Peng, J., et al. Action quality assessment via hierarchical pose-guided multi-stage contrastive regression. IEEE Trans. Image Process., 34: 0 6461--6474, 2025

2025

[21] [22]

Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning

Qi, M., Lv, C., and Ma, H. Robust disentangled counterfactual learning for physical audiovisual commonsense reasoning. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (3): 0 2514--2527, 2026 a

2026

[22] [23]

Dc-sam: In-context segment anything in images and videos via dual consistency

Qi, M., Zhu, P., Li, X., et al. Dc-sam: In-context segment anything in images and videos via dual consistency. IEEE Trans. Pattern Anal. and Mach. Intell., 48 0 (4): 0 4642--4656, 2026 b

2026

[23] [24]

Direct preference optimization: Your language model is secretly a reward model

Rafailov, R., Sharma, A., Mitchell, E., et al. Direct preference optimization: Your language model is secretly a reward model. In Adv. Neural Inform. Process. Syst., 2023

2023

[24] [25]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., et al. Proximal policy optimization algorithms. arXiv preprint: 1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint: 2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Hybridflow: A flexible and efficient rlhf framework

Sheng, G., Zhang, C., Ye, Z., et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pp.\ 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961

2025

[27] [28]

V., Lee, J., Xu, K., et al

Snell, C. V., Lee, J., Xu, K., et al. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In Int. Conf. Learn. Represent., 2025

2025

[28] [29]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Tan, H., Ji, Y., Hao, X., et al. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 5772--5822. Curran Associates, Inc., 2025

2025

[29] [30]

Vggt: Visual geometry grounded transformer

Wang, J., Chen, M., Karaev, N., et al. Vggt: Visual geometry grounded transformer. In IEEE Conf. Comput. Vis. Pattern Recog., June 2025

2025

[30] [31]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

Wang, T., Mao, X., Zhu, C., et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 19757--19767, June 2024

2024

[31] [32]

Chain-of-thought prompting elicits reasoning in large language models

Wei, J., Wang, X., Schuurmans, D., et al. Chain-of-thought prompting elicits reasoning in large language models. In Adv. Neural Inform. Process. Syst., volume 35, pp.\ 24824--24837, 2022

2022

[32] [33]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

Wu, D., Liu, F., Hung, Y.-H., et al. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. In Belgrave, D., Zhang, C., Lin, H., et al. (eds.), Adv. Neural Inform. Process. Syst., volume 38, pp.\ 13569--13597. Curran Associates, Inc., 2025 a

2025

[33] [34]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

Wu, J., Guan, J., Feng, K., et al. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. In Adv. Neural Inform. Process. Syst., volume 38, pp.\ 143297--143330, 2025 b

2025

[34] [35]

R., He, Z., et al

Xia, F., Zamir, A. R., He, Z., et al. Gibson env: Real-world perception for embodied agents. In IEEE Conf. Comput. Vis. Pattern Recog., June 2018

2018

[35] [36]

W., et al

Yang, J., Yang, S., Gupta, A. W., et al. Thinking in space: How multimodal large language models see, remember, and recall spaces. In IEEE Conf. Comput. Vis. Pattern Recog., pp.\ 10632--10643, 2025 a

2025

[36] [37]

Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents

Yang, R., Chen, H., Zhang, J., et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. In Int. Conf. Mach. Learn., volume 267, pp.\ 70576--70631, 13--19 Jul 2025 b

2025

[37] [38]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yang, Y., He, X., Pan, H., et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. In Int. Conf. Comput. Vis., pp.\ 2376--2385, October 2025 c

2025

[38] [39]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., et al. Tree of thoughts: Deliberate problem solving with large language models. In Adv. Neural Inform. Process. Syst., volume 36, pp.\ 11809--11822, 2023

2023

[39] [40]

Spatial Mental Modeling from Limited Views

Yin, B., Wang, Q., Zhang, P., et al. Spatial Mental Modeling from Limited Views . In Structural Priors for Vision Workshop at ICCV '25 , 2025

2025

[40] [41]

Thinking in 360°: Humanoid visual search in the wild

Yu, H., Han, Y., Zhang, X., et al. Thinking in 360°: Humanoid visual search in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., June 2026

2026

[41] [42]

Multimodal chain-of-thought reasoning in language models

Zhang, Z., Zhang, A., Li, M., et al. Multimodal chain-of-thought reasoning in language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024

[42] [44]

L lama F actory: Unified efficient fine-tuning of 100+ language models

Zheng, Y., Zhang, R., Zhang, J., et al. L lama F actory: Unified efficient fine-tuning of 100+ language models. In Annual Meeting of the Ass. for Comput. Ling., pp.\ 400--410, 2024

2024

[43] [45]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

Zhu, C., Wang, T., Zhang, W., et al. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. In Int. Conf. Comput. Vis., pp.\ 4295--4305, October 2025

2025

[44] [46]

Yin, Baiqiao and Wang, Qineng and Zhang, Pingyue and Zhang, Jianshu and Wang, Kangrui and Wang, Zihan and Zhang, Jieyu and Chandrasegaran, Keshigeyan and Liu, Han and Krishna, Ranjay and Xie, Saining and Li, Manling and Wu, Jiajun and Fei-Fei, Li , booktitle =. Spatial. 2025 , organization =

2025

[45] [47]

2025 , pages =

Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui , title =. 2025 , pages =

2025

[46] [48]

2025 , volume=

Fu, Rao and Liu, Jingyu and Chen, Xilun and Nie, Yixin and Xiong, Wenhan , booktitle=WACV, title=. 2025 , volume=

2025

[47] [49]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =

Wu, Diankun and Liu, Fangfu and Hung, Yi-Hsin and Duan, Yueqi , booktitle = NIPS, editor =. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence , volume =

[48] [50]

Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David , title =

[49] [51]

2025 , journal=

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. 2025 , journal=

2025

[50] [52]

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =

Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Chen, Xiansheng and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang , booktitle = NIPS, editor =. Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models , volume =

[51] [53]

2024 , journal=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=

2024

[52] [54]

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

[53] [55]

2017 , journal=

Proximal Policy Optimization Algorithms , author=. 2017 , journal=

2017

[54] [56]

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle = NIPS, title =

[55] [57]

2025 , journal=

Qwen2.5-VL Technical Report , author=. 2025 , journal=

2025

[56] [58]

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=

[57] [59]

Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun MA and Ziwei Liu and Chunyuan Li , journal=TMLR, issn=

[58] [60]

Long Context Transfer from Language to Vision , author=

[59] [61]

Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou , booktitle=ICLR, year=. m

[60] [62]

Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng , title =

[61] [63]

Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=

Building and better understanding vision-language models: insights and future directions , author=. Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models , year=

[62] [64]

2024 , journal=

DeepSeek-VL: Towards Real-World Vision-Language Understanding , author=. 2024 , journal=

2024

[63] [65]

2025 , journal=

Gemma 3 Technical Report , author=. 2025 , journal=

2025

[64] [66]

Mantis: Interleaved Multi-Image Instruction Tuning , author=

[65] [67]

2024 , journal=

GPT-4o System Card , author=. 2024 , journal=

2024

[66] [68]

2024 , month =

Anthropic , title =. 2024 , month =

2024

[67] [69]

Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and Xue, Xinda and Su, Qinghang and Lyu, Huaihai and Zheng, Xiaolong and Liu, Jiaming and Wang, Zhongyuan and Zhang, Shanghang , title =

[68] [70]

2024 , pages =

Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , title =. 2024 , pages =

2024

[69] [71]

2026 , month =

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views , author=. 2026 , month =

2026

[70] [72]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =

Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung-Yeung , booktitle = NIPS, editor =. Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , volume =

[71] [73]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=ICLR, year=. Scaling

[72] [74]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =

Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao , booktitle = NIPS, pages =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , volume =

[73] [75]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =

Wu, Junfei and Guan, Jian and Feng, Kaituo and Liu, Qiang and Wu, Shu and Wang, Liang and Wu, Wei and Tan, Tieniu , booktitle = NIPS, pages =. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing , volume =

[74] [76]

2004 , doi =

Dora Biro and Jessica Meade and Tim Guilford , title =. 2004 , doi =

2004

[75] [77]

and Wolfer, David P

Lipp, Hans-Peter and Vyssotski, Alexei L. and Wolfer, David P. and Renaudineau, Sophie and Savini, Maria and Tr. Pigeon Homing along Highways and Exits , journal=. 2004 , volume=

2004

[76] [78]

2013 , month=

The internal compass of the pigeon , journal=. 2013 , month=

2013

[77] [79]

2024 , pages =

Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao , title =. 2024 , pages =

2024

[78] [80]

and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =

Xia, Fei and Zamir, Amir R. and He, Zhiyang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio , title =

[79] [81]

2025 , pages =

Yan, Tianyi and Wu, Dongming and Han, Wencheng and Jiang, Junpeng and Zhou, Xia and Zhan, Kun and Xu, Cheng-zhong and Shen, Jianbing , title =. 2025 , pages =

2025

[80] [82]

arXiv preprint:2601.05172 , year=

CoV: Chain-of-View Prompting for Spatial Reasoning , author=. arXiv preprint:2601.05172 , year=

work page arXiv