pith. sign in

arxiv: 2605.18162 · v1 · pith:S2SNMD3Znew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Pith reviewed 2026-05-20 11:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningvision-language modelsgeometric consistencyself-evolving trainingduality operationsGRPOgeneralizationrobustness
0
0 comments X

The pith

SAGE enforces logical consistency across geometric transformations to strengthen spatial reasoning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often give correct answers to spatial questions but fail when the input is transformed in predictable ways, showing their reasoning is not robust. The paper introduces SAGE, a framework that uses duality operations on geometry and language to generate paired inputs with known answer mappings. It then adds a consistency reward during GRPO training and uses a dynamic pool to keep challenging the model with operations it has not yet mastered. This leads to measurable gains on benchmarks and better handling of new examples. A reader would care because it provides a lightweight way to improve reliability without needing vast new datasets.

Core claim

By treating logical consistency under geometric and linguistic duality as an auxiliary reward signal inside GRPO training and maintaining a dynamic operation pool that retires mastered transformations while introducing harder ones, the model learns to produce answers that remain coherent across original and transformed inputs, yielding consistent gains over baselines on spatial and video reasoning tasks together with improved generalization.

What carries the argument

The duality consistency reward within GRPO training, driven by a dynamic operation pool that continuously probes for logical inconsistencies using geometric and linguistic transformations.

If this is right

  • VLMs exhibit improved accuracy on spatial reasoning benchmarks after applying SAGE.
  • The method enhances generalization to data not seen during training.
  • SAGE can be used as a lightweight post-training stage on any existing vision-language model.
  • Training becomes more data-efficient than previous GRPO approaches by focusing on informative inconsistencies.
  • The dynamic pool ensures ongoing focus on challenging operations rather than static ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This consistency enforcement could be adapted to improve other forms of reasoning such as temporal or causal inference in multimodal models.
  • Models trained this way might show greater robustness when deployed in real-world environments with natural variations in viewpoint or description.
  • Combining SAGE with other alignment techniques could produce even stronger spatial capabilities.
  • Future tests might apply the same duality idea to non-spatial domains to check if the principle is general.

Load-bearing premise

That training for consistency under the probed geometric and linguistic transformations produces deep spatial understanding rather than pattern-matching to the specific operations used.

What would settle it

A test set of novel geometric transformations outside the dynamic operation pool on which the SAGE-trained model shows no improvement or even reduced performance compared to the baseline.

Figures

Figures reproduced from arXiv: 2605.18162 by Ding Wang, Junming Liu, Maonan Wang, Piotr Koniusz, Yifei Sun, Yirong Chen, Yuqi Li.

Figure 1
Figure 1. Figure 1: Analysis of spatial consistency in VLMs under geometric transformations. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SAGE framework. Stage 1 probes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of SAGE. (a) Accuracy reward remains stable throughout training. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Operation lifecycle during training. (a) Active operation states across checkpoints: geo [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples from four spatial reasoning benchmarks. Each example shows the [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual duality operation failure cases. Each row shows one operation type: horizontal flip, [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Linguistic duality operation failure cases. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SAGE (Spatial Alignment via Geometric Evolution), a self-evolving framework to improve spatial reasoning in Vision-Language Models. It adds duality consistency (across geometric and linguistic transformations) as an auxiliary reward inside GRPO training and maintains a dynamic operation pool that surfaces inconsistencies and retires mastered operations. The method is presented as model-agnostic and data-efficient; experiments on video and spatial-reasoning benchmarks are claimed to show gains over strong baselines together with better generalization to unseen data.

Significance. If the empirical claims hold, the work supplies a lightweight post-training recipe that could raise the logical robustness of existing VLMs on spatial tasks without large-scale data collection. The dynamic-pool mechanism for focusing training effort is a concrete engineering contribution that merits attention if it demonstrably avoids superficial consistency.

major comments (3)
  1. [§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.
  2. [§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.
  3. [§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.
minor comments (2)
  1. [Abstract] Abstract: replace the qualitative phrase “consistent improvements over strong baselines” with at least one concrete metric and benchmark name.
  2. [§3] Notation: define the geometric and linguistic duality operators with a short illustrative example (e.g., rotation + linguistic negation) before the formal reward equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity, rigor, and evidence that we will address in the revised manuscript. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.

    Authors: We agree that explicit clarification is needed to rule out any perception of circularity. The expected answer mappings in SAGE are derived from predefined geometric and linguistic transformation rules (e.g., explicit spatial relation updates under rotation or translation) that are constructed independently of model outputs and follow deterministic logic. These rules serve as the ground truth for the duality consistency reward. To eliminate ambiguity, we will revise §4 to include a dedicated subsection detailing the construction and independent verification of these mappings, with concrete examples for each operation type. revision: yes

  2. Referee: [§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.

    Authors: We acknowledge that the current generalization claims would benefit from stronger controls. While the dynamic pool mechanism is intended to prioritize inconsistent operations and retire mastered ones, the manuscript does not yet present held-out transformation families or a direct static-pool ablation. We will add these experiments: (1) a set of held-out transformation families excluded from all pool updates and training, and (2) a quantitative comparison against a static-pool baseline. The new results and analysis will be incorporated into §§5.2–5.3 to better substantiate the generalization claims. revision: yes

  3. Referee: [§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.

    Authors: We agree that the results section requires more precise reporting to allow proper evaluation. We will revise §6 (and update the abstract accordingly) to report explicit numerical deltas for all improvements, clearly name all baselines, include statistical significance tests, and add a dedicated ablation table isolating the contribution of the dynamic operation pool. These changes will make effect sizes and reliability transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The SAGE method defines duality consistency as an auxiliary reward signal within GRPO training, where geometric and linguistic transformations are applied to inputs and consistency is encouraged across original/transformed pairs. The dynamic operation pool selects based on detected inconsistencies from model outputs, but this constitutes a standard self-supervised RL loop with externally specified transformation rules rather than a reduction of the central claim to its own fitted outputs by construction. No equations or steps in the abstract or description equate a 'prediction' directly to an input parameter or rename a fit as a first-principles result. Evaluation relies on external video and spatial reasoning benchmarks, making the framework self-contained against independent test distributions. No self-citation chains or uniqueness theorems are invoked as load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; limited visibility into specific free parameters or axioms used in the full method.

axioms (1)
  • domain assumption Enforcing duality consistency between geometric transformations and linguistic answers improves robust spatial reasoning in VLMs
    This premise underpins the auxiliary reward and self-evolving training described in the abstract.

pith-pipeline@v0.9.0 · 5709 in / 947 out tokens · 54529 ms · 2026-05-20T11:29:50.326629+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Has- son, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´...

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

  3. [3]

    Qwen2.5-vl technical report,

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

  4. [4]

    URLhttps://arxiv.org/abs/2502.13923

  5. [5]

    a is b” fail to learn “b is a

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k

  6. [6]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In 2https://www.anthropic.com/news/claude-opus-4-6 10 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

  7. [7]

    Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas

    Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=k7vcuqLK4X

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

  9. [9]

    Spatialrgpt: Grounded spatial reasoning in vision- language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 135062–135093. Curran...

  10. [10]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

  12. [12]

    Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B. Chan. Viss-r1: Self-supervised reinforcement video reasoning, 2025. URL https://arxiv.org/abs/2511. 13054

  13. [13]

    Video-r1: Reinforcing video reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  14. [14]

    URLhttps://openreview.net/forum?id=a2JTVVvcEl

  15. [15]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...

  16. [16]

    Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

  17. [17]

    URLhttps://arxiv.org/abs/2501.13826

  18. [18]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InThe Fourteenth International Conference on Learning Representations,

  19. [19]

    URLhttps://openreview.net/forum?id=6nZKT2rL0H. 11

  20. [20]

    Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

    Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, and Tanuja Ganu. Faithful grpo: Improving visual spatial reasoning in multimodal language models via constrained policy optimization, 2026. URLhttps://arxiv.org/abs/2604.08476

  21. [21]

    SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

    Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Spatialevo: Self-evolving spatial intelligence via deterministic geometric environments, 2026. URL https: //a...

  22. [22]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial- bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URL https://arxiv.org/abs/2505.21500

  23. [23]

    Spatialladder: Progressive training for spatial reasoning in vision-language models

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KtrFXlvgrK

  24. [24]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

  25. [25]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, June 2024

  26. [26]

    OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding

    Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=vAkVKIOtcN

  27. [27]

    Transactions of the Association for Computational Linguistics , volume =

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566

  28. [28]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...

  29. [29]

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, Bangkok, Thailand, August 2024. Association for Computatio...

  30. [30]

    Video-ChatGPT: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, Bangkok, Thailan...

  31. [31]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. URL https://arxiv. org/abs/2504.01805

  32. [32]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  33. [33]

    Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse

    Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=EdQzLC0Zra

  34. [34]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

  35. [35]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associ...

  36. [36]

    Vision- language models are zero-shot reward models for reinforcement learning

    Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=N0I2RtD8je

  37. [37]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  38. [38]

    Ntu rgb+d: A large scale dataset for 3d human activity analysis

    Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  39. [39]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  40. [40]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, June 2025

  41. [41]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267

  42. [42]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2...

  43. [43]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/ 2403.05530

  44. [44]

    More thought, less accuracy? on the dual nature of rea- soning in vision-language models

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Henry Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of rea- soning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpL5eqjCjF

  45. [45]

    Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fer- gus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. ...

  46. [46]

    VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4oYxzssbVg

  47. [47]

    SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization

    Peiyao Wang and Haibin Ling. SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization. InThe First Workshop on Efficient Spatial Reasoning, 2026. URLhttps://openreview.net/forum?id=o2E8oa2frj

  48. [48]

    VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=3pORFyKzh1

  49. [49]

    Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

    Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-RTS: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in N...

  50. [50]

    URLhttps://aclanthology.org/2025.emnlp-main.1428/

  51. [51]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

  52. [52]

    Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= RnXS7aK4rK

  53. [54]

    URLhttps://openreview.net/forum?id=yyWeSAsOhs

  54. [55]

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  55. [56]

    URLhttps://openreview.net/forum?id=yyWeSAsOhs. 14

  56. [57]

    FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation

    Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=FACJ0478oQ

  57. [58]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, June 2025

  58. [59]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, June 2025

  59. [60]

    Visionthink: Smart and efficient vision language model via reinforcement learning

    Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=R6m6bNnmWm

  60. [61]

    MMSI- bench: A benchmark for multi-image spatial intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI- bench: A benchmark for multi-image spatial intelligence. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gHRoX4vXm3

  61. [62]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views. InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025. URL https:...

  62. [63]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informati...

  63. [64]

    Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistic...

  64. [65]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

  65. [66]

    URLhttps://openreview.net/forum?id=GzgPleFl8f

  66. [67]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699

  67. [68]

    CoFFT: Chain of foresight-focus thought for visual language models

    Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, and Mike Zheng Shou. CoFFT: Chain of foresight-focus thought for visual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  68. [69]

    URLhttps://openreview.net/forum?id=AQnjBIFCBQ. 15

  69. [70]

    Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning

    Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11071–11080, New York, ...

  70. [71]

    Mmvu: Measuring expert-level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (...

  71. [72]

    Towards learning a generalist model for embodied navigation

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, June 2024

  72. [73]

    International Journal of Computer Vision , author =

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, 2022. doi: 10.1007/s11263-022-01653-1. URLhttps://doi.org/10.1007/s11263-022-01653-1

  73. [74]

    Program-of-thought reveals LLM abstraction ceilings

    Mike Zhou, Fenil Bardoliya, Vivek Gupta, and Dan Roth. Program-of-thought reveals LLM abstraction ceilings. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 4911–4919, Rabat, Morocco, March 2026. Association for Computational Linguistics. ISBN 979-8-89176-386-

  74. [75]

    URL https://aclanthology.org/2026

    doi: 10.18653/v1/2026.findings-eacl.257. URL https://aclanthology.org/2026. findings-eacl.257/

  75. [76]

    Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16):13782–13790, Mar

  76. [77]

    What is the action performed by the person in the video?

    doi: 10.1609/aaai.v40i16.38386. URL https://ojs.aaai.org/index.php/AAAI/ article/view/38386. 16 A Pseudocode In this section, we present the pseudocode of SAGE, as shown in Algorithm 1. Algorithm 1Self-Evolving Consistency Training 1: Input:VLM Mθ, reference Mθref, dataset D, pool P, eval interval E, max active K, mastery thresholdτ 2:Initialize active se...