Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Ding Wang; Junming Liu; Maonan Wang; Piotr Koniusz; Yifei Sun; Yirong Chen; Yuqi Li

arxiv: 2605.18162 · v1 · pith:S2SNMD3Znew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

Junming Liu , Yuqi Li , Yifei Sun , Maonan Wang , Piotr Koniusz , Yirong Chen , Ding Wang This is my paper

Pith reviewed 2026-05-20 11:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatial reasoningvision-language modelsgeometric consistencyself-evolving trainingduality operationsGRPOgeneralizationrobustness

0 comments

The pith

SAGE enforces logical consistency across geometric transformations to strengthen spatial reasoning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often give correct answers to spatial questions but fail when the input is transformed in predictable ways, showing their reasoning is not robust. The paper introduces SAGE, a framework that uses duality operations on geometry and language to generate paired inputs with known answer mappings. It then adds a consistency reward during GRPO training and uses a dynamic pool to keep challenging the model with operations it has not yet mastered. This leads to measurable gains on benchmarks and better handling of new examples. A reader would care because it provides a lightweight way to improve reliability without needing vast new datasets.

Core claim

By treating logical consistency under geometric and linguistic duality as an auxiliary reward signal inside GRPO training and maintaining a dynamic operation pool that retires mastered transformations while introducing harder ones, the model learns to produce answers that remain coherent across original and transformed inputs, yielding consistent gains over baselines on spatial and video reasoning tasks together with improved generalization.

What carries the argument

The duality consistency reward within GRPO training, driven by a dynamic operation pool that continuously probes for logical inconsistencies using geometric and linguistic transformations.

If this is right

VLMs exhibit improved accuracy on spatial reasoning benchmarks after applying SAGE.
The method enhances generalization to data not seen during training.
SAGE can be used as a lightweight post-training stage on any existing vision-language model.
Training becomes more data-efficient than previous GRPO approaches by focusing on informative inconsistencies.
The dynamic pool ensures ongoing focus on challenging operations rather than static ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This consistency enforcement could be adapted to improve other forms of reasoning such as temporal or causal inference in multimodal models.
Models trained this way might show greater robustness when deployed in real-world environments with natural variations in viewpoint or description.
Combining SAGE with other alignment techniques could produce even stronger spatial capabilities.
Future tests might apply the same duality idea to non-spatial domains to check if the principle is general.

Load-bearing premise

That training for consistency under the probed geometric and linguistic transformations produces deep spatial understanding rather than pattern-matching to the specific operations used.

What would settle it

A test set of novel geometric transformations outside the dynamic operation pool on which the SAGE-trained model shows no improvement or even reduced performance compared to the baseline.

Figures

Figures reproduced from arXiv: 2605.18162 by Ding Wang, Junming Liu, Maonan Wang, Piotr Koniusz, Yifei Sun, Yirong Chen, Yuqi Li.

**Figure 2.** Figure 2: Overview of the SAGE framework. Stage 1 probes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics of SAGE. (a) Accuracy reward remains stable throughout training. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Operation lifecycle during training. (a) Active operation states across checkpoints: geo [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representative examples from four spatial reasoning benchmarks. Each example shows the [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Visual duality operation failure cases. Each row shows one operation type: horizontal flip, [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Linguistic duality operation failure cases. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SAGE (Spatial Alignment via Geometric Evolution), a self-evolving framework to improve spatial reasoning in Vision-Language Models. It adds duality consistency (across geometric and linguistic transformations) as an auxiliary reward inside GRPO training and maintains a dynamic operation pool that surfaces inconsistencies and retires mastered operations. The method is presented as model-agnostic and data-efficient; experiments on video and spatial-reasoning benchmarks are claimed to show gains over strong baselines together with better generalization to unseen data.

Significance. If the empirical claims hold, the work supplies a lightweight post-training recipe that could raise the logical robustness of existing VLMs on spatial tasks without large-scale data collection. The dynamic-pool mechanism for focusing training effort is a concrete engineering contribution that merits attention if it demonstrably avoids superficial consistency.

major comments (3)

[§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.
[§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.
[§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.

minor comments (2)

[Abstract] Abstract: replace the qualitative phrase “consistent improvements over strong baselines” with at least one concrete metric and benchmark name.
[§3] Notation: define the geometric and linguistic duality operators with a short illustrative example (e.g., rotation + linguistic negation) before the formal reward equation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity, rigor, and evidence that we will address in the revised manuscript. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Duality Consistency Reward): the reward is defined solely from the model’s own outputs on transformed inputs. The manuscript must specify how the expected answer mappings for each duality operation are independently verified as ground truth rather than inferred from the same model outputs; otherwise the consistency signal risks circularity.

Authors: We agree that explicit clarification is needed to rule out any perception of circularity. The expected answer mappings in SAGE are derived from predefined geometric and linguistic transformation rules (e.g., explicit spatial relation updates under rotation or translation) that are constructed independently of model outputs and follow deterministic logic. These rules serve as the ground truth for the duality consistency reward. To eliminate ambiguity, we will revise §4 to include a dedicated subsection detailing the construction and independent verification of these mappings, with concrete examples for each operation type. revision: yes
Referee: [§5.2–5.3] §5.2–5.3 (Dynamic Operation Pool and Generalization Experiments): the claim that the pool produces robust generalization beyond the probed transformations requires explicit controls—e.g., held-out transformation families never seen during pool updates and quantitative comparison against a static-pool ablation. Current evidence appears insufficient to rule out memorization of the operation set.

Authors: We acknowledge that the current generalization claims would benefit from stronger controls. While the dynamic pool mechanism is intended to prioritize inconsistent operations and retire mastered ones, the manuscript does not yet present held-out transformation families or a direct static-pool ablation. We will add these experiments: (1) a set of held-out transformation families excluded from all pool updates and training, and (2) a quantitative comparison against a static-pool baseline. The new results and analysis will be incorporated into §§5.2–5.3 to better substantiate the generalization claims. revision: yes
Referee: [§6] §6 (Experimental Results): the abstract and results section assert “consistent improvements” and “enhanced generalization” yet supply no numerical deltas, baseline identifiers, statistical tests, or ablation tables for the dynamic-pool component. These omissions prevent assessment of effect size and reliability.

Authors: We agree that the results section requires more precise reporting to allow proper evaluation. We will revise §6 (and update the abstract accordingly) to report explicit numerical deltas for all improvements, clearly name all baselines, include statistical significance tests, and add a dedicated ablation table isolating the contribution of the dynamic operation pool. These changes will make effect sizes and reliability transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The SAGE method defines duality consistency as an auxiliary reward signal within GRPO training, where geometric and linguistic transformations are applied to inputs and consistency is encouraged across original/transformed pairs. The dynamic operation pool selects based on detected inconsistencies from model outputs, but this constitutes a standard self-supervised RL loop with externally specified transformation rules rather than a reduction of the central claim to its own fitted outputs by construction. No equations or steps in the abstract or description equate a 'prediction' directly to an input parameter or rename a fit as a first-principles result. Evaluation relies on external video and spatial reasoning benchmarks, making the framework self-contained against independent test distributions. No self-citation chains or uniqueness theorems are invoked as load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; limited visibility into specific free parameters or axioms used in the full method.

axioms (1)

domain assumption Enforcing duality consistency between geometric transformations and linguistic answers improves robust spatial reasoning in VLMs
This premise underpins the auxiliary reward and self-evolving training described in the abstract.

pith-pipeline@v0.9.0 · 5709 in / 947 out tokens · 54529 ms · 2026-05-20T11:29:50.326629+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Has- son, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´...

work page 2022
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page
[4]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[5]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k

work page 2024
[6]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In 2https://www.anthropic.com/news/claude-opus-4-6 10 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

work page 2024
[7]

Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=k7vcuqLK4X

work page 2025
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Spatialrgpt: Grounded spatial reasoning in vision- language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 135062–135093. Curran...

work page 2024
[10]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B. Chan. Viss-r1: Self-supervised reinforcement video reasoning, 2025. URL https://arxiv.org/abs/2511. 13054

work page 2025
[13]

Video-r1: Reinforcing video reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[14]

URLhttps://openreview.net/forum?id=a2JTVVvcEl

work page
[15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...

work page 2025
[16]

Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

work page
[17]

URLhttps://arxiv.org/abs/2501.13826

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InThe Fourteenth International Conference on Learning Representations,

work page
[19]

URLhttps://openreview.net/forum?id=6nZKT2rL0H. 11

work page
[20]

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, and Tanuja Ganu. Faithful grpo: Improving visual spatial reasoning in multimodal language models via constrained policy optimization, 2026. URLhttps://arxiv.org/abs/2604.08476

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Spatialevo: Self-evolving spatial intelligence via deterministic geometric environments, 2026. URL https: //a...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial- bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URL https://arxiv.org/abs/2505.21500

work page arXiv 2025
[23]

Spatialladder: Progressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KtrFXlvgrK

work page 2026
[24]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

work page 2022
[25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, June 2024

work page 2024
[26]

OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=vAkVKIOtcN

work page 2026
[27]

Transactions of the Association for Computational Linguistics , volume =

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566

work page doi:10.1162/tacl_a_00566 2023
[28]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...

work page 2023
[29]

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, Bangkok, Thailand, August 2024. Association for Computatio...

work page doi:10.18653/v1/2024.findings-acl.517 2024
[30]

Video-ChatGPT: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, Bangkok, Thailan...

work page 2024
[31]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. URL https://arxiv. org/abs/2504.01805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[33]

Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse

Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=EdQzLC0Zra

work page 2026
[34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021
[35]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associ...

work page 2023
[36]

Vision- language models are zero-shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=N0I2RtD8je

work page 2024
[37]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Ntu rgb+d: A large scale dataset for 3d human activity analysis

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, June 2025

work page 2025
[41]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2...

work page 2020
[43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/ 2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

More thought, less accuracy? on the dual nature of rea- soning in vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Henry Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of rea- soning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpL5eqjCjF

work page 2026
[45]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fer- gus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. ...

work page 2024
[46]

VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4oYxzssbVg

work page 2025
[47]

SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization

Peiyao Wang and Haibin Ling. SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization. InThe First Workshop on Efficient Spatial Reasoning, 2026. URLhttps://openreview.net/forum?id=o2E8oa2frj

work page 2026
[48]

VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=3pORFyKzh1

work page 2025
[49]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-RTS: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in N...

work page doi:10.18653/v1/2025.emnlp-main 2025
[50]

URLhttps://aclanthology.org/2025.emnlp-main.1428/

work page 2025
[51]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

work page 2022
[52]

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= RnXS7aK4rK

work page 2025
[54]

URLhttps://openreview.net/forum?id=yyWeSAsOhs

work page
[55]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[56]

URLhttps://openreview.net/forum?id=yyWeSAsOhs. 14

work page
[57]

FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=FACJ0478oQ

work page 2025
[58]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, June 2025

work page 2025
[59]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, June 2025

work page 2025
[60]

Visionthink: Smart and efficient vision language model via reinforcement learning

Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=R6m6bNnmWm

work page 2025
[61]

MMSI- bench: A benchmark for multi-image spatial intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI- bench: A benchmark for multi-image spatial intelligence. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gHRoX4vXm3

work page 2026
[62]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views. InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025. URL https:...

work page 2025
[63]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informati...

work page doi:10.52202/079017-3522 2024
[64]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-demo.49 2023
[65]

From flatland to space: Teaching vision-language models to perceive and reason in 3d

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page
[66]

URLhttps://openreview.net/forum?id=GzgPleFl8f

work page
[67]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024
[68]

CoFFT: Chain of foresight-focus thought for visual language models

Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, and Mike Zheng Shou. CoFFT: Chain of foresight-focus thought for visual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[69]

URLhttps://openreview.net/forum?id=AQnjBIFCBQ. 15

work page
[70]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11071–11080, New York, ...

work page doi:10.1145/3746027.3755703 2025
[71]

Mmvu: Measuring expert-level multi-discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (...

work page 2025
[72]

Towards learning a generalist model for embodied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, June 2024

work page 2024
[73]

International Journal of Computer Vision , author =

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, 2022. doi: 10.1007/s11263-022-01653-1. URLhttps://doi.org/10.1007/s11263-022-01653-1

work page doi:10.1007/s11263-022-01653-1 2022
[74]

Program-of-thought reveals LLM abstraction ceilings

Mike Zhou, Fenil Bardoliya, Vivek Gupta, and Dan Roth. Program-of-thought reveals LLM abstraction ceilings. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 4911–4919, Rabat, Morocco, March 2026. Association for Computational Linguistics. ISBN 979-8-89176-386-

work page 2026
[75]

URL https://aclanthology.org/2026

doi: 10.18653/v1/2026.findings-eacl.257. URL https://aclanthology.org/2026. findings-eacl.257/

work page doi:10.18653/v1/2026.findings-eacl.257 2026
[76]

Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16):13782–13790, Mar

work page
[77]

What is the action performed by the person in the video?

doi: 10.1609/aaai.v40i16.38386. URL https://ojs.aaai.org/index.php/AAAI/ article/view/38386. 16 A Pseudocode In this section, we present the pseudocode of SAGE, as shown in Algorithm 1. Algorithm 1Self-Evolving Consistency Training 1: Input:VLM Mθ, reference Mθref, dataset D, pool P, eval interval E, max active K, mastery thresholdτ 2:Initialize active se...

work page doi:10.1609/aaai.v40i16.38386

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Has- son, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´...

work page 2022

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page

[4] [4]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k

work page 2024

[6] [6]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In 2https://www.anthropic.com/news/claude-opus-4-6 10 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

work page 2024

[7] [7]

Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for VLMs? an attention mechanism perspective on focus areas. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=k7vcuqLK4X

work page 2025

[8] [8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Spatialrgpt: Grounded spatial reasoning in vision- language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sys- tems, volume 37, pages 135062–135093. Curran...

work page 2024

[10] [10]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceed...

work page 2017

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B. Chan. Viss-r1: Self-supervised reinforcement video reasoning, 2025. URL https://arxiv.org/abs/2511. 13054

work page 2025

[13] [13]

Video-r1: Reinforcing video reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in MLLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page

[14] [14]

URLhttps://openreview.net/forum?id=a2JTVVvcEl

work page

[15] [15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...

work page 2025

[16] [16]

Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,

work page

[17] [17]

URLhttps://arxiv.org/abs/2501.13826

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. InThe Fourteenth International Conference on Learning Representations,

work page

[19] [19]

URLhttps://openreview.net/forum?id=6nZKT2rL0H. 11

work page

[20] [20]

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, and Tanuja Ganu. Faithful grpo: Improving visual spatial reasoning in multimodal language models via constrained policy optimization, 2026. URLhttps://arxiv.org/abs/2604.08476

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. Spatialevo: Self-evolving spatial intelligence via deterministic geometric environments, 2026. URL https: //a...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models.ArXiv, abs/2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial- bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URL https://arxiv.org/abs/2505.21500

work page arXiv 2025

[23] [23]

Spatialladder: Progressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=KtrFXlvgrK

work page 2026

[24] [24]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings...

work page 2022

[25] [25]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, June 2024

work page 2024

[26] [26]

OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. OST-bench: Evaluating the capabilities of MLLMs in online spatio-temporal scene understanding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=vAkVKIOtcN

work page 2026

[27] [27]

Transactions of the Association for Computational Linguistics , volume =

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566

work page doi:10.1162/tacl_a_00566 2023

[28] [28]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...

work page 2023

[29] [29]

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, Bangkok, Thailand, August 2024. Association for Computatio...

work page doi:10.18653/v1/2024.findings-acl.517 2024

[30] [30]

Video-ChatGPT: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, Bangkok, Thailan...

work page 2024

[31] [31]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning, 2025. URL https://arxiv. org/abs/2504.01805

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[33] [33]

Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse

Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in VLMs for the metaverse. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=EdQzLC0Zra

work page 2026

[34] [34]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machin...

work page 2021

[35] [35]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associ...

work page 2023

[36] [36]

Vision- language models are zero-shot reward models for reinforcement learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision- language models are zero-shot reward models for reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=N0I2RtD8je

work page 2024

[37] [37]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Ntu rgb+d: A large scale dataset for 3d human activity analysis

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016

[39] [39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, June 2025

work page 2025

[41] [41]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card, 2025. URLhttps://arxiv.org/abs/2601.03267

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc., 2...

work page 2020

[43] [43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/ 2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

More thought, less accuracy? on the dual nature of rea- soning in vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Henry Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of rea- soning in vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpL5eqjCjF

work page 2026

[45] [45]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fer- gus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. ...

work page 2024

[46] [46]

VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. VL- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4oYxzssbVg

work page 2025

[47] [47]

SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization

Peiyao Wang and Haibin Ling. SVQA-r1: Reinforcing spatial reasoning in MLLMs via view- consistent reward optimization. InThe First Workshop on Efficient Spatial Reasoning, 2026. URLhttps://openreview.net/forum?id=o2E8oa2frj

work page 2026

[48] [48]

VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. VideoRFT: Incentivizing video rea- soning capability in MLLMs via reinforced fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=3pORFyKzh1

work page 2025

[49] [49]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-RTS: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in N...

work page doi:10.18653/v1/2025.emnlp-main 2025

[50] [50]

URLhttps://aclanthology.org/2025.emnlp-main.1428/

work page 2025

[51] [51]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

work page 2022

[52] [52]

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= RnXS7aK4rK

work page 2025

[53] [54]

URLhttps://openreview.net/forum?id=yyWeSAsOhs

work page

[54] [55]

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page

[55] [56]

URLhttps://openreview.net/forum?id=yyWeSAsOhs. 14

work page

[56] [57]

FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. FOCUS: Unified vision-language modeling for interactive editing driven by referential segmentation. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=FACJ0478oQ

work page 2025

[57] [58]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, June 2025

work page 2025

[58] [59]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, June 2025

work page 2025

[59] [60]

Visionthink: Smart and efficient vision language model via reinforcement learning

Senqiao Yang, Junyi Li, Xin Lai, Jinming Wu, Wei Li, Zejun MA, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=R6m6bNnmWm

work page 2025

[60] [61]

MMSI- bench: A benchmark for multi-image spatial intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI- bench: A benchmark for multi-image spatial intelligence. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gHRoX4vXm3

work page 2026

[61] [62]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views. InNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025. URL https:...

work page 2025

[62] [63]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Informati...

work page doi:10.52202/079017-3522 2024

[63] [64]

Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, December 2023. Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-demo.49 2023

[64] [65]

From flatland to space: Teaching vision-language models to perceive and reason in 3d

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page

[65] [66]

URLhttps://openreview.net/forum?id=GzgPleFl8f

work page

[66] [67]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5625–5644 (2024) https://doi.org/10.1109/TPAMI.2024.3369699

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8): 5625–5644, 2024. doi: 10.1109/TPAMI.2024.3369699

work page doi:10.1109/tpami.2024.3369699 2024

[67] [68]

CoFFT: Chain of foresight-focus thought for visual language models

Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, and Mike Zheng Shou. CoFFT: Chain of foresight-focus thought for visual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page

[68] [69]

URLhttps://openreview.net/forum?id=AQnjBIFCBQ. 15

work page

[69] [70]

Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, and Wenwu Zhu. Embodied-r: Collaborative framework for activating embodied spatial reasoning in foundation models via reinforcement learning. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11071–11080, New York, ...

work page doi:10.1145/3746027.3755703 2025

[70] [71]

Mmvu: Measuring expert-level multi-discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (...

work page 2025

[71] [72]

Towards learning a generalist model for embodied navigation

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. Towards learning a generalist model for embodied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13624–13634, June 2024

work page 2024

[72] [73]

International Journal of Computer Vision , author =

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision- language models.International Journal of Computer Vision, 130(9):2337–2348, 2022. doi: 10.1007/s11263-022-01653-1. URLhttps://doi.org/10.1007/s11263-022-01653-1

work page doi:10.1007/s11263-022-01653-1 2022

[73] [74]

Program-of-thought reveals LLM abstraction ceilings

Mike Zhou, Fenil Bardoliya, Vivek Gupta, and Dan Roth. Program-of-thought reveals LLM abstraction ceilings. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 4911–4919, Rabat, Morocco, March 2026. Association for Computational Linguistics. ISBN 979-8-89176-386-

work page 2026

[74] [75]

URL https://aclanthology.org/2026

doi: 10.18653/v1/2026.findings-eacl.257. URL https://aclanthology.org/2026. findings-eacl.257/

work page doi:10.18653/v1/2026.findings-eacl.257 2026

[75] [76]

Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16):13782–13790, Mar

work page

[76] [77]

What is the action performed by the person in the video?

doi: 10.1609/aaai.v40i16.38386. URL https://ojs.aaai.org/index.php/AAAI/ article/view/38386. 16 A Pseudocode In this section, we present the pseudocode of SAGE, as shown in Algorithm 1. Algorithm 1Self-Evolving Consistency Training 1: Input:VLM Mθ, reference Mθref, dataset D, pool P, eval interval E, max active K, mastery thresholdτ 2:Initialize active se...

work page doi:10.1609/aaai.v40i16.38386