arxiv: 2604.02870 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Token Warping Helps MLLMs Look from Nearby Viewpoints

Chanho Park, Juil Koo, Mingue Park, Minhyuk Sung, Phillip Y. Lee, Seungwoo Yoo

Pith reviewed 2026-05-13 21:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords token warpingMLLMsviewpoint transformationViewBenchbackward warpingmental imageryViT

0 comments

The pith

Warping tokens rather than pixels enables multimodal large language models to reason reliably from nearby viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether warping image tokens inside ViT-based MLLMs can let them understand how a scene looks from a nearby viewpoint. Pixel warping often fails because small depth mistakes create large geometric distortions, but token warping avoids this by operating on part-level features instead. Backward token warping proves more stable than forward warping: it builds a grid in the target view and pulls matching tokens from the source view, keeping semantic structure intact. Tests on the new ViewBench benchmark show this token method beats pixel warping, spatially fine-tuned MLLMs, and generative warping baselines.

Core claim

Backward token warping defines a dense grid on the target viewpoint and retrieves the corresponding source-view token for each grid point, supplying the MLLM with a viewpoint-shifted token map that preserves semantic coherence without the distortions introduced by pixel-level operations.

What carries the argument

Backward token warping on ViT image tokens, which maps each point in a target-view grid back to its matching token in the source view to create a coherent warped representation.

If this is right

MLLMs can handle small viewpoint changes without retraining or pixel-level editing.
Backward token warping maintains semantic coherence better than pixel warping or generative alternatives.
The same token-warping step works across different ViT-based MLLM architectures.
ViewBench provides a concrete testbed for measuring viewpoint robustness in multimodal models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ViT tokens may already encode enough part structure to support simple mental-rotation operations inside language models.
Chaining multiple small warps could extend the method to moderately larger viewpoint differences.
Robotics and augmented-reality systems that need quick viewpoint-invariant descriptions could adopt token warping as a lightweight preprocessing step.

Load-bearing premise

Image tokens produced by the vision transformer correspond to part-level structural representations that can be warped to simulate viewpoint changes.

What would settle it

A direct comparison on ViewBench where token warping produces no gain or lower accuracy than pixel warping once viewpoint shifts introduce noticeable occlusions or depth errors.

Figures

Figures reproduced from arXiv: 2604.02870 by Chanho Park, Juil Koo, Mingue Park, Minhyuk Sung, Phillip Y. Lee, Seungwoo Yoo.

**Figure 1.** Figure 1: Viewpoint Change via Token Warping. We explore token warping as a means of enabling viewpoint changes for MLLMs and find that backward token warping can reliably transfer source image content to novel viewpoints without synthesizing new pixels. Abstract Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs … view at source ↗

**Figure 2.** Figure 2: Image Tokenization in MLLMs (Sec. 3.1). MLLMs process images by dividing them into fixed-size patches, embedding each patch, and passing them through a vision encoder (e.g., ViT) to obtain image tokens. Through our experiments on ViewBench, designed to evaluate MLLMs on spatial reasoning tasks involving viewpoint changes, we systematically explore the aforementioned axes of pipeline design. The results sh… view at source ↗

**Figure 3.** Figure 3: Limitations of Pixel-Wise Warping. Pixel-wise warping to a target viewpoint often introduces local distortions and semantic degradation. In both forward (top) and backward (bottom) warping, the book from the source view appears significantly distorted after transformation (in the red box). generation [7, 47, 53, 78], editing [31, 43, 105], or perception [9, 26, 46, 115] by introducing richer token types o… view at source ↗

**Figure 4.** Figure 4: Pixel-Wise vs. Token Warping. Comparison of inverse warping strategies (Sec. 3.3). (A) Pixel-wise warping retrieves pixels for each target coordinate, but patchifying the warped image introduces local distortions, resulting in degraded MLLM understanding. (B) Token warping directly retrieves intact tokens (or patches) from the source view, preserving semantics and improving viewpoint-aware perception. the … view at source ↗

**Figure 5.** Figure 5: Fetching Position Noise Sensitivity (Sec. 3.2). Through a toy experiment on CV-Bench-2D [93], where we emulate local positional perturbations and degradation introduced by warping, we find that token representations in MLLMs are highly robust to noise in the image positions from which tokens are fetched. This suggests that tokens are well suited for representing viewpoint changes. token v𝑖 from image I tog… view at source ↗

**Figure 6.** Figure 6: ViewBench. Example source-target image pairs with corresponding questions and answers from our ViewBench benchmark. The tasks evaluate MLLM’s ability to infer spatial relationships from nearby viewpoints (Text, Shape), while also measuring robustness to view changes by asking to describe object properties visible in the warped target view (Object). Π𝑆 ∈ R 4×4 , representing the world-to-camera transformati… view at source ↗

**Figure 7.** Figure 7: Token Fetching Strategies. (A) Nearest fetching selects the closest existing token from the source image grid. (B) Adaptive fetching dynamically crops a patch centered at the mapped coordinate to derive a token precisely centered at the target location [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Warping Visualizations. We compare the warped results of pixel-wise warping, token warping, and the generative NVS output [85]. The rightmost image shows the ground-truth target viewpoint. For token warping, we visualize the RGB image patches corresponding to each token for illustration only. Above each row, we provide the question 𝑄 from ViewBench, and below each image we show the response from Qwen2.5-VL… view at source ↗

read the original abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Backward token warping looks like a straightforward practical fix for viewpoint fragility in MLLMs, but the reported gains rest on high-level claims without enough visible controls or numbers to pin down why it works.

read the letter

The main takeaway is that this paper shows backward token warping can let existing MLLMs reason better from nearby viewpoints than pixel warping or a few fine-tuning baselines. They motivate it by linking ViT tokens to part-level representations from mental imagery work, then compare forward and backward token methods on a new ViewBench benchmark, claiming consistent wins for the backward version because it avoids depth-induced distortions at the pixel level.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that backward token warping on ViT image tokens allows MLLMs to better reason about scenes from nearby viewpoints compared to pixel warping and other baselines. It introduces the ViewBench benchmark and reports consistent outperformance, attributing this to the tokens serving as part-level structural representations inspired by mental imagery theories.

Significance. Should the findings be substantiated with detailed metrics and controls, this could represent a meaningful advance in making MLLMs more robust to viewpoint variations through efficient token manipulation rather than retraining or pixel-level operations.

major comments (2)

[Abstract and Experiments] The abstract states that token-level warping 'consistently outperforming all baselines' on ViewBench, but no specific accuracy numbers, standard deviations, or details on the benchmark construction (e.g., number of scenes, viewpoint shifts) are provided, which is load-bearing for assessing the central empirical claim.
[Theoretical Motivation] The link to mental imagery theories posits part-level representations in tokens, but the manuscript does not include any analysis or ablation showing that the warped tokens maintain part consistency across views; this leaves the interpretation open to alternative explanations such as embedding-space robustness.

minor comments (1)

[Notation] Clarify the exact definition of forward vs backward warping in the methods section, as the distinction is central but described only at high level in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] The abstract states that token-level warping 'consistently outperforming all baselines' on ViewBench, but no specific accuracy numbers, standard deviations, or details on the benchmark construction (e.g., number of scenes, viewpoint shifts) are provided, which is load-bearing for assessing the central empirical claim.

Authors: We agree that the abstract would be strengthened by including concrete quantitative details. The experimental section already reports accuracy numbers, standard deviations, and full benchmark construction details (including scene count and viewpoint shift ranges). We have revised the abstract to summarize the key performance metrics and benchmark scale for improved readability while preserving its concise nature. revision: yes
Referee: [Theoretical Motivation] The link to mental imagery theories posits part-level representations in tokens, but the manuscript does not include any analysis or ablation showing that the warped tokens maintain part consistency across views; this leaves the interpretation open to alternative explanations such as embedding-space robustness.

Authors: This observation is fair and highlights an opportunity to better support the theoretical framing. While the empirical gains in semantic coherence are shown through end-task performance, we have added a targeted ablation in the revised manuscript that quantifies part-level token consistency across warped views (via cross-view feature alignment metrics), helping differentiate the part-structure hypothesis from general embedding robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent baselines

full rationale

The paper presents a method (backward token warping on ViT tokens) and evaluates it empirically on the newly proposed ViewBench benchmark against multiple external baselines (pixel-wise warping, spatially fine-tuned MLLMs, generative warping). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experimental outperformance rather than any reduction to prior self-defined quantities. The mental-imagery reference is an external citation, not a load-bearing self-citation. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, fitted parameters, or new postulated entities; the contribution is an empirical method comparison.

pith-pipeline@v0.9.0 · 5482 in / 1032 out tokens · 46727 ms · 2026-05-13T21:04:11.527120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
We compare forward and backward warping, finding that backward token warping... achieves greater stability and better preserves semantic coherence under viewpoint shifts.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · 8 internal anchors

[1]

Spaceqwen2.5-vl-3b-instruct

Remyx AI. Spaceqwen2.5-vl-3b-instruct. https:// huggingface.co/remyxai/SpaceQwen2.5-VL- 3B-Instruct, 2025. 15

work page 2025
[2]

Spacethinker-qwen2.5vl-3b

Remyx AI. Spacethinker-qwen2.5vl-3b. https : / / huggingface . co / remyxai / SpaceThinker - Qwen2.5VL-3B, 2025. 15

work page 2025
[3]

Vqasynth

Remyx AI. Vqasynth. https : / / github . com / remyxai/VQASynth, 2025. 15

work page 2025
[4]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

XiangAn,YinXie,KaichengYang,WenkangZhang,Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen,ChunshengWu,etal. Llava-onevision-1.5: Fullyopen framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025. 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Scanqa: 3d question answering for spatial scene understanding

DaichiAzuma,TaikiMiyanishi,ShuheiKurita,andMotoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InCVPR, 2022. 3

work page 2022
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4, 6, 7, 8, 15, 16, 17, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Positional encoding field

Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field. InICLR, 2026. 3

work page 2026
[8]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InNeurIPS, 2021. 18

work page 2021
[9]

Perception tokens enhance visual reasoning in multimodal language models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Perception tokens enhance visual reasoning in multimodal language models. InCVPR, 2025. 3

work page 2025
[10]

Depth pro: Sharp monocular metric depth in less than a second

Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, MarcelSantos,YichaoZhou,StephanRRichter,andVladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InICLR, 2025. 1, 15

work page 2025
[11]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. InICRA, 2025. 2

work page 2025
[12]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, 2024. 2, 15, 16

work page 2024
[13]

Subobject-levelimagetokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang,andPascaleFung. Subobject-levelimagetokenization. InICML, 2025. 3

work page 2025
[14]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InCVPR, 2024. 2

work page 2024
[15]

Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025. 1, 3

work page arXiv 2025
[16]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 2

work page 2024
[17]

3d aware region prompted vision language model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model. InICLR, 2026. 2

work page 2026
[18]

Accelerating Vision Transformers with Adaptive Patch Sizes

Rohan Choudhury, JungEun Kim, Jinhyung Park, Eunho Yang, László A Jeni, and Kris M Kitani. Accelerating visiontransformerswithadaptivepatchsizes.arXivpreprint arXiv:2510.18091, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Scannet: Richly- annotated 3d reconstructions of indoor scenes

AngelaDai,AngelX.Chang,ManolisSavva,MaciejHalber, ThomasFunkhouser,andMatthiasNießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR,

work page
[20]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al. Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In ICCV, 2025. 2

work page 2025
[21]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022. Outstanding Paper Award. 16, 17

work page 2022
[22]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

MattDeitke,ChristopherClark,SanghoLee,RohunTripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InCVPR, 2025. 2

work page 2025
[23]

3d-llava: Towardsgeneralist3dlmms with omni superpoint transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub,andIanReid. 3d-llava: Towardsgeneralist3dlmms with omni superpoint transformer. InCVPR, 2025. 2

work page 2025
[24]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 2, 3

work page 2021
[25]

Palm-e: anembodiedmultimodallanguagemodel

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: anembodiedmultimodallanguagemodel. InICML,2023. 2

work page 2023
[26]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmentedwithinstruction-aligned3dreconstruction.arXiv preprint arXiv:2505.20279, 2025. 1, 2, 3, 6, 8, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

A., Tihanyi, N., and Debbah, M

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678,

work page arXiv
[28]

Principles of mental imagery, 1989

RA Finke. Principles of mental imagery, 1989. 1

work page 1989
[29]

Scene-llm: Extending language model for 3d visual reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual reasoning. InWACV, 2025. 2

work page 2025
[30]

Blink: Multimodal large language models can see but not perceive

XingyuFu,YushiHu,BangzhengLi,YuFeng,HaoyuWang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024. 2, 19

work page 2024
[31]

Tokenflow: Con- sistent diffusion features for consistent video editing,

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistentdiffusionfeaturesforconsistentvideo editing.arXiv preprint arXiv:2307.10373, 2023. 3

work page arXiv 2023
[32]

Seeing through their eyes: Evaluating visual perspectivetakinginvisionlanguagemodels.arXivpreprint arXiv:2409.12969, 2024

Gracjan Góral, Alicja Ziarko, Michal Nauman, and Maciej Wołczyk. Seeing through their eyes: Evaluating visual perspectivetakinginvisionlanguagemodels.arXivpreprint arXiv:2409.12969, 2024. 3

work page arXiv 2024
[33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Some demonstrations of the effects of structuraldescriptionsinmentalimagery.CognitiveScience, 3(3):231–250, 1979

Geoffrey Hinton. Some demonstrations of the effects of structuraldescriptionsinmentalimagery.CognitiveScience, 3(3):231–250, 1979. 1, 2, 3, 4, 8

work page 1979
[35]

3d-llm: Inject- ing the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Inject- ing the 3d world into large language models. InNeurIPS,

work page
[36]

3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Yonatan Bitton, Idan Szpektor, and Kai-Wei Chang. 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model. InNeurIPS, 2025. 2

work page 2025
[37]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 2

work page 2024
[38]

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

Ting Huang, Zeyu Zhang, and Hao Tang. 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025. 2

work page arXiv 2025
[39]

Mllms need 3d-aware representation supervision for scene understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. InNeurIPS, 2025. 2

work page 2025
[40]

Robobrain: Aunifiedbrainmodel forroboticmanipulationfromabstracttoconcrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, YaoMu,PengjuAn,etal. Robobrain: Aunifiedbrainmodel forroboticmanipulationfromabstracttoconcrete. InCVPR,

work page
[41]

Region- aware pretraining for open-vocabulary object detection with vision transformers

Dahun Kim, Anelia Angelova, and Weicheng Kuo. Region- aware pretraining for open-vocabulary object detection with vision transformers. InCVPR, 2023. 3

work page 2023
[42]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, 2023. 3

work page 2023
[43]

Videohandles: Editing 3d object com- positions in videos using video generative priors

JuilKoo, PaulGuerrero,Chun-HaoPHuang, DuyguCeylan, and Minhyuk Sung. Videohandles: Editing 3d object com- positions in videos using video generative priors. InCVPR,

work page
[44]

S. M. Kosslyn, T. M. Ball, and B. J. Reiser. Visual images preserve metric spatial information: Evidence from studies of image scanning.Journal of Experimental Psychology: Human Perception and Performance, 1978. 1

work page 1978
[45]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 3

work page 2024
[46]

arXiv preprint arXiv:2508.07917 (2025) 1, 3, 9

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 3

work page arXiv 2025
[47]

Groundit: Grounding diffusion transformers via noisy patch transplan- tation

PhillipY.Lee,TaehoonYoon,andMinhyukSung. Groundit: Grounding diffusion transformers via noisy patch transplan- tation. InNeurIPS, 2024. 3

work page 2024
[48]

Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy,LeonidasGuibas,andMinhyukSung

Phillip Y. Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy,LeonidasGuibas,andMinhyukSung. Perspective-aware reasoning in vision-language models via mental imagery simulation. InICCV, 2025. 1, 3

work page 2025
[49]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. TMLR, 2025. 2

work page 2025
[50]

Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500, 2025. 3

work page arXiv 2025
[51]

Spatialladder: Progressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. In ICLR, 2025. 2, 15, 16

work page 2025
[52]

See&trek: Training-free spatial prompting for multimodal large lan- guage model

Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multimodal large lan- guage model. InNeurIPS, 2025. 2

work page 2025
[53]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023. 3

work page 2023
[54]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 19

work page 2024
[55]

The 3d-pc: a benchmark for visual perspective taking in humans and machines

Drew Linsley, Peisen Zhou, Alekh Karkada Ashok, Akash Nagaraj, Gaurav Gaonkar, Francis E Lewis, Zygmunt Pizlo, and Thomas Serre. The 3d-pc: a benchmark for visual perspective taking in humans and machines. InICLR, 2025. 3

work page 2025
[56]

Visual instruction tuning

HaotianLiu, ChunyuanLi, QingyangWu, andYongJaeLee. Visual instruction tuning. InNeurIPS, 2023. 3

work page 2023
[57]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain- of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025

YuechengLiu,DafengChi,ShiguangWu,ZhanguangZhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, et al. Spatialcot: Advancing spatial reasoning through coordinate alignment and chain- of-thought for embodied task planning.arXiv preprint arXiv:2501.10074, 2025. 2

work page arXiv 2025
[58]

Visual embodied brain: Let multimodal large language models see, think, and control in spaces

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 15, 16

work page arXiv 2025
[59]

Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3d priors. InNeurIPS, 2024. 2

work page 2024
[60]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InICCV,

work page
[61]

Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning

WufeiMa,Yu-ChengChou,QihaoLiu,XingruiWang,Celso de Melo, Jianwen Xie, and Alan Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning. In NeurIPS, 2025. 1, 2, 6, 8, 16

work page 2025
[62]

Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. In CVPR, 2025. 2

work page 2025
[63]

Sqa3d: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InICLR, 2022. 3

work page 2022
[64]

Mind meets space: Rethinking agentic spatial intelligence from a neuroscience-inspired perspective.arXiv preprint arXiv:2509.09154, 2025

Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shri- ram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, and Lin Wang. Mind meets space: Rethinking agentic spatial intelligence from a neuroscience-inspired perspective.arXiv preprint arXiv:2509.09154, 2025. 3

work page arXiv 2025
[65]

Visual agentic ai for spatial reasoning with a dynamic api

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. InCVPR, 2025. 2

work page 2025
[66]

Simple open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In ECCV, 2022. 3

work page 2022
[67]

A framework for representing knowl- edge, 1974

Marvin Minsky et al. A framework for representing knowl- edge, 1974. 2, 3, 4, 8

work page 1974
[68]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. InNeurIPS, 2023. 2

work page 2023
[69]

Mental imagery.The Stanford Encyclopedia of Philosophy, 2021

Bence Nanay. Mental imagery.The Stanford Encyclopedia of Philosophy, 2021. 1

work page 2021
[70]

Embodied arena: A comprehensive, unified, and evolving evaluation platform for embodied ai

Fei Ni, Min Zhang, Pengyi Li, Yifu Yuan, Lingfeng Zhang, Yuecheng Liu, Peilong Han, Longxin Kou, Shaojin Ma, Jinbin Qiao, et al. Embodied arena: A comprehensive, unified, and evolving evaluation platform for embodied ai. arXiv preprint arXiv:2509.15273, 2025. 3

work page arXiv 2025
[71]

Dinov2: Learningrobustvisualfeatureswithoutsupervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learningrobustvisualfeatureswithoutsupervision. TMLR, 2023. 3

work page 2023
[72]

Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcingmllmsinvideospatialreasoning.arXivpreprint arXiv:2504.01805, 2025. 2, 15, 16

work page arXiv 2025
[73]

Paivio.Imagery and Verbal Processes (1st ed.)

A. Paivio.Imagery and Verbal Processes (1st ed.). Psychol- ogy Press, 1979. 1

work page 1979
[74]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

work page 2023
[75]

What the mind’s eye tells the mind’s brain: A critique of mental imagery.Psychological bulletin,

Zenon W Pylyshyn. What the mind’s eye tells the mind’s brain: A critique of mental imagery.Psychological bulletin,

work page
[76]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

ZhangyangQi,ZhixiongZhang,YizhouYu,JiaqiWang,and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

work page arXiv
[77]

Gpt4scene: Understand 3d scenes from videos with vision-language models

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. InICLR, 2026. 2

work page 2026
[78]

Tokenflow: Unified image tokenizer for multimodal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InCVPR, 2025. 3

work page 2025
[79]

Vision language models are blind

PooyanRahmanzadehgervi,LoganBolton,MohammadReza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InACCV, 2024. 2

work page 2024
[80]

Does spatial cognition emerge in frontier models? InICLR, 2025

Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cognition emerge in frontier models? InICLR, 2025. 1, 2

work page 2025

Showing first 80 references.