Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Chun-Hsiao Yeh; Fanyi Xiao; Joseph Tighe; Manchen Wang; Shengyi Qian; Yi Ma

arxiv: 2605.30231 · v1 · pith:OT5ND4W7new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Chun-Hsiao Yeh , Shengyi Qian , Manchen Wang , Yi Ma , Joseph Tighe , Fanyi Xiao This is my paper

Pith reviewed 2026-06-29 07:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language models3D spatial reasoningpoint correspondencedepth consistencygeometric priorsview invariance

0 comments

The pith

VLMs acquire reliable 3D spatial reasoning by training on video point correspondences and depth consistency instead of 3D VQA data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard vision-language models have very low internal accuracy at matching corresponding points across views, often below 5 percent. It introduces a training method that adds a small correspondence head with deep supervision across transformer layers and optimizes it on ground-truth geometry extracted from large-scale video scenes. A contrastive loss enforces 2D view-invariance while depth consistency resolves 3D ambiguities. The resulting models reach over 70 percent peak correspondence accuracy and over 85 percent temporal robustness, which produces large gains on spatial reasoning benchmarks. This route avoids the dataset biases that arise when models are fine-tuned directly on 3D visual question-answering tasks.

Core claim

Training a correspondence head with contrastive loss on ground-truth point matches and depth-consistency supervision across all LLM layers produces internal representations whose point-matching behavior improves from under 5 percent to over 70 percent accuracy; these representations then support substantially stronger performance on downstream 3D spatial benchmarks without any exposure to 3D VQA data during training.

What carries the argument

GASP's correspondence head, a lightweight module applied as deep supervision to every transformer layer and trained with a dual objective of contrastive point-correspondence loss plus depth consistency loss.

If this is right

Internal point correspondence accuracy rises from below 5 percent to above 70 percent peak layer-wise and remains above 85 percent under temporal shifts.
Performance on All-Angles Bench increases by 18.2 percentage points and on VSI-Bench by 29.0 percentage points.
These gains occur without any 3D VQA training data, reducing the risk of overfitting to dataset-specific question biases.
The same internal improvements appear across multiple layers rather than only at the final layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to video-only models to improve temporal 3D consistency without task-specific labels.
If the geometric priors transfer across different video domains, the approach might lower the annotation cost for building spatial reasoning systems.
Similar deep-supervision heads could be tested on other low-level geometric signals such as surface normals or optical flow.

Load-bearing premise

Ground-truth point correspondences and depth values extracted from large-scale video scenes are accurate enough and free of systematic bias to produce internal representations that generalize to new spatial tasks.

What would settle it

An ablation that keeps the same architecture and training budget but removes the geometric supervision losses, then measures whether the reported gains on All-Angles Bench and VSI-Bench disappear while internal correspondence accuracy stays low.

read the original abstract

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs can gain spatial reasoning from video-derived geometric priors via a deep-supervised correspondence head, but the evidence stays at the level of internal accuracy jumps without controls or full verification.

read the letter

The main point is that GASP trains a lightweight correspondence head across transformer layers using contrastive point matching and depth consistency losses drawn from ground-truth geometry in video scenes. This produces large internal gains in correspondence accuracy and carries over to +18% and +29% on two named spatial benchmarks, all without any 3D VQA data.

The new piece is the specific combination of deep supervision on the correspondence head plus the dual objective that enforces view-invariance and resolves 3D ambiguities. The diagnostic that off-the-shelf VLMs sit below 5% on correspondence matching is a useful baseline, and the jump to over 70% peak accuracy plus 85% temporal robustness is a concrete before-and-after result. The claim that fundamental geometric priors can substitute for curated VQA supervision is worth testing.

The soft spots are the missing pieces that matter for judging the result. No error bars, no ablation on the two loss terms, no dataset sizes or extraction details for the video ground truth, and no check on whether the video scenes share statistics with the evaluation benchmarks. If the ground-truth correspondences contain systematic noise or bias, the reported lifts could partly reflect dataset overlap rather than emergent geometric understanding. The abstract-only view leaves open whether post-hoc choices drove the numbers.

This work is aimed at groups trying to improve spatial reliability in VLMs for robotics or scene tasks without building larger VQA corpora. A reader who wants to experiment with geometric deep supervision would get value from the framework description. It deserves peer review because the core idea is testable and the internal-to-downstream link is worth checking with proper controls, even if heavy revision is likely.

Referee Report

2 major / 1 minor

Summary. The paper proposes GASP, a framework to inject 3D spatial priors into VLMs via a correspondence head with deep supervision across transformer layers. It trains on a dual objective (contrastive loss on ground-truth point correspondences from video scenes for view-invariance plus depth consistency) without any 3D VQA data, reports lifting internal correspondence accuracy from <5% to >70% with >85% temporal robustness, and claims resulting gains of +18.2% on All-Angles Bench and +29.0% on VSI-Bench.

Significance. If the central results hold after addressing validation gaps, the work would be significant for demonstrating that geometric priors extracted from video can improve VLM spatial reasoning in a manner that avoids overfitting to 3D VQA datasets and does not require specialized 3D encoders. The diagnostic analysis of baseline VLM correspondence failures provides a useful internal metric, and the separation between training scenes and evaluation benchmarks supports the generalizability argument.

major comments (2)

[Abstract] Abstract and training objective description: The central claim that improvements arise from fundamental geometric priors (rather than dataset-specific fitting) depends on the accuracy and lack of bias in the ground-truth point correspondences and depths extracted from large-scale video scenes. No independent validation, error metrics, or comparison against synthetic ground truth is provided for the extraction pipeline itself; if extraction errors correlate with scene statistics shared with the evaluation benchmarks, the reported lifts could reflect data artifacts rather than emergent understanding.
[Results] Results and analysis sections: The reported internal accuracy improvements (baseline <5% to >70% peak layer-wise) and downstream gains (+18.2%, +29.0%) are presented without error bars, ablation studies isolating the contrastive versus depth-consistency terms, dataset sizes for the video training scenes, or statistical tests. These omissions make it impossible to assess whether the dual objective is load-bearing or if the gains are robust.

minor comments (1)

[Abstract] The abstract references All-Angles Bench and VSI-Bench without defining the tasks or citing their sources; adding these details would improve clarity for readers unfamiliar with the benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback identifies key areas where additional validation and experimental details would strengthen the manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract and training objective description: The central claim that improvements arise from fundamental geometric priors (rather than dataset-specific fitting) depends on the accuracy and lack of bias in the ground-truth point correspondences and depths extracted from large-scale video scenes. No independent validation, error metrics, or comparison against synthetic ground truth is provided for the extraction pipeline itself; if extraction errors correlate with scene statistics shared with the evaluation benchmarks, the reported lifts could reflect data artifacts rather than emergent understanding.

Authors: We agree that validating the ground-truth extraction pipeline is important for supporting the central claim. The correspondences and depths are obtained via standard video-based methods, but we acknowledge the absence of explicit error analysis in the current manuscript. In the revision, we will add a dedicated validation subsection that reports error metrics (e.g., correspondence precision and depth RMSE) by comparing the extracted geometry against synthetic ground-truth scenes generated with a graphics engine. We will also check for correlation between extraction errors and statistics of the evaluation benchmarks to rule out data artifacts. revision: yes
Referee: [Results] Results and analysis sections: The reported internal accuracy improvements (baseline <5% to >70% peak layer-wise) and downstream gains (+18.2%, +29.0%) are presented without error bars, ablation studies isolating the contrastive versus depth-consistency terms, dataset sizes for the video training scenes, or statistical tests. These omissions make it impossible to assess whether the dual objective is load-bearing or if the gains are robust.

Authors: We concur that these details are necessary to evaluate robustness. The revised manuscript will include: (1) error bars computed across multiple random seeds for all reported metrics; (2) ablation experiments that isolate the contrastive loss from the depth-consistency term; (3) the exact number of video scenes and frames used for training; and (4) statistical significance tests (paired t-tests with p-values) comparing GASP against baselines. These additions will clarify the contribution of each component of the dual objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external GT geometry and held-out benchmarks.

full rationale

The paper's core derivation trains a correspondence head and depth consistency objective on ground-truth point matches and depths extracted from large-scale video scenes, then measures downstream gains on separate spatial benchmarks (All-Angles Bench, VSI-Bench) that receive no 3D VQA supervision. The reported internal correspondence lift (5% → 70%) is a direct consequence of the contrastive training signal but is presented only as a diagnostic; the load-bearing claim is generalization to unseen spatial tasks. No equation, self-citation chain, or fitted parameter is shown to redefine the benchmark improvements as quantities already present in the training inputs. The extraction pipeline for GT is treated as an external data source rather than a self-referential construct.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the availability of accurate ground-truth geometry in video data and on the premise that the proposed supervision signals produce generalizable spatial representations.

axioms (1)

domain assumption Large-scale video scenes supply accurate ground-truth point correspondences and depth values suitable for supervision.
This premise underpins both the contrastive loss and the depth consistency objective described in the abstract.

pith-pipeline@v0.9.1-grok · 5842 in / 1366 out tokens · 34212 ms · 2026-06-29T07:45:50.883217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 32 canonical work pages · 17 internal anchors

[1]

Claude, 2024

Anthropic. Claude, 2024. 1

2024
[2]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 17

2021
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1, 2, 6, 7, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Is space-time attention all you need for video understanding? InICML, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torre- sani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 17

2021
[5]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fer- gus, and Saining Xie. Benchmark designers should" train on the test set" to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 15

work page arXiv 2025
[6]

Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3, 7

2024
[7]

Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26428–26438, 2024. 1, 3

2024
[8]

Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 7

2024
[9]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632,

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xu- fang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632,

work page arXiv
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 17

2021
[11]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wen- long Huang, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Ma- chine Learning (ICML), 2023. 1

2023
[12]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 24108–24118, 2025. 7

2025
[14]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

work page arXiv 2024
[15]

Blink: Multimodal large language models can see but not perceive

XingyuFu, YushiHu, BangzhengLi, YuFeng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 2, 7

2024
[16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 3

2023
[18]

Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024

2024
[19]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022. 1 10

2022
[21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Track4gen: Teach- ing video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao P Huang, Jong Chul Ye, Niloy J Mitra, and Duygu Ceylan. Track4gen: Teach- ing video diffusion models to track points improves video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7276–7287, 2025. 3

2025
[23]

Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding. InEuro- pean Conference on Computer Vision, pages 289–310. Springer, 2024. 3

2024
[24]

Ego- humans: An ego-centric 3d multi-human benchmark

Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. Ego- humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19807–19819,
[25]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020. 4

2020
[26]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024. 2

2024
[28]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765,

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765,

work page arXiv
[30]

Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 2, 4, 6, 14

2024
[31]

Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1

2024
[32]

Improved baselines with visual instruction tun- ing

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 26296– 26306, 2024. 7

2024
[33]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15802–15812, 2025. 2

2025
[35]

Emergent temporal correspondences from video diffu- sion transformers.arXiv preprint arXiv:2506.17220,

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffu- sion transformers.arXiv preprint arXiv:2506.17220,

work page arXiv
[36]

Llarva: Vision-action in- struction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024

Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action in- struction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 1

work page arXiv 2024
[37]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805, 2025. 1, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13661–13670, 2025. 2

2025
[39]

Gpt4scene: Understand 3d scenes from videos with vision-language models

ZhangyangQi, ZhixiongZhang, YeFang, JiaqiWang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 2025. 1, 3

work page arXiv 2025
[40]

Enhancing video-llm reasoning via agent-of-thoughts distillation

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. Enhancing video-llm reasoning via agent-of-thoughts distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8523– 8533, 2025. 7

2025
[41]

One step at a time: Long-horizon vision-and-language navigation with milestones

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 15482–15491, 2022. 1 11

2022
[42]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 17

2024
[43]

Embodied bert: A transformer model for embodied, language- guided visual task completion.arXiv preprint arXiv:2108.04927, 2021

Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, and Gaurav Sukhatme. Embodied bert: A transformer model for embodied, language- guided visual task completion.arXiv preprint arXiv:2108.04927, 2021. 1

work page arXiv 2021
[44]

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025. 1, 3, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

RAFT: Recurrent all- pairs field transforms for optical flow

Zachary Teed and Jia Deng. RAFT: Recurrent all- pairs field transforms for optical flow. InEur. Conf. Comput. Vis., 2020. 5

2020
[48]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Midde- pogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024. 7

2024
[49]

Vggt: Visual geometry grounded trans- former

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306,
[50]

Dust3r: Geo- metric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geo- metric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 4

2024
[51]

Video- str: Reinforcing mllms in video spatio-temporal reasoning with relation graph.arXiv preprint arXiv:2510.10976, 2025

Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, et al. Video- str: Reinforcing mllms in video spatio-temporal reasoning with relation graph.arXiv preprint arXiv:2510.10976, 2025. 1, 3

work page arXiv 2025
[52]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 1, 3

work page arXiv 2023
[53]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 1, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 7

2021
[56]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024. 3

work page arXiv 2024
[57]

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feis- zli, and Kevin J Liang. Multi-spatialmllm: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7694–7701. IEEE, 2024. 3

2024
[59]

3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Mad- havan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29501–29512, 2025. 3

2025
[60]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 10632–10643, 2025. 1, 2, 3, 7, 15

2025
[61]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025. 1, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuex- iang Zhai, Yubei Chen, Shenghua Gao, and Yi 12 Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025. 2, 7

work page arXiv 2025
[63]

Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 1

2024
[64]

arXiv preprint arXiv:2503.22976 (2025) 5, 6, 22, 24

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976,

work page arXiv
[65]

Llava-next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. 2, 6, 7, 14

2024
[66]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 1, 3, 7

work page arXiv 2025
[68]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness

ChenmingZhu, TaiWang, WenweiZhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024. 3

work page arXiv 2024
[69]

Struct2d: A perception-guided framework for spatial reasoning in mllms

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, and Huaizu Jiang. Struct2d: A perception-guided framework for spatial reasoning in mllms. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems,
[70]

hack" the benchmark by learning superficial dataset-specific biases (e.g.,

1, 3 13 Appendix Overview In this supplementary material, we provide details on our geometric training data collection in Section A. Next, we provide full implementation details, includ- ing the correspondence head architecture (Hc) and all training hyperparameters, in Section B. Follow- ing this, we detail the evaluation protocol used to measure correspo...

[1] [1]

Claude, 2024

Anthropic. Claude, 2024. 1

2024

[2] [2]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 17

2021

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1, 2, 6, 7, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Is space-time attention all you need for video understanding? InICML, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torre- sani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 17

2021

[5] [5]

train on the test set

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fer- gus, and Saining Xie. Benchmark designers should" train on the test set" to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 15

work page arXiv 2025

[6] [6]

Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3, 7

2024

[7] [7]

Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26428–26438, 2024. 1, 3

2024

[8] [8]

Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 7

2024

[9] [9]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632,

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xu- fang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632,

work page arXiv

[10] [10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 17

2021

[11] [11]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wen- long Huang, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Ma- chine Learning (ICML), 2023. 1

2023

[12] [12]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 24108–24118, 2025. 7

2025

[14] [14]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3

work page arXiv 2024

[15] [15]

Blink: Multimodal large language models can see but not perceive

XingyuFu, YushiHu, BangzhengLi, YuFeng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 2, 7

2024

[16] [16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 3

2023

[18] [18]

Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024

2024

[19] [19]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022. 1 10

2022

[21] [21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Track4gen: Teach- ing video diffusion models to track points improves video generation

Hyeonho Jeong, Chun-Hao P Huang, Jong Chul Ye, Niloy J Mitra, and Duygu Ceylan. Track4gen: Teach- ing video diffusion models to track points improves video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7276–7287, 2025. 3

2025

[23] [23]

Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding. InEuro- pean Conference on Computer Vision, pages 289–310. Springer, 2024. 3

2024

[24] [24]

Ego- humans: An ego-centric 3d multi-human benchmark

Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. Ego- humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19807–19819,

[25] [25]

Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020. 4

2020

[26] [26]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024. 2

2024

[28] [28]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765,

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765,

work page arXiv

[30] [30]

Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 2, 4, 6, 14

2024

[31] [31]

Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1

2024

[32] [32]

Improved baselines with visual instruction tun- ing

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 26296– 26306, 2024. 7

2024

[33] [33]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning

Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15802–15812, 2025. 2

2025

[35] [35]

Emergent temporal correspondences from video diffu- sion transformers.arXiv preprint arXiv:2506.17220,

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffu- sion transformers.arXiv preprint arXiv:2506.17220,

work page arXiv

[36] [36]

Llarva: Vision-action in- struction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024

Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action in- struction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 1

work page arXiv 2024

[37] [37]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805, 2025. 1, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13661–13670, 2025. 2

2025

[39] [39]

Gpt4scene: Understand 3d scenes from videos with vision-language models

ZhangyangQi, ZhixiongZhang, YeFang, JiaqiWang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 2025. 1, 3

work page arXiv 2025

[40] [40]

Enhancing video-llm reasoning via agent-of-thoughts distillation

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. Enhancing video-llm reasoning via agent-of-thoughts distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8523– 8533, 2025. 7

2025

[41] [41]

One step at a time: Long-horizon vision-and-language navigation with milestones

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 15482–15491, 2022. 1 11

2022

[42] [42]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 17

2024

[43] [43]

Embodied bert: A transformer model for embodied, language- guided visual task completion.arXiv preprint arXiv:2108.04927, 2021

Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, and Gaurav Sukhatme. Embodied bert: A transformer model for embodied, language- guided visual task completion.arXiv preprint arXiv:2108.04927, 2021. 1

work page arXiv 2021

[44] [44]

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025. 1, 3, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

RAFT: Recurrent all- pairs field transforms for optical flow

Zachary Teed and Jia Deng. RAFT: Recurrent all- pairs field transforms for optical flow. InEur. Conf. Comput. Vis., 2020. 5

2020

[48] [48]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Midde- pogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024. 7

2024

[49] [49]

Vggt: Visual geometry grounded trans- former

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306,

[50] [50]

Dust3r: Geo- metric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geo- metric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 4

2024

[51] [51]

Video- str: Reinforcing mllms in video spatio-temporal reasoning with relation graph.arXiv preprint arXiv:2510.10976, 2025

Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, et al. Video- str: Reinforcing mllms in video spatio-temporal reasoning with relation graph.arXiv preprint arXiv:2510.10976, 2025. 1, 3

work page arXiv 2025

[52] [52]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 1, 3

work page arXiv 2023

[53] [53]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 1, 3, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 7

2021

[56] [56]

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024. 3

work page arXiv 2024

[57] [57]

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feis- zli, and Kevin J Liang. Multi-spatialmllm: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7694–7701. IEEE, 2024. 3

2024

[59] [59]

3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Mad- havan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29501–29512, 2025. 3

2025

[60] [60]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 10632–10643, 2025. 1, 2, 3, 7, 15

2025

[61] [61]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025. 1, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuex- iang Zhai, Yubei Chen, Shenghua Gao, and Yi 12 Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025. 2, 7

work page arXiv 2025

[63] [63]

Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 1

2024

[64] [64]

arXiv preprint arXiv:2503.22976 (2025) 5, 6, 22, 24

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976,

work page arXiv

[65] [65]

Llava-next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. 2, 6, 7, 14

2024

[66] [66]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 1, 3, 7

work page arXiv 2025

[68] [68]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness

ChenmingZhu, TaiWang, WenweiZhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024. 3

work page arXiv 2024

[69] [69]

Struct2d: A perception-guided framework for spatial reasoning in mllms

Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, and Huaizu Jiang. Struct2d: A perception-guided framework for spatial reasoning in mllms. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems,

[70] [70]

hack" the benchmark by learning superficial dataset-specific biases (e.g.,

1, 3 13 Appendix Overview In this supplementary material, we provide details on our geometric training data collection in Section A. Next, we provide full implementation details, includ- ing the correspondence head architecture (Hc) and all training hyperparameters, in Section B. Follow- ing this, we detail the evaluation protocol used to measure correspo...