pith. machine review for the scientific record. sign in

arxiv: 2605.08064 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D representationsVision-language modelsSemantic clusteringProxy representationsSpatial intelligenceVisual question answeringVideo processing3D visual grounding
0
0 comments X

The pith

Proxy3D creates compact 3D proxies from video frames via semantic clustering to let VLMs handle spatial tasks efficiently with shorter sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix how vision-language models process 3D scenes by moving away from long pixel sequences that either break spatial consistency or serialize data inefficiently. It extracts semantic and geometric features from ordinary video frames, then clusters them into a smaller set of 3D proxies that still carry the necessary scene structure. These proxies are aligned to the language model through a custom dataset and staged training. The result is competitive or leading performance on 3D visual question answering, grounding, and spatial reasoning benchmarks while using fewer vision tokens than standard approaches.

Core claim

Given only video frames, semantic and geometric encoders extract scene features that undergo semantic-aware clustering to produce a compact set of 3D proxies; these proxies are then aligned with a vision-language model through the SpaceSpan dataset and multi-stage training, yielding competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks even when vision sequences are kept short.

What carries the argument

Semantic-aware clustering of features from video frames that yields a compact set of 3D proxies serving as the vision input to the VLM.

If this is right

  • VLMs can maintain spatial consistency across 3D scenes while processing shorter vision sequences than pixel-aligned or full-geometry methods.
  • Video-only inputs become sufficient for high-performing 3D visual question answering and grounding without explicit 3D reconstruction.
  • Multi-stage alignment on the SpaceSpan dataset transfers the proxy representations effectively into existing VLM architectures.
  • General spatial intelligence benchmarks improve because the proxies encode both semantic identity and geometric relations in one compact form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering step might allow VLMs to ingest live video streams for real-time spatial tasks with lower latency.
  • Proxy representations could be combined with token-reduction techniques already used in 2D VLMs to push efficiency further.
  • Testing the proxies on longer or more dynamic video sequences would reveal whether fine-grained motion details survive the clustering step.
  • The approach opens a route to 3D-aware VLMs that avoid the heavy compute of full point-cloud or mesh inputs.

Load-bearing premise

Semantic-aware clustering of features extracted from video frames produces 3D proxies that retain enough spatial and semantic information for effective VLM alignment without critical loss of detail or consistency.

What would settle it

A controlled benchmark run where the same 3D VQA and grounding tasks are evaluated once with the clustered proxies and once with the original un-clustered feature sequences; a clear drop below competitive levels for the proxy version would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08064 by Denis Gudovskiy, Haowen Sun, Jerry Jiang, Kurt Keutzer, Tomoyuki Okuno, Wenzhao Zheng, Yohei Nakata.

Figure 1
Figure 1. Figure 1: Overview of Proxy3D: our 3D proxy representations are extracted from a set of pretrained encoders, their sequence length is compressed by the semantic-aware clustering followed by the multi-stage alignment with a language model using our SpaceSpan dataset. Abstract Spatial intelligence in vision-language models (VLMs) at￾tracts research interest with the practical demand to reason in the 3D world. Despite … view at source ↗
Figure 2
Figure 2. Figure 2: Proxy3D architecture. A geometry predictor and a semantic encoder output latent features of vision modality. Then, our proxy 3D representations are clustered to reduce complexity. Lastly, multi-stage training aligns proxy features with the language model. information. First, N RGB image frames {Ii} N i=1 with the H × W × 3 resolution are processed by a 2D visual encoder [2]. As a result, we obtain feature … view at source ↗
Figure 3
Figure 3. Figure 3: Proxy3D multi-stage training. Each stage in our progressive iterative training aims to develop a certain spatial intelligence skill from the easiest one to more complex ones: we begin with the simplified image-text alignment to actual images with spatial reasoning. the width and length {W × L}. This can be expressed by z ′ g,j = R (cg,j∈H) zg,j + F [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coordinate alignment stage helps an MLLM to precisely [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Proxy3D performance on VSI-Bench [46]. Left is on Scannet++ [48], right is on ARKitScenes [3]. Proxy3D generalizes well on unseen scenes, and is capable of solving difficult questions. and various variants of Qwen2.5VL [2]. Surprisingly, only the Spatial-MLLM [42] baseline has a marginal im￾provement over the proposed Proxy3D. Analysis of Spatial￾MLLM work shows that, similar to LEO-VL [17], it relies on p… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of VSI-Bench tasks and splits [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on VSI-Bench’s Scannet [13] split. Proxy3D outperforms the base Qwen2-VL-7B and GPT4Scene by a large margin in object counting, size and distance estimation. Coordinate alignment (CA) and longer sequences further increase metrics [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Proxy3D, a method that takes video frames as input, extracts features via separate semantic and geometric encoders, applies semantic-aware clustering to produce compact 3D proxy representations, curates the SpaceSpan dataset, and performs multi-stage training to align the proxies with a VLM. It claims that these shorter proxy sequences enable competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks.

Significance. If the performance claims hold under rigorous evaluation, the approach could offer a practical middle ground between inefficient full 3D reconstructions and spatially inconsistent 2D pipelines, improving scalability for VLMs that require spatial reasoning.

major comments (2)
  1. [Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.
  2. [Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.
minor comments (2)
  1. [Introduction] Introduction: the distinction between correspondence-based and representation-based models would benefit from one or two concrete citations to prior work to ground the motivation.
  2. [Dataset] Dataset section: provide details on how SpaceSpan was curated and its size/statistics to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining our planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.

    Authors: We appreciate the referee highlighting this aspect of the clustering procedure. The geometric encoder is specifically designed to extract 3D-aware features from the input video frames, and the resulting proxies are explicitly located in 3D space. Nevertheless, we acknowledge that the current description does not detail explicit multi-view constraints during clustering. In the revised manuscript we will expand the Method section with a clearer explanation of how the geometric encoder contributes to spatial consistency, add a dedicated paragraph on the 3D positioning of proxies, and include an ablation study that quantifies multi-view consistency (e.g., reprojection error and depth alignment metrics) before and after clustering. These additions will directly support the claim that the proxies preserve sufficient 3D structure for downstream spatial tasks. revision: partial

  2. Referee: [Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.

    Authors: We apologize for any lack of clarity in the presentation of results. The manuscript already contains quantitative evaluations on 3D VQA, visual grounding, and spatial-intelligence benchmarks together with baseline comparisons. To address the referee’s concerns comprehensively, we will reorganize and expand the Experiments section to place all headline results in the main body with full tables, report standard deviations across multiple runs, add ablations on clustering hyperparameters (number of clusters, semantic-to-geometric feature ratio), and include direct comparisons against both full 3D reconstruction pipelines and standard 2D pixel-based VLM inputs. Sequence-length statistics will also be reported to quantify the efficiency advantage alongside accuracy metrics. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no load-bearing circular reductions

full rationale

The described method consists of independent per-frame encoding via semantic and geometric encoders, followed by semantic-aware clustering to form proxies, curation of a new SpaceSpan dataset, and multi-stage training for VLM alignment. No equations, fitted parameters renamed as predictions, self-definitional constructs, or uniqueness theorems imported via self-citation are present in the provided text. Performance claims rest on experimental results against external benchmarks rather than any constructed equivalence to inputs, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented physical entities; the proxies are a representational construct rather than a new postulated object with independent evidence.

pith-pipeline@v0.9.0 · 5513 in / 1090 out tokens · 45523 ms · 2026-05-11T01:52:20.796104+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 6 internal anchors

  1. [1]

    ScanQA: 3D question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. InCVPR, 2022. 4, 6

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- VL technical report.a...

  3. [3]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InNeurIPS, 2021. 7

  4. [4]

    Holistic evaluation of multimodal llms on spatial intelligence

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holis- tic evaluation of multimodal LLMs on spatial intelligence. arXiv:2508.13142, 2025. 2

  5. [5]

    SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, 2024. 2

  6. [6]

    Scanrefer: 3D object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3D object localization in rgb-d scans using natural language. InECCV, 2020. 2, 4, 5, 6

  7. [7]

    PointGPT: Auto-regressively generative pre-training from point clouds

    Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. PointGPT: Auto-regressively generative pre-training from point clouds. InNeurIPS, 2023. 3

  8. [8]

    LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning. InCVPR, 2024. 2

  9. [9]

    Zhenyu Chen, Ali Gholami, Matthias Niessner, and Angel X. Chang. Scan2Cap: Context-aware dense captioning in RGB- D scans. InCVPR, 2021. 2, 4, 5, 6

  10. [10]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 6

  11. [11]

    Spatial- RGPT: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- RGPT: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 2

  12. [12]

    3D aware region prompted vision language model

    An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3D aware region prompted vision language model. InICLR, 2026. 2

  13. [13]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner. ScanNet: Richly- annotated 3D reconstructions of indoor scenes. InCVPR,

  14. [14]

    VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. InCVPR, 2026. 2

  15. [15]

    Seg- mentation from natural language expressions

    Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InECCV, 2016. 2

  16. [16]

    An embodied generalist agent in 3D world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baox- iong Jia, and Siyuan Huang. An embodied generalist agent in 3D world. InICML, 2024. 6

  17. [17]

    arXiv preprint arXiv:2506.09935 (2025) Egomotion-Aware Video Representation for 3D Scene Understanding 17

    Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, and Siyuan Huang. LEO-VL: Towards 3D vision-language generalists via data scaling with efficient representation.arXiv:2506.09935, 2025. 2, 3, 4, 5, 6, 7

  18. [18]

    MLLMs need 3D-aware representation supervision for scene understanding

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. MLLMs need 3D-aware representation supervision for scene understanding. InNeurIPS, 2025. 2, 6

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024. 6

  20. [20]

    Action genome: Actions as compositions of spatio- temporal scene graphs

    Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InCVPR, 2020. 2

  21. [21]

    LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

  22. [22]

    Learnable Fourier features for multi-dimensional spatial posi- tional encoding

    Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable Fourier features for multi-dimensional spatial posi- tional encoding. InNeurIPS, 2021. 3

  23. [23]

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model

    Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025. 2

  24. [24]

    Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task plan- ning

    Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang. SpatialCoT: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv:2501.10...

  25. [25]

    FlatFormer: Flattened window attention for effi- cient point cloud transformer

    Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. FlatFormer: Flattened window attention for effi- cient point cloud transformer. InCVPR, 2023. 2

  26. [26]

    MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations

    Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, and Jiangmiao Pang. MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations. InNeurIPS, 2024. 4, 5

  27. [27]

    SQA3D: Situated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InNeurIPS, 2023. 2, 4, 5

  28. [28]

    Mod- eling context between objects for referring expression under- standing

    Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Mod- eling context between objects for referring expression under- standing. InECCV, 2016. 2

  29. [29]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing mllms in video spatial reasoning.arXiv:2504.01805, 2025. 4

  30. [30]

    ShapeLLM: Universal 3D object understanding for embodied interaction

    Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. ShapeLLM: Universal 3D object understanding for embodied interaction. InECCV, 2024. 2

  31. [31]

    GPT4scene: Understand 3d scenes from videos with vision-language models

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4scene: Understand 3d scenes from videos with vision-language models. InICLR, 2026. 2, 6

  32. [32]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR,

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 4

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv:2402.03300, 2024. 7

  35. [35]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 3

  36. [36]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv:2403.05530, 2024. 6

  37. [37]

    SplatTalk: 3D VQA with gaussian splatting

    Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. SplatTalk: 3D VQA with gaussian splatting. InICCV, 2025. 2

  38. [38]

    Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness. InICCV,

  39. [39]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 5

  40. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024. 6

  41. [41]

    OctFormer: Octree-based transformers for 3D point clouds.ACM Trans

    Peng-Shuai Wang. OctFormer: Octree-based transformers for 3D point clouds.ACM Trans. Graph., 2023. 2

  42. [42]

    Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InNeurIPS, 2025. 2, 6, 7

  43. [43]

    Point transformer V3: Simpler, faster, stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer V3: Simpler, faster, stronger. In CVPR, 2024. 2

  44. [44]

    PointLLM: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering large language models to understand point clouds. InECCV, 2024. 2

  45. [45]

    Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, and C.-C. Jay Kuo. Descrip3D: Enhancing large language model-based 3D scene understand- ing with object-level text descriptions. InWACV, 2026. 6

  46. [46]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei- Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In CVPR, 2025. 1, 2, 5, 6, 7

  47. [47]

    Language-aware vision transformer for referring segmentation.IEEE Trans

    Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans. Pattern Anal. Mach. Intell., 2024. 2

  48. [48]

    Scannet++: A high-fidelity dataset of 3D indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3D indoor scenes. InCVPR, 2023. 7

  49. [49]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InECCV, 2016. 2

  50. [50]

    3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding

    Tatiana Zemskova and Dmitry Yudin. 3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2

  51. [51]

    ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles

    Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles. InCVPR, 2024. 4, 6

  52. [52]

    Multi3DRefer: Grounding text description to multiple 3d objects

    Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding text description to multiple 3d objects. InICCV, 2023. 2, 4, 5

  53. [53]

    LLaV A-video: Video instruc- tion tuning with synthetic data.Transactions on Machine Learning Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruc- tion tuning with synthetic data.Transactions on Machine Learning Research, 2025. 6

  54. [54]

    Video-3D LLM: Learning position-aware video representation for 3D scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning position-aware video representation for 3D scene understanding. InCVPR, 2025. 2, 6

  55. [55]

    Breadth-first heuristic search

    Rong Zhou and Eric A Hansen. Breadth-first heuristic search. Artificial Intelligence, 2006. 3

  56. [56]

    LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness. InICCV, 2025. 2, 5, 6

  57. [57]

    Unifying 3D vision-language understanding via prompt- able queries

    Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, and Qing Li. Unifying 3D vision-language understanding via prompt- able queries. InECCV, 2024. 2