arxiv: 2605.08064 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Jerry Jiang , Haowen Sun , Denis Gudovskiy , Yohei Nakata , Tomoyuki Okuno , Kurt Keutzer , Wenzhao Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D representationsVision-language modelsSemantic clusteringProxy representationsSpatial intelligenceVisual question answeringVideo processing3D visual grounding

0 comments

The pith

Proxy3D creates compact 3D proxies from video frames via semantic clustering to let VLMs handle spatial tasks efficiently with shorter sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix how vision-language models process 3D scenes by moving away from long pixel sequences that either break spatial consistency or serialize data inefficiently. It extracts semantic and geometric features from ordinary video frames, then clusters them into a smaller set of 3D proxies that still carry the necessary scene structure. These proxies are aligned to the language model through a custom dataset and staged training. The result is competitive or leading performance on 3D visual question answering, grounding, and spatial reasoning benchmarks while using fewer vision tokens than standard approaches.

Core claim

Given only video frames, semantic and geometric encoders extract scene features that undergo semantic-aware clustering to produce a compact set of 3D proxies; these proxies are then aligned with a vision-language model through the SpaceSpan dataset and multi-stage training, yielding competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks even when vision sequences are kept short.

What carries the argument

Semantic-aware clustering of features from video frames that yields a compact set of 3D proxies serving as the vision input to the VLM.

If this is right

VLMs can maintain spatial consistency across 3D scenes while processing shorter vision sequences than pixel-aligned or full-geometry methods.
Video-only inputs become sufficient for high-performing 3D visual question answering and grounding without explicit 3D reconstruction.
Multi-stage alignment on the SpaceSpan dataset transfers the proxy representations effectively into existing VLM architectures.
General spatial intelligence benchmarks improve because the proxies encode both semantic identity and geometric relations in one compact form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering step might allow VLMs to ingest live video streams for real-time spatial tasks with lower latency.
Proxy representations could be combined with token-reduction techniques already used in 2D VLMs to push efficiency further.
Testing the proxies on longer or more dynamic video sequences would reveal whether fine-grained motion details survive the clustering step.
The approach opens a route to 3D-aware VLMs that avoid the heavy compute of full point-cloud or mesh inputs.

Load-bearing premise

Semantic-aware clustering of features extracted from video frames produces 3D proxies that retain enough spatial and semantic information for effective VLM alignment without critical loss of detail or consistency.

What would settle it

A controlled benchmark run where the same 3D VQA and grounding tasks are evaluated once with the clustered proxies and once with the original un-clustered feature sequences; a clear drop below competitive levels for the proxy version would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08064 by Denis Gudovskiy, Haowen Sun, Jerry Jiang, Kurt Keutzer, Tomoyuki Okuno, Wenzhao Zheng, Yohei Nakata.

**Figure 1.** Figure 1: Overview of Proxy3D: our 3D proxy representations are extracted from a set of pretrained encoders, their sequence length is compressed by the semantic-aware clustering followed by the multi-stage alignment with a language model using our SpaceSpan dataset. Abstract Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite … view at source ↗

**Figure 2.** Figure 2: Proxy3D architecture. A geometry predictor and a semantic encoder output latent features of vision modality. Then, our proxy 3D representations are clustered to reduce complexity. Lastly, multi-stage training aligns proxy features with the language model. information. First, N RGB image frames {Ii} N i=1 with the H × W × 3 resolution are processed by a 2D visual encoder [2]. As a result, we obtain feature … view at source ↗

**Figure 3.** Figure 3: Proxy3D multi-stage training. Each stage in our progressive iterative training aims to develop a certain spatial intelligence skill from the easiest one to more complex ones: we begin with the simplified image-text alignment to actual images with spatial reasoning. the width and length {W × L}. This can be expressed by z ′ g,j = R (cg,j∈H) zg,j + F [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Coordinate alignment stage helps an MLLM to precisely [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Proxy3D performance on VSI-Bench [46]. Left is on Scannet++ [48], right is on ARKitScenes [3]. Proxy3D generalizes well on unseen scenes, and is capable of solving difficult questions. and various variants of Qwen2.5VL [2]. Surprisingly, only the Spatial-MLLM [42] baseline has a marginal improvement over the proposed Proxy3D. Analysis of SpatialMLLM work shows that, similar to LEO-VL [17], it relies on p… view at source ↗

**Figure 6.** Figure 6: Comparison of VSI-Bench tasks and splits [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on VSI-Bench’s Scannet [13] split. Proxy3D outperforms the base Qwen2-VL-7B and GPT4Scene by a large margin in object counting, size and distance estimation. Coordinate alignment (CA) and longer sequences further increase metrics [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy3D clusters per-frame semantic and geometric features into shorter 3D proxies and adds a new SpaceSpan dataset, but the abstract gives no numbers to show whether those proxies actually keep the spatial consistency needed for the claimed gains.

read the letter

The paper's core move is to run semantic and geometric encoders on video frames, then cluster the outputs into a small set of 3D proxies that replace the usual long vision sequence in a VLM. They also introduce the SpaceSpan dataset and a multi-stage alignment process to make the proxies usable. This targets the real bottleneck of sequence length when you want 3D reasoning without blowing up compute or context size.

Referee Report

2 major / 2 minor

Summary. The paper proposes Proxy3D, a method that takes video frames as input, extracts features via separate semantic and geometric encoders, applies semantic-aware clustering to produce compact 3D proxy representations, curates the SpaceSpan dataset, and performs multi-stage training to align the proxies with a VLM. It claims that these shorter proxy sequences enable competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks.

Significance. If the performance claims hold under rigorous evaluation, the approach could offer a practical middle ground between inefficient full 3D reconstructions and spatially inconsistent 2D pipelines, improving scalability for VLMs that require spatial reasoning.

major comments (2)

[Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.
[Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.

minor comments (2)

[Introduction] Introduction: the distinction between correspondence-based and representation-based models would benefit from one or two concrete citations to prior work to ground the motivation.
[Dataset] Dataset section: provide details on how SpaceSpan was curated and its size/statistics to allow reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining our planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.

Authors: We appreciate the referee highlighting this aspect of the clustering procedure. The geometric encoder is specifically designed to extract 3D-aware features from the input video frames, and the resulting proxies are explicitly located in 3D space. Nevertheless, we acknowledge that the current description does not detail explicit multi-view constraints during clustering. In the revised manuscript we will expand the Method section with a clearer explanation of how the geometric encoder contributes to spatial consistency, add a dedicated paragraph on the 3D positioning of proxies, and include an ablation study that quantifies multi-view consistency (e.g., reprojection error and depth alignment metrics) before and after clustering. These additions will directly support the claim that the proxies preserve sufficient 3D structure for downstream spatial tasks. revision: partial
Referee: [Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.

Authors: We apologize for any lack of clarity in the presentation of results. The manuscript already contains quantitative evaluations on 3D VQA, visual grounding, and spatial-intelligence benchmarks together with baseline comparisons. To address the referee’s concerns comprehensively, we will reorganize and expand the Experiments section to place all headline results in the main body with full tables, report standard deviations across multiple runs, add ablations on clustering hyperparameters (number of clusters, semantic-to-geometric feature ratio), and include direct comparisons against both full 3D reconstruction pipelines and standard 2D pixel-based VLM inputs. Sequence-length statistics will also be reported to quantify the efficiency advantage alongside accuracy metrics. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no load-bearing circular reductions

full rationale

The described method consists of independent per-frame encoding via semantic and geometric encoders, followed by semantic-aware clustering to form proxies, curation of a new SpaceSpan dataset, and multi-stage training for VLM alignment. No equations, fitted parameters renamed as predictions, self-definitional constructs, or uniqueness theorems imported via self-citation are present in the provided text. Performance claims rest on experimental results against external benchmarks rather than any constructed equivalence to inputs, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented physical entities; the proxies are a representational construct rather than a new postulated object with independent evidence.

pith-pipeline@v0.9.0 · 5513 in / 1090 out tokens · 45523 ms · 2026-05-11T01:52:20.796104+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To reduce the sequence length for computational efficiency, we propose to group the former triplets based on their semantic labels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 6 internal anchors

[1]

ScanQA: 3D question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. InCVPR, 2022. 4, 6

work page 2022
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- VL technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InNeurIPS, 2021. 7

work page 2021
[4]

Holistic evaluation of multimodal llms on spatial intelligence

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holis- tic evaluation of multimodal LLMs on spatial intelligence. arXiv:2508.13142, 2025. 2

work page arXiv 2025
[5]

SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, 2024. 2

work page 2024
[6]

Scanrefer: 3D object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3D object localization in rgb-d scans using natural language. InECCV, 2020. 2, 4, 5, 6

work page 2020
[7]

PointGPT: Auto-regressively generative pre-training from point clouds

Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. PointGPT: Auto-regressively generative pre-training from point clouds. InNeurIPS, 2023. 3

work page 2023
[8]

LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning. InCVPR, 2024. 2

work page 2024
[9]

Zhenyu Chen, Ali Gholami, Matthias Niessner, and Angel X. Chang. Scan2Cap: Context-aware dense captioning in RGB- D scans. InCVPR, 2021. 2, 4, 5, 6

work page 2021
[10]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 6

work page 2024
[11]

Spatial- RGPT: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- RGPT: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 2

work page 2024
[12]

3D aware region prompted vision language model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3D aware region prompted vision language model. InICLR, 2026. 2

work page 2026
[13]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner. ScanNet: Richly- annotated 3D reconstructions of indoor scenes. InCVPR,

work page
[14]

VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. InCVPR, 2026. 2

work page 2026
[15]

Seg- mentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InECCV, 2016. 2

work page 2016
[16]

An embodied generalist agent in 3D world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baox- iong Jia, and Siyuan Huang. An embodied generalist agent in 3D world. InICML, 2024. 6

work page 2024
[17]

arXiv preprint arXiv:2506.09935 (2025) Egomotion-Aware Video Representation for 3D Scene Understanding 17

Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, and Siyuan Huang. LEO-VL: Towards 3D vision-language generalists via data scaling with efficient representation.arXiv:2506.09935, 2025. 2, 3, 4, 5, 6, 7

work page arXiv 2025
[18]

MLLMs need 3D-aware representation supervision for scene understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. MLLMs need 3D-aware representation supervision for scene understanding. InNeurIPS, 2025. 2, 6

work page 2025
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Action genome: Actions as compositions of spatio- temporal scene graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InCVPR, 2020. 2

work page 2020
[21]

LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

work page
[22]

Learnable Fourier features for multi-dimensional spatial posi- tional encoding

Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable Fourier features for multi-dimensional spatial posi- tional encoding. InNeurIPS, 2021. 3

work page 2021
[23]

Coarse correspondences boost spatial-temporal reasoning in multimodal language model

Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025. 2

work page 2025
[24]

Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task plan- ning

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang. SpatialCoT: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv:2501.10...

work page arXiv 2025
[25]

FlatFormer: Flattened window attention for effi- cient point cloud transformer

Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. FlatFormer: Flattened window attention for effi- cient point cloud transformer. InCVPR, 2023. 2

work page 2023
[26]

MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations

Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, and Jiangmiao Pang. MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations. InNeurIPS, 2024. 4, 5

work page 2024
[27]

SQA3D: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InNeurIPS, 2023. 2, 4, 5

work page 2023
[28]

Mod- eling context between objects for referring expression under- standing

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Mod- eling context between objects for referring expression under- standing. InECCV, 2016. 2

work page 2016
[29]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing mllms in video spatial reasoning.arXiv:2504.01805, 2025. 4

work page internal anchor Pith review arXiv 2025
[30]

ShapeLLM: Universal 3D object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. ShapeLLM: Universal 3D object understanding for embodied interaction. InECCV, 2024. 2

work page 2024
[31]

GPT4scene: Understand 3d scenes from videos with vision-language models

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4scene: Understand 3d scenes from videos with vision-language models. InICLR, 2026. 2, 6

work page 2026
[32]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR,

work page
[33]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 4

work page 2022
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv:2402.03300, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 3

work page 2024
[36]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv:2403.05530, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

SplatTalk: 3D VQA with gaussian splatting

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. SplatTalk: 3D VQA with gaussian splatting. InICCV, 2025. 2

work page 2025
[38]

Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness. InICCV,

work page
[39]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 5

work page 2025
[40]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

OctFormer: Octree-based transformers for 3D point clouds.ACM Trans

Peng-Shuai Wang. OctFormer: Octree-based transformers for 3D point clouds.ACM Trans. Graph., 2023. 2

work page 2023
[42]

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InNeurIPS, 2025. 2, 6, 7

work page 2025
[43]

Point transformer V3: Simpler, faster, stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer V3: Simpler, faster, stronger. In CVPR, 2024. 2

work page 2024
[44]

PointLLM: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering large language models to understand point clouds. InECCV, 2024. 2

work page 2024
[45]

Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, and C.-C. Jay Kuo. Descrip3D: Enhancing large language model-based 3D scene understand- ing with object-level text descriptions. InWACV, 2026. 6

work page 2026
[46]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei- Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In CVPR, 2025. 1, 2, 5, 6, 7

work page 2025
[47]

Language-aware vision transformer for referring segmentation.IEEE Trans

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans. Pattern Anal. Mach. Intell., 2024. 2

work page 2024
[48]

Scannet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3D indoor scenes. InCVPR, 2023. 7

work page 2023
[49]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InECCV, 2016. 2

work page 2016
[50]

3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding

Tatiana Zemskova and Dmitry Yudin. 3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2

work page 2025
[51]

ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles

Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles. InCVPR, 2024. 4, 6

work page 2024
[52]

Multi3DRefer: Grounding text description to multiple 3d objects

Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding text description to multiple 3d objects. InICCV, 2023. 2, 4, 5

work page 2023
[53]

LLaV A-video: Video instruc- tion tuning with synthetic data.Transactions on Machine Learning Research, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruc- tion tuning with synthetic data.Transactions on Machine Learning Research, 2025. 6

work page 2025
[54]

Video-3D LLM: Learning position-aware video representation for 3D scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning position-aware video representation for 3D scene understanding. InCVPR, 2025. 2, 6

work page 2025
[55]

Breadth-first heuristic search

Rong Zhou and Eric A Hansen. Breadth-first heuristic search. Artificial Intelligence, 2006. 3

work page 2006
[56]

LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness. InICCV, 2025. 2, 5, 6

work page 2025
[57]

Unifying 3D vision-language understanding via prompt- able queries

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, and Qing Li. Unifying 3D vision-language understanding via prompt- able queries. InECCV, 2024. 2

work page 2024