Recognition: 2 theorem links
· Lean TheoremProxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3
The pith
Proxy3D creates compact 3D proxies from video frames via semantic clustering to let VLMs handle spatial tasks efficiently with shorter sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given only video frames, semantic and geometric encoders extract scene features that undergo semantic-aware clustering to produce a compact set of 3D proxies; these proxies are then aligned with a vision-language model through the SpaceSpan dataset and multi-stage training, yielding competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks even when vision sequences are kept short.
What carries the argument
Semantic-aware clustering of features from video frames that yields a compact set of 3D proxies serving as the vision input to the VLM.
If this is right
- VLMs can maintain spatial consistency across 3D scenes while processing shorter vision sequences than pixel-aligned or full-geometry methods.
- Video-only inputs become sufficient for high-performing 3D visual question answering and grounding without explicit 3D reconstruction.
- Multi-stage alignment on the SpaceSpan dataset transfers the proxy representations effectively into existing VLM architectures.
- General spatial intelligence benchmarks improve because the proxies encode both semantic identity and geometric relations in one compact form.
Where Pith is reading between the lines
- The same clustering step might allow VLMs to ingest live video streams for real-time spatial tasks with lower latency.
- Proxy representations could be combined with token-reduction techniques already used in 2D VLMs to push efficiency further.
- Testing the proxies on longer or more dynamic video sequences would reveal whether fine-grained motion details survive the clustering step.
- The approach opens a route to 3D-aware VLMs that avoid the heavy compute of full point-cloud or mesh inputs.
Load-bearing premise
Semantic-aware clustering of features extracted from video frames produces 3D proxies that retain enough spatial and semantic information for effective VLM alignment without critical loss of detail or consistency.
What would settle it
A controlled benchmark run where the same 3D VQA and grounding tasks are evaluated once with the clustered proxies and once with the original un-clustered feature sequences; a clear drop below competitive levels for the proxy version would falsify the central claim.
Figures
read the original abstract
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Proxy3D, a method that takes video frames as input, extracts features via separate semantic and geometric encoders, applies semantic-aware clustering to produce compact 3D proxy representations, curates the SpaceSpan dataset, and performs multi-stage training to align the proxies with a VLM. It claims that these shorter proxy sequences enable competitive or state-of-the-art results on 3D visual question answering, visual grounding, and general spatial intelligence benchmarks.
Significance. If the performance claims hold under rigorous evaluation, the approach could offer a practical middle ground between inefficient full 3D reconstructions and spatially inconsistent 2D pipelines, improving scalability for VLMs that require spatial reasoning.
major comments (2)
- [Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.
- [Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.
minor comments (2)
- [Introduction] Introduction: the distinction between correspondence-based and representation-based models would benefit from one or two concrete citations to prior work to ground the motivation.
- [Dataset] Dataset section: provide details on how SpaceSpan was curated and its size/statistics to allow reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, outlining our planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] Method section (clustering step): semantic-aware clustering is performed on independently encoded per-frame features without explicit incorporation of multi-view geometric constraints such as epipolar consistency, depth lifting, or 3D point regularization. This risks producing semantically coherent but spatially inconsistent proxies, directly threatening the central claim that the proxies retain sufficient 3D structure for spatial tasks.
Authors: We appreciate the referee highlighting this aspect of the clustering procedure. The geometric encoder is specifically designed to extract 3D-aware features from the input video frames, and the resulting proxies are explicitly located in 3D space. Nevertheless, we acknowledge that the current description does not detail explicit multi-view constraints during clustering. In the revised manuscript we will expand the Method section with a clearer explanation of how the geometric encoder contributes to spatial consistency, add a dedicated paragraph on the 3D positioning of proxies, and include an ablation study that quantifies multi-view consistency (e.g., reprojection error and depth alignment metrics) before and after clustering. These additions will directly support the claim that the proxies preserve sufficient 3D structure for downstream spatial tasks. revision: partial
-
Referee: [Experiments] Experiments section: the headline claim of competitive or SOTA performance on 3D VQA, grounding, and spatial benchmarks is asserted without reported quantitative results, baselines, error bars, ablation studies on clustering parameters, or comparisons to full 3D or pixel-based alternatives in the evaluated sections; this prevents assessment of whether the efficiency gain comes at the cost of spatial accuracy.
Authors: We apologize for any lack of clarity in the presentation of results. The manuscript already contains quantitative evaluations on 3D VQA, visual grounding, and spatial-intelligence benchmarks together with baseline comparisons. To address the referee’s concerns comprehensively, we will reorganize and expand the Experiments section to place all headline results in the main body with full tables, report standard deviations across multiple runs, add ablations on clustering hyperparameters (number of clusters, semantic-to-geometric feature ratio), and include direct comparisons against both full 3D reconstruction pipelines and standard 2D pixel-based VLM inputs. Sequence-length statistics will also be reported to quantify the efficiency advantage alongside accuracy metrics. revision: yes
Circularity Check
Empirical pipeline with no load-bearing circular reductions
full rationale
The described method consists of independent per-frame encoding via semantic and geometric encoders, followed by semantic-aware clustering to form proxies, curation of a new SpaceSpan dataset, and multi-stage training for VLM alignment. No equations, fitted parameters renamed as predictions, self-definitional constructs, or uniqueness theorems imported via self-citation are present in the provided text. Performance claims rest on experimental results against external benchmarks rather than any constructed equivalence to inputs, satisfying the criteria for a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To reduce the sequence length for computational efficiency, we propose to group the former triplets based on their semantic labels
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ScanQA: 3D question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D question answering for spatial scene understanding. InCVPR, 2022. 4, 6
work page 2022
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- VL technical report.a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InNeurIPS, 2021. 7
work page 2021
-
[4]
Holistic evaluation of multimodal llms on spatial intelligence
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Holis- tic evaluation of multimodal LLMs on spatial intelligence. arXiv:2508.13142, 2025. 2
-
[5]
SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, 2024. 2
work page 2024
-
[6]
Scanrefer: 3D object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3D object localization in rgb-d scans using natural language. InECCV, 2020. 2, 4, 5, 6
work page 2020
-
[7]
PointGPT: Auto-regressively generative pre-training from point clouds
Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, and Yufeng Yue. PointGPT: Auto-regressively generative pre-training from point clouds. InNeurIPS, 2023. 3
work page 2023
-
[8]
LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual interactive instruction tuning for omni-3D understand- ing reasoning and planning. InCVPR, 2024. 2
work page 2024
-
[9]
Zhenyu Chen, Ali Gholami, Matthias Niessner, and Angel X. Chang. Scan2Cap: Context-aware dense captioning in RGB- D scans. InCVPR, 2021. 2, 4, 5, 6
work page 2021
-
[10]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 6
work page 2024
-
[11]
Spatial- RGPT: Grounded spatial reasoning in vision-language models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- RGPT: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 2
work page 2024
-
[12]
3D aware region prompted vision language model
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3D aware region prompted vision language model. InICLR, 2026. 2
work page 2026
-
[13]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner. ScanNet: Richly- annotated 3D reconstructions of indoor scenes. InCVPR,
-
[14]
VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction. InCVPR, 2026. 2
work page 2026
-
[15]
Seg- mentation from natural language expressions
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InECCV, 2016. 2
work page 2016
-
[16]
An embodied generalist agent in 3D world
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baox- iong Jia, and Siyuan Huang. An embodied generalist agent in 3D world. InICML, 2024. 6
work page 2024
-
[17]
Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, and Siyuan Huang. LEO-VL: Towards 3D vision-language generalists via data scaling with efficient representation.arXiv:2506.09935, 2025. 2, 3, 4, 5, 6, 7
-
[18]
MLLMs need 3D-aware representation supervision for scene understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. MLLMs need 3D-aware representation supervision for scene understanding. InNeurIPS, 2025. 2, 6
work page 2025
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Action genome: Actions as compositions of spatio- temporal scene graphs
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InCVPR, 2020. 2
work page 2020
-
[21]
LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
-
[22]
Learnable Fourier features for multi-dimensional spatial posi- tional encoding
Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable Fourier features for multi-dimensional spatial posi- tional encoding. InNeurIPS, 2021. 3
work page 2021
-
[23]
Coarse correspondences boost spatial-temporal reasoning in multimodal language model
Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran- jay Krishna. Coarse correspondences boost spatial-temporal reasoning in multimodal language model. InCVPR, 2025. 2
work page 2025
-
[24]
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang. SpatialCoT: Advancing spatial reasoning through coordinate alignment and chain-of-thought for embodied task planning.arXiv:2501.10...
-
[25]
FlatFormer: Flattened window attention for effi- cient point cloud transformer
Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. FlatFormer: Flattened window attention for effi- cient point cloud transformer. InCVPR, 2023. 2
work page 2023
-
[26]
MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations
Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, and Jiangmiao Pang. MMScan: A multi- modal 3D scene dataset with hierarchical grounded language annotations. InNeurIPS, 2024. 4, 5
work page 2024
-
[27]
SQA3D: Situated question answering in 3d scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Situated question answering in 3d scenes. InNeurIPS, 2023. 2, 4, 5
work page 2023
-
[28]
Mod- eling context between objects for referring expression under- standing
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Mod- eling context between objects for referring expression under- standing. InECCV, 2016. 2
work page 2016
-
[29]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing mllms in video spatial reasoning.arXiv:2504.01805, 2025. 4
work page internal anchor Pith review arXiv 2025
-
[30]
ShapeLLM: Universal 3D object understanding for embodied interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. ShapeLLM: Universal 3D object understanding for embodied interaction. InECCV, 2024. 2
work page 2024
-
[31]
GPT4scene: Understand 3d scenes from videos with vision-language models
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4scene: Understand 3d scenes from videos with vision-language models. InICLR, 2026. 2, 6
work page 2026
-
[32]
SAM 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR,
-
[33]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 4
work page 2022
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv:2402.03300, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 3
work page 2024
-
[36]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context. arXiv:2403.05530, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
SplatTalk: 3D VQA with gaussian splatting
Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. SplatTalk: 3D VQA with gaussian splatting. InICCV, 2025. 2
work page 2025
-
[38]
Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness
Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3D: Reconstruc- tive visual instruction tuning with 3D-awareness. InICCV,
-
[39]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 5
work page 2025
-
[40]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
OctFormer: Octree-based transformers for 3D point clouds.ACM Trans
Peng-Shuai Wang. OctFormer: Octree-based transformers for 3D point clouds.ACM Trans. Graph., 2023. 2
work page 2023
-
[42]
Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence. InNeurIPS, 2025. 2, 6, 7
work page 2025
-
[43]
Point transformer V3: Simpler, faster, stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer V3: Simpler, faster, stronger. In CVPR, 2024. 2
work page 2024
-
[44]
PointLLM: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering large language models to understand point clouds. InECCV, 2024. 2
work page 2024
-
[45]
Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, and C.-C. Jay Kuo. Descrip3D: Enhancing large language model-based 3D scene understand- ing with object-level text descriptions. InWACV, 2026. 6
work page 2026
-
[46]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei- Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In CVPR, 2025. 1, 2, 5, 6, 7
work page 2025
-
[47]
Language-aware vision transformer for referring segmentation.IEEE Trans
Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans. Pattern Anal. Mach. Intell., 2024. 2
work page 2024
-
[48]
Scannet++: A high-fidelity dataset of 3D indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3D indoor scenes. InCVPR, 2023. 7
work page 2023
-
[49]
Modeling context in referring expressions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InECCV, 2016. 2
work page 2016
-
[50]
3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding
Tatiana Zemskova and Dmitry Yudin. 3DGraphLLM: Com- bining semantic graphs and large language models for 3d scene understanding. InICCV, 2025. 2
work page 2025
-
[51]
ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles
Jiawei Zhang, Chejian Xu, and Bo Li. ChatScene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles. InCVPR, 2024. 4, 6
work page 2024
-
[52]
Multi3DRefer: Grounding text description to multiple 3d objects
Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding text description to multiple 3d objects. InICCV, 2023. 2, 4, 5
work page 2023
-
[53]
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruc- tion tuning with synthetic data.Transactions on Machine Learning Research, 2025. 6
work page 2025
-
[54]
Video-3D LLM: Learning position-aware video representation for 3D scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning position-aware video representation for 3D scene understanding. InCVPR, 2025. 2, 6
work page 2025
-
[55]
Breadth-first heuristic search
Rong Zhou and Eric A Hansen. Breadth-first heuristic search. Artificial Intelligence, 2006. 3
work page 2006
-
[56]
LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness. InICCV, 2025. 2, 5, 6
work page 2025
-
[57]
Unifying 3D vision-language understanding via prompt- able queries
Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, and Qing Li. Unifying 3D vision-language understanding via prompt- able queries. InECCV, 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.