SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

Jiajie Xu; Jiawei Li; Long Chen; Weijie Shi; Xiaofang Zhou; Ziyi Liu

arxiv: 2605.28490 · v1 · pith:MOMYEIAFnew · submitted 2026-05-27 · 💻 cs.CV · cs.AI

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

Jiawei Li , Ziyi Liu , Weijie Shi , Long Chen , Jiajie Xu , Xiaofang Zhou This is my paper

Pith reviewed 2026-06-29 13:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D object groundingspatial reasoning3D-LLMfine-grained groundinglatent reasoning stepsunified multimodal modelsreferential grounding

0 comments

The pith

Latent spatial reasoning steps allow 3D-LLMs to refine object rankings step by step for fine-grained queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that unified 3D-LLMs can handle detailed 3D object grounding better by generating a sequence of latent spatial reasoning steps from the input query instead of making a single selection. These steps are then fed to a geometry-aware scorer that applies them in order, using masking to refine candidate rankings progressively from fixed object proposals. The approach trains on standard benchmark targets plus extra cue supervision but runs at inference with only the query and proposals. This structure is meant to preserve the model's ability to do dialog, QA, and captioning while improving performance when multiple same-class objects must be distinguished by relations and context. If the claim holds, breaking spatial decisions into ordered latent steps would make grounding more reliable in scenes with ambiguous references.

Core claim

Given fixed Mask3D object proposals, the LLM produces a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision together with auxiliary referential-cue supervision during training, yet inference requires only the input query and the Mask3D proposals.

What carries the argument

The sequence of latent spatial reasoning steps generated by the LLM, processed sequentially by the geometry-aware scorer to produce successive ranking refinements.

If this is right

SSR3D-LLM records the highest scores among unified 3D-LLM methods on ReferIt3D, ScanRefer, and Multi3DRef.
It delivers large gains over single-pointer grounding on tasks that require ruling out multiple same-class candidates.
It improves results over earlier unified 3D-LLMs while leaving the standard language-task route unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Exposing the latent steps could make the model's spatial decisions easier to inspect or debug after training.
The same step-wise refinement pattern might transfer to other sequential spatial tasks such as 3D navigation or scene editing.
Gains could increase further if the method were paired with learned or adaptive object proposals instead of fixed ones.

Load-bearing premise

The fixed Mask3D object proposals supply enough information and the learned latent steps will create actual step-by-step refinements at inference time rather than the model simply memorizing training patterns.

What would settle it

Performance on fine-grained grounding tasks falls to the level of the single-pointer baseline when evaluated on scenes with new object arrangements or when object proposals come from a different detector than the one used in training.

Figures

Figures reproduced from arXiv: 2605.28490 by Jiajie Xu, Jiawei Li, Long Chen, Weijie Shi, Xiaofang Zhou, Ziyi Liu.

**Figure 1.** Figure 1: QPG vs. S3G. Fine-grained 3D grounding often requires ruling out candidates through context objects and spatial relations. QPG makes one pointer-style selection, while S3G writes latent spatial reasoning steps and refines candidate rankings step by step. the same model can both generate language and select referred objects. Because grounding must share the model with dialog, question answering, and caption… view at source ↗

**Figure 2.** Figure 2: SSR3D-LLM overview. One 3D-LLM backbone handles language tasks and grounding. Without <geom>, it uses the default dialog/QA/caption route; with <geom>, it routes the instruction and Mask3D proposal representations to S3G for target selection and step-wise score traces. matching. For proposal i, fi , bi , ai , ci denote its pooled Mask3D feature, rotated-box geometry, optional DINOv2 multi-view appearance, … view at source ↗

**Figure 3.** Figure 3: S3G mechanism. (a) The LLM writes a latent-step workspace in one forward pass by reading hidden states at reserved step markers and memory tokens. (b) A geometry-aware scorer reads these latent steps to refine Mask3D proposal rankings step by step, then a pred-class filter uses the predicted target class for final reranking. (c) Step-length masking keeps inactive steps from changing candidate states, suppo… view at source ↗

**Figure 4.** Figure 4: Qualitative grounding case. A real ScanRefer example shows the query, step-wise candidate scores, proposal boxes, and baseline vs. SSR3D-LLM predictions. The masked fourth slot preserves the step-3 ranking; cue labels are visualization-only annotations for the figure [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative visualization I. Query: this is a short wooden table; it is against a wall. Latent Step 1 Latent Step 2 Latent Step 3 Final S3G Baseline Grounding [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative visualization II. Query: this is an white and black monitor; it is behind an all black keyboard on a tan desk; it is close to an off white and black monitor of similar size. Latent Step 1 Latent Step 2 Latent Step 3 Final S3G Baseline Grounding [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative visualization III. Query: there is a black office chair; placed in the side of the wall. Latent Step 1 Latent Step 2 Latent Step 3 Final S3G Baseline Grounding [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative visualization IV. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Auxiliary representation visualization for QPG. The QPG representation separates object-category semantics more clearly than spatial-relation semantics. We use this as a qualitative view of the single-step bottleneck; Section 5.2 evaluates its effect on object selection. A.4 Unified Baseline Adaptation and Index-to-Candidate Evaluation Grounded 3D-LLM and Chat-Scene use internal object-index outputs, so [… view at source ↗

**Figure 10.** Figure 10: Offline referential-cue annotation prompt. The Qwen/vLLM annotator receives the target phrase and query, returns JSON anchors plus a target-last referential order, and the preprocessing script converts that order into reserved step tokens. These generated cues supervise latent slots during training; inference uses only the query and Mask3D proposal representation, not generated or humanwritten chains. la… view at source ↗

**Figure 11.** Figure 11: Step-wise score heatmaps (examples). For each query, we track the ground-truth (GT) object and the union of candidates that enter the per-step top-k list, and visualize their step-wise probabilities over effective steps (k ≤ L) as a heatmap (color indicates log10 probability). Rows show candidate instance ids (and labels when available); “Other” aggregates the remaining probability mass outside the tracke… view at source ↗

**Figure 12.** Figure 12: Step-length masking analysis. From exported stepwise traces, we directly compare the transition magnitude within valid steps (∆valid) and after termination (∆after). The near-zero ∆after confirms step-length masking makes padded steps behaviorally inert [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: GT rank flow from step-wise traces. Smoothed density of GT rank over normalized step progress k/Lused. Columns are normalized within each progress step; curves summarize the median, mean, and 25–75% interquartile range. This aggregate diagnostic shows whether recursive updates tend to move the ground-truth object toward the top of the candidate ranking. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Macro trends from full evaluation (Nr3D, Sr3D). We aggregate the full evaluation set using the saved per-candidate probability distributions (no re-inference). Rows correspond to oracle step length L (from referential-cue annotations, clipped to K = 4); columns bin the candidate-set size (#objects). We show two methods side-by-side: SSR3D-LLM (ours) and a paired baseline w/o step-length mask (fixed-K upda… view at source ↗

**Figure 15.** Figure 15: High-confidence coverage from full evaluation. Same binning as [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: High-confidence conditional error from full evaluation. Same data as the bottom row of [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Error-mode breakdown from full evaluation. Top: top-1 probability histograms for correct vs wrong predictions. Bottom: high-confidence conditional error by oracle step length L (clipped to K = 4) with sample counts. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: QPG diagnostic macro trends. Same binning as [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: QPG diagnostic error-mode breakdown. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: QPG diagnostic high-confidence coverage. Same as [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: QPG diagnostic high-confidence conditional error. Same as [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

read the original abstract

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSR3D-LLM adds ordered latent steps and auxiliary cues to 3D grounding but the abstract leaves open whether the structure itself produces the reported gains.

read the letter

The main takeaway is that SSR3D-LLM replaces single-pointer grounding with an LLM-generated sequence of latent spatial reasoning steps plus memory tokens; a geometry-aware scorer then reads those steps in order, applying step-length masking to refine rankings on fixed Mask3D proposals.

The paper does a clean job of keeping the rest of the unified 3D-LLM pipeline intact while targeting the known weakness on fine-grained relational queries. It reports the strongest numbers among the unified baselines on ReferIt3D, ScanRefer, and Multi3DRef, with larger lifts on the harder cases, and the auxiliary referential-cue supervision during training is a straightforward addition that could plausibly help the model learn useful intermediate structure.

The soft spots are exactly where the stress-test note points. The abstract supplies no ablations that isolate the ordered scoring from the extra supervision or the scorer architecture, no error bars, and no evidence on whether the latent steps are actually consulted at inference or simply improve training. Because the proposals stay fixed and identical to the QPG baseline, any performance difference could come from richer training signals rather than genuine step-by-step refinement that generalizes. Without those checks the central claim remains plausible but unproven.

This is for people already working on unified 3D-LLMs and fine-grained grounding; a reader outside that subfield will not get much. The idea is concrete enough and the positioning against prior unified models is clear enough that it deserves a serious referee to examine the full methods and experiments.

Referee Report

2 major / 1 minor

Summary. The paper introduces SSR3D-LLM, a unified 3D-LLM for 3D object grounding that has the LLM generate a sequence of latent spatial reasoning steps and memory tokens from the natural language query; a geometry-aware scorer then reads these steps sequentially (with step-length masking) to iteratively refine rankings over fixed Mask3D object proposals. Training uses both standard target supervision and auxiliary referential-cue supervision, while inference uses only the query and the same proposals. The work claims the strongest results among unified 3D-LLM baselines on ReferIt3D, ScanRefer, and Multi3DRef, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and no degradation on the default language-task route.

Significance. If the gains are robust and attributable to the latent-step mechanism rather than auxiliary supervision or scorer differences, the approach would usefully extend unified 3D-LLMs to handle relational and fine-grained queries that single-pointer methods struggle with. The design choice to keep the language-task route unchanged is a practical strength. The manuscript supplies no machine-checked proofs, reproducible code, or parameter-free derivations; its value rests entirely on the empirical results.

major comments (2)

[Abstract] Abstract: performance numbers are stated on three benchmarks with no error bars, ablation details, training curves, or statistical tests. Without these, it is impossible to determine whether the reported improvements over QPG and prior unified 3D-LLMs are statistically supported or sensitive to post-hoc choices; this directly undermines the central empirical claim.
[Abstract] Abstract (training/inference description): the model is trained with both target supervision and auxiliary referential-cue supervision, yet inference uses only the query. No ablation is described that isolates the contribution of the sequential latent-step reading (versus a single-step or non-masked scorer) while holding the extra supervision fixed. This leaves open the possibility that gains arise from the auxiliary signal or scorer architecture rather than from structured spatial reasoning that generalizes at inference, which is load-bearing for the paper's main thesis.

minor comments (1)

[Abstract] Abstract: the phrase 'unified 3D-LLMs' is used without a brief definition or pointer to the specific prior works being compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support and targeted ablations. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: performance numbers are stated on three benchmarks with no error bars, ablation details, training curves, or statistical tests. Without these, it is impossible to determine whether the reported improvements over QPG and prior unified 3D-LLMs are statistically supported or sensitive to post-hoc choices; this directly undermines the central empirical claim.

Authors: We agree that the abstract's presentation of results without error bars or statistical tests limits the ability to assess robustness. Although space constraints prevent including all details in the abstract itself, we will revise the main results tables to report error bars from multiple random seeds, add statistical significance tests comparing against baselines, and ensure ablation details and training curves are explicitly referenced in the main text with pointers to the supplementary material. revision: yes
Referee: [Abstract] Abstract (training/inference description): the model is trained with both target supervision and auxiliary referential-cue supervision, yet inference uses only the query. No ablation is described that isolates the contribution of the sequential latent-step reading (versus a single-step or non-masked scorer) while holding the extra supervision fixed. This leaves open the possibility that gains arise from the auxiliary signal or scorer architecture rather than from structured spatial reasoning that generalizes at inference, which is load-bearing for the paper's main thesis.

Authors: The referee is correct that the current manuscript lacks an ablation that holds auxiliary supervision fixed while varying only the sequential latent-step mechanism versus single-step or non-masked alternatives. We will add this controlled ablation in the revised version to isolate the contribution of the structured reasoning at inference time and better substantiate the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical model proposal

full rationale

The paper describes an empirical architecture change for unified 3D-LLMs: fixed Mask3D proposals are fed to an LLM that generates latent steps, which a geometry-aware scorer then processes with step-length masking. Training uses standard target supervision plus auxiliary referential-cue supervision; inference uses only the query and proposals. Results are reported on external benchmarks (ReferIt3D, ScanRefer, Multi3DRef). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claim to its inputs by construction appear in the text. The work is self-contained as a model evaluated on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level model components; the central claim rests on the unstated assumption that Mask3D proposals plus learned latent steps suffice for generalization.

pith-pipeline@v0.9.1-grok · 5792 in / 1290 out tokens · 38666 ms · 2026-06-29T13:51:03.560038+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 16 canonical work pages

[1]

Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding.arXiv preprint arXiv:2310.06214, 2023

Eslam Abdelrahman, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, and Mohamed El- hoseiny. Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding.arXiv preprint arXiv:2310.06214, 2023

work page arXiv 2023
[2]

Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes

Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, and Panos Achlioptas. Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3524–3534, 2024

2024
[3]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes.16th European Conference on Computer Vision (ECCV), 2020

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes.16th European Conference on Computer Vision (ECCV), 2020

2020
[4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022
[5]

Mikasa: Multi-key- anchor & scene-aware transformer for 3d visual grounding

Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, and Didier Stricker. Mikasa: Multi-key- anchor & scene-aware transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14131–14140, 2024

2024
[6]

Dave Chen, Angel Chang, and Matthias Nießner.ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language, pages 202–221. 11 2020. ISBN 978-3-030-58564-8. doi: 10.1007/978-3-030-58565-5_13

work page doi:10.1007/978-3-030-58565-5_13 2020
[7]

Language conditioned spatial relation reasoning for 3d object grounding.Advances in neural information processing systems, 35:20522–20535, 2022

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding.Advances in neural information processing systems, 35:20522–20535, 2022

2022
[8]

Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024
[9]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[10]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

2022
[11]

Transcrib3d: 3d referring ex- pression resolution through large language models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, and Matthew R Walter. Transcrib3d: 3d referring ex- pression resolution through large language models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9737–9744. IEEE, 2024

2024
[12]

Scene-llm: Extending language model for 3d visual reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual reasoning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2195–2206. IEEE, 2025

2025
[13]

Comparison of distrib uted task allocation algorithms considering non -ideal communication f actors for multi -UAV collaborative visit missions,

Liang Geng and Jianqin Yin. Viewinfer3d: 3d visual grounding based on embodied viewpoint inference.IEEE Robotics and Automation Letters, 9:7469–7476, 2024. doi: 10.1109/LRA. 2024.3426286

work page doi:10.1109/lra 2024
[14]

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance.arXiv preprint arXiv:2303.16894, 2023

Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance.arXiv preprint arXiv:2303.16894, 2023

work page arXiv 2023
[15]

TransRefer3D: Entity-and-relation aware transformer for fine-grained 3d visual grounding

Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. TransRefer3D: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2344–2352, 2021. doi: 10.1145/3474085.3475397. URLhttps://arxiv.org/abs/2108.02388. 10

work page doi:10.1145/3474085.3475397 2021
[16]

3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

2023
[17]

Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

2024
[18]

Multi-view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022

2022
[19]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[20]

Spazer: Spatial-semantic progressive reasoning agent for zero-shot 3d visual grounding, 2025

Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, and Dacheng Tao. Spazer: Spatial-semantic progressive reasoning agent for zero-shot 3d visual grounding, 2025. URLhttps://arxiv.org/abs/2506.21924

work page arXiv 2025
[21]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

2024
[22]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Junwei Liang. Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[23]

A survey on text-guided 3d visual grounding: Elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. A survey on text-guided 3d visual grounding: Elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

work page arXiv 2024
[24]

Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction, 2025

Jiazhen Liu, Mingkuan Feng, and Long Chen. Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction, 2025. URL https: //arxiv.org/abs/2512.00395

work page arXiv 2025
[25]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025
[26]

3d-sps: Single-stage 3d visual grounding via referred point progressive selection

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18212–18221, 2022

2022
[27]

Language-to-space programming for training-free 3d visual grounding

Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, and Jiangmiao Pang. Language-to-space programming for training-free 3d visual grounding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3844–3864, 2025

2025
[28]

Gpt4scene: Understand 3d scenes from videos with vision-language models,

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models, 2025. URL https://arxiv. org/abs/2501.01428

work page arXiv 2025
[29]

Languagerefer: Spatial-language model for 3d visual grounding.arXiv preprint arXiv:2107.03438, 2021

Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. Languagerefer: Spatial-language model for 3d visual grounding.arXiv preprint arXiv:2107.03438, 2021

work page arXiv 2021
[30]

Dense multimodal alignment for open-vocabulary 3d scene understanding

Li Ruihuang, Zhang Zhengqiang, He Chenhang, Ma Zhiyuan, Patel Vishal M., and Zhang Lei. Dense multimodal alignment for open-vocabulary 3d scene understanding. InECCV, 2024

2024
[31]

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. 2023. 11

2023
[32]

Data-efficient 3d visual grounding via order-aware referring

Tung-Yu Wu, Sheng-Yu Huang, and Yu-Chiang Frank Wang. Data-efficient 3d visual grounding via order-aware referring. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3107–3117. IEEE, 2025

2025
[33]

Vlm- grounder: A vlm agent for zero-shot 3d visual grounding

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm- grounder: A vlm agent for zero-shot 3d visual grounding. InCoRL, 2024

2024
[34]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent.arXiv preprint arXiv:2309.12311, 2023

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent.arXiv preprint arXiv:2309.12311, 2023

work page arXiv 2023
[35]

3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination.arXiv preprint arXiv:2406.05132, 2024

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination.arXiv preprint arXiv:2406.05132, 2024

work page arXiv 2024
[36]

A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes

Ting Yu, Xiaojun Lin, Shuhui Wang, Weiguo Sheng, Qingming Huang, and Jun Yu. A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes. IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1322–1338, 2023

2023
[37]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, and Zhen Li. Visual programming for zero-shot open-vocabulary 3d visual grounding.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20623–20633, 2023. doi: 10.1109/CVPR52733.2024.01949

work page doi:10.1109/cvpr52733.2024.01949 2024
[38]

Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025

Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Guofei Chen, Ji Zhang, and Wenshan Wang. Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025. URLhttps://arxiv.org/abs/2504.18684

work page arXiv 2025
[39]

Multi3drefer: Grounding text description to multiple 3d objects

Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3drefer: Grounding text description to multiple 3d objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15225–15236, October 2023

2023
[40]

3dvg-transformer: Relation modeling for visual grounding on point clouds

Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 3dvg-transformer: Relation modeling for visual grounding on point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2928–2937, October 2021. 12 Algorithm 1 ReferBlockcomputation at stepk. Require:Candidate tokensX (k−1) ∈R N×D , memoryE∈R M×D , step cues k ∈R D ...

2021

[1] [1]

Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding.arXiv preprint arXiv:2310.06214, 2023

Eslam Abdelrahman, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, and Mohamed El- hoseiny. Cot3dref: Chain-of-thoughts data-efficient 3d visual grounding.arXiv preprint arXiv:2310.06214, 2023

work page arXiv 2023

[2] [2]

Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes

Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, and Panos Achlioptas. Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3524–3534, 2024

2024

[3] [3]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes.16th European Conference on Computer Vision (ECCV), 2020

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes.16th European Conference on Computer Vision (ECCV), 2020

2020

[4] [4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022

2022

[5] [5]

Mikasa: Multi-key- anchor & scene-aware transformer for 3d visual grounding

Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, and Didier Stricker. Mikasa: Multi-key- anchor & scene-aware transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14131–14140, 2024

2024

[6] [6]

Dave Chen, Angel Chang, and Matthias Nießner.ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language, pages 202–221. 11 2020. ISBN 978-3-030-58564-8. doi: 10.1007/978-3-030-58565-5_13

work page doi:10.1007/978-3-030-58565-5_13 2020

[7] [7]

Language conditioned spatial relation reasoning for 3d object grounding.Advances in neural information processing systems, 35:20522–20535, 2022

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding.Advances in neural information processing systems, 35:20522–20535, 2022

2022

[8] [8]

Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, and Jiangmiao Pang. Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024

[9] [9]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017

[10] [10]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

2022

[11] [11]

Transcrib3d: 3d referring ex- pression resolution through large language models

Jiading Fang, Xiangshan Tan, Shengjie Lin, Igor Vasiljevic, Vitor Guizilini, Hongyuan Mei, Rares Ambrus, Gregory Shakhnarovich, and Matthew R Walter. Transcrib3d: 3d referring ex- pression resolution through large language models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9737–9744. IEEE, 2024

2024

[12] [12]

Scene-llm: Extending language model for 3d visual reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual reasoning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2195–2206. IEEE, 2025

2025

[13] [13]

Comparison of distrib uted task allocation algorithms considering non -ideal communication f actors for multi -UAV collaborative visit missions,

Liang Geng and Jianqin Yin. Viewinfer3d: 3d visual grounding based on embodied viewpoint inference.IEEE Robotics and Automation Letters, 9:7469–7476, 2024. doi: 10.1109/LRA. 2024.3426286

work page doi:10.1109/lra 2024

[14] [14]

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance.arXiv preprint arXiv:2303.16894, 2023

Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding with gpt and prototype guidance.arXiv preprint arXiv:2303.16894, 2023

work page arXiv 2023

[15] [15]

TransRefer3D: Entity-and-relation aware transformer for fine-grained 3d visual grounding

Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. TransRefer3D: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2344–2352, 2021. doi: 10.1145/3474085.3475397. URLhttps://arxiv.org/abs/2108.02388. 10

work page doi:10.1145/3474085.3475397 2021

[16] [16]

3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023

2023

[17] [17]

Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Information Processing Systems, 37:113991–114017, 2024

2024

[18] [18]

Multi-view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022

2022

[19] [19]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[20] [20]

Spazer: Spatial-semantic progressive reasoning agent for zero-shot 3d visual grounding, 2025

Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, and Dacheng Tao. Spazer: Spatial-semantic progressive reasoning agent for zero-shot 3d visual grounding, 2025. URLhttps://arxiv.org/abs/2506.21924

work page arXiv 2025

[21] [21]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

2024

[22] [22]

Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Junwei Liang. Seeground: See and ground for zero-shot open-vocabulary 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[23] [23]

A survey on text-guided 3d visual grounding: Elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

Daizong Liu, Yang Liu, Wencan Huang, and Wei Hu. A survey on text-guided 3d visual grounding: Elements, recent advances, and future directions.arXiv preprint arXiv:2406.05785, 2024

work page arXiv 2024

[24] [24]

Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction, 2025

Jiazhen Liu, Mingkuan Feng, and Long Chen. Better, stronger, faster: Tackling the trilemma in mllm-based segmentation with simultaneous textual mask prediction, 2025. URL https: //arxiv.org/abs/2512.00395

work page arXiv 2025

[25] [25]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025

[26] [26]

3d-sps: Single-stage 3d visual grounding via referred point progressive selection

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18212–18221, 2022

2022

[27] [27]

Language-to-space programming for training-free 3d visual grounding

Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, and Jiangmiao Pang. Language-to-space programming for training-free 3d visual grounding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3844–3864, 2025

2025

[28] [28]

Gpt4scene: Understand 3d scenes from videos with vision-language models,

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models, 2025. URL https://arxiv. org/abs/2501.01428

work page arXiv 2025

[29] [29]

Languagerefer: Spatial-language model for 3d visual grounding.arXiv preprint arXiv:2107.03438, 2021

Junha Roh, Karthik Desingh, Ali Farhadi, and Dieter Fox. Languagerefer: Spatial-language model for 3d visual grounding.arXiv preprint arXiv:2107.03438, 2021

work page arXiv 2021

[30] [30]

Dense multimodal alignment for open-vocabulary 3d scene understanding

Li Ruihuang, Zhang Zhengqiang, He Chenhang, Ma Zhiyuan, Patel Vishal M., and Zhang Lei. Dense multimodal alignment for open-vocabulary 3d scene understanding. InECCV, 2024

2024

[31] [31]

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. 2023. 11

2023

[32] [32]

Data-efficient 3d visual grounding via order-aware referring

Tung-Yu Wu, Sheng-Yu Huang, and Yu-Chiang Frank Wang. Data-efficient 3d visual grounding via order-aware referring. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3107–3117. IEEE, 2025

2025

[33] [33]

Vlm- grounder: A vlm agent for zero-shot 3d visual grounding

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm- grounder: A vlm agent for zero-shot 3d visual grounding. InCoRL, 2024

2024

[34] [34]

Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent.arXiv preprint arXiv:2309.12311, 2023

Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent.arXiv preprint arXiv:2309.12311, 2023

work page arXiv 2023

[35] [35]

3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination.arXiv preprint arXiv:2406.05132, 2024

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination.arXiv preprint arXiv:2406.05132, 2024

work page arXiv 2024

[36] [36]

A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes

Ting Yu, Xiaojun Lin, Shuhui Wang, Weiguo Sheng, Qingming Huang, and Jun Yu. A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes. IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1322–1338, 2023

2023

[37] [37]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, and Zhen Li. Visual programming for zero-shot open-vocabulary 3d visual grounding.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20623–20633, 2023. doi: 10.1109/CVPR52733.2024.01949

work page doi:10.1109/cvpr52733.2024.01949 2024

[38] [38]

Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025

Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Guofei Chen, Ji Zhang, and Wenshan Wang. Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025. URLhttps://arxiv.org/abs/2504.18684

work page arXiv 2025

[39] [39]

Multi3drefer: Grounding text description to multiple 3d objects

Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3drefer: Grounding text description to multiple 3d objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15225–15236, October 2023

2023

[40] [40]

3dvg-transformer: Relation modeling for visual grounding on point clouds

Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 3dvg-transformer: Relation modeling for visual grounding on point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2928–2937, October 2021. 12 Algorithm 1 ReferBlockcomputation at stepk. Require:Candidate tokensX (k−1) ∈R N×D , memoryE∈R M×D , step cues k ∈R D ...

2021