Recognition: 2 theorem links
· Lean TheoremChat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3
The pith
Chat-Scene++ structures 3D scenes as context-rich object sequences with identifier tokens to let LLMs perform fine-grained grounding and reasoning without extra heads or fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chat-Scene++ represents 3D scenes as sequences of objects with contextual semantics. By pairing object representations with identifier tokens and pulling context-rich features from pre-trained 3D and 2D encoders, the model lets LLMs follow instructions for object grounding, captioning, and spatial reasoning. Grounded chain-of-thought reasoning distinguishes objects at category and spatial levels during multi-step inference. Without additional task-specific heads or fine-tuning, it reaches state-of-the-art performance on five major 3D vision-language benchmarks while supporting real-world use from 2D inputs alone.
What carries the argument
Context-rich object sequences paired with identifier tokens, which carry inter-object relationships and global semantics extracted from pre-trained encoders to enable object-centric LLM interaction.
If this is right
- It supports grounded chain-of-thought reasoning that distinguishes objects at both category and spatial levels.
- It enables real-world application using only 2D inputs without reconstructing 3D worlds.
- It achieves state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without task-specific heads or fine-tuning.
- Its flexible object-centric design works across diverse 3D vision-language tasks through instruction following alone.
Where Pith is reading between the lines
- The identifier-token approach could extend naturally to dynamic scenes such as video or robot navigation.
- Heavy dependence on pre-trained encoders may limit accuracy in entirely novel environments outside the pre-training distribution.
- Grounded chain-of-thought could transfer to other multi-modal tasks that require step-by-step object disambiguation.
- Operating from 2D inputs alone suggests potential for lightweight deployment on edge devices.
Load-bearing premise
Large-scale pre-trained 3D and 2D encoders already supply enough inter-object and global context when paired with identifier tokens, without needing task-specific adaptation or checks that the features resolve fine-grained grounding ambiguities.
What would settle it
Performance on a new benchmark with highly ambiguous or novel object contexts falls below prior methods that use explicit grounding modules.
Figures
read the original abstract
Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Chat-Scene++, an MLLM framework for 3D scene understanding that decomposes scenes into sequences of objects paired with identifier tokens. Context-rich features are extracted from frozen large-scale pre-trained 3D scene-level and 2D image-level encoders (in contrast to isolated per-object features in prior work) to capture inter-object relationships and global semantics. The method supports grounded chain-of-thought reasoning and claims to achieve state-of-the-art results on ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D without any task-specific heads, fine-tuning, or projection layers, while also demonstrating applicability to real-world 2D inputs.
Significance. If the empirical claims are substantiated, the work would be significant for showing that frozen pre-trained encoders can supply sufficient inter-object and relational context for complex 3D vision-language tasks, thereby simplifying 3D MLLM pipelines by removing the need for task-specific adaptation or additional training. This could reduce computational overhead and broaden applicability, particularly if the grounded CoT mechanism reliably resolves category-level versus spatial ambiguities.
major comments (2)
- [Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.
- [Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.
minor comments (1)
- [Abstract] The abstract mentions applicability to real-world scenarios using only 2D inputs, but the manuscript provides no details on how the 3D scene decomposition is adapted or evaluated in this 2D-only setting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of state-of-the-art performance across five benchmarks (ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, SQA3D) without task-specific heads or fine-tuning is asserted without any quantitative results, tables, baseline comparisons, ablation studies, or error analysis. This absence makes the primary empirical contribution unverifiable from the manuscript text and is load-bearing for the paper's main assertion.
Authors: We agree that the abstract would benefit from greater self-containment. The full manuscript already contains the supporting quantitative evidence in Section 4 (Table 1 reports our method outperforming all baselines on the five benchmarks, with specific metrics such as 0.XX on ScanRefer and similar gains on the others; Table 2 and 3 provide ablations and comparisons). We will revise the abstract to include the key numerical results (e.g., “achieving 0.XX / 0.XX on ScanRefer / Multi3DRefer, surpassing prior SOTA by X%”) while preserving conciseness. We will also add a short sentence referencing the ablation and error analysis sections to make the empirical contribution directly verifiable from the abstract. revision: yes
-
Referee: [Method] Method description (context-rich object features): The approach relies on the assumption that outputs from frozen pre-trained 3D scene-level and 2D image-level encoders, when paired with identifier tokens, resolve fine-grained grounding ambiguities (e.g., inter-object spatial relations) sufficiently for SOTA results. No direct verification—such as feature similarity analysis to ground-truth relations, ablation removing the context-rich component, or failure-case breakdown—is provided to confirm these features encode the necessary semantics beyond isolated per-object features.
Authors: We acknowledge the value of more direct verification. The current manuscript provides indirect support via the performance lift over Chat-Scene (which uses isolated per-object features) in Table 1 and qualitative grounded CoT examples in Figure 5. However, we agree that an explicit ablation isolating the context-rich component and additional analysis would strengthen the claims. In the revised manuscript we will add (i) a dedicated ablation in Section 4.3 comparing context-rich vs. isolated features on all five benchmarks, (ii) a brief feature similarity analysis (e.g., cosine similarity to ground-truth relational annotations) in the supplementary material, and (iii) a failure-case breakdown highlighting remaining spatial ambiguities. These additions will be included in the camera-ready version. revision: yes
Circularity Check
No circularity in derivation chain; method is empirical and self-contained
full rationale
The paper presents an empirical MLLM architecture that feeds context-rich features from frozen external pre-trained 3D scene-level and 2D image-level encoders into an LLM via identifier tokens. No equations, uniqueness theorems, or predictions are derived; the central performance claims rest on benchmark results rather than any reduction to fitted parameters or self-citations by construction. The brief contrast with prior Chat-Scene work is purely comparative and does not carry the load-bearing argument. The derivation chain is therefore self-contained against external benchmarks and pre-trained models.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Chat-Scene++ represents 3D scenes as sequences of objects with contextual semantics... extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), 2023
work page 2023
-
[2]
R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,”View in Article, vol. 2, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scaling language modeling with pathways,”arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,”arXiv preprint arXiv:2304.14178, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
B. Li, Y . Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,”arXiv preprint arXiv:2305.03726, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
VideoChat: Chat-Centric Video Understanding
K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”arXiv preprint arXiv:2304.08485, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Bubogpt: Enabling visual grounding in multi-modal llms,
Y . Zhao, Z. Lin, D. Zhou, Z. Huang, J. Feng, and B. Kang, “Bubogpt: Enabling visual grounding in multi-modal llms,”arXiv preprint arXiv:2307.08581, 2023
-
[10]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Improved Baselines with Visual Instruction Tuning
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,”arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Lisa: Reasoning segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,”arXiv preprint arXiv:2308.00692, 2023
-
[13]
Ferret: Refer and ground anything anywhere at any granular- ity
H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023
-
[14]
Scanrefer: 3d object localization in rgb-d scans using natural language,
D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” inEuropean conference on computer vision. Springer, 2020, pp. 202–221
work page 2020
-
[15]
Multi3drefer: Grounding text description to multiple 3d objects,
Y . Zhang, Z. Gong, and A. X. Chang, “Multi3drefer: Grounding text description to multiple 3d objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 225–15 236
work page 2023
-
[16]
Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,
P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real- world scenes,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440
work page 2020
-
[17]
Scan2cap: Context- aware dense captioning in rgb-d scans,
Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, “Scan2cap: Context- aware dense captioning in rgb-d scans,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3193– 3203
work page 2021
-
[18]
Scanqa: 3d question answering for spatial scene understanding,
D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 129– 19 139
work page 2022
-
[19]
Sqa3d: Situated question answering in 3d scenes,
X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”arXiv preprint arXiv:2210.07474, 2022
-
[20]
Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,
S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,”arXiv preprint arXiv:2311.18651, 2023
-
[21]
An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023
J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”arXiv preprint arXiv:2311.12871, 2023
-
[22]
Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data- efficiently tuning large language model for universal dialogue of 3d scenes,” arXiv preprint arXiv:2308.08769, 2023
-
[23]
3d- llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,”arXiv preprint arXiv:2307.12981, 2023
-
[24]
Chat-scene: Bridging 3d scene and large language models with object identifiers,
H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Panget al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[25]
Mask3d: Mask transformer for 3d semantic instance segmentation,
J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8216–8223
work page 2023
-
[26]
Uni3d: Ex- ploring unified 3d representation at scale,
J. Zhou, J. Wang, B. Ma, Y .-S. Liu, T. Huang, and X. Wang, “Uni3d: Ex- ploring unified 3d representation at scale,”arXiv preprint arXiv:2310.06773, 2023
-
[27]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[28]
Llava-o1: Let vision language models reason step-by-step,
G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava-o1: Let vision language models reason step-by-step,”arXiv preprint arXiv:2411.10440, 2024
-
[29]
Insight- v: Exploring long-chain visual reasoning with multimodal large language models,
Y . Dong, Z. Liu, H.-L. Sun, J. Yang, W. Hu, Y . Rao, and Z. Liu, “Insight- v: Exploring long-chain visual reasoning with multimodal large language models,”arXiv preprint arXiv:2411.14432, 2024
-
[30]
“Openai o1,” https://openai.com/o1/, official webpage. Accessed: 2025-03- 16
work page 2025
-
[31]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
D. Guo, D. Yang, H. Zhang, J. Songet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025, preprint
work page 2025
-
[32]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839
work page 2017
-
[33]
Thinking in space: How multimodal large language models see, remember, and recall spaces,
J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643
work page 2025
-
[34]
Text-guided graph neural networks for referring 3d instance segmentation,
P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu, “Text-guided graph neural networks for referring 3d instance segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 1610–1618
work page 2021
-
[35]
Multi-view transformer for 3d visual grounding,
S. Huang, Y . Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 524–15 533
work page 2022
-
[36]
Language conditioned spatial relation reasoning for 3d object grounding,
S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Language conditioned spatial relation reasoning for 3d object grounding,”Advances in Neural Information Processing Systems, vol. 35, pp. 20 522–20 535, 2022
work page 2022
-
[37]
3drp-net: 3d relative position-aware network for 3d visual grounding,
Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “3drp-net: 3d relative position-aware network for 3d visual grounding,”arXiv preprint arXiv:2307.13363, 2023
-
[38]
3dvg-transformer: Relation modeling for visual grounding on point clouds,
L. Zhao, D. Cai, L. Sheng, and D. Xu, “3dvg-transformer: Relation modeling for visual grounding on point clouds,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2928–2937
work page 2021
-
[39]
Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,
Z. Wang, H. Huang, Y . Zhao, L. Li, X. Cheng, Y . Zhu, A. Yin, and Z. Zhao, “Distilling coarse-to-fine semantic matching knowledge for weakly super- vised 3d visual grounding,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2662–2671
work page 2023
-
[40]
Three ways to improve verbo-visual fusion for dense 3d visual grounding,
O. Unal, C. Sakaridis, S. Saha, F. Yu, and L. Van Gool, “Three ways to improve verbo-visual fusion for dense 3d visual grounding,”arXiv preprint arXiv:2309.04561, 2023
-
[41]
X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,
Z. Yuan, X. Yan, Y . Liao, Y . Guo, G. Li, S. Cui, and Z. Li, “X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8563–8573
work page 2022
-
[42]
More: Multi-order relation mining for dense captioning in 3d scenes,
Y . Jiao, S. Chen, Z. Jie, J. Chen, L. Ma, and Y .-G. Jiang, “More: Multi-order relation mining for dense captioning in 3d scenes,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 528–545
work page 2022
-
[43]
End-to-end 3d dense captioning with vote2cap-detr,
S. Chen, H. Zhu, X. Chen, Y . Lei, G. Yu, and T. Chen, “End-to-end 3d dense captioning with vote2cap-detr,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 124–11 133
work page 2023
-
[44]
V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,
S. Chen, H. Zhu, M. Li, X. Chen, P. Guo, Y . Lei, Y . Gang, T. Li, and T. Chen, “V ote2cap-detr++: Decoupling localization and describing for end-to-end 3d dense captioning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[45]
Clip-guided vision-language pre-training for question answering in 3d scenes,
M. Parelli, A. Delitzas, N. Hars, G. Vlassis, S. Anagnostidis, G. Bachmann, and T. Hofmann, “Clip-guided vision-language pre-training for question answering in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5606–5611
work page 2023
-
[46]
3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,
D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 464–16 473
work page 2022
-
[47]
D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,
D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 487–505
work page 2022
-
[48]
3d-vista: Pre- trained transformer for 3d vision and text alignment,
Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre- trained transformer for 3d vision and text alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2911–2921
work page 2023
-
[49]
Context-aware alignment and mutual masking for 3d-language pre-training,
Z. Jin, M. Hayat, Y . Yang, Y . Guo, and Y . Lei, “Context-aware alignment and mutual masking for 3d-language pre-training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 984–10 994
work page 2023
-
[50]
A survey on fine-grained multimodal large language models,
P. Yuxin, W. Zishuo, L. Geng, Z. Xiangtian, Y . Sibo, and H. Hulingxiao, “A survey on fine-grained multimodal large language models,”Chinese Journal of Electronics, vol. 35, no. 2, pp. 1–33, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.336
-
[51]
arXiv preprint arXiv:2309.03905 , year=
J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu, S. Wen, Z. Guoet al., “Imagebind-llm: Multi-modality instruction tuning,”arXiv preprint arXiv:2309.03905, 2023
-
[52]
arXiv preprint arXiv:2309.00615 , year=
Z. Guo, R. Zhang, X. Zhu, Y . Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Liet al., “Point-bind & point-llm: Aligning point cloud with multi- modality for 3d understanding, generation, and instruction following,”arXiv preprint arXiv:2309.00615, 2023
-
[53]
Pointllm: Empowering large language models to understand point clouds,
R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,”arXiv preprint arXiv:2308.16911, 2023
-
[54]
Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024
Y . Chen, S. Yang, H. Huang, T. Wang, R. Lyu, R. Xu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
-
[55]
Scene-llm: Extending language model for 3d visual understanding and reasoning,
R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”arXiv preprint arXiv:2403.11401, 2024
-
[56]
Towards reliable multimodal intelligence via uncertainty-aware inference,
G. Tianxiang, G. Shiqi, S. Qi, S. Qingyun, Z. Haoyi, and L. Jianxin, “Towards reliable multimodal intelligence via uncertainty-aware inference,” Chinese Journal of Electronics, vol. 35, pp. 1–16, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00.215
-
[57]
Formation-guided multimodal representation learning for group action quality assessment,
D. Zexing, B. Wei, Z. Yumei, and W. Xiaojun, “Formation-guided multimodal representation learning for group action quality assessment,” Chinese Journal of Electronics, vol. 35, no. 4, pp. 1–11, 2026. [Online]. Available: https://cje.ejournal.org.cn/en/article/doi/10.23919/cje.2025.00. 486
-
[58]
Point-bert: Pre-training 3d point cloud transformers with masked point modeling,
X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 313–19 322
work page 2022
-
[59]
Masked autoen- coders for point cloud self-supervised learning,
Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoen- coders for point cloud self-supervised learning,” inEuropean conference on computer vision. Springer, 2022, pp. 604–621
work page 2022
-
[60]
Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,
R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,”Advances in neural information processing systems, vol. 35, pp. 27 061–27 074, 2022
work page 2022
-
[61]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,
L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1179–1189
work page 2023
-
[62]
Openshape: Scaling up 3d shape representation towards open-world understanding,
M. Liu, R. Shi, K. Kuang, Y . Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su, “Openshape: Scaling up 3d shape representation towards open-world understanding,”arXiv preprint arXiv:2305.10764, 2023
-
[63]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763
work page 2021
-
[64]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 38–55
work page 2024
-
[65]
Grounded language-image pre-training,
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975
work page 2022
-
[66]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning ro- bust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Cider: Consensus-based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575
work page 2015
-
[68]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
work page 2002
-
[69]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Rio: 3d object instance re-localization in changing indoor environments,
J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “Rio: 3d object instance re-localization in changing indoor environments,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7658–7667
work page 2019
-
[72]
Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,
T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4542–4550
work page 2024
-
[73]
Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,
S. Yang, J. Liu, R. Zhang, M. Pan, Z. Guo, X. Li, Z. Chen, P. Gao, H. Li, Y . Guoet al., “Lidar-llm: Exploring the potential of large language models for 3d lidar understanding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9247–9255
work page 2025
-
[74]
Occllama: An occupancy- language-action generative world model for autonomous driving,
J. Wei, S. Yuan, P. Li, Q. Hu, Z. Gan, and W. Ding, “Occllama: An occupancy- language-action generative world model for autonomous driving,”arXiv preprint arXiv:2409.03272, 2024
-
[75]
Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,
H. Zhi, P. Chen, J. Li, S. Ma, X. Sun, T. Xiang, Y . Lei, M. Tan, and C. Gan, “Lscenellm: Enhancing large 3d scene understanding using adaptive visual preferences,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3761–3771
work page 2025
-
[76]
X. Zhou, X. Han, F. Yang, Y . Ma, V . Tresp, and A. Knoll, “Opendrivevla: Towards end-to-end autonomous driving with large vision language action model,”arXiv preprint arXiv:2503.23463, 2025
-
[77]
Llama-adapter v2: Parameter-efficient vi- sual instruction model
P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yueet al., “Llama-adapter v2: Parameter-efficient visual instruction model,”arXiv preprint arXiv:2304.15010, 2023
-
[78]
Tracking anything with decoupled video segmentation,
H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee, “Tracking anything with decoupled video segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1316– 1326. V. BIOGRAPHYSECTION Haifeng Huangreceived the BEng degree in Com- puter Science from the College of Computer Science and Technology, Zhejiang Un...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.