pith. machine review for the scientific record. sign in

arxiv: 2605.09218 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

Sagar Bharadwaj , Ziyong Ma , Anurag Ghosh , Srinivasan Seshan , Anthony Rowe

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords 3D scene understandingcompositional reasoningzero-shot learningmultimodal language modelsspatial toolsagentic reasoningtraining-free methodseditable scene memory
0
0 comments X

The pith

A training-free framework lets off-the-shelf language models reason about complex 3D scenes by editing memories and inventing spatial tools at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that broad 3D scene understanding, including free space, object relations, hypothetical insertions, and geometric queries, does not need large-scale 3D-language training. Instead, scenes can be stored as editable visual-textual memories that an existing multimodal model accesses through composable spatial tools. The model can also generate new tools on the fly to handle open-ended or multi-hop questions about layouts and objects not yet in the scene. External information can be added to the memory without any retraining. Experiments show this matches fine-tuned 3D methods on ScanQA and outperforms them on a new compositional benchmark where fixed tools alone are insufficient.

Core claim

Flame3D stores 3D scenes as editable visual-textual memories and exposes them to an off-the-shelf MLLM through both fixed and agent-synthesized spatial tools, enabling zero-shot compositional reasoning over free space, grounding, hypothetical objects, and geometric relationships without 3D-specific training.

What carries the argument

Editable visual-textual 3D memory exposed through fixed and self-synthesized spatial tools that the agent composes at inference time.

If this is right

  • Competitive results on ScanQA are achieved without any 3D-language fine-tuning.
  • Synthesized tools at inference time are required for success on multi-hop spatial reasoning in Compose3D.
  • External data or human corrections can be inserted into the memory without retraining the model.
  • Open-ended reasoning about empty space and absent objects becomes possible through on-the-fly program synthesis.
  • Progress in 3D understanding can shift toward richer scene memories and compositional abstractions rather than larger training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-plus-tool design could support dynamic scenes if the memory is updated in real time.
  • Robotic planning systems might reuse the identical editable memory for both perception and action sequencing.
  • Limits in current MLLM spatial reasoning may be addressed more effectively by better tool libraries than by additional training data.
  • The approach invites tests on whether tool synthesis scales to longer reasoning chains or to scenes with many interacting objects.

Load-bearing premise

An off-the-shelf multimodal language model can reliably interpret and execute both fixed and self-generated spatial tools on the editable 3D memory without substantial hallucinations on complex tasks.

What would settle it

Systematic failure of the MLLM to correctly apply agent-synthesized spatial programs on multi-hop queries from the Compose3D benchmark would show that inference-time tool synthesis does not deliver the claimed generalization.

Figures

Figures reproduced from arXiv: 2605.09218 by Anthony Rowe, Anurag Ghosh, Sagar Bharadwaj, Srinivasan Seshan, Ziyong Ma.

Figure 1
Figure 1. Figure 1: Flame3D answers compositional, multi-hop spatial queries about a 3D scene by com￾posing chain-of-thought inferences over a structured visual-textual scene memory and a systematically designed set of spatial and visual tools along with external knowledge. Qualitative examples are provided our website: open-flame.com/flame3d Abstract 3D scene understanding spans reasoning about free space, object grounding, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Flame3D. When a natural-language query is received, an off-the-shelf tool￾calling vision–language model breaks it down into a sequence of spatial and external tool-calls to produce a grounded answer. The agent composes these inferences by interacting with the structured scene memory through a collection of spatial tools (search, distance, vicinity search, navigation distance, image retrieval, a… view at source ↗
Figure 3
Figure 3. Figure 3: Meta-abstraction Tool Use. Examples of the Execute program abstraction. Flame3D can synthesize tailored Python code at inference time with attributes from external sources, draw arbitrary bounding boxes (top), or compute precise geometric relationships such as projections (bottom). safety regulations), machine profiles (e.g., equipment specifications), and pricing catalogs. Because external attributes are … view at source ↗
Figure 4
Figure 4. Figure 4: Example of grounded output with component identifiers. Iterative tool use. At each step, the MLLM produces either a tool call or a final answer. Tool calls retrieve information from the scene memory (e.g., searching for relevant components, computing distances, inspecting im￾ages, etc.) or from external sources. The result of each tool call is appended to the agent’s context, enabling multi-hop chains. For… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples illustrating Flame3D answering complex 3D queries by chaining spatial tools, visual inspection, and external knowledge retrieval. 4 Experiments 4.1 Datasets ScanQA [19] is a 3D question answering benchmark built on top of real-world indoor RGB-D scans from ScanNet [49]. Given a reconstructed 3D scene and a natural-language question, the task requires the model to produce an answer grou… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of dif [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of spatial reasoning tools on model performance. The results illustrate the incremental performance gains achieved by integrating spatial tools. Genuine 3D understanding may not require training on 3D–language data [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Our method often generates contextually valid answers but are penalized by ambiguity in ScanQA ground-truth. Fixed toolsets are insufficient for open-ended 3D queries [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison across different foundation models. Modern, larger models consistently demonstrate superior spatial reasoning and object grounding capabilities. is that this execute-only baseline takes us a long way. Because the meta-tool is highly expressive, it effectively subsumes many geometric tools and establishes a strong foundation for reasoning. However, it does not take us all the way. It cannot, for … view at source ↗
Figure 10
Figure 10. Figure 10: Full tool suite pro￾vides consistent performance gains over relying solely on the meta tool. Scaling foundation models improves spatial reasoning. To un￾derstand how the choice of the underlying foundation model impacts spatial reasoning, we evaluate Flame3D across different language models. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Interface of the benchmark collection tool. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A waterfall chart detailing the runtime breakdown for creating the scene memory. To comprehensively evaluate Flame3D on ScanQA, we compare against three distinct classes of baselines: Expert models [63, 64, 19, 43] that utilize dedicated 3D encoders and are trained with task-specific supervision for 3D question answering. 3D LMMs [1, 25, 65– 68, 20, 44] that have undergone extensive pre￾training or finetu… view at source ↗
read the original abstract

3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent's ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Flame3D, a training-free, zero-shot framework for compositional 3D scene reasoning. Scenes are encoded as editable visual-textual 3D memories that an off-the-shelf MLLM accesses via composable spatial tools; the agent can synthesize custom spatial programs at inference time to support open-ended queries over layouts, free space, and hypothetical insertions. The work reports competitive performance against finetuned 3D-LMMs on ScanQA and introduces the Compose3D benchmark to demonstrate that tool synthesis is essential because fixed tools underperform on multi-hop spatial tasks.

Significance. If the results hold, the work shows that broad 3D generalization can be realized at inference time without 3D-specific training by combining editable memories with agentic tool synthesis. This offers a potential alternative to large-scale 3D-language pretraining. Credit is given for the explicitly training-free design, use of off-the-shelf MLLMs, introduction of the Compose3D compositional benchmark, and the explicit empirical contrast between fixed and synthesizable tools.

major comments (2)
  1. [Compose3D evaluation] The central claim that inference-time tool synthesis is essential rests on the statement that 'fixed tools fall short' on Compose3D, yet the manuscript supplies no quantitative breakdown of synthesis success rate, per-step tool-execution accuracy, hallucination frequency on geometric or empty-space relations, or recovery behavior across multi-hop chains. Without these metrics it is impossible to attribute observed performance to the proposed mechanism rather than benchmark leniency or short task depth.
  2. [Framework and method description] The framework's reliability hinges on the off-the-shelf MLLM correctly parsing the editable 3D memory, invoking or inventing spatial tools, and maintaining state without substantial hallucinations. The manuscript provides no implementation details on memory construction, update protocol, prompting strategy for synthesis, or verification of tool outputs, leaving the weakest assumption (MLLM reliability on self-synthesized tools) untested.
minor comments (2)
  1. [Abstract] The abstract claims 'competitive performance' on ScanQA but does not name the exact metrics, baselines, or numerical margins; these should be stated explicitly even in the abstract.
  2. [Method] Notation for the editable visual-textual memory and the distinction between fixed versus synthesizable tools is introduced without a compact summary table or diagram, making the compositional abstraction harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the training-free nature of Flame3D, the introduction of Compose3D, and the empirical contrast between fixed and synthesizable tools. We address each major comment below and will incorporate the suggested clarifications and analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Compose3D evaluation] The central claim that inference-time tool synthesis is essential rests on the statement that 'fixed tools fall short' on Compose3D, yet the manuscript supplies no quantitative breakdown of synthesis success rate, per-step tool-execution accuracy, hallucination frequency on geometric or empty-space relations, or recovery behavior across multi-hop chains. Without these metrics it is impossible to attribute observed performance to the proposed mechanism rather than benchmark leniency or short task depth.

    Authors: We agree that a more granular breakdown would strengthen attribution of the performance gains to tool synthesis. Our current results demonstrate a clear gap between fixed-tool and synthesizable-tool variants on multi-hop Compose3D tasks, but we did not report per-step synthesis success rates, execution accuracy, or hallucination frequencies. In the revision we will add these metrics by post-hoc analysis of the agent's tool-invocation traces, including synthesis success rate, per-step geometric/empty-space accuracy, hallucination counts, and recovery behavior across chain lengths. revision: yes

  2. Referee: [Framework and method description] The framework's reliability hinges on the off-the-shelf MLLM correctly parsing the editable 3D memory, invoking or inventing spatial tools, and maintaining state without substantial hallucinations. The manuscript provides no implementation details on memory construction, update protocol, prompting strategy for synthesis, or verification of tool outputs, leaving the weakest assumption (MLLM reliability on self-synthesized tools) untested.

    Authors: We acknowledge that the manuscript describes the framework at a high level but omits low-level implementation specifics. In the revised version we will expand the Methods section with: (i) the exact procedure for constructing and updating the editable visual-textual 3D memory, (ii) the prompting templates and few-shot examples used for tool synthesis, (iii) the protocol for verifying tool outputs before state update, and (iv) a dedicated failure-case analysis that quantifies MLLM reliability on self-synthesized tools using the same Compose3D logs. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is a new inference-time proposal relying on external off-the-shelf MLLM

full rationale

The paper introduces Flame3D as a training-free system that represents 3D scenes via editable visual-textual memories and exposes them to an external MLLM through fixed and synthesizable spatial tools. The central argument—that broad 3D generalization can be achieved at inference time without 3D-specific training—is advanced by describing the architecture and reporting competitive results on ScanQA plus a new Compose3D benchmark. No equations, fitted parameters, or self-referential definitions appear in the provided text; the performance claims rest on the independent capabilities of the chosen MLLM rather than on any loop that reduces the output to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for the core mechanism. This is a standard non-circular proposal of a new agentic framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested assumption that general MLLMs can effectively use the new memory format and tool-synthesis mechanism; two new conceptual entities are introduced without independent evidence in the abstract.

axioms (1)
  • domain assumption Off-the-shelf multimodal large language models possess sufficient reasoning capabilities to utilize composable spatial tools and synthesize custom programs for 3D scene understanding when given an appropriate memory representation.
    This assumption directly supports the zero-shot claim and is invoked in the description of how Flame3D exposes scenes to the MLLM.
invented entities (2)
  • Editable visual-textual 3D memories no independent evidence
    purpose: To provide a dynamic, updatable representation of scenes that MLLMs can access and modify for reasoning tasks.
    Core new representation introduced in the framework; no independent evidence supplied in the abstract.
  • Composable spatial tools with inference-time synthesis no independent evidence
    purpose: To enable both fixed operations and custom program creation for open-ended 3D reasoning.
    New abstraction for tool use; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1536 out tokens · 69783 ms · 2026-05-12T02:05:00.348440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

  1. [1]

    3d-llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,NeurIPS, 2023

  2. [2]

    Chat-3d: Data- efficiently tuning large language model for universal dialogue of 3d scenes, 2023

    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data- efficiently tuning large language model for universal dialogue of 3d scenes, 2023

  3. [3]

    Spatiallm: Training large language models for structured indoor modeling

    Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Training large language models for structured indoor modeling. InAdvances in Neural Information Processing Systems, 2025

  4. [4]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024

  5. [5]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 2017

  6. [6]

    The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025

    Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, et al. The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025

  7. [7]

    Do 3d large language models really understand 3d spatial relationships? InICLR, 2026

    Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, and Victor Adrian Prisacariu. Do 3d large language models really understand 3d spatial relationships? InICLR, 2026

  8. [8]

    Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba

    Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.Robotics: Science and Systems (RSS), 2023

  9. [9]

    Lerf: Language embedded radiance fields

    Justin* Kerr, Chung Min* Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InICCV, 2023

  10. [10]

    Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

    Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical Open-V ocabulary 3D Scene Graphs for Language-Grounded Robot Navigation. In RSS, 2024

  11. [11]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  12. [12]

    Spatialprompting: Keyframe-driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

    Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, and Hiroyuki Sakai. Spatialprompting: Keyframe-driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

  13. [13]

    See&trek: Training-free spatial prompting for multimodal large language model

    Pengteng Li, Pinhao Song, Wuyang Li, Huizai Yao, Weiyu Guo, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multimodal large language model. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=2exr4mlbx1

  14. [14]

    Agent3d-zero: An agent for zero-shot 3d understanding

    Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, and Yanyong Zhang. Agent3d-zero: An agent for zero-shot 3d understanding. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,ECCV, pages 186–202, Cham, 2024. Springer Nature Switzerland. ISBN 978-3-031-72655-2

  15. [15]

    Fouhey, and Joyce Chai

    Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. InICRA, 2024. 10

  16. [16]

    Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025

    Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Ji Zhang, and Wenshan Wang. Sort3d: Spatial object-centric reasoning toolbox for zero-shot 3d grounding using large language models, 2025. URLhttps://arxiv.org/abs/2504.18684

  17. [17]

    Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning

    Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. InCoRL, 2023

  18. [18]

    Beyond bare queries: Open-vocabulary object grounding with 3d scene graph

    Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov, Dmitry Yudin, Maxim Monastyrny, and Aleksei Valenkov. Beyond bare queries: Open-vocabulary object grounding with 3d scene graph. InICRA, 2025

  19. [19]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InCVPR, 2022

  20. [20]

    An embodied generalist agent in 3d world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024

  21. [21]

    Think3d: Thinking with space for spatial reasoning

    Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning. 2026

  22. [22]

    Sqa3d: Situated question answering in 3d scenes

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InICLR, 2023

  23. [23]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InECCV, 2024

  24. [24]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In CVPR, 2024

  25. [25]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. InCVPR, 2024

  26. [26]

    Openscene: 3d scene understanding with open vocabularies

    Songyou Peng, Kyle Genova, Chiyu "Max" Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. Openscene: 3d scene understanding with open vocabularies. InCVPR, 2023

  27. [27]

    Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. InICRA, 2024

  28. [28]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023

  29. [29]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, 2023

  30. [30]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023

  31. [31]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023

  32. [32]

    Progprompt: Generating situated robot task plans using large language models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. InICRA, 2023. 11

  33. [33]

    CaP-X: A framework for benchmarking and improving coding agents for robot manipulation

    Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry, Yuke Zhu, Ken Goldberg, and Jim Fan. CaP-X: A framework for benchmarking and improving coding agents for robot manipulation. arXiv preprint arXiv:2603.22435, 2025. URLhttps://arxiv.org/abs/2603.22435

  34. [34]

    Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron David, Andy Zeng, and Chuyuan Kelly Fu

    brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar ...

  35. [35]

    Goat: Go to any thing.RSS, 2024

    Matthew Chang, Theophile Gervet, Mukul Khanna, Sriram Yenamandra, Dhruv Shah, So Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, et al. Goat: Go to any thing.RSS, 2024

  36. [36]

    Navigating to objects in the real world.Science Robotics, 2023

    Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world.Science Robotics, 2023

  37. [37]

    Procthor: Large-scale embodied ai using procedural generation.NeurIPS, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.NeurIPS, 2022

  38. [38]

    Inner monologue: Em- bodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and brian ichter. Inner monologue: Em- bodied reasoning through planning with language models. In Karen Liu, Dana Kulic, and Jeff Ichnow...

  39. [39]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InIn the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

  40. [40]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  41. [41]

    PostGIS Project Steering Committee.PostGIS: Spatial and Geographic Objects for PostgreSQL,

  42. [42]

    URLhttps://postgis.net

  43. [43]

    Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond. 2009

  44. [44]

    3d-vista: Pre-trained transformer for 3d vision and text alignment

    Zhu Ziyu, Ma Xiaojian, Chen Yixin, Deng Zhidong, Huang Siyuan, and Li Qing. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InICCV, 2023

  45. [45]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InCVPR, 2025

  46. [46]

    Cvp: Central- peripheral vision-inspired multimodal model for spatial reasoning

    Zeyuan Chen, Xiang Zhang, Haiyang Xu, Jianwen Xie, and Zhuowen Tu. Cvp: Central- peripheral vision-inspired multimodal model for spatial reasoning. InWACV, 2026

  47. [47]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024. 12

  48. [48]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  49. [49]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. URLhttps://arxiv.org/abs/2410.02713

  50. [50]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

  51. [51]

    Scannet++: A high- fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  52. [52]

    Conversational image segmentation: Grounding abstract concepts with scalable supervision

    Aadarsh Sahoo and Georgia Gkioxari. Conversational image segmentation: Grounding abstract concepts with scalable supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  53. [53]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. ACL, 2002

  54. [54]

    Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments

    Alon Lavie and Abhaya Agarwal. Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. ACL, 2007

  55. [55]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out. ACL, 2004

  56. [56]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InCVPR, 2015

  57. [57]

    Openflame: Federated visual positioning system to enable large-scale aug- mented reality applications

    Sagar Bharadwaj, Harrison Williams, Luke Wang, Michael Liang, Tao Jin, Srinivasan Seshan, and Anthony Rowe. Openflame: Federated visual positioning system to enable large-scale aug- mented reality applications. In2025 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 706–716, 2025. doi: 10.1109/ISMAR67309.2025.00080

  58. [58]

    OpenFLAME: A Federated Spatial Naming Infrastructure

    Sagar Bharadwaj, Ziyong Ma, Ivan Liang, Michael Farb, Anthony Rowe, and Srinivasan Seshan. OpenFLAME: A Federated Spatial Naming Infrastructure. In Katerina Argyraki and Aurojit Panda, editors,1st New Ideas in Networked Systems (NINeS 2026), volume 139 ofOpen Access Series in Informatics (OASIcs), pages 20:1–20:26, Dagstuhl, Germany, 2026. Schloss Dagstuh...

  59. [59]

    URL https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.NINeS. 2026.20

  60. [60]

    Uniting the world by dividing it: Federated maps to enable spatial applications

    Sagar Bharadwaj, Anthony Rowe, and Srinivasan Seshan. Uniting the world by dividing it: Federated maps to enable spatial applications. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems, pages 74–79, 2025

  61. [61]

    Openclip, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/ zenodo.5143773. If you use this software, please cite it as below

  62. [62]

    A statistical method for evaluating systematic relationships

    Robert R Sokal, Charles D Michener, et al. A statistical method for evaluating systematic relationships. 1958

  63. [63]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  64. [64]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996. 13

  65. [65]

    Chang, and Matthias Nießner

    Dave Zhenyu Chen, Angel X. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV, 2020

  66. [66]

    Deep modular co-attention networks for visual question answering

    Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. InCVPR, 2019

  67. [67]

    Scene-llm: Extending language model for 3d visual understanding and reasoning,

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

  68. [68]

    Chat-3d v2: Bridging 3d scene and large language models with object identifiers

    Haifeng Huang, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tao Jin, and Zhou Zhao. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168, 2023

  69. [69]

    Chat-scene: Bridging 3d scene and large language models with object identifiers

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, and Zhou Zhao. Chat-scene: Bridging 3d scene and large language models with object identifiers. InNeurIPS, 2024

  70. [70]

    \n\nQuestion:\n

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 14 A 3D Instance Segmentation Pipeline This section details the multi-stage pipeline that converts posed RGB frames into persistent 3D in- stance components referenced in ...

  71. [71]

    saw”→“saw

    Lemmatization.Each label is lowercased and lemmatized as a noun using spaCy, preventing verb-form artifacts (e.g., “saw”→“saw” rather than “see”)

  72. [72]

    CLIP embedding.The unique lemmatized labels are embedded with OpenCLIP [ 59] to obtain dense semantic representations

  73. [73]

    Within each cluster, the label whose embedding is closest to the cluster centroid is selected as the canonical representative, and all labels in the cluster are replaced by it

    Agglomerative clustering.We cluster the embeddings using average-linkage agglomerative clustering [60] with a cosine-distance threshold (default 0.05). Within each cluster, the label whose embedding is closest to the cluster centroid is selected as the canonical representative, and all labels in the cluster are replaced by it. After normalization, we perf...

  74. [74]

    Construct amask canvasusing a painter’s algorithm: masks are sorted by bounding-box area (largest first) and painted onto an (H×W) index map, so that smaller masks on top override larger background masks

  75. [75]

    Each feature point’s pixel location is looked up in the index map to determine which mask (if any) it falls inside

    Project all COLMAP 2D feature points (whose 3D correspondences are known) onto the canvas. Each feature point’s pixel location is looked up in the index map to determine which mask (if any) it falls inside

  76. [76]

    Accumulate the 3D point IDs for each (object, sequence, instance) triple across all frames in the subsequence

  77. [77]

    chair” and once under “seat

    Apply a per-mask DBSCAN [62] pass on the accumulated 3D coordinates (defaultε= 0.5 m, min_samples= 5) to discard spatial outlier points caused by noisy 2D projections. 15 The result is a dictionary mapping each instance triple (object slug, sequence index, SAM3 tracking ID) to its set of associated 3D point IDs. A.5 Mask Connectivity Graph The same physic...

  78. [78]

    Entities and Relations.Identifying or referring to an entity through an open-ended descrip- tion, possibly relative to other entities.Example:which object is closest to the red plastic chair?

  79. [79]

    Affordance.Reasoning about potential actions an agent can perform on an entity.Example: where can I sit in this room?

  80. [80]

    Functionality.Reasoning about the function of an entity.Example:find me something that can help with cleaning the room

Showing first 80 references.