ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration

H. Jin Kim; Jeonghwa Heo; Jeongjun Choi; Taekbeom Lee; Youngseok Jang

arxiv: 2606.24068 · v1 · pith:67TJU3XOnew · submitted 2026-06-23 · 💻 cs.CV · cs.RO

ObsGraph: Hierarchical Observation Representation for Embodied Reasoning and Exploration

Taekbeom Lee , Youngseok Jang , Jeonghwa Heo , Jeongjun Choi , H. Jin Kim This is my paper

Pith reviewed 2026-06-26 01:28 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords hierarchical scene graphembodied reasoningrobot explorationobservation representationvisual retrievaladaptive explorationcoarse-to-fine search

0 comments

The pith

ObsGraph organizes observations into room-view-object layers that link memory retrieval directly to multi-scale exploration decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ObsGraph, an observation-centric hierarchical scene graph that structures visual evidence into room-level semantic anchors, view-level object covisibility, and object-level details. This layering supports coarse-to-fine retrieval under a fixed budget and converts retrieval results into choices among room exploration, view refinement, or frontier expansion. The central idea is that this tight coupling lets agents identify evidence gaps and explore more purposefully in unfamiliar spaces. A sympathetic reader would care because unstructured memory often leads to inefficient or incomplete information gathering during embodied tasks.

Core claim

ObsGraph is an observation-centric hierarchical scene graph that retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, the method performs coarse-to-fine hierarchical retrieval under a bounded budget and uses retrieval outcomes to structure the exploration candidate space, activating room-level exploration, view refinement, or frontier exploration, thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration.

What carries the argument

The ObsGraph observation-centric hierarchical scene graph with its room-view-object layering that converts retrieval results into room, view, or frontier exploration candidates.

If this is right

Structured scene representation improves success rates on embodied reasoning and exploration benchmarks.
Targeted information gathering driven by identified evidence gaps increases exploration efficiency.
Retrieval outcomes directly determine whether to explore at room level, refine a view, or expand a frontier.
The unified representation-retrieval-exploration loop reduces reliance on exhaustive or random search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layering might allow incremental graph updates when new observations arrive in partially dynamic scenes.
The bounded-budget retrieval could be combined with task-specific priors to allocate search effort more selectively.
Frontier decisions derived from evidence gaps might transfer across related environments without full remapping.

Load-bearing premise

The room-view-object layering and the mechanism that turns retrieval outcomes into exploration candidates will reliably identify and close evidence gaps without missing task-critical information or introducing excessive overhead.

What would settle it

An experiment in which the hierarchy prunes away a task-critical object during retrieval, causing the agent to select the wrong exploration scale and fail the task despite the object being observable within the budget.

Figures

Figures reproduced from arXiv: 2606.24068 by H. Jin Kim, Jeonghwa Heo, Jeongjun Choi, Taekbeom Lee, Youngseok Jang.

**Figure 1.** Figure 1: We propose ObsGraph, a unified framework that tightly integrates representation, retrieval, and exploration. It maintains an observation-centric scene graph that preserves rich visual evidence while structuring it hierarchically (room–view–object) to provide both spatial context and object co-visibility cues. Given a task, relevant information is retrieved across hierarchy levels, and the retrieved conte… view at source ↗

**Figure 2.** Figure 2: The overall pipeline of the proposed unified reasoning framework, ObsGraph. agent’s current knowledge, making it crucial to select an appropriate exploration scope. Nonetheless, prior work has paid limited attention to how exploration scope should be adaptively selected to resolve such insufficiency. Fine-EQA [15] partially addresses this issue via room-level triggering, but it still depends on task-condit… view at source ↗

**Figure 3.** Figure 3: Incremental update of the view layer. Given new observations, a sub-layer containing re-observed objects is first extracted from the current view layer. Newly observed and re-observed objects are then integrated by bottom-up clustering, followed by top-down view assignment, resulting in an updated view layer. k)/ P k′ (P(vj |ri = k ′ )P(ri = k ′ )) where k denotes the room index. For roomview assignments,… view at source ↗

**Figure 4.** Figure 4: Reasoning process for information retrieval and exploration. In the informationretrieval stage, the questions denoted by Q are queried to an LLM; through three sequential queries, the system retrieves task-relevant information and selects views that capture the relevant classes. In the exploration stage, a VLM is queried to choose an exploration option and its target. An exploration option set is activate… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on selected EM-EQA episodes between 3D-Mem (left) and our method (right). Each row highlights the role of each layer in our observationcentric hierarchy: (a) room-level retrieval narrows candidates and mitigates lexical exact-match bias, (b) view-level memory preserves informative co-visibility patterns, and (c) object-level crops provide fine-grained evidence for object-state reaso… view at source ↗

**Figure 6.** Figure 6: Cases where each view refinement strategy correctly guides the agent [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Embodied reasoning and exploration are increasingly considered crucial abilities for robots operating in complex and unfamiliar environments. To accomplish tasks in such settings, an agent must identify and acquire the information necessary for the task through exploration. We propose ObsGraph, an observation-centric hierarchical scene graph that unifies scene representation, retrieval, and exploration. It retains visual evidence and organizes it into room-view-object layers: rooms provide coarse semantic anchors, views preserve contextual object covisibility, and objects store fine-grained details. On top of this representation, we perform coarse-to-fine hierarchical retrieval under a bounded budget, and crucially use retrieval outcomes to structure the exploration candidate space--activating room-level exploration, view refinement, or frontier exploration--thereby tightly coupling representation, retrieval, and adaptive multi-scale exploration. Experiments across embodied reasoning and exploration benchmarks demonstrate improved success and efficiency, highlighting the benefits of structured scene representation and more targeted information gathering driven by identified evidence gaps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ObsGraph ties a room-view-object graph to exploration by feeding retrieval results back into choosing room-level moves, view refinement, or frontiers, but the abstract gives almost no detail on how that mapping actually works.

read the letter

The main takeaway is that this paper builds an observation-centric graph with three explicit layers—rooms as coarse anchors, views that keep covisibility context, and objects for fine details—then routes the output of hierarchical retrieval straight into the exploration decision under a budget. That closed coupling between what you have retrieved and what you decide to look at next is the concrete piece they put forward.

It does organize visual evidence in a way that lets retrieval drive multi-scale exploration rather than treating them as separate modules. The reported gains in success and efficiency on embodied reasoning and exploration benchmarks suggest the structure can help agents target gaps more directly than looser scene-graph or frontier baselines.

The soft spot is exactly the activation logic that turns retrieval outcomes into one of the three exploration types. The abstract states the coupling happens but supplies no decision rules, pseudocode, or ablation that isolates whether this mapping avoids missing task-critical objects or running up extra cost. Without that, the benchmark numbers do not yet confirm the central claim holds for the reason they think.

This is for people working on memory and active exploration in robotics or embodied AI. A reader who needs a practical hierarchical representation to plug into an agent loop would get something usable to test.

It deserves a serious referee because the idea is specific, the experiments span relevant benchmarks, and the weakest assumption is checkable with the right ablations. I would send it to review.

Referee Report

2 major / 0 minor

Summary. The paper proposes ObsGraph, an observation-centric hierarchical scene graph that organizes visual evidence into room-view-object layers (rooms as coarse anchors, views preserving covisibility, objects storing details). It performs bounded-budget coarse-to-fine hierarchical retrieval and uses retrieval outcomes to activate room-level exploration, view refinement, or frontier exploration, thereby coupling representation, retrieval, and adaptive multi-scale exploration. Experiments on embodied reasoning and exploration benchmarks are claimed to show improved success and efficiency.

Significance. If the empirical results and the retrieval-to-exploration mapping hold under scrutiny, the work could advance embodied AI by offering a structured, evidence-gap-driven approach to scene representation and exploration that improves efficiency over unstructured methods. The explicit layering and bounded-budget retrieval are potentially useful ideas for multi-scale information gathering in unknown environments.

major comments (2)

[Abstract] Abstract: the central claim of 'improved success and efficiency' on 'embodied reasoning and exploration benchmarks' is load-bearing, yet the text supplies no benchmark names, baselines, metrics, trial counts, error bars, or dataset details, preventing verification of the reported gains.
[Abstract] Abstract (paragraph on exploration candidate activation): the mechanism that maps retrieval outcomes to room-level/view-refinement/frontier exploration is load-bearing for the 'tightly coupling' claim and the weakest assumption that it reliably closes evidence gaps without omissions or excess overhead, but no decision rules, pseudocode, or ablation isolating this logic are supplied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'improved success and efficiency' on 'embodied reasoning and exploration benchmarks' is load-bearing, yet the text supplies no benchmark names, baselines, metrics, trial counts, error bars, or dataset details, preventing verification of the reported gains.

Authors: We agree the abstract should provide more concrete support for the central claim within length constraints. In the revision we will name the specific benchmarks (ALFRED for reasoning, Habitat-based exploration tasks), list the primary metrics (success rate, SPL, exploration efficiency), note the number of trials, and indicate that all reported gains include error bars (mean ± std). Full experimental protocols, baselines, and dataset details remain in Sections 5 and 6; the abstract update will serve as a high-level pointer rather than a substitute. revision: yes
Referee: [Abstract] Abstract (paragraph on exploration candidate activation): the mechanism that maps retrieval outcomes to room-level/view-refinement/frontier exploration is load-bearing for the 'tightly coupling' claim and the weakest assumption that it reliably closes evidence gaps without omissions or excess overhead, but no decision rules, pseudocode, or ablation isolating this logic are supplied.

Authors: The mapping logic is defined in Section 4.3 (Retrieval-to-Exploration Activation) with explicit conditions based on retrieved evidence gaps at each hierarchy level. We acknowledge that the abstract itself contains no pseudocode or isolated ablation. In the revision we will (1) insert a compact decision table or pseudocode snippet in the abstract or as a footnote, (2) add a dedicated ablation (Section 5.4) that isolates the activation component by comparing against a version that performs uniform frontier exploration, and (3) report overhead metrics to address the concern about excess cost. These additions will make the coupling explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: method definition independent of results

full rationale

The paper proposes ObsGraph as a new hierarchical scene graph (room-view-object layers) that couples representation with retrieval-driven exploration candidates. No equations, fitted parameters, or self-referential definitions appear. Claims rest on benchmark experiments rather than any reduction of outputs to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the abstract or description. This is a standard non-circular proposal of a structured representation with empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no access to methods, equations, or experiments that would reveal free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5704 in / 1007 out tokens · 18490 ms · 2026-06-26T01:28:03.503491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 6 linked inside Pith

[1]

arXiv preprint arXiv:2204.01691 (2022)

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

Pith/arXiv arXiv 2022
[2]

In: Proceedings of the IEEE international confer- ence on computer vision

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international confer- ence on computer vision. pp. 2425–2433 (2015)

2015
[3]

arXiv preprint arXiv:2511.21631 (2025)

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2410.23968 (2024)

Booker, M., Byrd, G., Kemp, B., Schmidt, A., Rivera, C.: Embodiedrag: Dy- namic 3d scene graph retrieval for efficient and scalable robot task planning. arXiv preprint arXiv:2410.23968 (2024)

arXiv 2024
[5]

arXiv preprint arXiv:2212.06817 (2022)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

Pith/arXiv arXiv 2022
[6]

arXiv preprint arXiv:2512.02458 (2025)

Cai, Z., Du, Y., Wang, C., Kong, Y.: Vision to geometry: 3d spatial mem- ory for sequential embodied mllm reasoning and exploration. arXiv preprint arXiv:2512.02458 (2025)

arXiv 2025
[7]

arXiv preprint arXiv:2511.16719 (2025)

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

Pith/arXiv arXiv 2025
[8]

Davoodi, A.G., Davoudi, S.P.M., Pezeshkpour, P.: Llms are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of llms.In:Proceedingsofthe2025ConferenceoftheNationsoftheAmericasChapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3127–3140 (2025)

2025
[9]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Fan, Y., Ma, X., Su, R., Guo, J., Wu, R., Chen, X., Li, Q.: Embodied videoagent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6342–6352 (2025)

2025
[10]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024
[11]

In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: Vln bert: A recur- rent vision-and-language bert for navigation. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 1643–1653 (2021)

2021
[12]

arXiv preprint arXiv:2505.22657 (2025)

Hu, W., Hong, Y., Wang, Y., Gao, L., Wei, Z., Yao, X., Peng, N., Bitton, Y., Szpektor, I., Chang, K.W.: 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model. arXiv preprint arXiv:2505.22657 (2025)

arXiv 2025
[13]

arXiv preprint arXiv:2201.13360 (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. arXiv preprint arXiv:2201.13360 (2022)

arXiv 2022
[14]

arXiv preprint arXiv:2410.21276 (2024) 16 T

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 16 T. Lee et al

Pith/arXiv arXiv 2024
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, K., Liu, Y., Chen, W., Luo, J., Chen, Z., Pan, L., Li, G., Lin, L.: Beyond the destination: A novel benchmark for exploration-aware embodied question an- swering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9091–9101 (2025)

2025
[16]

Psychometrika32(3), 241–254 (1967)

Johnson, S.C.: Hierarchical clustering schemes. Psychometrika32(3), 241–254 (1967)

1967
[17]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language em- bedded radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 19729–19739 (2023)

2023
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Khanna, M., Ramrakhya, R., Chhablani, G., Yenamandra, S., Gervet, T., Chang, M., Kira, Z., Chaplot, D.S., Batra, D., Mottaghi, R.: Goat-bench: A benchmark for multi-modal lifelong navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16373–16383 (2024)

2024
[19]

In: International confer- ence on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

2022
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., et al.: Openeqa: Embodied question answering in the era of foundation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16488–16498 (2024)

2024
[21]

arXiv preprint arXiv:2410.05229 (2024)

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 (2024)

Pith/arXiv arXiv 2024
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023)

2023
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ramrakhya, R., Batra, D., Wijmans, E., Das, A.: Pirlnav: Pretraining with imita- tion and rl finetuning for objectnav. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17896–17906 (2023)

2023
[24]

arXiv preprint arXiv:2403.15941 (2024)

Ren, A.Z., Clark, J., Dixit, A., Itkina, M., Majumdar, A., Sadigh, D.: Explore until confident: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941 (2024)

arXiv 2024
[25]

The International Journal of Robotics Research40(12-14), 1510–1546 (2021)

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40(12-14), 1510–1546 (2021)

2021
[26]

arXiv preprint arXiv:2412.14480 (2024)

Saxena, S., Buchanan, B., Paxton, C., Liu, P., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real- time embodied question answering. arXiv preprint arXiv:2412.14480 (2024)

arXiv 2024
[27]

In: Confer- ence on Robot Learning

Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: Semantic guesswork as a heuristic for planning. In: Confer- ence on Robot Learning. pp. 2683–2699. PMLR (2023)

2023
[28]

arXiv preprint arXiv:2306.13631 (2023)

Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631 (2023)

arXiv 2023
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020
[30]

arXiv preprint arXiv:2511.08935 (2025) ObsGraph 17

Wang, N., Chen, W., Chen, L., Ji, H., Guo, Z., Zhang, X., Sun, H.: Expand your scope: Semantic cognition over potential-based exploration for embodied visual navigation. arXiv preprint arXiv:2511.08935 (2025) ObsGraph 17

arXiv 2025
[31]

Advances in Neural Information Processing Systems35, 7727–7740 (2022)

Wijmans, E., Essa, I., Batra, D.: Ver: Scaling on-policy rl leads to the emergence of navigation in embodied rearrangement. Advances in Neural Information Processing Systems35, 7727–7740 (2022)

2022
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021
[33]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, Y., Yang, H., Zhou, J., Chen, P., Zhang, H., Du, Y., Gan, C.: 3d-mem: 3d scene memory for embodied exploration and reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17294–17303 (2025)

2025
[34]

In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA). pp. 42–48. IEEE (2024)

2024
[35]

arXiv preprint arXiv:2304.05506 (2023)

Yu, B., Kasaei, H., Cao, M.: Frontier semantic exploration for visual target navi- gation. arXiv preprint arXiv:2304.05506 (2023)

arXiv 2023
[36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, J., Dong, R., Ma, K.: Clip-fo3d: Learning free open-world 3d scene rep- resentations from 2d dense clip. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2048–2059 (2023)

2048
[37]

arXiv preprint arXiv:2409.18125 (2024)

Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125 (2024)

arXiv 2024
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, Z., Wang, X., Li, Y., Zhang, Z., Ma, X., Chen, Y., Jia, B., Liang, W., Yu, Q., Deng, Z., et al.: Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8120–8132 (2025)

2025
[39]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ziliotto, F., Campari, T., Serafini, L., Ballan, L.: Tango: training-free embodied ai agents for open-world tasks. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24603–24613 (2025)

2025
[40]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

2023

[1] [1]

arXiv preprint arXiv:2204.01691 (2022)

Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022)

Pith/arXiv arXiv 2022

[2] [2]

In: Proceedings of the IEEE international confer- ence on computer vision

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international confer- ence on computer vision. pp. 2425–2433 (2015)

2015

[3] [3]

arXiv preprint arXiv:2511.21631 (2025)

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2410.23968 (2024)

Booker, M., Byrd, G., Kemp, B., Schmidt, A., Rivera, C.: Embodiedrag: Dy- namic 3d scene graph retrieval for efficient and scalable robot task planning. arXiv preprint arXiv:2410.23968 (2024)

arXiv 2024

[5] [5]

arXiv preprint arXiv:2212.06817 (2022)

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022)

Pith/arXiv arXiv 2022

[6] [6]

arXiv preprint arXiv:2512.02458 (2025)

Cai, Z., Du, Y., Wang, C., Kong, Y.: Vision to geometry: 3d spatial mem- ory for sequential embodied mllm reasoning and exploration. arXiv preprint arXiv:2512.02458 (2025)

arXiv 2025

[7] [7]

arXiv preprint arXiv:2511.16719 (2025)

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

Pith/arXiv arXiv 2025

[8] [8]

Davoodi, A.G., Davoudi, S.P.M., Pezeshkpour, P.: Llms are not intelligent thinkers: Introducing mathematical topic tree benchmark for comprehensive evaluation of llms.In:Proceedingsofthe2025ConferenceoftheNationsoftheAmericasChapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3127–3140 (2025)

2025

[9] [9]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Fan, Y., Ma, X., Su, R., Guo, J., Wu, R., Chen, X., Li, Q.: Embodied videoagent: Persistent memory from egocentric videos and embodied sensors enables dynamic scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6342–6352 (2025)

2025

[10] [10]

In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024

[11] [11]

In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition

Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: Vln bert: A recur- rent vision-and-language bert for navigation. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 1643–1653 (2021)

2021

[12] [12]

arXiv preprint arXiv:2505.22657 (2025)

Hu, W., Hong, Y., Wang, Y., Gao, L., Wei, Z., Yao, X., Peng, N., Bitton, Y., Szpektor, I., Chang, K.W.: 3dllm-mem: Long-term spatial-temporal memory for embodied 3d large language model. arXiv preprint arXiv:2505.22657 (2025)

arXiv 2025

[13] [13]

arXiv preprint arXiv:2201.13360 (2022)

Hughes, N., Chang, Y., Carlone, L.: Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. arXiv preprint arXiv:2201.13360 (2022)

arXiv 2022

[14] [14]

arXiv preprint arXiv:2410.21276 (2024) 16 T

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 16 T. Lee et al

Pith/arXiv arXiv 2024

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, K., Liu, Y., Chen, W., Luo, J., Chen, Z., Pan, L., Li, G., Lin, L.: Beyond the destination: A novel benchmark for exploration-aware embodied question an- swering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9091–9101 (2025)

2025

[16] [16]

Psychometrika32(3), 241–254 (1967)

Johnson, S.C.: Hierarchical clustering schemes. Psychometrika32(3), 241–254 (1967)

1967

[17] [17]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language em- bedded radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 19729–19739 (2023)

2023

[18] [18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Khanna, M., Ramrakhya, R., Chhablani, G., Yenamandra, S., Gervet, T., Chang, M., Kira, Z., Chaplot, D.S., Batra, D., Mottaghi, R.: Goat-bench: A benchmark for multi-modal lifelong navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16373–16383 (2024)

2024

[19] [19]

In: International confer- ence on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

2022

[20] [20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., et al.: Openeqa: Embodied question answering in the era of foundation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16488–16498 (2024)

2024

[21] [21]

arXiv preprint arXiv:2410.05229 (2024)

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., Farajtabar, M.: Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229 (2024)

Pith/arXiv arXiv 2024

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023)

2023

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ramrakhya, R., Batra, D., Wijmans, E., Das, A.: Pirlnav: Pretraining with imita- tion and rl finetuning for objectnav. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17896–17906 (2023)

2023

[24] [24]

arXiv preprint arXiv:2403.15941 (2024)

Ren, A.Z., Clark, J., Dixit, A., Itkina, M., Majumdar, A., Sadigh, D.: Explore until confident: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941 (2024)

arXiv 2024

[25] [25]

The International Journal of Robotics Research40(12-14), 1510–1546 (2021)

Rosinol,A.,Violette,A.,Abate,M.,Hughes,N.,Chang,Y.,Shi,J.,Gupta,A.,Car- lone, L.: Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research40(12-14), 1510–1546 (2021)

2021

[26] [26]

arXiv preprint arXiv:2412.14480 (2024)

Saxena, S., Buchanan, B., Paxton, C., Liu, P., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real- time embodied question answering. arXiv preprint arXiv:2412.14480 (2024)

arXiv 2024

[27] [27]

In: Confer- ence on Robot Learning

Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: Semantic guesswork as a heuristic for planning. In: Confer- ence on Robot Learning. pp. 2683–2699. PMLR (2023)

2023

[28] [28]

arXiv preprint arXiv:2306.13631 (2023)

Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631 (2023)

arXiv 2023

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3d semantic scene graphs from 3d indoor reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3961–3970 (2020)

2020

[30] [30]

arXiv preprint arXiv:2511.08935 (2025) ObsGraph 17

Wang, N., Chen, W., Chen, L., Ji, H., Guo, Z., Zhang, X., Sun, H.: Expand your scope: Semantic cognition over potential-based exploration for embodied visual navigation. arXiv preprint arXiv:2511.08935 (2025) ObsGraph 17

arXiv 2025

[31] [31]

Advances in Neural Information Processing Systems35, 7727–7740 (2022)

Wijmans, E., Essa, I., Batra, D.: Ver: Scaling on-policy rl leads to the emergence of navigation in embodied rearrangement. Advances in Neural Information Processing Systems35, 7727–7740 (2022)

2022

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, S.C., Wald, J., Tateno, K., Navab, N., Tombari, F.: Scenegraphfusion: In- cremental 3d scene graph prediction from rgb-d sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7515– 7525 (2021)

2021

[33] [33]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, Y., Yang, H., Zhou, J., Chen, P., Zhang, H., Du, Y., Gan, C.: 3d-mem: 3d scene memory for embodied exploration and reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 17294–17303 (2025)

2025

[34] [34]

In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA)

Yokoyama, N., Ha, S., Batra, D., Wang, J., Bucher, B.: Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: 2024 IEEE International Con- ference on Robotics and Automation (ICRA). pp. 42–48. IEEE (2024)

2024

[35] [35]

arXiv preprint arXiv:2304.05506 (2023)

Yu, B., Kasaei, H., Cao, M.: Frontier semantic exploration for visual target navi- gation. arXiv preprint arXiv:2304.05506 (2023)

arXiv 2023

[36] [36]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, J., Dong, R., Ma, K.: Clip-fo3d: Learning free open-world 3d scene rep- resentations from 2d dense clip. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2048–2059 (2023)

2048

[37] [37]

arXiv preprint arXiv:2409.18125 (2024)

Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125 (2024)

arXiv 2024

[38] [38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, Z., Wang, X., Li, Y., Zhang, Z., Ma, X., Chen, Y., Jia, B., Liang, W., Yu, Q., Deng, Z., et al.: Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8120–8132 (2025)

2025

[39] [39]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ziliotto, F., Campari, T., Serafini, L., Ballan, L.: Tango: training-free embodied ai agents for open-world tasks. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24603–24613 (2025)

2025

[40] [40]

In: Conference on Robot Learning

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)

2023