pith. sign in

arxiv: 2606.01072 · v2 · pith:7EIGLUW6new · submitted 2026-05-31 · 💻 cs.RO · cs.CV

Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

Pith reviewed 2026-06-28 17:22 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords scene graphsimitation learningrobotic manipulationpartial observabilitylong-term reasoningmobile manipulationtabletop manipulation
0
0 comments X

The pith

Dynamic scene graphs serve as explicit memory so imitation-learned robot policies can track object relations across long sequences and partial views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes maintaining a dynamic scene graph during imitation learning so a robot policy can keep track of object-centric relationships and how they change over time. This structured memory lets the policy reason over information that has accumulated gradually rather than depending only on the current partial observation. The approach targets two common real-world difficulties: large spaces that hide much of the environment from any single viewpoint, and tasks that require completing several subtasks in sequence. Experiments in both simulated mobile manipulation and real tabletop settings show the method raises policy success rates, especially when long-term recall and generalization from incomplete data are required.

Core claim

By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, the method supplies the agent with an explicit structured memory that retains relevant historical context, enabling efficient reasoning over incrementally accrued scene information during task execution.

What carries the argument

Dynamic scene graph serving as explicit structured memory that records object-centric relationships and their temporal changes.

If this is right

  • Policy success rates rise substantially on mobile manipulation tasks that span large spaces.
  • Real-world tabletop policies generalize better when observations are incomplete.
  • Reasoning over extended time horizons improves because the graph preserves subtask history.
  • Incremental scene information becomes usable without retraining the entire policy from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph structure could be tested as a memory layer inside other robot learning pipelines that currently rely on recurrent networks or attention over raw images.
  • If graph construction remains reliable, the approach may reduce the need for full environment resets between trials in long-horizon experiments.
  • Combining the graph with additional geometric features such as contact points could be examined in follow-up work to handle finer manipulation details.

Load-bearing premise

Scene graphs can be built and kept accurate enough from incomplete sensor data to supply useful historical context.

What would settle it

A test in which the constructed scene graph repeatedly misrepresents object relations or locations from partial observations, causing the learned policy to fail on any task that depends on recalling earlier states.

Figures

Figures reproduced from arXiv: 2606.01072 by Bernadette Bucher, Dinesh Jayaraman, Emmanuel Panov, Jianing Qian, Leonor Fermoselle, Qinhe Peng, Tarik Kelestemur.

Figure 1
Figure 1. Figure 1: We maintain a scene graph where each task-relevant object is represented as a node [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We observe that errors in early subtasks often prop- [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Success rates of different methods and their ablations across three simulated tasks are represented by stacked bar plots. The [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations of the simulated mobile manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the real-world tabletop manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Partial-Observation Tabletop Setup. We evaluate our method using a 7-DoF Franka Emika Panda arm with only a wrist￾mounted ZED Mini RGB-D camera, removing the side cameras. While this setup simplifies data collection and makes policies invariant to many task-irrelevant scene features, it introduces limited and changing viewpoints. Policies must therefore leverage observations accu￾mulated over the trajector… view at source ↗
Figure 6
Figure 6. Figure 6: Success rates of different methods and their ablations on the first three real-world tabletop manipulation tasks are shown as [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GPT-4 responses for task-relevant object name identifi [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes using dynamic scene graphs as an explicit structured memory mechanism within imitation learning policies for robots. By capturing object-centric relationships and their evolution over time, the method aims to retain historical context and enable reasoning over incrementally accrued scene information, addressing partial observability in large environments and long time horizons in sequential tasks. Experiments are claimed to demonstrate substantial policy performance improvements in simulated mobile manipulation and real-world tabletop manipulation, particularly for long-term reasoning and generalization under partial observability.

Significance. If the performance claims hold with appropriate evidence, the work could offer a useful structured alternative to implicit memory mechanisms (e.g., RNNs or transformers) for scaling imitation learning to real-world settings with large spatial scales and extended task horizons.

major comments (1)
  1. [Abstract] Abstract: the claim that the approach 'substantially improves policy performance' in simulated and real experiments supplies no quantitative results, baselines, or method details. This prevents verification that the data support the central claim of improved reasoning over long horizons and partial observability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify our work. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the approach 'substantially improves policy performance' in simulated and real experiments supplies no quantitative results, baselines, or method details. This prevents verification that the data support the central claim of improved reasoning over long horizons and partial observability.

    Authors: We agree that the abstract would be improved by including quantitative highlights to support the performance claims. The full manuscript reports specific success rates, baseline comparisons, and ablation results demonstrating gains in long-horizon tasks under partial observability (see Experiments section). In the revision we will update the abstract to reference key metrics, such as relative improvements over baselines in simulated mobile manipulation and real tabletop tasks, while keeping the abstract concise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a methodological approach of using dynamic scene graphs as explicit memory for imitation learning to address partial observability and long time horizons. The provided abstract and text contain no equations, parameter fits, predictions, or self-citations that form a derivation chain. The central claim is a design proposal evaluated via experiments on simulated and real tasks, with no load-bearing step that reduces by construction to its own inputs. This is self-contained against external benchmarks as a standard engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5683 in / 929 out tokens · 34560 ms · 2026-06-28T17:22:23.325980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 9 linked inside Pith

  1. [1]

    3d scene graph: A structure for unified semantics, 3d space, and cam- era.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5663–5672, 2019

    Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir Zamir, Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene graph: A structure for unified semantics, 3d space, and cam- era.2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5663–5672, 2019

  2. [2]

    HELIOS: Hier- archical Exploration for Language-grounded Interaction in Open Scenes.ArXiv, abs/2509.22498, 2025

    Katrina Ashton, Chahyon Ku, Shrey Shah, Wen Jiang, Kostas Daniilidis, and Bernadette Bucher. HELIOS: Hier- archical Exploration for Language-grounded Interaction in Open Scenes.ArXiv, abs/2509.22498, 2025

  3. [3]

    Focusing on what matters: Object-agent-centric tokenization for vision language action models.ArXiv, abs/2509.23655, 2025

    Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, and Pietro Mazzaglia. Focusing on what matters: Object-agent-centric tokenization for vision language action models.ArXiv, abs/2509.23655, 2025

  4. [4]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the International Conference on Computer Vi- sion (ICCV), 2021

  5. [5]

    Bekris, and Abdeslam Boularias

    Haonan Chang, Kowndinya Boyalakuntla, Shiyang Lu, Si- wei Cai, Eric Pu Jing, Shreesh Keskar, Shijie Geng, Adeeb Abbas, Lifeng Zhou, Kostas E. Bekris, and Abdeslam Boularias. Context-aware entity grounding with open- vocabulary 3d scene graphs.ArXiv, abs/2309.15940, 2023

  6. [6]

    ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

    Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, and Jiuguang Wang. ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  7. [7]

    Decoding dynamic visual scenes across the brain hierarchy.PLOS Computational Biology, 20 (8):e1012297, 2024

    Ye Chen, Peter Beech, Ziwei Yin, Shanshan Jia, Jiayi Zhang, Zhaofei Yu, and Jian K Liu. Decoding dynamic visual scenes across the brain hierarchy.PLOS Computational Biology, 20 (8):e1012297, 2024

  8. [8]

    Ho Kei Cheng and Alexander G. Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. InEuropean Conference on Computer Vi- sion, 2022

  9. [9]

    Scenegraphfusion: Incre- mental 3d scene graph prediction from rgb-d sequences

    Shun cheng Wu, Johanna Wald, Keisuke Tateno, Nassir Navab, and Federico Tombari. Scenegraphfusion: Incre- mental 3d scene graph prediction from rgb-d sequences. 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 7511–7521, 2021

  10. [10]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023

  11. [11]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.ArXiv, abs/2402.10329, 2024

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Ben- jamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.ArXiv, abs/2402.10329, 2024

  12. [12]

    Transformers for one-shot imitation learning

    Sudeep Dasari and et al. Transformers for one-shot imitation learning. InCoRL, 2020

  13. [13]

    Orion- nav: Online planning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs.ArXiv, abs/2410.06239, 2024

    Venkata Naren Devarakonda, Raktim Gautam Goswami, Ali Umut Kaypak, Naman Patel, Rooholla Khorrambakht, Prashanth Krishnamurthy, and Farshad Khorrami. Orion- nav: Online planning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs.ArXiv, abs/2410.06239, 2024

  14. [14]

    Visual representations in the human brain are aligned with large lan- guage models, 2024

    Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, and Ian Charest. Visual representations in the human brain are aligned with large lan- guage models, 2024

  15. [15]

    Simple agent, complex environment: Efficient reinforcement learn- ing with agent states.Journal of Machine Learning Re- search, 23(255):1–54, 2022

    Shi Dong, Benjamin Van Roy, and Zhengyuan Zhou. Simple agent, complex environment: Efficient reinforcement learn- ing with agent states.Journal of Machine Learning Re- search, 23(255):1–54, 2022

  16. [16]

    Robot utility models: General policies for zero-shot deploy- ment in new environments.2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 8275–8283, 2024

    Haritheja Etukuru, Norihito Naka, Zijin Hu, Seungjae Lee, Julian Mehu, Aaron Edsinger, Chris Paxton, Soumith Chin- tala, Lerrel Pinto, and Nur Muhammad Mahi Shafiullah. Robot utility models: General policies for zero-shot deploy- ment in new environments.2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 8275–8283, 2024

  17. [17]

    Implicit behavioral cloning.RSS, 2022

    Pete Florence, Lucas Manuelli, and Russ Tedrake. Implicit behavioral cloning.RSS, 2022

  18. [18]

    Visual graphs from motion (vgfm): Scene understanding with object ge- ometry reasoning

    Paul Gay, Stuart James, and Alessio Del Bue. Visual graphs from motion (vgfm): Scene understanding with object ge- ometry reasoning. InAsian Conference on Computer Vision, 2018

  19. [19]

    Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

  20. [20]

    Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning.ArXiv, abs/2309.16650, 2023

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Ramalingam Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B Tenenbaum, Antonio Torralba, Florian Shkurti, and Liam Paull. Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning.ArXiv, abs/...

  21. [21]

    Human-inspired perspec- tives: A survey on ai long-term memory.arXiv preprint arXiv:2411.00489, 2024

    Zihong He, Weizhe Lin, Hao Zheng, Fan Zhang, Matt W Jones, Laurence Aitchison, Xuhai Xu, Miao Liu, Per Ola Kristensson, and Junxiao Shen. Human-inspired perspec- tives: A survey on ai long-term memory.arXiv preprint arXiv:2411.00489, 2024

  22. [22]

    Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.ArXiv, abs/2305.07154, 2023

    Nathan Hughes, Yun Chang, Siyi Hu, Rajat Talak, Rumaisa Abdulhai, Jared Strader, and Luca Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.ArXiv, abs/2305.07154, 2023

  23. [23]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025

  24. [24]

    Bc-z: Zero-shot task generalization with robotic imitation learning.CoRL, 2022

    Eric Jang and et al. Bc-z: Zero-shot task generalization with robotic imitation learning.CoRL, 2022

  25. [25]

    Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, Ayush Kumar Tewari, Joshua B

    Krishna Murthy Jatavallabhula, Ali Kuwajerwala, Qiao Gu, Mohd. Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Varma Keetha, Ayush Kumar Tewari, Joshua B. Tenenbaum, Celso M. de Melo, M. Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. Conceptfusion: Open-set multimodal 3d mapping.ArXiv, abs/2302.07241, 2023

  26. [26]

    Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation.ArXiv, abs/2402.15487, 2024

    Hanxiao Jiang, Binghao Huang, Ruihai Wu, Zhuoran Li, Shubham Garg, Hooshang Nayyeri, Shenlong Wang, and Yunzhu Li. Roboexp: Action-conditioned scene graph via interactive exploration for robotic manipulation.ArXiv, abs/2402.15487, 2024

  27. [27]

    On value functions and the agent–environment boundary.arXiv preprint arXiv:1905.13341, 2019

    Nan Jiang. On value functions and the agent–environment boundary.arXiv preprint arXiv:1905.13341, 2019

  28. [28]

    A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge.Neurosymbolic Artificial Intelligence, 1:NAI–240719, 2025

    M Jaleed Khan, Filip Ilievski, John G Breslin, and Edward Curry. A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge.Neurosymbolic Artificial Intelligence, 1:NAI–240719, 2025

  29. [29]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B. Girshick. Segment anything.2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3992–4003, 2023

  30. [30]

    Semantic memory: A review of meth- ods, models, and current challenges.Psychonomic bulletin & review, 28(1):40–80, 2021

    Abhilasha A Kumar. Semantic memory: A review of meth- ods, models, and current challenges.Psychonomic bulletin & review, 28(1):40–80, 2021

  31. [31]

    Scene graph generation from objects, phrases and region captions.2017 IEEE International Conference on Computer Vision (ICCV), pages 1270–1279, 2017

    Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xi- aogang Wang. Scene graph generation from objects, phrases and region captions.2017 IEEE International Conference on Computer Vision (ICCV), pages 1270–1279, 2017

  32. [32]

    Gps-net: Graph property sensing network for scene graph generation.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3743–3752, 2020

    Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. Gps-net: Graph property sensing network for scene graph generation.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3743–3752, 2020

  33. [33]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detec- tion.ArXiv, abs/2303.05499, 2023

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun yue Li, Jianwei Yang, Hang Su, Jun-Juan Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detec- tion.ArXiv, abs/2303.05499, 2023

  34. [34]

    Enabling stateful behaviors for diffusion-based policy learn- ing.arXiv preprint arXiv:2404.12539, 2024

    Xiao Liu, Fabian Weigend, Yifan Zhou, and Heni Ben Amor. Enabling stateful behaviors for diffusion-based policy learn- ing.arXiv preprint arXiv:2404.12539, 2024

  35. [35]

    Reinforce- ment learning, bit by bit.Foundations and Trends in Machine Learning, 16(6):733–865, 2023

    Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen, et al. Reinforce- ment learning, bit by bit.Foundations and Trends in Machine Learning, 16(6):733–865, 2023

  36. [36]

    Clio: Real-time task- driven open-set 3d scene graphs

    Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith, Carlyn Dougherty, Eric Cristofalo, Lukas Schmid, and Luca Carlone. Clio: Real-time task- driven open-set 3d scene graphs. 2024

  37. [37]

    What matters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar and et al. What matters in learning from offline human demonstrations for robot manipulation. In CoRL, 2021

  38. [38]

    Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren´e Zurbr ¨ugg, Nikita Rudin, et al. Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

  39. [39]

    Liu, and Long Zeng

    Zhe Ni, Xiao-Xin Deng, Cong Tai, Xin-Yue Zhu, Xiang Wu, Y . Liu, and Long Zeng. Grid: Scene-graph-based instruction-driven robotic task planning.2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 13765–13772, 2023

  40. [40]

    Maxime Oquab, Timoth’ee Darcet, Th ´eo Moutakanni, Huy Q. V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Huijiao Xu, Herv ´e J ´egou,...

  41. [41]

    Task-oriented hierarchical object decomposition for visuomotor control

    Jianing Qian, Yunshuang Li, Bernadette Bucher, and Dinesh Jayaraman. Task-oriented hierarchical object decomposition for visuomotor control. InConference on Robot Learning, 2024

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021

  43. [43]

    Daniel Griffith, and Luca Carlone

    Zachary Ravichandran, Lisa Peng, Nathan Hughes, J. Daniel Griffith, and Luca Carlone. Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks.2022 Inter- national Conference on Robotics and Automation (ICRA), pages 9272–9279, 2021

  44. [44]

    Zero-shot object-centric instruction following: Integrating foundation models with traditional navigation.ArXiv, abs/2411.07848, 2024

    Sonia Raychaudhuri, Duy Ta, Katrina Ashton, Angel X Chang, Jiuguang Wang, and Bernadette Bucher. Zero-shot object-centric instruction following: Integrating foundation models with traditional navigation.ArXiv, abs/2411.07848, 2024

  45. [45]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hut- ter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learn- ing, pages 91–100. PMLR, 2022

  46. [46]

    What matters in learning from large-scale datasets for robot manipulation

    Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Chul Shin, Soroush Nasiriany, Ajay Mandlekar, and Danfei Xu. What matters in learning from large-scale datasets for robot manipulation. arXiv preprint arXiv:2506.13536, 2025

  47. [47]

    Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017

  48. [48]

    Junyao Shi, Jianing Qian, Yecheng Jason Ma, and Dinesh Ja- yaraman. Composing pre-trained object-centric representa- tions for robotics from ”what” and ”where” foundation mod- els.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15424–15432, 2024

  49. [49]

    Graph-structured visual imi- tation

    Maximilian Sieb, Xian Zhou, Audrey Huang, Oliver Kroe- mer, and Katerina Fragkiadaki. Graph-structured visual imi- tation. InConference on Robot Learning, 2019

  50. [50]

    Agent-state-based poli- cies in pomdps: Beyond belief-state mdps.arXiv preprint arXiv:??, 2023

    Amit Sinha and Aditya Mahajan. Agent-state-based poli- cies in pomdps: Beyond belief-state mdps.arXiv preprint arXiv:??, 2023

  51. [51]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  52. [52]

    Learning long-context diffusion policies via past-token pre- diction.arXiv preprint arXiv:2505.09561, 2025

    Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning long-context diffusion policies via past-token pre- diction.arXiv preprint arXiv:2505.09561, 2025

  53. [53]

    Episodic and semantic memory

    Endel Tulving. Episodic and semantic memory. InOrgani- zation of Memory, pages 381–403. Academic Press, 1972

  54. [54]

    Odyssey: Open-world quadrupeds ex- ploration and manipulation for long-horizon tasks.ArXiv, abs/2508.08240, 2025

    Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, and Chunhua Shen. Odyssey: Open-world quadrupeds ex- ploration and manipulation for long-horizon tasks.ArXiv, abs/2508.08240, 2025

  55. [55]

    Scene graph generation by iterative message pass- ing.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3097–3106, 2017

    Danfei Xu, Yuke Zhu, Christopher Bongsoo Choy, and Li Fei-Fei. Scene graph generation by iterative message pass- ing.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3097–3106, 2017

  56. [56]

    Maniflow: A general robot manipulation policy via consistency flow training.arXiv preprint arXiv:2509.01819, 2025

    Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, et al. Maniflow: A general robot manipulation policy via consistency flow training.arXiv preprint arXiv:2509.01819, 2025

  57. [57]

    Dynamic open- vocabulary 3d scene graphs for long-term language-guided mobile manipulation.IEEE Robotics and Automation Let- ters, 10:4252–4259, 2024

    Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun-Yan Zhu, Lijiang Chen, and Jihong Liu. Dynamic open- vocabulary 3d scene graphs for long-term language-guided mobile manipulation.IEEE Robotics and Automation Let- ters, 10:4252–4259, 2024

  58. [58]

    Graph r-cnn for scene graph generation

    Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InEuropean Conference on Computer Vision, 2018

  59. [59]

    Interleaved llm and motion planning for general- ized multi-object collection in large scene graphs.ArXiv, abs/2507.15782, 2025

    Ruochu Yang, Yu Zhou, Fumin Zhang, and Mengxue Hou. Interleaved llm and motion planning for general- ized multi-object collection in large scene graphs.ArXiv, abs/2507.15782, 2025

  60. [60]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation.2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 42–48, 2023

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation.2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 42–48, 2023

  61. [61]

    3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024

  62. [62]

    Knowledge-inspired 3d scene graph prediction in point cloud

    Shoulong Zhang, Shuai Li, Aimin Hao, and Hong Qin. Knowledge-inspired 3d scene graph prediction in point cloud. InNeural Information Processing Systems, 2021

  63. [63]

    Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

    Tingfan Zhang, Zoe McCarthy, Eric Jang, and Sergey Levine. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. InIROS, 2018

  64. [64]

    Perceiver-actor: A multi-task trans- former for robotic manipulation

    Yunzhu Zhang and et al. Perceiver-actor: A multi-task trans- former for robotic manipulation. InCoRL, 2021

  65. [65]

    Rec- ognize anything: A strong image tagging model.ArXiv, abs/2306.03514, 2023

    Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Siyi Liu, Yandong Guo, and Lei Zhang. Rec- ognize anything: A strong image tagging model.ArXiv, abs/2306.03514, 2023

  66. [66]

    Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  67. [68]

    Learning generalizable manipulation policies with object- centric 3d representations.arXiv preprint arXiv:2310.14386, 2023

    Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object- centric 3d representations.arXiv preprint arXiv:2310.14386, 2023. Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs Supplementary Material

  68. [69]

    microwave

    Additional Information about Our Task- Driven Scene Graph Concretely, our scene graph is implemented as atwo-level treein which the root node is represented by the CLS token extracted from the DINO-v2 encoder applied to the current image observation. The second level of the tree consists of the set of task-relevant object nodes described in Section 3.1. A...

  69. [70]

    Following the MDP formulation of [45], we employ Proximal Policy Optimization (PPO) [47] within IsaacLab [38] to learn a ro- bust quadruped walking policy

    Technical Details for Collecting Demonstra- tions in Simulated Mujoco Environment Low-Level Controller .During demonstration collection, we train a locomotion controller in simulation. Following the MDP formulation of [45], we employ Proximal Policy Optimization (PPO) [47] within IsaacLab [38] to learn a ro- bust quadruped walking policy. The resulting lo...

  70. [71]

    A ZED Mini camera is mounted on the robot’s wrist, and the captured wrist images are encoded using DINO-v2

    Technical Details for Real World Tabletop Manipulation Experiments We utilize a 7-DoF Franka robotic arm operating under a continuous joint-control action space at 15 Hz. A ZED Mini camera is mounted on the robot’s wrist, and the captured wrist images are encoded using DINO-v2. The resulting CLS token is incorporated as an additional input to the pol- icy...