pith. sign in

arxiv: 2606.31682 · v1 · pith:7P2SVUROnew · submitted 2026-06-30 · 💻 cs.RO

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

Pith reviewed 2026-07-01 05:29 UTC · model grok-4.3

classification 💻 cs.RO
keywords human-robot interactionrobot manipulationdemonstration datasethuman-aware behaviorscollaborative taskscoworker taskssupervisor tasksrobot learning
0
0 comments X

The pith

Training on human-present robot data produces synchronization, yielding, and gesture responses absent from human-absent data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing large-scale robot demonstration datasets are collected without humans present, so policies trained on them complete tasks in isolation but lack behaviors needed when people share the space. HABIT supplies over 10,000 episodes across 60 tasks in human-present settings, grouped into three interaction roles: Collaborator for joint work, Coworker for separate tasks in shared space, and Supervisor for human direction of the robot. Experiments show that policies trained on this data display spatiotemporal synchronization during joint tasks, yielding space to humans, and grounding to gestures, none of which appear in policies trained on robot-only data. The same training also supports faster adaptation when the robot faces new human-robot interaction tasks. The work treats human presence itself as an added dimension of dataset diversity needed for policies that operate safely alongside people.

Core claim

The paper claims that a dataset of robot demonstrations collected with humans present, organized by the three roles of Collaborator, Coworker, and Supervisor and totaling over 10K episodes, produces policies that exhibit human-aware behaviors including spatiotemporal synchronization in joint tasks, yielding in shared-space separate tasks, and gesture grounding in directed tasks; these behaviors do not emerge from training on human-absent data, and the dataset further enables rapid adaptation to new interaction tasks.

What carries the argument

HABIT dataset structured by Collaborator, Coworker, and Supervisor interaction roles, which supplies the human-present demonstrations used to elicit the target behaviors.

If this is right

  • Policies trained on HABIT exhibit spatiotemporal synchronization during joint human-robot tasks.
  • Policies yield space to humans when both pursue separate tasks in the same environment.
  • Policies respond to human gestures as instructions in supervisory settings.
  • Training on HABIT produces faster adaptation to previously unseen human-robot interaction tasks.
  • Robot policies gain the capacity to operate in environments shared with humans by incorporating this form of data diversity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could measure whether the learned behaviors reduce collision rates or interruptions in unstructured home or factory settings.
  • The supervisor role could be extended by pairing the dataset with language models to handle more open-ended instructions.
  • Adding metrics of social comfort or task efficiency alongside the reported behaviors would strengthen evidence of practical value.
  • Other existing robot benchmarks could be re-collected with humans present to isolate the contribution of this diversity axis.

Load-bearing premise

The three interaction roles and the collected episodes sufficiently represent real-world human-robot dynamics to produce generalizable human-aware policies.

What would settle it

A controlled test in which policies trained on HABIT show no measurable increase in synchronization, yielding, or gesture response compared with policies trained on matched human-absent data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.31682 by Byeongguk Jeon, Hyungmok Son, Jaehwi Song, Kimin Lee, Minjoon Seo, Suchae Jeong, Sungdong Kim.

Figure 1
Figure 1. Figure 1: HABIT comprises 164 hours of human-robot interaction demonstrations across 60 tasks [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative examples of task workflows with their subtask sequences. For each row, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) The collection unit includes both the human and the robot agent, along with five RGB [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset statistics. (a) Per-role workflow composition, with single-agent and cross-agent [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative evaluation tasks across the three human roles. Collaborator: [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success rate across six evaluation tasks for (a) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Role-specific failure cases for the Collaborator (left), Coworker (middle), and Supervisor [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Role-specific failure analysis on one representative task per role. HABIT (Ours) sub [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sample efficiency of HABIT mid-training vs. direct fine-tuning. HABIT mid-training [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Robot-side workspace detail. The world origin (yellow) lies at the midpoint between the [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The five initial setups for the Table Serving task. The setups vary the cup-bowl arrangement [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The three initial setups for the Shelf Cleaning task. The setups vary which object (clock or [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The five initial setups for the Waste Sorting task. The setups vary the arrangement of the [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The five initial setups for the Box Packaging task. The setups vary the arrangement of the [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Initial setups for the Food Storage task. (a) The pointing positions define the four setups. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Pointing positions: the donut indexed 1 through 4 from the human operator’s right. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Initial configuration and task workflow for Shelf Cleaning. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Initial configuration and task workflow for Table Serving. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Initial configuration and task workflow for Waste Sorting. [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Initial configuration and task workflow for Box Packaging. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Initial configuration and task workflow for Food Storage. [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Initial configuration and task workflow for Donut Serving. [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Role-specific failure analysis on the remaining three tasks. Failure types are equal to those [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Success rate under in-distribution and out-of-distribution conditions for HABIT-trained (a) [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Role-specific failure analysis under in-distribution and out-of-distribution conditions. [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Failure mode breakdown for direct fine-tuning and HABIT mid-training followed by [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗
read the original abstract

Large-scale demonstration datasets have been central to recent progress in general-purpose robot policies. However, existing datasets are collected in human-absent settings, and policies trained on such data may perform tasks competently in isolation but fail to exhibit human-aware behaviors. To address this gap, we introduce HABIT, a large-scale robot demonstration dataset for human-present environments. We organize tasks into three roles capturing distinct modes of human-robot interaction: Collaborator, where human and robot jointly accomplish a task; Coworker, where they pursue separate tasks in a shared space; and Supervisor, where the human directs the robot. The dataset comprises over 10K episodes and over 160 hours across 60 tasks. Our experiments show that training on human-present data elicits human-aware behaviors that robot-only data fails to produce: spatiotemporal synchronization in Collaborator tasks, yielding in Coworker tasks, and gesture grounding in Supervisor tasks. Moreover, training on HABIT enables rapid adaptation to new human-robot interaction tasks. By introducing human presence as a new axis of dataset diversity, HABIT extends robot policies to environments shared with humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces HABIT, a large-scale robot demonstration dataset comprising over 10K episodes and 160 hours across 60 tasks in human-present environments. Tasks are organized into three interaction roles—Collaborator (joint task accomplishment), Coworker (separate tasks in shared space), and Supervisor (human directs the robot). The central claim is that policies trained on this human-present data exhibit human-aware behaviors (spatiotemporal synchronization, yielding, and gesture grounding) absent in robot-only training, and that the dataset enables rapid adaptation to new human-robot interaction tasks.

Significance. If substantiated, HABIT would provide a useful new resource by adding human presence as an axis of diversity to robot manipulation datasets, potentially supporting development of policies that handle shared environments. The scale and role-based organization are strengths for studying distinct interaction modes. The work supplies a concrete dataset contribution at a time when large-scale demonstrations drive progress in robot learning.

major comments (2)
  1. [Abstract] Abstract: the claim that training on human-present data elicits spatiotemporal synchronization, yielding, and gesture grounding (absent from robot-only data) and enables rapid adaptation lacks any reported metrics, baseline comparisons, statistical details, or data collection protocols, which are load-bearing for the central empirical claims.
  2. [Dataset Collection / Experiments] Dataset and Experiments sections: no evidence is supplied that the 60 tasks and collected episodes capture representative human variability in motions, intent signaling, or environmental factors; without this, it is unclear whether observed behaviors arise from human presence per se or from dataset artifacts, directly affecting the generalizability and adaptation assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and indicate where revisions will strengthen the empirical presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that training on human-present data elicits spatiotemporal synchronization, yielding, and gesture grounding (absent from robot-only data) and enables rapid adaptation lacks any reported metrics, baseline comparisons, statistical details, or data collection protocols, which are load-bearing for the central empirical claims.

    Authors: The abstract provides a concise summary of the central findings, while the Experiments section (Section 5) reports the supporting quantitative results, including task success rates, synchronization metrics (e.g., temporal alignment errors), yielding behaviors, gesture recognition accuracy, and adaptation performance with comparisons to robot-only baselines. Statistical details and data collection protocols are described in Sections 3 and 4. We agree the abstract would be strengthened by including a few key metrics; the revised version will incorporate concise quantitative highlights and explicit references to the relevant sections. revision: yes

  2. Referee: [Dataset Collection / Experiments] Dataset and Experiments sections: no evidence is supplied that the 60 tasks and collected episodes capture representative human variability in motions, intent signaling, or environmental factors; without this, it is unclear whether observed behaviors arise from human presence per se or from dataset artifacts, directly affecting the generalizability and adaptation assertions.

    Authors: Section 3 details the data collection protocol, which involved multiple human participants across varied demographics, motion styles, and environmental conditions to elicit natural variability in the three interaction roles. The Experiments section compares policies trained on HABIT versus robot-only data, showing the emergence of the target behaviors only in the human-present setting. To further substantiate representativeness, the revision will add explicit quantitative analysis of motion variability (e.g., trajectory variance statistics) and intent signaling diversity. revision: partial

Circularity Check

0 steps flagged

Empirical dataset paper with no derivations or self-referential predictions

full rationale

The paper introduces the HABIT dataset and reports empirical results from training policies on it versus robot-only data. No equations, fitted parameters, or predictions are claimed; the central claims are direct observations from experiments (synchronization, yielding, gesture grounding, adaptation). These are externally falsifiable by replication and do not reduce to inputs by construction. No self-citation load-bearing steps or ansatzes are present. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset paper with no free parameters, axioms, or invented entities; contribution rests on empirical collection and role definitions rather than theoretical constructs.

pith-pipeline@v0.9.1-grok · 5745 in / 937 out tokens · 24407 ms · 2026-07-01T05:29:11.797281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 19 canonical work pages · 16 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, 2022

  3. [3]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. InInternational Conference on Robotics and Automation, 2024

  4. [4]

    Open X- Embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open X- Embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. InInternational Conference on Robotics and Automation, 2024

  5. [5]

    RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

    Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024

  6. [6]

    RoboNet: Large-Scale Multi-Robot Learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019

  7. [7]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  8. [8]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  9. [9]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, 2023

  10. [10]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InIEEE/CVF conference on computer vision and pattern recognition, 2022

  11. [11]

    Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  12. [12]

    EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y Zhu, et al. Egoverse: An egocentric human dataset for robot learning from around the world.arXiv preprint arXiv:2604.07607, 2026

  13. [13]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  14. [14]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 11

  15. [15]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  16. [16]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  17. [17]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  18. [18]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, 2023

  19. [19]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos Policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  20. [20]

    Video Generators are Robot Policies

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  21. [21]

    mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  22. [22]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  23. [23]

    Human–robot collaboration: a survey.Interna- tional Journal of Humanoid Robotics, 5(01):47–66, 2008

    Andrea Bauer, Dirk Wollherr, and Martin Buss. Human–robot collaboration: a survey.Interna- tional Journal of Humanoid Robotics, 5(01):47–66, 2008

  24. [24]

    Theory and evaluation of human robot interactions

    Jean Scholtz. Theory and evaluation of human robot interactions. InHawaii International Conference on System Sciences, 2003

  25. [25]

    A taxonomy to structure and analyze human–robot interac- tion.International Journal of Social Robotics, 13(4):833–849, 2021

    Linda Onnasch and Eileen Roesler. A taxonomy to structure and analyze human–robot interac- tion.International Journal of Social Robotics, 13(4):833–849, 2021

  26. [26]

    How to communicate robot motion intent: A scoping review

    Max Pascher, Uwe Gruenefeld, Stefan Schneegass, and Jens Gerken. How to communicate robot motion intent: A scoping review. InConference on Human Factors in Computing Systems, 2023

  27. [27]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  28. [28]

    Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours

    Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. InInternational conference on robotics and automation, 2016

  29. [29]

    Multiple interactions made easy (mime): Large scale demonstrations data for imitation

    Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhinav Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. InConference on robot learning, 2018

  30. [30]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, 2018

  31. [31]

    Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, 2018. 12

  32. [32]

    Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021

    Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale.arXiv preprint arXiv:2104.08212, 2021

  33. [33]

    BridgeData V2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. BridgeData V2: A dataset for robot learning at scale. InConference on Robot Learning, 2023

  34. [34]

    RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

  35. [35]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. InInternational Conference on Robotics and Automation, 2024

  36. [36]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  37. [37]

    Harmonic: A multimodal dataset of assistive human–robot collaboration.The International Journal of Robotics Research, 41(1):3–11, 2022

    Benjamin A Newman, Reuben M Aronson, Siddhartha S Srinivasa, Kris Kitani, and Henny Admoni. Harmonic: A multimodal dataset of assistive human–robot collaboration.The International Journal of Robotics Research, 41(1):3–11, 2022

  38. [38]

    place the k-th donut from the left on the tray

    Frederik Plahl, Georgios Katranis, Ilshat Mamaev, and Andrey Morozov. Lihra: A lidar-based hri dataset for automated risk monitoring methods. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025. 13 Appendix: HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation Figure 10: Robot-side workspace detail. T...

  39. [39]

    Hand the Duster to the robot

  40. [40]

    Lift the objects on a randomly selected tier of the Shelf

  41. [41]

    Once the robot finishes cleaning, lift the objects on the remaining tiers of the Shelf

  42. [42]

    Receive the Duster from the robot. Robot:

  43. [43]

    Pick up the Duster from the human

  44. [44]

    Clean the specific tier of the Shelf with the Duster once objects are removed

  45. [45]

    Clean the remaining tier of the Shelf with the Duster once objects are removed

  46. [46]

    Table Serving (Collaborator)

    Hand the Duster back to the human. Table Serving (Collaborator). Human: 19 Figure 17: Initial configuration and task workflow for Shelf Cleaning

  47. [47]

    The human picks up the top napkin from the stack on the human table and walks to the robot table to stand in front of one of the two trays

  48. [48]

    When the robot lifts the bowl and the cup, the human unfolds the napkin and lays the napkin flat on the tray

  49. [49]

    The human returns to the human table to pick up another napkin and walks to the robot table to stand in front of the tray without a napkin

  50. [50]

    When the robot lifts the bowl and the cup, the human unfolds the napkin and lays the napkin flat on the tray. Robot:

  51. [52]

    Place the Picnic Bowl and Reusable plastic cup back onto the Handle tray in front of the human’s position

  52. [53]

    Pick up the Picnic Bowl and Reusable plastic cup from the Handle tray in front of the human’s position and hold them in the air

  53. [54]

    Waste Sorting (Coworker)

    Place the Picnic Bowl and Reusable plastic cup back onto the Handle tray in front of the human’s position. Waste Sorting (Coworker). Human:

  54. [56]

    The human picks up one piece of trash that is not a can and places the trash into the appropriate organizing basket

  55. [57]

    20 Figure 18: Initial configuration and task workflow for Table Serving

    The human picks up one piece of trash that is not a can and places the trash into the appropriate organizing basket. 20 Figure 18: Initial configuration and task workflow for Table Serving

  56. [58]

    The human picks up one piece of trash that is not a can and places the trash into the appropriate organizing basket. Robot:

  57. [59]

    Pick up the can waste from the table and place it in the right Fabric basket

  58. [60]

    Figure 19: Initial configuration and task workflow for Waste Sorting

    Pick up the can waste from the table and place it in the right Fabric basket. Figure 19: Initial configuration and task workflow for Waste Sorting. Box Packaging (Coworker). Human:

  59. [63]

    Pick up an object on the table and put it in the box. 21

  60. [64]

    Pick up an object on the table and put it in the box

  61. [65]

    Close the lid of the box facing the person. Robot:

  62. [67]

    Pick up a Pencil pouch or Stapler and place it inside the Mailer Box closest to the robot

  63. [68]

    Figure 20: Initial configuration and task workflow for Box Packaging

    Close the lid of the Mailer Box closest to the robot. Figure 20: Initial configuration and task workflow for Box Packaging. Food Storage (Supervisor). Human:

  64. [69]

    A person randomly selects and points to an Airtight Container. Robot:

  65. [70]

    Figure 21: Initial configuration and task workflow for Food Storage

    Place the Butter Roll into the Airtight Container indicated by the human. Figure 21: Initial configuration and task workflow for Food Storage. Donut Serving (Supervisor). Human: 22

  66. [71]

    Points to the third donut from the left from the robot’s perspective. Robot:

  67. [72]

    Figure 22: Initial configuration and task workflow for Donut Serving

    Pick up the Paper togo box containing the Donut indicated by the person and place it on the Handle tray. Figure 22: Initial configuration and task workflow for Donut Serving. C Model Training Details This section provides the fine-tuning configurations for the two open-source VLAs evaluated through- out the paper, namely π0.5 and GR00T N1.6. Within each m...