OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

Boyuan Zheng; Haodong Zhu; Junming Zhao; Ruiqian Nai; Tong Zhang; Yang Gao; Yihang Hu; Yingdong Hu; Zunhao Chen

arxiv: 2606.22174 · v1 · pith:TDTYHG4Dnew · submitted 2026-06-20 · 💻 cs.RO

OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

Yingdong Hu , Haodong Zhu , Boyuan Zheng , Yihang Hu , Tong Zhang , Zunhao Chen , Junming Zhao , Ruiqian Nai

show 1 more author

Yang Gao

This is my paper

Pith reviewed 2026-06-26 11:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid loco-manipulationwhole-body controlvision-language-actionteleoperation interfaceheterogeneous co-trainingempirical roadmappolicy generalization

0 comments

The pith

A phased empirical roadmap yields a whole-body humanoid VLA that outperforms prior models while using less than half the demonstration data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine the minimal set of design choices needed to train a vision-language-action model that directly maps pixels and language to every degree of freedom on a humanoid robot. It organizes the inquiry as three sequential phases of controlled experiments: selecting a teleoperation interface that exposes the full kinematic chain, pretraining the model on data from static and wheeled platforms, and co-training on additional humanoid demonstrations collected with a simpler interface. If these choices prove sufficient, the resulting policy can handle long-horizon loco-manipulation tasks that require coordinated upper- and lower-body motion across a wide vertical range, without collecting new whole-body demonstrations for every new object or instruction. A reader would care because the approach replaces the common practice of decoupling upper and lower bodies with a single native controller trained on far less target-specific data.

Core claim

The authors show that a joint-based whole-body teleoperation interface, combined with pretraining on static and wheeled dual-arm data and co-training on humanoid-specific demonstrations, produces a policy that coordinates the entire kinematic chain. In a challenging long-horizon task, this policy exceeds the performance of two existing humanoid VLA baselines while requiring less than half the total demonstration time and without any additional whole-body teleoperation on the evaluation objects and instructions.

What carries the argument

The three-phase empirical roadmap of one-variable-at-a-time experiments that isolates the contribution of teleoperation interface, pretraining sources, and heterogeneous co-training to whole-body policy performance.

If this is right

Joint-based teleoperation that exposes all degrees of freedom produces higher-quality demonstrations than interfaces that hide part of the kinematic chain.
A model pretrained on static and wheeled platforms transfers directly to a humanoid's full action space without architecture changes.
Co-training with humanoid data collected via a simpler interface extends the policy to new objects and language instructions without new whole-body demonstrations.
The resulting policy achieves higher success rates than prior humanoid VLAs on long-horizon tasks that require coordinated motion across a wide vertical range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phased testing sequence could be applied to test whether other robot morphologies benefit from mixed-platform pretraining.
If the co-training benefit holds under stricter distribution-shift controls, it would lower the data-collection barrier for deploying humanoids in new environments.
The roadmap's emphasis on isolating one variable at a time offers a template for future empirical studies of whole-body control that avoid confounding multiple design changes.

Load-bearing premise

That co-training on data collected from static, wheeled, and humanoid platforms will extend policy performance to new objects and instructions without any further whole-body teleoperation on those targets.

What would settle it

A side-by-side comparison in which the co-trained model shows no improvement, or a decline, in success rate on new objects and instructions relative to a model trained only on the original humanoid demonstrations when total data volume and task difficulty are held constant.

Figures

Figures reproduced from arXiv: 2606.22174 by Boyuan Zheng, Haodong Zhu, Junming Zhao, Ruiqian Nai, Tong Zhang, Yang Gao, Yihang Hu, Yingdong Hu, Zunhao Chen.

**Figure 1.** Figure 1: Overview of OpenHLM. A roadmap of controlled experiments in three phases: (I) we compare teleop interfaces for a low-level whole-body controller and adopt a joint-based interface; (II) we adapt a manipulation VLA to the humanoid’s full action space along several design axes; (III) we extend the policy to new objects and instructions by co-training the full loco-manipulation data with stationary teleop or H… view at source ↗

**Figure 2.** Figure 2: The HLM-12 Benchmark. Tasks fall into four capability families targeting different aspects of whole-body behavior, with one representative per family shown. Full task specifications are in Appendix A. 2 Design Goals and Task Suite Before launching into the roadmap of §3, we first lay out the design goals the system aims to meet, introduce the HLM-12 benchmark, and describe the evaluation protocol. Design g… view at source ↗

**Figure 3.** Figure 3: Joint-based vs. SMPL-based whole-body teleop. Joint-space training data reaches 88% average task progress against 75% for SMPL. Two failure modes account for most of the gap. On Bottle Disposal, the SMPL-trained policy lifts the heel without sufficiently lifting the toes, leaving inadequate clearance to depress the pedal. On Cola Placement, it occasionally walks too close to the table and knocks the can … view at source ↗

**Figure 4.** Figure 4: Future-frame preview latency sweep. Future-frame preview latency: 0.2 s balances locomotion and manipulation. A whole-body controller trained via motion tracking exposes a tunable preview latency ∆t, controlling how far into the future it sees the reference motion. Longer preview yields smoother motion but adds delay between the operator’s command and its enactment. We sweep ∆t ∈ {0, 0.2, 0.4, 0.6} s o… view at source ↗

**Figure 5.** Figure 5: VLA design ablations on the 4-task subset. Amber: interface ablations (one choice flipped per bar); drops are minor and no single choice is the bottleneck. Rose: pretraining ablations; robot pretraining (π0.5) dominates, with PaliGemma and from-scratch collapsing sharply. Sage: one-step action generation; both underperform the 10-step baseline by ∼20 points despite lower validation action MSE. humanoid-spe… view at source ↗

**Figure 6.** Figure 6: Whole-body teleop data scaling. At this point we have a humanoid-adapted VLA: a π0.5-initialized backbone with weightsurgery action projection, the pretrained bimanual ordering, absolute joint targets, proprioception as input, and multi-step flow matching inference. Carrying it to the 8 training tasks (40 demonstrations each), the system reaches 89% average task progress ( [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 7.** Figure 7: Heterogeneous co-training results. Per-task progress on the 4 held-out tasks and aggregate averages over the 8 training tasks (Tasks 1–8 (avg)), the 3 motion-reuse held-out tasks (Tasks 9–11 (avg)), and all 4 held-out tasks (Tasks 9–12 (avg)). Per-task breakdown for all 12 tasks is in Appendix E.1. Stationary co-training delivers both new motions and new semantic understanding. Stationary teleoperation op… view at source ↗

**Figure 8.** Figure 8: Long-horizon language-conditioned task. Task and data. The humanoid performs a long-horizon language-conditioned task spanning its large vertical workspace ( [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Humanoid robot hardware. The Unitree G1 is equipped with wrist-mounted grippers and onboard cameras. UMI grippers with GoPro cameras HTC VIVE Ultimate trackers [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 11.** Figure 11: Whole-body teleoperation scene and HMD snapshot, of the same frame. Left: the PICO4U kit provides the HMD, two handheld controllers, and two leg trackers for live teleoperation. Right: an in-headset egocentric camera view streamed to the operator during teleoperation. The fisheye head view is placed in the center, with the 2 wrist views placed on its top-left and top-right corners. C Implementation Detail… view at source ↗

**Figure 12.** Figure 12: Per-task breakdown of heterogeneous co-training results. Task progress for all 12 tasks and three aggregates (rightmost). Four conditions: 8-task baseline, stationary co-training, HuMI co-training, and 12-task oracle. Co-training does not regress training tasks. On held-out tasks, both methods reach near-oracle on motion-reuse tasks (9–11); stationary succeeds on the new-motion task (12), HuMI does not. O… view at source ↗

**Figure 13.** Figure 13: examines how HuMI demonstration count affects performance on the three motion-reuse held-out tasks (Tasks 9–11), with the 8-task whole-body teleop set fixed. 5 10 20 40 HuMI Demos per Task 50% 75% 100% Task Progress (%) 42% 67% 76% 84% [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

read the original abstract

Whole-body humanoid loco-manipulation requires coordinating the robot's entire kinematic chain. However, most existing systems typically decouple the upper and lower bodies into separate controllers, limiting such coordination and yielding behaviors similar to those of a wheeled dual-arm platform. In this paper, we ask what it takes to build a whole-body native vision-language-action (VLA) model that maps language and pixels directly to all of the humanoid's degrees of freedom. We conduct a systematic empirical study organized as a roadmap of one-variable-at-a-time experiments across three phases: whole-body teleoperation, VLA model design, and heterogeneous co-training. Our study yields several intriguing findings: a joint-based whole-body teleoperation interface outperforms alternatives that only partially expose the humanoid's degrees of freedom; a VLA pretrained on static and wheeled dual-arm platforms transfers surprisingly well to a humanoid's full action space; and co-training with HuMI, the humanoid analog of UMI, extends the policy to new objects and instructions without additional whole-body teleoperation on those targets. Following this roadmap yields OpenHLM, an open-source recipe for whole-body humanoid loco-manipulation. In a challenging long-horizon task that spans a wide vertical range of the humanoid, OpenHLM outperforms two state-of-the-art humanoid VLA baselines (GR00T N1.6 and $\Psi_0$) using less than half the total demonstration time. Our code, training data, and model checkpoints are available at [https://openhlm-project.github.io/].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenHLM delivers a usable open recipe for whole-body humanoid VLA with some plausible empirical findings, but the headline efficiency claim from heterogeneous co-training lacks the controls needed to pin it down.

read the letter

The paper's core contribution is an empirical roadmap that breaks down whole-body humanoid loco-manipulation into teleoperation choices, VLA architecture, and heterogeneous co-training. It reports that a joint-based interface works better than partial-DoF alternatives, that pretraining on static and wheeled platforms transfers to full humanoid action space, and that adding HuMI data lets the policy handle new objects and instructions without extra whole-body humanoid demos. The end result is OpenHLM, which beats GR00T N1.6 and Ψ0 on a long-horizon vertical task while using less than half the demonstration time, and everything is released with code, data, and checkpoints.

Those findings are worth noting because they come from a deliberate one-variable-at-a-time design rather than a single big model drop. The teleoperation result aligns with intuition about needing full kinematic access, and the cross-platform transfer observation is the kind of practical detail that can save other groups time. Open-sourcing the full stack is the clearest positive here; it turns the work into something others can actually run and extend.

The soft spot is the efficiency claim itself. The abstract ties the performance edge to co-training on static, wheeled, and HuMI data extending coverage without new humanoid teleop on the target objects. Yet the provided text gives no ablations that hold humanoid data fixed while removing the heterogeneous sources, no measures of distribution shift between platforms, and no explicit checks that task difficulty (vertical span, horizon, object variety) was matched across conditions. Without those, the gains could trace to the teleop interface, model size, or evaluation choices instead. The stress-test concern lands because the central selling point rests on that unquantified transfer.

This is for groups already building or benchmarking humanoid VLAs who need a concrete starting recipe. It deserves peer review because the open artifacts make the claims checkable and the empirical questions are directly relevant to deployment costs, even if the current writeup leaves the co-training mechanism under-specified.

Referee Report

1 major / 0 minor

Summary. The paper conducts an empirical study organized as a one-variable-at-a-time roadmap across whole-body teleoperation, VLA model design, and heterogeneous co-training phases. It reports that a joint-based teleoperation interface, pretraining on static/wheeled platforms, and co-training with HuMI data together produce OpenHLM, an open-source whole-body VLA that outperforms GR00T N1.6 and Ψ0 on a long-horizon loco-manipulation task spanning wide vertical range while using less than half the total demonstration time. Code, training data, and checkpoints are released.

Significance. If the central efficiency result holds under controlled conditions, the work would supply a practical, reproducible recipe for scaling whole-body humanoid VLA policies by leveraging heterogeneous data sources, thereby lowering the barrier of whole-body teleoperation for new objects and instructions. The explicit release of code, training data, and model checkpoints is a clear strength that supports direct replication and extension by the community.

major comments (1)

[Abstract] Abstract: the central claim that heterogeneous co-training on static, wheeled, and HuMI data extends policy performance to new objects/instructions without any additional whole-body teleoperation demonstrations underpins the reported <half demonstration-time advantage over GR00T N1.6 and Ψ0. No ablation that removes the co-training data while holding humanoid teleoperation fixed, no distribution-shift metrics (e.g., feature-space distance or held-out source-task success gap), and no explicit task-difficulty matching (vertical range, horizon length, object variability) between co-training and evaluation are described. This omission leaves open whether observed gains arise from the co-training mechanism or from model architecture, teleoperation interface, or evaluation conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that heterogeneous co-training on static, wheeled, and HuMI data extends policy performance to new objects/instructions without any additional whole-body teleoperation demonstrations underpins the reported <half demonstration-time advantage over GR00T N1.6 and Ψ0. No ablation that removes the co-training data while holding humanoid teleoperation fixed, no distribution-shift metrics (e.g., feature-space distance or held-out source-task success gap), and no explicit task-difficulty matching (vertical range, horizon length, object variability) between co-training and evaluation are described. This omission leaves open whether observed gains arise from the co-training mechanism or from model architecture, teleoperation interface, or evaluation conditions.

Authors: We agree that the manuscript would benefit from a more explicit isolation of the heterogeneous co-training effect. Our empirical study is organized as a one-variable-at-a-time roadmap, with the VLA model design phase incorporating pretraining on static and wheeled platforms, followed by the co-training phase with HuMI data. The performance advantage is demonstrated relative to GR00T N1.6 and Ψ0 baselines. However, we did not include an ablation that trains a model using only the humanoid teleoperation demonstrations without the heterogeneous data sources, nor did we report distribution-shift metrics or detailed task-difficulty matching. We will incorporate these analyses in the revised version to more rigorously support the contribution of the co-training mechanism. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivations or fitted predictions; results are direct experimental outcomes.

full rationale

The paper conducts a systematic empirical study organized as one-variable-at-a-time experiments across teleoperation, VLA model design, and heterogeneous co-training phases. All reported findings, including the outperformance of OpenHLM over baselines with less demonstration time, are presented as direct results from data collection and training runs. No equations, parameter fits, or derivations appear that could reduce a claimed prediction to its inputs by construction. Self-citations, if present, are not load-bearing for any central result, and the work is self-contained against external benchmarks without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical robotics paper with no mathematical derivations, free parameters, or axioms in the abstract; HuMI is referenced as an analog dataset but no new physical entities are postulated.

pith-pipeline@v0.9.1-grok · 5834 in / 1270 out tokens · 30032 ms · 2026-06-26T11:36:34.382488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

[1]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026
[2]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024
[3]

M. Liu, Z. Chen, X. Cheng, Y . Ji, R.-Z. Qiu, R. Yang, and X. Wang. Visual whole-body control for legged loco-manipulation.arXiv preprint arXiv:2403.16967, 2024

arXiv 2024
[4]

C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang. Mobile- television: Predictive motion priors for humanoid whole-body control. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025

2025
[5]

Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 12

arXiv 2025
[6]

J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion optimiza- tion for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738, 2025

arXiv 2025
[7]

Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec

NVIDIA GEAR Team. Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec. 2025. NVIDIA Research Blog, Accessed: 2026-05-06

2025
[8]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026
[9]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025
[10]

Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan

Figure AI. Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan. 2026. Figure AI Blog, Accessed: 2026-05-06

2026
[11]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[12]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024
[13]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[14]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[15]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[16]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

Pith/arXiv arXiv 2024
[17]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026
[18]

Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026

NVIDIA GEAR Team. Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026. GR00T- WholeBodyControl Documentation. Last updated: 2026-05-07. Accessed: 2026-05-14

2026
[19]

PICO Immersive Pte. Ltd. PICO 4 Ultra.https://www.picoxr.com/global/products/ pico4-ultra, 2024. Product webpage. Accessed: 2026-05-14

2024
[20]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025
[21]

, title =

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6), Oct. 2015. doi:10.1145/ 2816795.2818013. URLhttps://doi.org/10.1145/2816795.2818013. 13

work page doi:10.1145/2816795.2818013 2015
[22]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[23]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[24]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[25]

M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

Pith/arXiv arXiv 2026
[26]

VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/

HTC VIVE. VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/. Accessed: 2026-05-16

2026
[27]

Cosmos-reason2: Physical ai common sense and embodied reasoning models

NVIDIA. Cosmos-reason2: Physical ai common sense and embodied reasoning models. https://huggingface.co/nvidia/Cosmos-Reason2-8B, 2025. Accessed: 2026-05-17

2025
[28]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[29]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025
[30]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024
[31]

Zhang, B

T. Zhang, B. Zheng, R. Nai, Y . Hu, Y .-J. Wang, G. Chen, F. Lin, J. Li, C. Hong, K. Sreenath, and Y . Gao. Hub: Learning extreme humanoid balance.arXiv preprint arXiv:2505.07294, 2025

arXiv 2025
[32]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[33]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025
[34]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[35]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[36]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA That Learns From Experience. arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[37]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026
[38]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025
[39]

K. Liu, C. Guan, Z. Jia, Z. Wu, X. Liu, T. Wang, S. Liang, P. Chen, P. Zhang, H. Song, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. arXiv preprint arXiv:2409.19499, 2024. 14

arXiv 2024
[40]

F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 54877–54910, 2025

2025
[41]

Engel, K

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023

Pith/arXiv arXiv 2023
[42]

Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026

Apple Inc. Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026. Accessed: 2026-05-18

2026
[43]

Meta quest 3.https://www.meta.com/quest/quest-3/, 2026

Meta Platforms, Inc. Meta quest 3.https://www.meta.com/quest/quest-3/, 2026. Ac- cessed: 2026-05-18

2026
[44]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

2024
[45]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[46]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022
[47]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022
[48]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[49]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025
[50]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[51]

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

arXiv 2025
[52]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025
[53]

CTAG2F90-D Electric Parallel Gripper

ChangingTek Robotics Technology (Suzhou) Co., Ltd. CTAG2F90-D Electric Parallel Gripper. https://en.changingtek.com/diandong/147. Accessed: 2026-05-29

2026
[54]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 15 Appendices A The HLM-12 Benchmark 17 A.1 The 12 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Overall Evaluation Protocol . . . . . . . . . . . . . . . . ...

Pith/arXiv arXiv 2025
[55]

Four conditions: 8-task baseline, stationary co-training, HuMI co-training, and 12-task oracle

Pouring Tasks 1 8 (avg) Tasks 9 11 (avg) Tasks 9 12 (avg) 0 25 50 75 100T ask Progress (%) 100 93 87 87 93 87 80 93 85 100 95 85 87 100100100 92 92 72 88 85 70 95 95 90 75 85 80 80 80 80 67 20 100 80 100 47 93 100100 40 73 73 87 25 80 15 90 89 87 87 87 36 89 84 96 33 87 67 94 8-task baseline Stationary co-training HuMI co-training 12-task oracle Figure 12...

[1] [1]

R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026

arXiv 2026

[2] [2]

Cheng, Y

X. Cheng, Y . Ji, J. Chen, R. Yang, G. Yang, and X. Wang. Expressive whole-body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

arXiv 2024

[3] [3]

M. Liu, Z. Chen, X. Cheng, Y . Ji, R.-Z. Qiu, R. Yang, and X. Wang. Visual whole-body control for legged loco-manipulation.arXiv preprint arXiv:2403.16967, 2024

arXiv 2024

[4] [4]

C. Lu, X. Cheng, J. Li, S. Yang, M. Ji, C. Yuan, G. Yang, S. Yi, and X. Wang. Mobile- television: Predictive motion priors for humanoid whole-body control. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025

2025

[5] [5]

Q. Ben, F. Jia, J. Zeng, J. Dong, D. Lin, and J. Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025. 12

arXiv 2025

[6] [6]

J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang. Amo: Adaptive motion optimiza- tion for hyper-dexterous humanoid whole-body control.arXiv preprint arXiv:2505.03738, 2025

arXiv 2025

[7] [7]

Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec

NVIDIA GEAR Team. Gr00t n1.6: An improved open foundation model for generalist hu- manoid robots.https://research.nvidia.com/labs/gear/gr00t-n1_6/, Dec. 2025. NVIDIA Research Blog, Accessed: 2026-05-06

2025

[8] [8]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026

[9] [9]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

arXiv 2025

[10] [10]

Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan

Figure AI. Introducing helix 02: Full-body autonomy.https://www.figure.ai/news/ helix-02, Jan. 2026. Figure AI Blog, Accessed: 2026-05-06

2026

[11] [11]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[12] [12]

T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024

arXiv 2024

[13] [13]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[14] [14]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[15] [15]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[16] [16]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

Pith/arXiv arXiv 2024

[17] [17]

M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026

Pith/arXiv arXiv 2026

[18] [18]

Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026

NVIDIA GEAR Team. Decoupled wbc.https://nvlabs.github.io/ GR00T-WholeBodyControl/references/decoupled_wbc.html, 2026. GR00T- WholeBodyControl Documentation. Last updated: 2026-05-07. Accessed: 2026-05-14

2026

[19] [19]

PICO Immersive Pte. Ltd. PICO 4 Ultra.https://www.picoxr.com/global/products/ pico4-ultra, 2024. Product webpage. Accessed: 2026-05-14

2024

[20] [20]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025

[21] [21]

, title =

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics (TOG), 34(6), Oct. 2015. doi:10.1145/ 2816795.2818013. URLhttps://doi.org/10.1145/2816795.2818013. 13

work page doi:10.1145/2816795.2818013 2015

[22] [22]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[23] [23]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[24] [24]

Lipman, R

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[25] [25]

M. Deng, H. Li, T. Li, Y . Du, and K. He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

Pith/arXiv arXiv 2026

[26] [26]

VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/

HTC VIVE. VIVE Ultimate Tracker.https://www.vive.com/eu/accessory/ vive-ultimate-tracker/. Accessed: 2026-05-16

2026

[27] [27]

Cosmos-reason2: Physical ai common sense and embodied reasoning models

NVIDIA. Cosmos-reason2: Physical ai common sense and embodied reasoning models. https://huggingface.co/nvidia/Cosmos-Reason2-8B, 2025. Accessed: 2026-05-17

2025

[28] [28]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[29] [29]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025

[30] [30]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

arXiv 2024

[31] [31]

Zhang, B

T. Zhang, B. Zheng, R. Nai, Y . Hu, Y .-J. Wang, G. Chen, F. Lin, J. Li, C. Hong, K. Sreenath, and Y . Gao. Hub: Learning extreme humanoid balance.arXiv preprint arXiv:2505.07294, 2025

arXiv 2025

[32] [32]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[33] [33]

W. Zeng, S. Lu, K. Yin, X. Niu, M. Dai, J. Wang, and J. Pang. Behavior foundation model for humanoid robots.arXiv preprint arXiv:2509.13780, 2025

arXiv 2025

[34] [34]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[35] [35]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[36] [36]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA That Learns From Experience. arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[37] [37]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026

[38] [38]

Hoque, P

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

Pith/arXiv arXiv 2025

[39] [39]

K. Liu, C. Guan, Z. Jia, Z. Wu, X. Liu, T. Wang, S. Liang, P. Chen, P. Zhang, H. Song, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. arXiv preprint arXiv:2409.19499, 2024. 14

arXiv 2024

[40] [40]

F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 54877–54910, 2025

2025

[41] [41]

Engel, K

J. Engel, K. Somasundaram, M. Goesele, A. Sun, A. Gamino, A. Turner, A. Talattof, A. Yuan, B. Souti, B. Meredith, et al. Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561, 2023

Pith/arXiv arXiv 2023

[42] [42]

Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026

Apple Inc. Apple vision pro technical specifications.https://www.apple.com/sg/ apple-vision-pro/specs/, 2026. Accessed: 2026-05-18

2026

[43] [43]

Meta quest 3.https://www.meta.com/quest/quest-3/, 2026

Meta Platforms, Inc. Meta quest 3.https://www.meta.com/quest/quest-3/, 2026. Ac- cessed: 2026-05-18

2026

[44] [44]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

2024

[45] [45]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022

[46] [46]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022

[47] [47]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022

[48] [48]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[49] [49]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025

[50] [50]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[51] [51]

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies.arXiv preprint arXiv:2509.17759, 2025

arXiv 2025

[52] [52]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025

[53] [53]

CTAG2F90-D Electric Parallel Gripper

ChangingTek Robotics Technology (Suzhou) Co., Ltd. CTAG2F90-D Electric Parallel Gripper. https://en.changingtek.com/diandong/147. Accessed: 2026-05-29

2026

[54] [54]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 15 Appendices A The HLM-12 Benchmark 17 A.1 The 12 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Overall Evaluation Protocol . . . . . . . . . . . . . . . . ...

Pith/arXiv arXiv 2025

[55] [55]

Four conditions: 8-task baseline, stationary co-training, HuMI co-training, and 12-task oracle

Pouring Tasks 1 8 (avg) Tasks 9 11 (avg) Tasks 9 12 (avg) 0 25 50 75 100T ask Progress (%) 100 93 87 87 93 87 80 93 85 100 95 85 87 100100100 92 92 72 88 85 70 95 95 90 75 85 80 80 80 80 67 20 100 80 100 47 93 100100 40 73 73 87 25 80 15 90 89 87 87 87 36 89 84 96 33 87 67 94 8-task baseline Stationary co-training HuMI co-training 12-task oracle Figure 12...