arxiv: 2604.07993 · v1 · submitted 2026-04-09 · 💻 cs.RO

Recognition: unknown

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

Shuanghao Bai , Meng Li , Xinyuan Lv , Jiawei Wang , Xinhua Wang , Fei Liao , Chengkai Hou , Langzhe Gu

show 9 more authors

Wanqi Zhou Kun Wu Ziluo Ding Zhiyuan Xu Lei Sun Shanghang Zhang Zhengping Che Jian Tang Badong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid robotswhole-body manipulationcross-embodiment learningvision-language-actionmixture of expertsproprioceptive predictionflow matching

0 comments

The pith

HEX achieves state-of-the-art whole-body manipulation on humanoid robots by aligning states across embodiments and modeling coordination with a mixture-of-experts predictor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HEX to solve the instability that arises when vision-language-action models control high-degree-of-freedom humanoid robots by treating body parts independently. It creates a single humanoid-aligned state space that lets the same model train on trajectories from many different robot bodies, then adds a mixture-of-experts network that predicts unified proprioceptive signals to capture how limbs must move together over time. This approach matters because real humanoid tasks require continuous balance, reach, and interaction that current methods cannot sustain reliably. The system also summarizes past images with lightweight tokens and fuses vision-language information with predicted dynamics through a gated flow-matching head. Experiments on physical humanoid platforms show higher task completion rates and stronger generalization than prior models, especially when actions must be fast or span many steps.

Core claim

HEX is a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. It introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action生成. Real‑

What carries the argument

A humanoid-aligned universal state representation that normalizes data from varied robot morphologies into one common format, paired with a Mixture-of-Experts Unified Proprioceptive Predictor that learns whole-body coordination and temporal dynamics from pooled multi-embodiment trajectories.

Load-bearing premise

The humanoid-aligned universal state representation combined with the Mixture-of-Experts Unified Proprioceptive Predictor can reliably capture and generalize whole-body coordination and temporal dynamics across heterogeneous embodiments from large-scale multi-embodiment trajectory data.

What would settle it

Running HEX on a new humanoid embodiment whose limb lengths, mass distribution, or joint dynamics differ markedly from the training set and observing loss of whole-body coordination or task failure in long-horizon trials would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.07993 by Badong Chen, Chengkai Hou, Fei Liao, Jian Tang, Jiawei Wang, Kun Wu, Langzhe Gu, Lei Sun, Meng Li, Shanghang Zhang, Shuanghao Bai, Wanqi Zhou, Xinhua Wang, Xinyuan Lv, Zhengping Che, Zhiyuan Xu, Ziluo Ding.

**Figure 1.** Figure 1: Overview of HEX. (a) HEX is, to the best of our knowledge, the first whole-body VLA framework for full-sized bipedal humanoid robots, pretrained on diverse cross-embodiment humanoid trajectory data. (b) HEX combines a high-level VLA module with a low-level whole-body controller for coordinated action generation and balance-preserving execution. (c) We evaluate HEX on Tienkung 2.0 and Tienkung 3.0 across wh… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed high-level VLA policy in HEX. Given a language instruction L, the current visual observation Vt, and a history query token Qt, the VLM encodes visual-language context together with lightweight temporal review cues summarized in a history cache. In parallel, humanoid-aligned proprioceptive states are organized into structured part-aware tokens and processed by a MoE-based Unified Pr… view at source ↗

**Figure 3.** Figure 3: Left and middle: Unified Proprioceptive Predictor (UPP). Morphology-based proprioceptive states are first mapped into canonical body-part tokens and augmented with learnable future query tokens. These spatio-temporal tokens are processed by a shared transformer backbone sandwiched by morphology-aware MoE adaptation modules, yielding future proprioceptive latents Hp . The middle panel details the morphology… view at source ↗

**Figure 4.** Figure 4: Real-Robot teleoperation data collection Setup. Baselines. To ensure a fair comparison of high-level VLA policies, we use the same RL-based low-level controller for balance control across all methods, thereby isolating the contribution of the high-level policy. All models are provided with the same input information, while the use of state inputs follows each model’s original setting. We compare HEX with t… view at source ↗

**Figure 5.** Figure 5: Generalization tasks. Two distribution-shift variants for each of four seen tasks: Pose Mimic, Pouring, Box Carry, and Kneel Pick. task, outperforming the baselines by a clear margin. Notably, on the final Place Box stage, HEX surpasses the strongest baseline by around 15%, indicating its superior ability to sustain stable execution and reduce cascading errors over long-horizon whole-body manipulation. 4.3… view at source ↗

**Figure 6.** Figure 6: Generalization results across unseen task variants. For Pose Mimic, we consider Pose Mimic Fast, which increases the speed of human pose switching, and Pose Mimic Intervention, where an additional person in the background continuously performs distracting poses. A total of 18 trials are conducted, including 5 trials each for the V-, L-, and A-shaped poses, and 3 trials for the return-hand pose. For Pouring… view at source ↗

**Figure 7.** Figure 7: Ablation study. (a) Effect of pretraining. Left: state loss. Right: action loss together with success-rate comparisons at different training stages. Pretraining improves optimization. (b) Effect of key model components. Performance improves consistently as the history cache, UPP, and MoE design are progressively introduced, and the full HEX achieves the best success rates on both Pouring and Box Conveying.… view at source ↗

**Figure 8.** Figure 8: Failure analysis across different methods and tasks. Each Sankey diagram shows how failed trials are distributed across task stages and fine-grained error types, with flow width proportional to the number of failures. MoE Routing Pattern [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of MoE routing patterns before and after the transformer blocks during a long-horizon box conveyance task. Left: routing before the transformer blocks. Right: routing after the transformer blocks. The heatmaps show the selected expert ID for each body part over time, together with representative frames and subtask boundaries. Compared with the largely static routing before the transformer blocks… view at source ↗

**Figure 10.** Figure 10: Latency–accuracy comparison on a single NVIDIA RTX 4090 GPU, where bubble size indicates the number of model parameters. 5 Conclusion We presented HEX, a framework for humanoid whole-body manipulation that addresses a key limitation of existing VLA-style approaches: they often do not explicitly model how different body parts interact under shared balance and posture. HEX tackles this problem through a hum… view at source ↗

read the original abstract

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HEX puts forward a sensible architecture for cross-embodiment humanoid control but the SOTA claims on real robots rest on details that are not visible in the abstract.

read the letter

The main thing to know about HEX is that it assembles a humanoid-aligned universal state, an MoE proprioceptive predictor, lightweight history tokens, residual-gated fusion, and a flow-matching head into one pipeline aimed at stable whole-body manipulation on full-sized bipeds. That combination is the actual novelty; it is not just another VLA tweak but a deliberate attempt to move from independent limb control to coordinated dynamics learned across embodiments from large trajectory sets. The design choices for efficiency (history tokens instead of re-encoding every frame) and adaptive fusion look reasonable on paper and address real inference-time pain points in high-DoF systems. The paper also correctly identifies the instability problem that comes from treating body parts separately, which is a genuine barrier for deploying humanoids outside labs. Those parts are worth reading if you work on embodied control or scaling VLAs. The soft spot is the experimental section. The abstract asserts state-of-the-art success rates and generalization in fast-reaction and long-horizon tasks, yet supplies no task definitions, trial counts, baseline implementations, ablation results on the universal state versus per-embodiment encoders, or statistical tests. Without those numbers it is impossible to tell whether the claimed gains come from the architecture or from data selection and post-hoc tuning. The stress-test concern about unverified cross-embodiment transfer and mode collapse in the MoE therefore stands on the information given. If the full paper contains held-out embodiment tests and clear quantitative comparisons, the evaluation would be stronger; right now the central claims are not yet falsifiable from the summary. This work is aimed at robotics researchers who already care about whole-body humanoid policies and cross-embodiment transfer. A reader looking for concrete architectural ideas rather than proven benchmarks could still extract useful components. It is coherent enough and addresses a real gap, so it deserves a serious referee who can check the missing experimental details and ask for ablations. I would send it to review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper proposes HEX, a state-centric framework for coordinated whole-body manipulation on full-sized bipedal humanoid robots. It introduces a humanoid-aligned universal state representation for cross-embodiment scalability, a Mixture-of-Experts Unified Proprioceptive Predictor to model coordination and temporal dynamics from multi-embodiment trajectory data, lightweight history tokens for visual context, a residual-gated fusion mechanism, and a flow-matching action head. The central empirical claim is that HEX achieves state-of-the-art task success rates and generalization on real-world humanoid tasks, especially in fast-reaction and long-horizon scenarios.

Significance. If the quantitative results hold with proper controls, the work could advance cross-embodiment VLA models by tackling whole-body coordination and stability in high-DoF humanoids, a known challenge. The universal state representation and MoE proprioceptive predictor represent a concrete architectural approach to leveraging heterogeneous data without per-embodiment retraining.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The manuscript asserts SOTA performance in task success rate and generalization on real robots but supplies no quantitative metrics (e.g., success percentages), baseline comparisons, trial counts, data splits, statistical tests, or failure-mode analysis. This is load-bearing for the central claim that the humanoid-aligned state representation plus MoE predictor produce measurable gains in cross-embodiment whole-body coordination and long-horizon stability.
[Method] Method section (description of Mixture-of-Experts Unified Proprioceptive Predictor): No ablation results or analysis are referenced showing that the MoE avoids mode collapse on heterogeneous dynamics or that the universal state representation transfers without embodiment-specific retraining, which directly supports the generalization claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical presentation and analysis without altering the core contributions.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The manuscript asserts SOTA performance in task success rate and generalization on real robots but supplies no quantitative metrics (e.g., success percentages), baseline comparisons, trial counts, data splits, statistical tests, or failure-mode analysis. This is load-bearing for the central claim that the humanoid-aligned state representation plus MoE predictor produce measurable gains in cross-embodiment whole-body coordination and long-horizon stability.

Authors: We agree that the submitted manuscript presents the SOTA claim in the abstract and experiments section without sufficient quantitative detail. We will revise the abstract to include key success rate metrics and expand the experiments section with explicit tables reporting success percentages, baseline comparisons, trial counts, data splits, statistical tests, and failure-mode analysis. This will make the evidence for the benefits of the humanoid-aligned state representation and MoE predictor fully transparent and verifiable. revision: yes
Referee: [Method] Method section (description of Mixture-of-Experts Unified Proprioceptive Predictor): No ablation results or analysis are referenced showing that the MoE avoids mode collapse on heterogeneous dynamics or that the universal state representation transfers without embodiment-specific retraining, which directly supports the generalization claims.

Authors: We acknowledge that the initial submission did not include explicit ablations or transfer analysis for the MoE and universal state representation. The method section describes their design for handling heterogeneous multi-embodiment data and cross-embodiment scalability, with generalization supported by the overall experimental outcomes. In revision, we will add ablation studies comparing the MoE to non-MoE baselines to demonstrate avoidance of mode collapse, along with analysis and results confirming transfer of the state representation across embodiments without per-embodiment retraining. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on empirical evaluation

full rationale

The paper presents HEX as an architectural framework for humanoid manipulation, introducing components such as a humanoid-aligned universal state representation, Mixture-of-Experts Unified Proprioceptive Predictor, lightweight history tokens, and residual-gated fusion with flow-matching action head. These are motivated by design choices for cross-embodiment coordination and temporal dynamics, trained on multi-embodiment trajectory data. The central claims concern state-of-the-art task success rates and generalization in real-world experiments. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance assertions are tied to experimental outcomes rather than analytical steps that could be tautological. This is a standard empirical ML robotics paper with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on standard machine-learning assumptions about learning from trajectory data plus several newly introduced architectural elements whose effectiveness is asserted rather than derived.

axioms (1)

domain assumption Large-scale multi-embodiment trajectory data contains sufficient signal to learn generalizable whole-body coordination and temporal dynamics.
Invoked when describing training of the Mixture-of-Experts Unified Proprioceptive Predictor.

invented entities (2)

Humanoid-aligned universal state representation no independent evidence
purpose: Enable scalable learning across heterogeneous robot embodiments
Core new representation introduced to address embodiment differences.
Mixture-of-Experts Unified Proprioceptive Predictor no independent evidence
purpose: Model whole-body coordination and temporal motion dynamics
Central predictive component of the HEX framework.

pith-pipeline@v0.9.0 · 5527 in / 1438 out tokens · 59446 ms · 2026-05-10T17:36:47.891396+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

Reference graph

Works this paper leans on

59 extracted references · 25 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, et al. Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

work page internal anchor Pith review arXiv 2026
[3]

arXiv preprint arXiv:2512.22983 , year=

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Zhe Li, Pengxiang Ding, et al. Embodied robot manipulation in the era of foundation models: Planning and learning perspectives.arXiv preprint arXiv:2512.22983, 2025

work page arXiv 2025
[4]

Towards a unified understanding of robot ma- nipulation: A comprehensive survey,

Shuanghao Bai, Wenxuan Song, Jiayi Chen, Yuheng Ji, Zhide Zhong, Jin Yang, Han Zhao, Wanqi Zhou, Wei Zhao, Zhe Li, et al. Towards a unified understanding of robot manipulation: A comprehensive survey.arXiv preprint arXiv:2510.10903, 2025

work page arXiv 2025
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review arXiv 2025
[6]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3549–3556, 2025

2025
[7]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025

2025
[8]

Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching

Sirui Chen, Yufei Ye, Zi-ang Cao, Pei Xu, Jennifer Lew, and Karen Liu. Hand-eye autonomous delivery: Learning humanoid navigation, locomotion and reaching. InConference on Robot Learning, pages 4058–4073. PMLR, 2025. 14

2025
[9]

Gmt: General motion tracking for humanoid whole-body control,

Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

work page arXiv 2025
[10]

Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912, 2025

work page arXiv 2025
[11]

Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025

work page arXiv 2025
[12]

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, pages 496–512. PMLR, 2025

2025
[13]

Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action policies.arXiv preprint arXiv:2512.05693, 2025

Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, and Yu-Gang Jiang. Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action policies.arXiv preprint arXiv:2512.05693, 2025

work page arXiv 2025
[14]

Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation

Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In Conference on Robot Learning, pages 2018–2037. PMLR, 2025

2018
[15]

Humanplus: Humanoid shadowing and imitation from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. InConference on Robot Learning, pages 2828–2844. PMLR, 2025

2025
[16]

Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning

Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris M Kitani, Changliu Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning. InConference on Robot Learning, pages 1516–1540. PMLR, 2025

2025
[17]

Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv: 2511.15200,

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025

work page arXiv 2025
[18]

Robomind 2.0: A multimodal, biman- ual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

work page arXiv 2025
[19]

Slac: Simulation-pretrained latent action space for whole-body real-world rl

Jiaheng Hu, Peter Stone, and Roberto Martín-Martín. Slac: Simulation-pretrained latent action space for whole-body real-world rl. InConference on Robot Learning, pages 2966–2982. PMLR, 2025

2025
[20]

π0.5: A vision-language-action model with open-world generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: A vision-language-action model with open-world generalization. InConference on Robot Learning, 2025

2025
[21]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[22]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025
[23]

Okami: Teaching humanoid robots manipulation skills through single video imitation

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. InConference on Robot Learning, pages 299–317. PMLR, 2025

2025
[24]

Switchvla: Execution-aware task switching for vision- language-action models.arXiv preprint arXiv:2506.03574,

Meng Li, Zhen Zhao, Zhengping Che, Fei Liao, Kun Wu, Zhiyuan Xu, Pei Ren, Zhao Jin, Ning Liu, and Jian Tang. Switchvla: Execution-aware task switching for vision-language-action models.arXiv preprint arXiv:2506.03574, 2025

work page arXiv 2025
[25]

Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

Qiayuan Liao, Takara E Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, and C Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page arXiv 2025
[26]

H-zero: Cross-humanoid locomotion pretraining enables few-shot novel embodiment transfer.arXiv preprint arXiv:2512.00971, 2025

Yunfeng Lin, Minghuan Liu, Yufei Xue, Ming Zhou, Yong Yu, Jiangmiao Pang, and Weinan Zhang. H-zero: Cross-humanoid locomotion pretraining enables few-shot novel embodiment transfer.arXiv preprint arXiv:2512.00971, 2025

work page arXiv 2025
[27]

Trajbooster: Boosting humanoid whole-body manipulation via trajectory-centric learning

Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, et al. Trajbooster: Boosting humanoid whole-body manipulation via trajectory-centric learning. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026. 15

2026
[28]

Mobile-television: Predictive motion priors for humanoid whole-body control

Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, and Xiaolong Wang. Mobile-television: Predictive motion priors for humanoid whole-body control. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 5364–5371. IEEE, 2025

2025
[29]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

work page arXiv 2026
[30]

Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castaneda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page arXiv 2025
[31]

Learning from massive human videos for universal humanoid pose control

Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, and Yue Wang. Learning from massive human videos for universal humanoid pose control. InInternational Conference on Humanoid Robots, 2025

2025
[32]

Learning from demonstration and adaptation of biped locomotion.Robotics and autonomous systems, 47(2-3):79–91, 2004

Jun Nakanishi, Jun Morimoto, Gen Endo, Gordon Cheng, Stefan Schaal, and Mitsuo Kawato. Learning from demonstration and adaptation of biped locomotion.Robotics and autonomous systems, 47(2-3):79–91, 2004

2004
[33]

Embodiment-aware generalist specialist distillation for unified humanoid whole-body control.arXiv preprint arXiv:2602.02960, 2026

Quanquan Peng, Yunfeng Lin, Yufei Xue, Jiangmiao Pang, and Weinan Zhang. Embodiment-aware generalist specialist distillation for unified humanoid whole-body control.arXiv preprint arXiv:2602.02960, 2026

work page arXiv 2026
[34]

Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018
[35]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

2021
[36]

Egobridge: Domain adaptation for generalizable imitation from egocentric human data

Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[37]

Humanoid policy human policy

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy human policy. InConference on Robot Learning, pages 2888–2906. PMLR, 2025

2025
[38]

Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

2024
[39]

Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration.arXiv preprint arXiv:2602.10106, 2026

work page arXiv 2026
[40]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InThe 40th Annual AAAI Conference on Artificial Intelligence, 2026

2026
[41]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in neural information processing systems, volume 37, pages 124420–124450, 2024

2024
[42]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InThe 40th Annual AAAI Conference on Artificial Intelligence, 2026

2026
[43]

ψ0: An open foundation model towards universal humanoid loco-manipulation, 2026

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, and Yue Wang. ψ0: An open foundation model towards universal humanoid loco-manipulation, 2026

2026
[44]

Hdmi: Learning interactive humanoid whole-body control from human videos,

Haoyang Weng, Yitang Li, Nikhil Sobanbabu, Zihan Wang, Zhengyi Luo, Tairan He, Deva Ramanan, and Guanya Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

work page arXiv 2025
[45]

Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS), 2025

2025
[46]

RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025. 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills

Weiji Xie, Jinrui Han, Jiakun Zheng, Huanyu Li, Xinzhe Liu, Jiyuan Shi, Weinan Zhang, Chenjia Bai, and Xuelong Li. Kungfubot: Physics-based humanoid whole-body control for learning highly-dynamic skills. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[48]

Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–18, 2023

Zhaoming Xie, Jonathan Tseng, Sebastian Starke, Michiel Van De Panne, and C Karen Liu. Hierarchical planning and control for box loco-manipulation.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–18, 2023

2023
[49]

Hacts: a human-as- copilot teleoperation system for robot learning

Zhiyuan Xu, Yinuo Zhao, Kun Wu, Ning Liu, Junjie Ji, Zhengping Che, Chi Harold Liu, and Jian Tang. Hacts: a human-as- copilot teleoperation system for robot learning. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 15475–15481. IEEE, 2025

2025
[50]

LeVERB: Humanoid whole-body control with latent vision-language instruction,

Haoru Xue, Xiaoyu Huang, Dantong Niu, Qiayuan Liao, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, et al. Leverb: Humanoid whole-body control with latent vision-language instruction.arXiv preprint arXiv:2506.13751, 2025

work page arXiv 2025
[51]

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

Yufei Xue, YunFeng Lin, Wentao Dong, Yang Tang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Minghuan Liu, and Weinan Zhang. Scalable and general whole-body control for cross-humanoid locomotion.arXiv preprint arXiv:2602.05791, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

ZeroWBC: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. Zerowbc: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

work page arXiv 2026
[53]

Pushing the limits of cross-embodiment learning for manipulation and navigation

Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation. InRobotics: Science and Systems, 2024

2024
[54]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

work page arXiv 2025
[55]

Twist: Teleoperated whole-body imitation system

Yanjie Ze, Zixuan Chen, Joao Pedro Araujo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and Karen Liu. Twist: Teleoperated whole-body imitation system. InConference on Robot Learning, pages 2143–2154. PMLR, 2025

2025
[56]

Generalizable humanoid manipulation with 3d diffusion policies

Yanjie Ze, Zixuan Chen, Wenhao Wang, Tianyi Chen, Xialin He, Ying Yuan, Xue Bin Peng, and Jiajun Wu. Generalizable humanoid manipulation with 3d diffusion policies. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2873–2880. IEEE, 2025

2025
[57]

Falcon: Learning force-adaptive humanoid loco-manipulation

Yuanhang Zhang, Yifu Yuan, Prajwal Gurunath, Ishita Gupta, Shayegan Omidshafiei, Ali-akbar Agha-mohammadi, Marcell Vazquez-Chanlatte, Liam Pedersen, Tairan He, and Guanya Shi. Falcon: Learning force-adaptive humanoid loco-manipulation. 8th Annual Learning for Dynamics\& Control Conference, 2026

2026
[58]

Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023

Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023

2023
[59]

Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation, 2025

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807, 2025. 17

work page arXiv 2025