SceneBot: Contact-Prompted General Humanoid Whole Body Tracking with Scene-Interaction

C. Karen Liu; Guanya Shi; Jiaman Li; Shibo Zhao; Sirui Chen; Zhen Wu

arxiv: 2606.27581 · v1 · pith:ZK52CHAAnew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

SceneBot: Contact-Prompted General Humanoid Whole Body Tracking with Scene-Interaction

Sirui Chen , Shibo Zhao , Zhen Wu , Jiaman Li , Guanya Shi , C. Karen Liu This is my paper

Pith reviewed 2026-06-29 01:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords humanoid controlreinforcement learningwhole-body trackingcontact-rich tasksmotion retargetingscene interactionpolicy conditioning

0 comments

The pith

SceneBot unifies free-space and contact-rich humanoid tracking by conditioning one policy on reference motions plus per-link contact labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build a single reinforcement-learning policy that lets humanoid robots perform both free-space locomotion and contact-rich tasks such as object manipulation or terrain traversal within one framework. It adds explicit per-link contact labels to the policy input so the controller knows which body parts should touch the environment at each moment. Because labeled interaction data does not exist, the authors reconstruct scenes after the fact from ordinary human motion captures and infer the required contact labels. A sympathetic reader would care because separate policies for different behaviors have so far blocked long-horizon tasks that mix walking, climbing, and carrying. If the approach holds, a humanoid could switch between free motion and contact behaviors without changing controllers or retraining.

Core claim

SceneBot trains one policy on reference motions and per-link contact labels that are obtained by hindsight scene reconstruction from retargeted human motion. The resulting policy handles freespace locomotion, uneven terrain, and whole-body manipulation, generalizes to motions and scenes outside the training set, and completes long-horizon tasks such as carrying a box upstairs. The work therefore presents contact conditioning as a practical interface that resolves physical ambiguities pure kinematic tracking cannot address.

What carries the argument

per-link contact conditioning, which supplies explicit expected interaction labels to a single policy so it can resolve contact ambiguities across locomotion and manipulation.

If this is right

A single policy executes both free-space and contact-rich behaviors without controller switching.
Training on 7.5 hours of reconstructed contact-rich data suffices for generalization to unseen motions and environments.
Contact conditioning provides a reusable interface that extends kinematic tracking to scene-interacting tasks.
Complex long-horizon sequences such as carrying objects upstairs become feasible within one learned controller.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contact-label interface could be tested on other robot morphologies or multi-robot coordination tasks.
If reconstruction errors accumulate on highly deformable objects, the method would require additional sensing or online label correction.
Real-robot transfer would need to verify that simulated contact labels remain valid under actuator noise and model mismatch.

Load-bearing premise

The hindsight scene reconstruction step produces sufficiently accurate per-link contact labels from retargeted human motion without introducing systematic errors that would prevent policy generalization.

What would settle it

Run the trained policy on a new scene where the reconstructed contact labels disagree with the actual geometry and physics; if the policy produces unstable or incorrect contacts while a version trained with ground-truth labels succeeds, the reconstruction assumption fails.

Figures

Figures reproduced from arXiv: 2606.27581 by C. Karen Liu, Guanya Shi, Jiaman Li, Shibo Zhao, Sirui Chen, Zhen Wu.

**Figure 2.** Figure 2: Scene reconstruction: SceneBot uses retargeted human motion to reconstruct scene assets. It first builds the robot-scene interaction graph, then reconstruct plausible terrains and objects. Training: SceneBot trains a motion and contact tracking policy via reinforcement learning using contact-based rewards. Deployment: SceneBot relies on SuperOdometry [35] and an onboard IMU to estimate root position xroot… view at source ↗

**Figure 3.** Figure 3: Scene interaction graph for different in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a): Our scene reconstruction method can generate complex scenes that match the retar [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Composition of training data. 4.2 Tracking Performance for Different Motions Qualitatively, our method successfully executes behaviors such as stepping onto stairs of varying heights, picking up and carrying boxes, sitting down, and performing agile kicking and running, as demonstrated in the supplementary video. Additionally, our approach manages long-horizon, simultaneous object and terrain interact… view at source ↗

**Figure 6.** Figure 6: Drift in local tracking causes motion-terrain misalignment. Quantitatively, we evaluate tracking performance using the average global root tracking error, average joint tracking error, and success rates across four task categories in a MuJoCo sim-to-sim environment: free-space, terrain interaction, object interaction, sitting. We compare our method against the state-of-the-art general motion tracking p… view at source ↗

**Figure 7.** Figure 7: Root position error on the terrain task across different training steps. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Left: Omni-retarget creates an objecthand mismatch. Right: Tracking performance comparison between scene asset reconstruction and scene-aware retargeting. Qualitatively, our pipeline can reconstructs complicate terrain from Lafan obstacle sequences ( [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Left: State estimation compared against mo [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Detailed breakdown of the state estimation results for the task of grasping a box and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Current humanoid reinforcement-learning policies excel at free-space motions but struggle with contact-rich tasks, as pure kinematic tracking cannot resolve the physical ambiguities of interacting with objects and uneven terrain. To address this, we introduce SceneBot, a unified motion-tracking framework capable of handling freespace locomotion, terrain traversal, and whole-body manipulation. SceneBot conditions a single policy on both reference motions and per-link contact labels, explicitly defining expected environmental interactions. To overcome the lack of annotated interaction data, we propose a hindsight scene reconstruction approach that infers scene-interaction graphs from retargeted human motion. Trained on 7.5 hours of this reconstructed, contact-rich data, SceneBot successfully generalizes to unseen motions and environments. Our results demonstrate that SceneBot is the first general framework to seamlessly unify free-space and contact-rich behaviors executing complex, long-horizon tasks like carrying a box upstairs and establishing contact conditioning as a powerful interface for humanoid control. All code and data will be open-sourced. More demos and information are available at: https://ericcsr.github.io/scenebot/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Contact prompting via hindsight reconstruction offers a promising interface for humanoid control, but the paper's claims rest on unverified label accuracy.

read the letter

The punchline is that SceneBot conditions a humanoid policy on per-link contact labels and uses hindsight scene reconstruction to create contact-rich training data from mocap. This addresses a real gap in current tracking methods that handle free-space well but struggle with interactions.

What the paper does well is identify the limitation of pure kinematic tracking and propose contact labels as an explicit interface for expected interactions. Generating data via hindsight reconstruction from retargeted human motion is a practical workaround for the lack of annotated scenes. The plan to open-source everything is helpful.

The soft spots are in the evidence. The abstract claims successful generalization to unseen motions and environments, including complex tasks like carrying a box upstairs, but provides no error metrics, ablation studies, or validation of the contact labels against actual dynamics. The hindsight step could introduce biases from retargeting errors, and without checks on label fidelity, it's unclear if the policy learns true contact physics or reconstruction artifacts. That matches the stress-test concern.

This paper is aimed at the humanoid control and robotics RL community. Readers working on contact-rich behaviors would find the conditioning approach worth considering. It deserves a serious referee because the core idea has potential even if the current results are preliminary.

I recommend sending it to peer review rather than desk rejecting it. Reviewers can push for the missing quantitative details on data quality and generalization.

Referee Report

2 major / 1 minor

Summary. The paper introduces SceneBot, a unified RL-based motion-tracking framework for humanoids that conditions a single policy on reference motions plus per-link contact labels. These labels are obtained via a hindsight scene-reconstruction pipeline applied to retargeted human motion; the resulting 7.5-hour dataset is used to train the policy, which is claimed to generalize to unseen motions and environments while seamlessly handling both free-space locomotion and contact-rich whole-body tasks such as carrying a box upstairs.

Significance. If the central claims hold, the work would be significant for humanoid control by demonstrating that explicit contact conditioning can serve as a general interface bridging free-space and interaction behaviors, with the open release of code and data providing a concrete resource for the community.

major comments (2)

[Abstract / Methods (hindsight reconstruction pipeline)] The unification claim and generalization to long-horizon contact-rich tasks rest on the assumption that hindsight scene reconstruction produces sufficiently accurate per-link contact labels. No quantitative validation of label fidelity (e.g., precision/recall of contact timing and location against physics simulation or ground-truth scenes) is reported, so systematic retargeting biases could cause the policy to overfit to reconstruction artifacts rather than true dynamics.
[Abstract / Results] The abstract asserts successful generalization to unseen motions and environments and to complex tasks, yet supplies no quantitative results, ablation studies, tracking-error metrics, or success rates. Without these data the evidence supporting the central claim cannot be evaluated.

minor comments (1)

[Abstract] The manuscript should include a clear description of the policy architecture, observation space, and reward terms in the main text rather than relying solely on the project page.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, indicating planned revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [Abstract / Methods (hindsight reconstruction pipeline)] The unification claim and generalization to long-horizon contact-rich tasks rest on the assumption that hindsight scene reconstruction produces sufficiently accurate per-link contact labels. No quantitative validation of label fidelity (e.g., precision/recall of contact timing and location against physics simulation or ground-truth scenes) is reported, so systematic retargeting biases could cause the policy to overfit to reconstruction artifacts rather than true dynamics.

Authors: We agree this is a valid point and that explicit validation of label quality would strengthen the paper. In the revision we will add a dedicated analysis (new subsection in Methods or Experiments) that reports precision, recall, and timing error for contact labels on a held-out set of motions, obtained by comparing the reconstructed labels against forward simulation in the target scenes. This will directly address potential retargeting biases. revision: yes
Referee: [Abstract / Results] The abstract asserts successful generalization to unseen motions and environments and to complex tasks, yet supplies no quantitative results, ablation studies, tracking-error metrics, or success rates. Without these data the evidence supporting the central claim cannot be evaluated.

Authors: The full results section already presents quantitative tracking-error curves, per-task success rates (including the box-carrying example), and ablations on contact conditioning versus baselines. To make the strength of evidence immediately visible, we will revise the abstract to include concise numerical highlights drawn from those results (e.g., mean tracking error and success rate ranges). revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external data pipeline

full rationale

The paper presents a policy trained on contact labels generated via an external hindsight reconstruction process from retargeted human motion data. No equations, fitted parameters, or predictions are shown that reduce to the inputs by construction. The generalization claim is to unseen motions and environments, which are independent of the training set. No self-citation chains or uniqueness theorems are invoked in the provided text to support the central result. The method is a standard supervised training pipeline on reconstructed labels, with no load-bearing step that equates the output to the input definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The hindsight reconstruction step implicitly assumes accurate contact inference from mocap retargeting, but this is not formalized.

pith-pipeline@v0.9.1-grok · 5733 in / 1104 out tokens · 28928 ms · 2026-06-29T01:17:53.705707+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes
cs.RO 2026-06 unverdicted novelty 6.0

Generates 48,000 synthetic VLK trajectories in 3D-reconstructed scenes to train a policy for egocentric perception-based humanoid navigation and object transport, shown on physical Unitree G1 robot.

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025
[2]

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018
[3]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025
[4]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025
[5]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025
[6]

Zhang, J

Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025
[7]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025
[8]

Q. Lu, Y . Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu. Gentlehumanoid: Learn- ing upper-body compliance for contact-rich human and object interaction.arXiv preprint arXiv:2511.04679, 2025

arXiv 2025
[9]

Chen, Z.-a

S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

arXiv 2025
[10]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019
[11]

Saito, J

J. Saito, J. Li, M. de Ruyter, M. Guerrero, E. Lim, E. Hassani, R. B. Ribera, H. Moon, M. Dadela, M. D. Lucca, Q. Wang, X. Li, J. Kautz, S. Yuen, and U. Iqbal. Soma: Unify- ing parametric human body models.arXiv preprint arXiv:2603.16858, 2026. URLhttps: //arxiv.org/abs/2603.16858

arXiv 2026
[12]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025
[13]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025
[14]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight frame- work for gpu-accelerated robot learning.arXiv preprint arXiv:2601.22074, 2026

arXiv 2026
[15]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 10

arXiv 2025
[16]

Deits and R

R. Deits and R. Tedrake. Footstep planning on uneven terrain with mixed-integer convex optimization. In2014 IEEE-RAS international conference on humanoid robots, pages 279–
[17]

Kuindersma, R

S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot.Autonomous robots, 40(3):429–455, 2016

2016
[18]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

arXiv 2025
[19]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026
[20]

Zhang, V

C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026

arXiv 2026
[21]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025
[22]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026
[23]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

2021
[24]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026
[25]

M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

2025
[26]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026
[27]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[28]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026
[29]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

2023
[30]

Lu, C.-H

J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

2025
[31]

H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025. 11

arXiv 2025
[32]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176– 11186, 2025

2025
[33]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025
[34]

Zhang, W

C. Zhang, W. Xiao, T. He, and G. Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

arXiv 2024
[35]

S. Zhao, H. Zhang, P. Wang, L. Nogueira, and S. Scherer. Super odometry: Imu-centric lidar-visual-inertial estimator for challenging environments. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE, 2021

2021
[36]

Jiang, Y

Y . Jiang, Y . Ye, D. Gopinath, J. Won, A. W. Winkler, and C. K. Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. InSIGGRAPH Asia 2022 Conference Papers, SA ’22, page 1–9. ACM, Nov. 2022. doi:10.1145/3550469.3555428. URLhttp://dx.doi.org/10.1145/3550469.3555428

work page doi:10.1145/3550469.3555428 2022
[37]

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. ACM Transactions on Graphics, 39(4), Aug. 2020. ISSN 1557-7368. doi:10.1145/3386569. 3392480. URLhttp://dx.doi.org/10.1145/3386569.3392480. 12 A Scene Reconstruction Algorithm Algorithm 1Scene Reconstruction from Human Motion Require:Human kinematic motionM human Ensure:R...

work page doi:10.1145/3386569 2020

[1] [1]

Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

Pith/arXiv arXiv 2025

[2] [2]

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

2018

[3] [3]

T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, et al. Asap: Aligning simulation and real-world physics for learning agile humanoid whole- body skills.arXiv preprint arXiv:2502.01143, 2025

arXiv 2025

[4] [4]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Pith/arXiv arXiv 2025

[5] [5]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. Gmt: General motion tracking for humanoid whole-body control.arXiv preprint arXiv:2506.14770, 2025

arXiv 2025

[6] [6]

Zhang, J

Z. Zhang, J. Guo, C. Chen, J. Wang, C. Lin, Y . Lian, H. Xue, Z. Wang, M. Liu, J. Lyu, et al. Track any motions under any disturbances.arXiv preprint arXiv:2509.13833, 2025

arXiv 2025

[7] [7]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025

arXiv 2025

[8] [8]

Q. Lu, Y . Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu. Gentlehumanoid: Learn- ing upper-body compliance for contact-rich human and object interaction.arXiv preprint arXiv:2511.04679, 2025

arXiv 2025

[9] [9]

Chen, Z.-a

S. Chen, Z.-a. Cao, Z. Luo, F. Casta ˜neda, C. Li, T. Wang, Y . Yuan, L. Fan, C. K. Liu, Y . Zhu, et al. Chip: Adaptive compliance for humanoid control through hindsight perturbation.arXiv preprint arXiv:2512.14689, 2025

arXiv 2025

[10] [10]

Mahmood, N

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

2019

[11] [11]

Saito, J

J. Saito, J. Li, M. de Ruyter, M. Guerrero, E. Lim, E. Hassani, R. B. Ribera, H. Moon, M. Dadela, M. D. Lucca, Q. Wang, X. Li, J. Kautz, S. Yuen, and U. Iqbal. Soma: Unify- ing parametric human body models.arXiv preprint arXiv:2603.16858, 2026. URLhttps: //arxiv.org/abs/2603.16858

arXiv 2026

[12] [12]

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

arXiv 2025

[13] [13]

Mittal, P

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025

[14] [14]

Zakka, Q

K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight frame- work for gpu-accelerated robot learning.arXiv preprint arXiv:2601.22074, 2026

arXiv 2026

[15] [15]

Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025. 10

arXiv 2025

[16] [16]

Deits and R

R. Deits and R. Tedrake. Footstep planning on uneven terrain with mixed-integer convex optimization. In2014 IEEE-RAS international conference on humanoid robots, pages 279–

[17] [17]

Kuindersma, R

S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot.Autonomous robots, 40(3):429–455, 2016

2016

[18] [18]

Q. Ben, B. Xu, K. Li, F. Jia, W. Zhang, J. Wang, J. Wang, D. Lin, and J. Pang. Gallant: V oxel grid-based humanoid locomotion and local-navigation across 3d constrained terrains.arXiv preprint arXiv:2511.14625, 2025

arXiv 2025

[19] [19]

Zhang, Y

Y . Zhang, Y . Seo, J. Chen, Y . Yuan, K. Sreenath, P. Abbeel, C. Sferrazza, K. Liu, R. Duan, and G. Shi. Rpl: Learning robust humanoid perceptive locomotion on challenging terrains.arXiv preprint arXiv:2602.03002, 2026

arXiv 2026

[20] [20]

Zhang, V

C. Zhang, V . Klemm, F. Yang, and M. Hutter. Ame-2: Agile and generalized legged locomotion via attention-based neural map encoding.arXiv preprint arXiv:2601.08485, 2026

arXiv 2026

[21] [21]

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

Pith/arXiv arXiv 2025

[22] [22]

Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

Pith/arXiv arXiv 2026

[23] [23]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

2021

[24] [24]

S. Zhu, Z. Zhuang, M. Zhao, K.-Y . Lee, and H. Zhao. Hiking in the wild: A scalable perceptive parkour framework for humanoids.arXiv preprint arXiv:2601.07718, 2026

arXiv 2026

[25] [25]

M. Xu, Y . Shi, K. Yin, and X. B. Peng. Parc: Physics-based augmentation with reinforcement learning for character controllers. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

2025

[26] [26]

Zhang, K

Z. Zhang, K. Wen, M. Xu, J. He, C. Li, T. Miki, C. Schwarke, C. Zhang, X. B. Peng, and M. Hutter. Learning whole-body humanoid locomotion via motion generation and motion tracking.arXiv preprint arXiv:2604.17335, 2026

Pith/arXiv arXiv 2026

[27] [27]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[28] [28]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

arXiv 2026

[29] [29]

J. Li, J. Wu, and C. K. Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

2023

[30] [30]

Lu, C.-H

J. Lu, C.-H. P. Huang, U. Bhattacharya, Q. Huang, and Y . Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10886–10897, 2025

2025

[31] [31]

H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025. 11

arXiv 2025

[32] [32]

Z. Wu, J. Li, P. Xu, and C. K. Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176– 11186, 2025

2025

[33] [33]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. Resmimic: From gen- eral motion tracking to humanoid whole-body loco-manipulation via residual learning.arXiv preprint arXiv:2510.05070, 2025

arXiv 2025

[34] [34]

Zhang, W

C. Zhang, W. Xiao, T. He, and G. Shi. Wococo: Learning whole-body humanoid control with sequential contacts.arXiv preprint arXiv:2406.06005, 2024

arXiv 2024

[35] [35]

S. Zhao, H. Zhang, P. Wang, L. Nogueira, and S. Scherer. Super odometry: Imu-centric lidar-visual-inertial estimator for challenging environments. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8729–8736. IEEE, 2021

2021

[36] [36]

Jiang, Y

Y . Jiang, Y . Ye, D. Gopinath, J. Won, A. W. Winkler, and C. K. Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. InSIGGRAPH Asia 2022 Conference Papers, SA ’22, page 1–9. ACM, Nov. 2022. doi:10.1145/3550469.3555428. URLhttp://dx.doi.org/10.1145/3550469.3555428

work page doi:10.1145/3550469.3555428 2022

[37] [37]

F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal. Robust motion in-betweening. ACM Transactions on Graphics, 39(4), Aug. 2020. ISSN 1557-7368. doi:10.1145/3386569. 3392480. URLhttp://dx.doi.org/10.1145/3386569.3392480. 12 A Scene Reconstruction Algorithm Algorithm 1Scene Reconstruction from Human Motion Require:Human kinematic motionM human Ensure:R...

work page doi:10.1145/3386569 2020