GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
Pith reviewed 2026-06-28 05:40 UTC · model grok-4.3
The pith
GRAIL generates over 20,000 virtual humanoid sequences from 3D assets and video priors that train policies achieving 84% real pickup success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation. This privileged setup allows model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction trajectories with reduced depth ambiguity and morphology mismatch. The recovered motions are retargeted to the humanoid, complementary task-general trackers are trained, and the resulting dataset of over 20,000 sequences is used to train egocentric policies that succeed in real-world deployment.
What carries the argument
Privileged 3D configuration setup with known object geometry, camera parameters, metric scale, environment depth, and robot-proportioned character that conditions accurate 4D HOI trajectory reconstruction.
If this is right
- Over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal become available without physical collection.
- An object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal can be trained from the generated data.
- Egocentric visual policies can be trained through a sim-to-real pipeline using only the synthetic sequences.
- Real-world deployment on a Unitree G1 humanoid reaches 84% success on diverse object pick-up and 90% success on stair-climbing.
Where Pith is reading between the lines
- The same virtual composition approach could generate demonstration data for robot morphologies other than humanoids.
- Updating the video foundation model priors over time could improve the quality and diversity of generated motions.
- Extending the pipeline to multi-object or long-horizon interactions would test whether the privileged setup scales beyond single interactions.
- The reduction in depth ambiguity from known 3D parameters might generalize to other reconstruction tasks that currently rely on in-the-wild video.
Load-bearing premise
Fully specified 3D configurations with known geometry, camera parameters, metric scale, environment depth, and robot-proportioned character can be provided or assumed for the generated sequences.
What would settle it
Training egocentric policies only on GRAIL data and deploying them on the Unitree G1 results in real-world success rates substantially below 84% for diverse object pick-up or 90% for stair-climbing.
Figures
read the original abstract
Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents GRAIL, a fully virtual pipeline that composes 3D assets, simulator scenes, and video foundation model priors to synthesize metric 4D human-object interaction trajectories. Starting from privileged 3D configurations (known geometry, camera parameters, metric scale), it reconstructs motions via model-based tracking and interaction-aware optimization, retargets them to a humanoid, and generates over 20,000 sequences spanning pick-up, manipulation, sitting, and terrain traversal. Policies trained solely on this data via sim-to-real achieve 84% real-world success on diverse object pick-up and 90% on stair-climbing when deployed on a Unitree G1.
Significance. If the reconstruction fidelity and transfer results hold, GRAIL would provide a scalable, instrumentation-free method for generating diverse humanoid loco-manipulation data, addressing a key bottleneck in robot learning. The explicit use of privileged 3D setups to condition VFM-based recovery and the end-to-end real-robot validation are notable strengths.
major comments (2)
- [Reconstruction pipeline and Experiments] The headline empirical claims (84% pick-up and 90% stair-climbing success) rest on the assumption that the reconstructed 4D HOI trajectories are sufficiently accurate and artifact-free for sim-to-real transfer. However, the manuscript supplies no quantitative validation of reconstruction quality (e.g., per-joint error, contact accuracy, or physical plausibility metrics) against mocap ground truth, simulator rollouts, or human raters. This validation is load-bearing for the central claim that the privileged 3D setup plus VFM priors yields robot-usable trajectories.
- [Abstract and Results] The abstract and results report concrete success rates but omit key experimental details required to assess support: number of trials per task, definition of success, variance across runs, comparison to any baseline (teleop, mocap, or alternative generation methods), and statistics on the 20k-sequence dataset (object diversity, motion coverage, failure modes).
minor comments (1)
- [Policy training] Notation for the object-aware latent adaptor and scene-aware tracker should be introduced with explicit equations or pseudocode to clarify how they complement the retargeted motions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation of the reconstruction pipeline and more complete reporting of experimental details. We address each major comment below and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Reconstruction pipeline and Experiments] The headline empirical claims (84% pick-up and 90% stair-climbing success) rest on the assumption that the reconstructed 4D HOI trajectories are sufficiently accurate and artifact-free for sim-to-real transfer. However, the manuscript supplies no quantitative validation of reconstruction quality (e.g., per-joint error, contact accuracy, or physical plausibility metrics) against mocap ground truth, simulator rollouts, or human raters. This validation is load-bearing for the central claim that the privileged 3D setup plus VFM priors yields robot-usable trajectories.
Authors: We agree that direct quantitative validation of reconstruction fidelity would strengthen the central claim. The current manuscript relies on end-to-end real-robot success rates as indirect evidence of trajectory quality, but this is insufficient on its own. We will add a new subsection reporting physical plausibility metrics (e.g., contact force consistency and penetration rates from simulator rollouts) and human rater agreement scores on a sampled subset of 500 trajectories. However, per-joint error against mocap ground truth cannot be provided because the pipeline is designed to operate without any physical mocap capture or instrumentation. revision: partial
-
Referee: [Abstract and Results] The abstract and results report concrete success rates but omit key experimental details required to assess support: number of trials per task, definition of success, variance across runs, comparison to any baseline (teleop, mocap, or alternative generation methods), and statistics on the 20k-sequence dataset (object diversity, motion coverage, failure modes).
Authors: We acknowledge that the abstract and results sections are missing these details. In the revision we will expand both sections to report: 50 trials per task with success defined as stable grasp/lift or stair traversal without falling; mean and standard deviation across three random seeds; dataset statistics including 50 object categories for pick-up, motion coverage histograms, and failure mode breakdown (e.g., 8% contact loss). Direct comparisons to teleop or mocap are outside the scope of a fully virtual method, but we will add a limited comparison to an alternative video-only generation baseline on a subset of tasks. revision: yes
- Direct per-joint error metrics against mocap ground truth, as the method is intentionally instrumentation-free and no mocap data was collected.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical generation pipeline that starts from known 3D configurations, applies video foundation model priors, performs reconstruction, retargets motions, and trains policies whose performance is measured by real-world success rates. These outcomes are presented as experimental results of the end-to-end process rather than mathematical derivations or predictions that reduce to the inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described chain that would trigger any of the enumerated circularity patterns. The central claims remain externally falsifiable via the reported deployment metrics.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training
GenHOI reconstructs robot-object scenes, generates task videos from language and first-frame images, extracts contact constraints, optimizes reference trajectories, and executes them via closed-loop control for zero-s...
Reference graph
Works this paper leans on
-
[1]
FirstName Alpher , title =
-
[2]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[3]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[4]
FirstName Alpher and FirstName Gamow , title =
-
[5]
Computer Vision -- ECCV 2022 , year =
2022
-
[6]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages =
End-to-end recovery of human shape and pose , author =. Proceedings of the IEEE conference on computer vision and pattern recognition , pages =
-
[7]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[8]
Advances in Neural Information Processing Systems , volume =
Neural localizer fields for continuous 3d human pose and shape estimation , author =. Advances in Neural Information Processing Systems , volume =
-
[9]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Beyond static features for temporally consistent 3d human pose and shape from a video , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[10]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
Humans in 4d: Reconstructing and tracking humans with transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
-
[11]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Vibe: Video inference for human body pose and shape estimation , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[12]
European Conference on Computer Vision , pages =
Coin: Control-inpainting diffusion prior for human and camera motion estimation , author =. European Conference on Computer Vision , pages =. 2024 , organization =
2024
-
[13]
European Conference on Computer Vision , pages =
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos , author =. European Conference on Computer Vision , pages =. 2024 , organization =
2024
-
[14]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Glamr: Global occlusion-aware human mesh recovery with dynamic cameras , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[15]
SIGGRAPH Asia 2024 Conference Papers , pages =
World-grounded human motion recovery via gravity-view coordinates , author =. SIGGRAPH Asia 2024 Conference Papers , pages =
2024
-
[16]
Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye , booktitle =
-
[17]
Proceedings of the IEEE/CVF international conference on computer vision , pages =
AMASS: Archive of motion capture as surface shapes , author =. Proceedings of the IEEE/CVF international conference on computer vision , pages =
-
[18]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
BABEL: Bodies, action and behavior with english labels , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[19]
Proceedings of the AAAI Conference on Artificial Intelligence , volume =
Flame: Free-form language-based motion synthesis & editing , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
-
[20]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Edge: Editable dance generation from music , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[21]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Bailando: 3d dance generation by actor-critic gpt with choreographic memory , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[22]
ACM Trans
Learning physically simulated tennis skills from broadcast videos , author =. ACM Trans. Graph , volume =
-
[23]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
Skillmimic: Learning basketball interaction skills from demonstrations , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
-
[24]
arXiv preprint arXiv:2504.10414 , year =
HUMOTO: A 4D Dataset of Mocap Human Object Interactions , author =. arXiv preprint arXiv:2504.10414 , year =
-
[25]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
InteractVLM: 3D interaction reasoning from 2D foundational models , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
-
[26]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
Infinite photorealistic worlds using procedural generation , author =. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages =
-
[27]
2025 , eprint =
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space , author =. 2025 , eprint =
2025
-
[28]
Proceedings of the IEEE/CVF international conference on computer vision , pages =
Adding conditional control to text-to-image diffusion models , author =. Proceedings of the IEEE/CVF international conference on computer vision , pages =
-
[29]
Expressive Body Capture:
Pavlakos, Georgios and Choutas, Vasileios and Ghorbani, Nima and Bolkart, Timo and Osman, Ahmed AA and Tzionas, Dimitrios and Black, Michael J , booktitle =. Expressive Body Capture:
-
[30]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
DAViD: Modeling dynamic affordance of 3d objects using pre-trained video diffusion models , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
-
[31]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Genzi: Zero-shot 3d human-scene interaction generation , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[32]
arXiv preprint arXiv:2509.22442 , year =
Learning to Ball: Composing Policies for Long-Horizon Basketball Moves , author =. arXiv preprint arXiv:2509.22442 , year =
-
[33]
Acm transactions on graphics (tog) , volume =
Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning , author =. Acm transactions on graphics (tog) , volume =. 2018 , publisher =
2018
-
[34]
ACM Transactions on Graphics , volume =
Local motion phases for learning multi-contact character movements , author =. ACM Transactions on Graphics , volume =
-
[35]
arXiv preprint arXiv:2508.21043 , year =
Hitter: A humanoid table tennis robot via hierarchical planning and learning , author =. arXiv preprint arXiv:2508.21043 , year =
-
[36]
arXiv preprint arXiv:2510.18002 , year =
Humanoid Goalkeeper: Learning from Position Conditioned Task-Motion Constraints , author =. arXiv preprint arXiv:2510.18002 , year =
-
[37]
ACM Transactions on Graphics (TOG) , volume =
Learning to schedule control fragments for physics-based characters using deep q-learning , author =. ACM Transactions on Graphics (TOG) , volume =. 2017 , publisher =
2017
-
[38]
ACM Transactions on Graphics (ToG) , volume =
Deepphase: Periodic autoencoders for learning motion phase manifolds , author =. ACM Transactions on Graphics (ToG) , volume =. 2022 , publisher =
2022
-
[39]
Science Robotics , volume =
Learning agile soccer skills for a bipedal robot with deep reinforcement learning , author =. Science Robotics , volume =. 2024 , publisher =
2024
-
[40]
SIGGRAPH Asia 2022 Conference Papers , pages =
Scene synthesis from human motion , author =. SIGGRAPH Asia 2022 Conference Papers , pages =
2022
-
[41]
ACM Transactions on Graphics (TOG) , volume =
Object motion guided human motion synthesis , author =. ACM Transactions on Graphics (TOG) , volume =. 2023 , publisher =
2023
-
[42]
3DV , year =
Zerohsi: Zero-shot 4d human-scene interaction by video generation , author =. 3DV , year =
-
[43]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
Human-object interaction from human-level instructions , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
-
[44]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Scaling up dynamic human-scene interaction modeling , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[45]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
Wilor: End-to-end 3d hand localization and reconstruction in-the-wild , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
-
[46]
ACM Transactions on Graphics (TOG) , volume =
Embodied hands: modeling and capturing hands and bodies together , author =. ACM Transactions on Graphics (TOG) , volume =. 2017 , publisher =
2017
-
[47]
, author =
Smoothing and differentiation of data by simplified least squares procedures. , author =. Analytical chemistry , volume =. 1964 , publisher =
1964
-
[48]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Foundationpose: Unified 6d pose estimation and tracking of novel objects , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[49]
Advances in neural information processing systems , volume =
Vitpose: Simple vision transformer baselines for human pose estimation , author =. Advances in neural information processing systems , volume =
-
[50]
The Thirteenth International Conference on Learning Representations , year =
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second , author =. The Thirteenth International Conference on Learning Representations , year =
-
[51]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
Perpetual Humanoid Control for Real-time Simulated Avatars , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
-
[52]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
Intermimic: Towards universal whole-body control for physics-based human-object interactions , author =. Proceedings of the Computer Vision and Pattern Recognition Conference , pages =
-
[53]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
D-grasp: Physically plausible dynamic grasp synthesis for hand-object interactions , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[54]
ACM Transactions On Graphics (TOG) , volume =
Deepmimic: Example-guided deep reinforcement learning of physics-based character skills , author =. ACM Transactions On Graphics (TOG) , volume =. 2018 , publisher =
2018
-
[55]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author =. arXiv preprint arXiv:1707.06347 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Proceedings of the 9th International Conference on Motion in Games , pages =
XPBD: position-based simulation of compliant constrained dynamics , author =. Proceedings of the 9th International Conference on Motion in Games , pages =
-
[57]
Beyond the Contact: Discovering Comprehensive Affordance for
Kim, Hyeonwoo and Han, Sookwan and Kwon, Patrick and Joo, Hanbyul , booktitle =. Beyond the Contact: Discovering Comprehensive Affordance for. 2024 , organization =
2024
-
[58]
CVPR 2025 Workshop of HuMoGen , year =
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models , author =. CVPR 2025 Workshop of HuMoGen , year =
2025
-
[59]
CG-HOI: Contact-Guided 3D Human-Object Interaction Generation , author =. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE , year =
-
[60]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Kulkarni, Nilesh and Rempe, Davis and Genova, Kyle and Kundu, Abhijit and Johnson, Justin and Fouhey, David and Guibas, Leonidas , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =
2024
-
[61]
ECCV , year =
Controllable human-object interaction synthesis , author =. ECCV , year =
-
[62]
CVPR , year =
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing , author =. CVPR , year =
-
[63]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =
Xu, Sirui and Li, Zhengyuan and Wang, Yu-Xiong and Gui, Liang-Yan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =
2023
-
[64]
NeurIPS , year =
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction , author =. NeurIPS , year =
-
[65]
ECCV , year =
Chore: Contact, human and object reconstruction from a single rgb image , author =. ECCV , year =
-
[66]
3DV , year =
Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors , author =. 3DV , year =
-
[67]
CVPR , year =
Object pop-up: Can we infer 3D objects and their poses from human interactions alone? , author =. CVPR , year =
-
[68]
Zhou, Keyang and Bhatnagar, Bharat Lal and Lenssen, Jan Eric and Pons-Moll, Gerard , booktitle =
-
[69]
Zhang, Hui and Christen, Sammy and Fan, Zicong and Zheng, Luocheng and Hwangbo, Jemin and Song, Jie and Hilliges, Otmar , journal =
-
[70]
ICCV , year =
Petrovich, Mathis and Black, Michael J and Varol, G. ICCV , year =
-
[71]
ECCV , year =
Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts , author =. ECCV , year =
-
[72]
CVPR , year =
Executing your Commands via Motion Diffusion in Latent Space , author =. CVPR , year =
-
[73]
Zhang, Mingyuan and Cai, Zhongang and Pan, Liang and Hong, Fangzhou and Guo, Xinying and Yang, Lei and Liu, Ziwei , journal =
-
[74]
arXiv preprint arXiv:2306.10900 , year =
MotionGPT: Finetuned LLMs are General-Purpose Motion Generators , author =. arXiv preprint arXiv:2306.10900 , year =
-
[75]
CVPR , year =
Generating Human Motion From Textual Descriptions With Discrete Representations , author =. CVPR , year =
-
[76]
ECCV , year =
Motionclip: Exposing human motion generation to clip space , author =. ECCV , year =
-
[77]
3DV , year =
Language2pose: Natural language grounded pose forecasting , author =. 3DV , year =
-
[78]
CVPR , year =
Generating diverse and natural 3d human motions from text , author =. CVPR , year =
-
[79]
Lu, Shunlin and Chen, Ling-Hao and Zeng, Ailing and Lin, Jing and Zhang, Ruimao and Zhang, Lei and Shum, Heung-Yeung , journal =
-
[80]
arXiv preprint arXiv:2302.05905 , year =
Single Motion Diffusion , author =. arXiv preprint arXiv:2302.05905 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.