OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation
Pith reviewed 2026-06-27 18:22 UTC · model grok-4.3
The pith
Policies for humanoid loco-manipulation trained on simulation data from 3D reconstructions outperform those trained on real teleoperation data under zero-shot deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, a hierarchical visuomotor policy for humanoid loco-manipulation is trained, and under zero-shot deployment this policy achieves higher success rates on most tasks than policies trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by simulation rendering.
What carries the argument
The OASIS pipeline that reconstructs 3D object assets from images, collects teleoperated trajectories in simulation, augments them with domain randomizations, and trains a hierarchical visuomotor policy.
If this is right
- Simulation data collection removes the need for repeated physical scene resets during data gathering.
- Post-processing domain randomization on simulated trajectories increases robustness to visual changes at deployment time.
- A hierarchical policy structure allows separate handling of locomotion and manipulation within the same visuomotor controller.
- Zero-shot transfer becomes feasible when training data covers lighting and environmental diversity that real demonstrations typically lack.
Where Pith is reading between the lines
- Visual diversity in training data may matter more for these tasks than exact matching of demonstration trajectories.
- The approach could reduce reliance on physical robot time for data collection across a wider range of manipulation scenarios.
- If the 3D asset generation step improves further, similar pipelines might apply to other robot embodiments without new teleoperation hardware.
Load-bearing premise
The 3D generative model produces object assets whose geometry, textures, and dynamics are accurate enough that policies trained on them transfer directly to the physical robot without real-world fine-tuning.
What would settle it
A side-by-side test in which the simulation-trained policy is evaluated on tasks whose real objects have geometry or surface properties outside the distribution produced by the 3D generative model and shows lower success rates than the real-data baseline.
Figures
read the original abstract
Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OASIS, a framework that reconstructs object assets from real-world images via a 3D generative model, collects teleoperated trajectories in simulation, augments them with domain randomization on lighting and environment, and trains a hierarchical visuomotor policy for humanoid loco-manipulation. The central claim is that, under zero-shot real-robot deployment, the simulation-trained policy achieves higher success rates on most tasks than a policy trained on real-robot teleoperation data, attributed primarily to the broader rendering variations in simulation.
Significance. If the empirical comparison holds under controlled conditions, the work would demonstrate that simulation data derived from image-based generative assets can outperform real teleoperation data for training humanoid policies, addressing the quality-scalability tradeoff in loco-manipulation data collection. The hierarchical policy design and post-processing augmentation represent concrete engineering contributions that could be adopted if the transfer results are reproducible.
major comments (2)
- [Abstract] Abstract: the central claim states that the simulation-trained policy 'achieves higher success rates on most tasks' but supplies no quantitative success rates, number of tasks evaluated, error bars, statistical tests, task definitions, or controls for data volume/policy architecture/evaluation protocol, rendering the result impossible to assess.
- [Abstract] Abstract: the performance advantage is ascribed to 'broad lighting and environmental variations' in simulation rendering, but the manuscript provides no discussion or validation of the accuracy of physical dynamics parameters (mass, inertia, friction, restitution) recovered by the 3D generative model; for contact-rich loco-manipulation this is load-bearing, as mismatches in contact dynamics could explain or undermine the reported zero-shot advantage independent of rendering diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, clarifying the abstract's role as a summary while committing to targeted revisions for improved assessability and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim states that the simulation-trained policy 'achieves higher success rates on most tasks' but supplies no quantitative success rates, number of tasks evaluated, error bars, statistical tests, task definitions, or controls for data volume/policy architecture/evaluation protocol, rendering the result impossible to assess.
Authors: We agree the abstract is high-level and omits specifics that would aid immediate assessment. The full manuscript details results across 6 tasks in Section 5 and Table 2, with success rates, standard deviations over 20 trials, and controls for data volume and policy architecture. We will revise the abstract to incorporate key quantitative elements (e.g., number of tasks and average improvement) within length constraints to address this concern. revision: yes
-
Referee: [Abstract] Abstract: the performance advantage is ascribed to 'broad lighting and environmental variations' in simulation rendering, but the manuscript provides no discussion or validation of the accuracy of physical dynamics parameters (mass, inertia, friction, restitution) recovered by the 3D generative model; for contact-rich loco-manipulation this is load-bearing, as mismatches in contact dynamics could explain or undermine the reported zero-shot advantage independent of rendering diversity.
Authors: The 3D generative model focuses on visual and geometric reconstruction from images; dynamics parameters (mass, friction, etc.) are assigned using standard estimates rather than recovered or validated in this work. The manuscript attributes gains primarily to rendering variations and does not claim dynamics fidelity as the driver. We acknowledge the absence of explicit dynamics validation as a limitation for contact-rich tasks. We will add a dedicated discussion paragraph on this point and its potential implications in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical pipeline with direct hardware validation
full rationale
The paper describes a data-generation pipeline (3D asset reconstruction from images, teleoperation in simulation, domain randomization, hierarchical policy training) followed by an empirical zero-shot comparison on a physical humanoid robot. No equations, fitted parameters, or mathematical derivations are present in the provided text. The central claim rests on measured success rates rather than any reduction of a prediction to its own inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the result. The result is therefore self-contained against external benchmarks (real-robot trials) and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Z. Gu, J. Li, W. Shen, W. Yu, Z. Xie, S. McCrory, X. Cheng, A. Shamsah, R. Griffin, C. K. Liu, et al. Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning.IEEE/ASME Transactions on Mechatronics, 31(2):2300–2330, 2026
2026
-
[2]
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024
arXiv 2024
-
[3]
T. He, Z. Luo, X. He, W. Xiao, C. Zhang, W. Zhang, K. Kitani, C. Liu, and G. Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body teleoperation and learning.arXiv preprint arXiv:2406.08858, 2024
arXiv 2024
-
[4]
O’Neill, A
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[5]
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[6]
Zitkovich, T
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[7]
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022
Pith/arXiv arXiv 2022
-
[8]
C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023
arXiv 2023
-
[9]
Grauman, A
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022
2022
-
[10]
Kareer, D
S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025
2025
- [11]
-
[12]
R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 9
Pith/arXiv arXiv 2025
-
[13]
M. Shi, S. Peng, J. Chen, H. Jiang, Y . Li, D. Huang, P. Luo, H. Li, and L. Chen. Egohu- manoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration. arXiv preprint arXiv:2602.10106, 2026
Pith/arXiv arXiv 2026
-
[14]
S. Bai, M. Li, X. Lv, J. Wang, X. Wang, F. Liao, C. Hou, L. Gu, W. Zhou, K. Wu, et al. Hex: Humanoid-aligned experts for cross-embodiment whole-body manipulation.arXiv preprint arXiv:2604.07993, 2026
Pith/arXiv arXiv 2026
-
[15]
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[16]
P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y . Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al. Humanoid-vla: Towards universal humanoid control with visual integration.arXiv preprint arXiv:2502.14795, 2025
arXiv 2025
-
[17]
S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026
arXiv 2026
-
[18]
Y . Ze, Z. Chen, J. P. Ara´ujo, Z. ang Cao, X. B. Peng, J. Wu, and C. K. Liu. Twist: Teleoperated whole-body imitation system.arXiv preprint arXiv:2505.02833, 2025
arXiv 2025
-
[19]
Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025
arXiv 2025
-
[20]
Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. Fan, and Y . Zhu. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025
Pith/arXiv arXiv 2025
-
[21]
Y . Li, L. Ma, Y . Lin, Y . Du, M. Liu, K. Hu, J. Cui, Y . Zhu, W. Liang, B. Jia, et al. Omniclone: Engineering a robust, all-rounder whole-body humanoid teleoperation system.arXiv preprint arXiv:2603.14327, 2026
arXiv 2026
-
[22]
V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021
Pith/arXiv arXiv 2021
-
[23]
Isaac Sim
NVIDIA. Isaac Sim. URLhttps://github.com/isaac-sim/IsaacSim
-
[24]
Todorov, T
E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012
2012
-
[25]
PICO Immersive Pte. Ltd. PICO 4 Ultra: An All-New Mixed Reality Experience.https: //www.picoxr.com/global/products/pico4-ultra, 2023
2023
-
[26]
H. Zhao, R. Cathomen, L. Gulich, W. Liu, E. A. Ongan, M. Lin, S. Jain, S. Pouya, and Y . Chang. Agile: A comprehensive workflow for humanoid loco-manipulation learning.arXiv preprint arXiv:2603.20147, 2026
arXiv 2026
-
[27]
Y . Fu, F. Xie, C. Xu, J. Xiong, H. Yuan, and Z. Lu. Demohlm: From one demonstration to generalizable humanoid loco-manipulation.IEEE Robotics and Automation Letters, 2026
2026
- [28]
-
[29]
R. Nai, B. Zheng, J. Zhao, H. Zhu, S. Dai, Z. Chen, Y . Hu, Y . Hu, T. Zhang, C. Wen, et al. Hu- manoid manipulation interface: Humanoid whole-body manipulation from robot-free demon- strations.arXiv preprint arXiv:2602.06643, 2026
arXiv 2026
- [30]
-
[31]
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024
Pith/arXiv arXiv 2024
-
[32]
L. Wang, Y . Ling, Z. Yuan, M. Shridhar, C. Bao, Y . Qin, B. Wang, H. Xu, and X. Wang. Gensim: Generating robotic simulation tasks via large language models. InInternational Conference on Learning Representations, volume 2024, pages 4890–4924, 2024
2024
-
[33]
Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: towards unleashing infinite data for automated robot learning via generative simulation. InProceedings of the 41st International Conference on Machine Learning, pages 51936–51983, 2024
2024
-
[34]
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023
Pith/arXiv arXiv 2023
-
[35]
Jiang, Y
Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. J. Fan, and Y . Zhu. Dexmimic- gen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 16923– 16930. IEEE, 2025
2025
-
[36]
T. He, Z. Wang, H. Xue, Q. Ben, Z. Luo, W. Xiao, Y . Yuan, X. Da, F. Casta ˜neda, S. Sas- try, et al. Viral: Visual sim-to-real at scale for humanoid loco-manipulation.arXiv preprint arXiv:2511.15200, 2025
arXiv 2025
-
[37]
Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025
Pith/arXiv arXiv 2025
-
[38]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[39]
J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025
arXiv 2025
-
[40]
Teleopit: A lightweight and scalable whole-body teleoperation framework for humanoid robots, 2025
BotRunner64. Teleopit: A lightweight and scalable whole-body teleoperation framework for humanoid robots, 2025. URLhttps://github.com/BotRunner64/Teleopit. Accessed: 2026-04-15
2025
-
[41]
W. Xie, J. Zheng, J. Han, J. Shi, W. Zhang, C. Bai, and X. Li. Textop: Real-time interactive text-driven humanoid robot motion generation and control.arXiv preprint arXiv:2602.07439, 2026
arXiv 2026
-
[42]
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[43]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 11
2021
-
[44]
length cm
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without s...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.