pith. machine review for the scientific record. sign in

arxiv: 2604.11090 · v1 · submitted 2026-04-13 · 💻 cs.RO

Recognition: unknown

Simulator Adaptation for Sim-to-Real Learning of Legged Locomotion via Proprioceptive Distribution Matching

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3

classification 💻 cs.RO
keywords sim-to-real transferlegged locomotionsimulator adaptationproprioceptive sensingdistribution matchingquadruped robotsdynamics identificationpolicy transfer
0
0 comments X

The pith

Proprioceptive distribution matching adapts simulators for legged robot policies using only joint data, matching privileged methods without motion capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that comparing distributions of joint observations and actions between simulation and hardware rollouts can adapt simulator dynamics to close the sim-to-real gap for legged locomotion policies. This matters because prior methods require time-aligned trajectories, motion capture, and privileged sensing, which are impractical for real hardware. The approach uses the distribution match as a black-box objective to tune parameters or add correction models, and it delivers comparable parameter recovery and policy gains to baselines.

Core claim

Simulator adaptation via proprioceptive distribution matching recovers accurate dynamics parameters and improves real-world policy performance comparably to state-matching baselines, as shown in extensive sim-to-sim ablations on the Go2 quadruped and in real hardware tests that reduce drift with less than five minutes of data even for two-legged walking.

What carries the argument

Proprioceptive distribution matching, which quantifies dynamics discrepancies by comparing simulation and hardware rollouts as distributions of joint observations and actions without requiring time alignment or external sensors.

Load-bearing premise

Comparing distributions of proprioceptive joint observations and actions is sufficient to identify and correct the relevant dynamics discrepancies without time alignment or privileged state information.

What would settle it

If adapting the simulator with proprioceptive distribution matching produces no improvement in parameter recovery accuracy or real-world policy drift reduction on the Go2 quadruped compared to an unadapted simulator.

Figures

Figures reproduced from arXiv: 2604.11090 by Alan Fern, Jeremy Dao.

Figure 1
Figure 1. Figure 1: Finetuned policy performance for the Model Parameter Shift sim-to-sim test scenario. We average values over 10,000 trajectories with random commands. Error bars represent a 95% confidence interval. Noise Level Cost Function Hip Fric Thigh Fric Calf Fric Hip Arm Thigh Arm Calf Arm Average Error σ = 1.0 MatchS(∞) 0.450% 3.787% 24.820% 7.400% 2.600% 4.504% 7.2602 MatchO(∞) 17.130% 6.047% 21.220% 5.317% 33.800… view at source ↗
Figure 2
Figure 2. Figure 2: Finetuned policy performance for the Spring Joint sim-to-sim test scenario. We average values over 10,000 trajectories with random commands. Error bars represent a 95% confidence interval. second long trajectories with random velocity commands and computing reward and velocity-tracking error. As baselines, we evaluate both π0 and πθtarg , a policy finetuned directly on the target model. These two baselines… view at source ↗
Figure 3
Figure 3. Figure 3: A Unitree Go2 quadruped used in sim-to-real experiments. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A Unitree Go2 quadruped walking on its hind legs. This [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Simulation trained legged locomotion policies often exhibit performance loss on hardware due to dynamics discrepancies between the simulator and the real world, highlighting the need for approaches that adapt the simulator itself to better match hardware behavior. Prior work typically quantify these discrepancies through precise, time-aligned matching of joint and base trajectories. This process requires motion capture, privileged sensing, and carefully controlled initial conditions. We introduce a practical alternative based on proprioceptive distribution matching, which compares hardware and simulation rollouts as distributions of joint observations and actions, eliminating the need for time alignment or external sensing. Using this metric as a black-box objective, we explore adapting simulator dynamics through parameter identification, action-delta models, and residual actuator models. Our approach matches the parameter recovery and policy-performance gains of privileged state-matching baselines across extensive sim-to-sim ablations on the Go2 quadruped. Real-world experiments demonstrate substantial drift reduction using less than five minutes of hardware data, even for a challenging two-legged walking behavior. These results demonstrate that proprioceptive distribution matching provides a practical and effective route to simulator adaptation for sim-to-real transfer of learned legged locomotion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that proprioceptive distribution matching—comparing marginal distributions of joint observations and actions from hardware and simulation rollouts, without time alignment or external sensing—provides a practical method for adapting simulators to reduce dynamics discrepancies in sim-to-real transfer of legged locomotion policies. It explores three adaptation strategies (simulator parameter identification, action-delta models, and residual actuator models) and reports that these match the parameter recovery and policy performance of privileged state-matching baselines across extensive sim-to-sim ablations on the Go2 quadruped. Real-world experiments further show substantial drift reduction using less than five minutes of hardware data, including for a challenging two-legged walking behavior.

Significance. If the central results hold, the work provides a low-overhead alternative to trajectory-matching sim-to-real methods that typically require motion capture or privileged state information. The extensive sim-to-sim ablations on a standard quadruped platform and the real-world validation with minimal data collection are notable strengths, as they directly address practical deployment constraints for learned locomotion policies. Credit is due for demonstrating effectiveness on a non-standard behavior (two-legged walking) and for framing the adaptation as a black-box optimization problem.

major comments (2)
  1. [§3] §3 (Proprioceptive Distribution Matching): The central objective minimizes distance between marginal distributions of proprioceptive observations and actions. This formulation does not enforce matching of temporal structure, transition probabilities P(o_{t+1}|o_t, a_t), or phase relationships. Consequently, distinct dynamics parameter sets (e.g., compensating friction and inertia changes that preserve joint-angle histograms) can produce statistically indistinguishable marginals under the same policy. The sim-to-sim ablations claim equivalent parameter recovery to privileged baselines, yet no identifiability analysis, sensitivity to initialization, or uniqueness checks are reported. This directly affects the claim that the method 'recovers the relevant dynamics discrepancies.'
  2. [Real-world results section] Real-world results section (and abstract): The experiments report 'substantial drift reduction' with <5 min of hardware data for both quadrupedal and two-legged behaviors. However, the provided abstract contains no quantitative metrics, error bars, or explicit baseline comparisons for the real-world drift (e.g., position error over time or success rate). If the full manuscript similarly omits detailed statistics or exclusion criteria for the hardware rollouts, the evidence for practical effectiveness remains difficult to assess independently of post-hoc choices.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., drift reduction percentages or parameter recovery error) to support the qualitative claims of 'substantial' improvement and 'matching' baselines.
  2. [§3] Notation for the distribution distance metric (e.g., whether Wasserstein, KL, or MMD is used) and the precise definition of the proprioceptive observation vector should be clarified early in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical strengths of our approach. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: The central objective minimizes distance between marginal distributions of proprioceptive observations and actions. This formulation does not enforce matching of temporal structure, transition probabilities P(o_{t+1}|o_t, a_t), or phase relationships. Consequently, distinct dynamics parameter sets (e.g., compensating friction and inertia changes that preserve joint-angle histograms) can produce statistically indistinguishable marginals under the same policy. The sim-to-sim ablations claim equivalent parameter recovery to privileged baselines, yet no identifiability analysis, sensitivity to initialization, or uniqueness checks are reported. This directly affects the claim that the method 'recovers the relevant dynamics discrepancies.'

    Authors: We agree that marginal distribution matching does not enforce temporal structure or guarantee unique recovery of dynamics parameters. Nevertheless, our sim-to-sim ablations demonstrate that the adapted simulators achieve parameter recovery and downstream policy performance equivalent to privileged state-matching baselines across varied conditions on the Go2. This provides empirical support that the method identifies discrepancies relevant to policy transfer. We will add a discussion of these theoretical limitations, including identifiability considerations, and report sensitivity to initialization from our existing experimental results. revision: partial

  2. Referee: Real-world results section (and abstract): The experiments report 'substantial drift reduction' with <5 min of hardware data for both quadrupedal and two-legged behaviors. However, the provided abstract contains no quantitative metrics, error bars, or explicit baseline comparisons for the real-world drift (e.g., position error over time or success rate). If the full manuscript similarly omits detailed statistics or exclusion criteria for the hardware rollouts, the evidence for practical effectiveness remains difficult to assess independently of post-hoc choices.

    Authors: The full manuscript reports quantitative real-world metrics including position drift over time with error bars, success rates, and comparisons against unadapted and baseline simulators, along with details on data collection and rollout criteria. To improve accessibility, we will revise the abstract to include key quantitative results and explicit baseline comparisons for the observed drift reduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; adaptation objective and validation are externally grounded

full rationale

The paper defines its core objective as minimizing a distribution distance between proprioceptive observations and actions collected from independent hardware rollouts and simulator rollouts. This target is external to the optimization and is not derived from the fitted parameters themselves. Claims of matching privileged baselines are supported by separate sim-to-sim ablations, while real-world drift reduction is measured on held-out hardware trials using <5 min of data. No equations reduce any reported gain to a fitted quantity by construction, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that proprioceptive distributions capture dynamics mismatches and on fitted parameters within the adaptation models; no new entities are postulated.

free parameters (2)
  • simulator dynamics parameters
    Identified via black-box optimization of the distribution-matching objective.
  • parameters of action-delta and residual actuator models
    Learned or tuned to correct simulator behavior during adaptation.
axioms (1)
  • domain assumption Proprioceptive distributions of joint observations and actions are sufficient to quantify relevant sim-to-real dynamics discrepancies
    Used as the black-box objective for all adaptation methods.

pith-pipeline@v0.9.0 · 5492 in / 1360 out tokens · 81059 ms · 2026-05-10T15:34:56.432353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 5 canonical work pages

  1. [1]

    Sim-to-real transfer in deep reinforcement learning for robotics: A survey,

    W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: A survey,”2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020, pp. 737–744, 2020

  2. [2]

    Sim-to-real transfer of robotic control with dynamics randomization,

    X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5 2018, pp. 3803–3810. [Online]. Available: https://ieeexplore.ieee.org/document/8460528/

  3. [3]

    Data-efficient domain randomization with bayesian optimization,

    F. Muratore, C. Eilers, M. Gienger, and J. Peters, “Data-efficient domain randomization with bayesian optimization,”IEEE Robotics and Automation Letters, vol. 6, pp. 911–918, 4 2021

  4. [4]

    Bayessim: Adaptive domain randomization via probabilistic inference for robotics simulators,

    F. Ramos, R. Possas, and D. Fox, “Bayessim: Adaptive domain randomization via probabilistic inference for robotics simulators,” 2019

  5. [5]

    Auto-tuned sim-to-real transfer,

    Y . Du, O. Watkins, T. Darrell, P. Abbeel, and D. Pathak, “Auto-tuned sim-to-real transfer,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5 2021, pp. 1290–1296. [Online]. Available: https://ieeexplore.ieee.org/document/9562091/

  6. [6]

    Learning human-to-humanoid real-time whole-body teleoperation,

    T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10 2024, pp. 8944–8951

  7. [7]

    Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,

    T. He, J. Gao, W. Xiao, Y . Zhang, Z. Wang, J. Wang, Z. Luo, G. He, N. Sobanbab, C. Pan, Z. Yi, G. Qu, K. Kitani, J. Hodgins, L. J. Fan, Y . Zhu, C. Liu, and G. Shi, “Asap: Aligning simulation and real-world physics for learning agile humanoid whole-body skills,” 2 2025

  8. [8]

    Bridging the sim-to-real gap for athletic loco-manipulation,

    N. Fey, G. B. Margolis, M. Peticco, and P. Agrawal, “Bridging the sim-to-real gap for athletic loco-manipulation,” 2 2025

  9. [9]

    Sim-to-real transfer for biped locomotion,

    W. Yu, V . C. Kumar, G. Turk, and C. K. Liu, “Sim-to-real transfer for biped locomotion,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11 2019, pp. 3503–3510

  10. [10]

    Preparing for the unknown: Learning a universal policy with online system identification,

    W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for the unknown: Learning a universal policy with online system identification,” in Robotics: Science and Systems XIII. Robotics: Science and Systems Foundation, 7 2017

  11. [11]

    Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,

    G. Ji, J. Mun, H. Kim, and J. Hwangbo, “Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion,”IEEE Robotics and Automation Letters, vol. 7, pp. 4630– 4637, 4 2022

  12. [12]

    Adapting rapid motor adaptation for bipedal robots,

    A. Kumar, Z. Li, J. Zeng, D. Pathak, K. Sreenath, and J. Malik, “Adapting rapid motor adaptation for bipedal robots,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10 2022, pp. 1161–1168

  13. [13]

    Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning,

    X. Gu, Y .-J. Wang, X. Zhu, C. Shi, Y . Guo, Y . Liu, and J. Chen, “Advancing humanoid locomotion: Mastering challenging terrains with denoising world model learning,” inRobotics: Science and Systems XX. Robotics: Science and Systems Foundation, 7 2024

  14. [14]

    Cts: Concurrent teacher- student reinforcement learning for legged locomotion,

    H. Wang, H. Luo, W. Zhang, and H. Chen, “Cts: Concurrent teacher- student reinforcement learning for legged locomotion,”IEEE Robotics and Automation Letters, vol. 9, pp. 9191–9198, 11 2024

  15. [15]

    Rapid locomotion via reinforcement learning,

    G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal, “Rapid locomotion via reinforcement learning,” inRobotics: Science and Systems XVIII. Robotics: Science and Systems Foundation, 6 2022

  16. [16]

    Learning agile robotic locomotion skills by imitating animals,

    X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, and S. Levine, “Learning agile robotic locomotion skills by imitating animals,” inRobotics: Science and Systems XVI. Robotics: Science and Systems Foundation, 2020. [Online]. Available: http://www.roboticsproceedings.org/rss16/p064.pdf

  17. [17]

    Self-supervised policy adaptation during deployment,

    N. Hansen, R. Jangir, Y . Sun, G. Aleny `a, P. Abbeel, A. A. Efros, L. Pinto, and X. Wang, “Self-supervised policy adaptation during deployment,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. ICLR, 2021. [Online]. Available: https://openreview.net/forum?id= o V-MjyyGV

  18. [18]

    Simgan: Hybrid simulator identification for domain adaptation via adversarial reinforcement learning,

    Y . Jiang, T. Zhang, D. Ho, Y . Bai, C. K. Liu, S. Levine, and J. Tan, “Simgan: Hybrid simulator identification for domain adaptation via adversarial reinforcement learning,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5 2021, pp. 2884–2890. [Online]. Available: https://ieeexplore.ieee.org/document/ 9561731/

  19. [19]

    Improv- ing domain transfer of robot dynamics models with geometric system identification and learned friction compensation,

    L. Schwendeman, A. SaLoutos, E. Stanger-Jones, and S. Kim, “Improv- ing domain transfer of robot dynamics models with geometric system identification and learned friction compensation,” in2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids). IEEE, 12 2023, pp. 1–8

  20. [20]

    Estimation of inertial parameters of manipulator loads and links,

    C. G. Atkeson, C. H. An, and J. M. Hollerbach, “Estimation of inertial parameters of manipulator loads and links,”The International Journal of Robotics Research, vol. 5, pp. 101–119, 9 1986

  21. [21]

    Learning to walk in minutes using massively parallel deep reinforcement learning,

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, A. Faust, D. Hsu, and G. Neumann, Eds., vol. 164. PMLR, 3 2022, pp. 91–100. [Online]. Available: https://proceedings.mlr.press/v164/rudin22a.html

  22. [22]

    Imitate and repurpose: Learning reusable robot movement skills from human and animal behaviors,

    S. Bohez, S. Tunyasuvunakool, P. Brakel, F. Sadeghi, L. Hasen- clever, Y . Tassa, E. Parisotto, J. Humplik, T. Haarnoja, R. Hafner, M. Wulfmeier, M. Neunert, B. Moran, N. Siegel, A. Huber, F. Romano, N. Batchelor, F. Casarini, J. Merel, R. Hadsell, and N. Heess, “Imitate and repurpose: Learning reusable robot movement skills from human and animal behavior...

  23. [23]

    Sim-to-real: Learning agile locomotion for quadruped robots,

    J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” inRobotics: Science and Systems XIV. Robotics: Science and Systems Foundation, 6 2018. [Online]. Available: http://www.roboticsproceedings.org/rss14/p10.htmlhttp:// www.roboticsproceedings.org/rss14/p10.pdf

  24. [24]

    Dynamic parameter identifi- cation of serial robots using a hybrid approach,

    Y . Huang, J. Ke, X. Zhang, and J. Ota, “Dynamic parameter identifi- cation of serial robots using a hybrid approach,”IEEE Transactions on Robotics, vol. 39, pp. 1607–1621, 4 2023

  25. [25]

    Learning agile and dynamic motor skills for legged robots,

    J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,”Science Robotics, vol. 4, 1 2019

  26. [26]

    Reinforced grounded action transformation for sim-to-real transfer,

    H. Karnan, S. Desai, J. P. Hanna, G. Warnell, and P. Stone, “Reinforced grounded action transformation for sim-to-real transfer,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10 2020, pp. 4397–4402

  27. [27]

    Rl2ac: Reinforcement learning-based rapid online adaptive control for legged robot robust locomotion,

    S. Lyu, X. Lang, H. Zhao, H. Zhang, P. Ding, and D. Wang, “Rl2ac: Reinforcement learning-based rapid online adaptive control for legged robot robust locomotion,” inRobotics: Science and Systems XX. Robotics: Science and Systems Foundation, 7 2024

  28. [28]

    Tossingbot: Learning to throw arbitrary objects with residual physics,

    A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser, “Tossingbot: Learning to throw arbitrary objects with residual physics,”IEEE Transactions on Robotics, vol. 36, pp. 1307–1319, 8 2020

  29. [29]

    Data-efficient control policy search using residual dynamics learning,

    M. Saveriano, Y . Yin, P. Falco, and D. Lee, “Data-efficient control policy search using residual dynamics learning,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 9 2017, pp. 4709–4715. [Online]. Available: http://ieeexplore.ieee.org/document/8206343/

  30. [30]

    Neuralsim: Augmenting differentiable simulators with neural networks,

    E. Heiden, D. Millard, E. Coumans, Y . Sheng, and G. S. Sukhatme, “Neuralsim: Augmenting differentiable simulators with neural networks,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 5 2021, pp. 9474–9481

  31. [31]

    Sim- to-real of soft robots with learned residual physics,

    J. Gao, M. Y . Michelis, A. Spielberg, and R. K. Katzschmann, “Sim- to-real of soft robots with learned residual physics,”IEEE Robotics and Automation Letters, vol. 9, pp. 8523–8530, 10 2024

  32. [32]

    Residual physics learning and system identification for sim-to-real transfer of policies on buoyancy assisted legged robots,

    N. Sontakke, H. Chae, S. Lee, T. Huang, D. W. Hong, and S. Hal, “Residual physics learning and system identification for sim-to-real transfer of policies on buoyancy assisted legged robots,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10 2023, pp. 392–399

  33. [33]

    High-performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,

    A. Miller, F. Yu, M. Brauckmann, and F. Farshidian, “High-performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 9981–9988

  34. [34]

    2016, arXiv e-prints, arXiv:1604.00772, doi: 10.48550/arXiv.1604.00772

    N. Hansen, “The CMA evolution strategy: A tutorial,”CoRR, vol. abs/1604.00772, 2016. [Online]. Available: http://arxiv.org/abs/1604. 00772

  35. [35]

    cmaes : A simple yet practical python library for cma-es,

    M. Nomura and M. Shibata, “cmaes : A simple yet practical python library for cma-es,” 2024. [Online]. Available: https: //arxiv.org/abs/2402.01373

  36. [36]

    Multiple task optimization with a mixture of controllers for motion generation,

    N. Dehio, R. F. Reinhart, and J. J. Steil, “Multiple task optimization with a mixture of controllers for motion generation,” in2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 9 2015, pp. 6416–6421

  37. [37]

    Sampling-based system identification with active exploration for legged sim2real learning,

    N. Sobanbabu, G. He, T. He, Y . Yang, and G. Shi, “Sampling-based system identification with active exploration for legged sim2real learning,” in9th Annual Conference on Robot Learning, 2025. [Online]. Available: https://openreview.net/forum?id=UTPBM4dEUS