pith. sign in

arxiv: 2606.02636 · v2 · pith:5YI3IZOEnew · submitted 2026-05-30 · 💻 cs.RO · cs.AI

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

Pith reviewed 2026-06-28 18:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords sim2realpolicy learningroboticssimulator lock-insim2sim2realkinematics constraintreinforcement learning
0
0 comments X

The pith

Sim2real efforts can misalign incentives in policy learning by imposing real-world constraints too early, leading to simulator lock-in and poor exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that sim2real work, while needed for hardware transfer, has created misaligned incentives that hurt the policy learning stage itself. Early matching to real-world limits locks researchers into particular simulators and restricts how freely policies can explore solutions. This happens because real-world constraints are often unreasonable during initial training. The authors diagnose the issue and propose shifting to a sim2sim2real setup that treats only the robot's kinematics as the binding design constraint.

Core claim

Sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. The proposed remedy is a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.

What carries the argument

sim2sim2real paradigm that uses the robot's kinematics as the sole design constraint

If this is right

  • Policy training can proceed with greater exploration freedom when real-world constraints are deferred.
  • Simulator lock-in decreases because designs are no longer forced to match one particular real-world model.
  • Transfer to hardware remains possible once a policy is learned, using kinematics as the bridge.
  • Research effort can shift from fidelity matching to developing better exploration methods in unconstrained simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-constraint problem may appear in other simulation-heavy fields such as autonomous driving or molecular design.
  • Benchmark suites in robotics could be redesigned to separate pure exploration phases from transfer phases.
  • Kinematics-only simulators might become a standard intermediate training environment before any real-world data is introduced.

Load-bearing premise

The primary bottleneck in current policy learning is the early imposition of real-world constraints rather than reward design, exploration strategies, or simulator fidelity.

What would settle it

An experiment that trains identical policies first in a kinematics-only simulator versus a full sim2real-constrained simulator and measures which set explores more states and transfers with higher success.

Figures

Figures reproduced from arXiv: 2606.02636 by Bharath Masetty, Kyle Morgenstein, Luis Sentis, Stephen Welch.

Figure 1
Figure 1. Figure 1: The sim2sim2real paradigm. A. Initial policy learning is performed in IsaacLab using a reduced order model. A forward [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A. The current standard sim2real policy learning [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The forward model is trained to estimate the next kinematic state of the robot. The forward model observes the same [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution via a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that sim2real efforts in robotics have produced misaligned incentives for policy learning, resulting in simulator lock-in and restricted exploration because real-world constraints are imposed too early. It offers a diagnosis of this problem and proposes a sim2sim2real paradigm in which the first simulation stage uses only robot kinematics as the design constraint.

Significance. If the diagnosis and proposed staged-simulation approach hold, the work could encourage robotics RL practitioners to prioritize broader policy exploration in early simulation phases before introducing real-world constraints, potentially yielding more robust policies. The kinematics-only initial stage is a concrete, minimal design choice that could be tested, though the manuscript supplies no evidence that it improves exploration or transfer.

major comments (2)
  1. [Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.
  2. [Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments correctly identify that the current manuscript is conceptual in nature and lacks empirical validation for both the diagnosis and the proposed paradigm. We will revise to clarify these aspects while preserving the position-paper intent.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.

    Authors: We agree the diagnosis is not supported by new isolating experiments. The manuscript is a position paper that synthesizes observed practices in the field to argue for misaligned incentives; it does not claim to have performed controlled ablations separating constraint timing from reward design or exploration limits. In revision we will explicitly state the conceptual scope in the abstract and add a dedicated discussion section outlining how future work could design experiments to test the lock-in hypothesis while controlling for the listed confounders. revision: partial

  2. Referee: [Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.

    Authors: We accept that the proposal requires formalization. The revision will add a dedicated section providing (1) a precise definition of the kinematics-only stage, (2) a high-level algorithm sketch showing how the first simulation phase differs from standard sim2real pipelines, and (3) an explicit discussion of possible new transfer risks together with mitigation strategies. No preliminary results exist in the current draft; we will therefore frame the paradigm as a testable hypothesis rather than an empirically validated method. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual argument with no derivation chain or self-referential reductions

full rationale

The paper advances an interpretive diagnosis of sim2real practices and proposes a kinematics-only sim2sim2real paradigm. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claim rests on field observations and literature interpretation rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. Because there is no mathematical or statistical derivation to reduce to its own inputs, the analysis finds no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the domain assumption that real-world constraints are the dominant limiter of exploration and that kinematics alone suffice as an initial design constraint; no free parameters or new entities with independent evidence are introduced.

axioms (2)
  • domain assumption Real-world constraints imposed early in simulation are unreasonable and impede policy exploration
    Invoked in the abstract as the cause of poor exploration
  • ad hoc to paper Robot kinematics are the sole necessary design constraint for the first simulation stage
    Stated as the basis of the proposed sim2sim2real paradigm
invented entities (1)
  • sim2sim2real paradigm no independent evidence
    purpose: Staged simulation that first uses only kinematics then adds realism
    Introduced to address the diagnosed misalignment

pith-pipeline@v0.9.1-grok · 5622 in / 1318 out tokens · 18553 ms · 2026-06-28T18:17:55.524451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Impact of static friction on sim2real in robotic reinforcement learning,

    X. Hu, Q. Sun, B. He, H. Liu, X. Zhang, C. lu, and J. Zhong, “Impact of static friction on sim2real in robotic reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01255

  2. [2]

    High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,

    A. Miller, F. Yu, M. Brauckmann, and F. Farshidian, “High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,” 2025. [Online]. Available: https://arxiv.org/abs/2504.17857

  3. [3]

    Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,

    Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16889

  4. [4]

    Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,

    M. Raibert and F. Farshidian, “Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,” 2025, presented at the Workshop on Whole-body Control and Bimanual Ma- nipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS) 2025

  5. [5]

    Learning robust perceptive locomotion for quadrupedal robots in the wild,

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, Jan. 2022. [Online]. Available: http://dx.doi.org/10.1126/scirobotics.abk2822

  6. [6]

    RMA: Rapid Motor Adaptation for Legged Robots

    A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,” 2021. [Online]. Available: https://arxiv.org/abs/2107.04034

  7. [7]

    Deepmimic: example-guided deep reinforcement learning of physics-based character skills,

    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics, vol. 37, no. 4, p. 1–14, Jul

  8. [8]

    Graph.37, 4, Article 133 (July 2018), 13 pages

    [Online]. Available: http://dx.doi.org/10.1145/3197517.3201311

  9. [9]

    Amp: Adversarial motion priors for stylized physics-based character control,

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: adversarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, p. 1–20, Jul. 2021. [Online]. Available: http://dx.doi.org/10.1145/3450626.3459670

  10. [10]

    Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,

    M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740– 3747, Jun. 2023, arXiv:2301.04195 [cs]. [Online]. Avai...

  11. [11]

    MuJoCo: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5026–5033, iSSN: 2153-

  12. [12]

    Available: https://ieeexplore.ieee.org/document/6386109

    [Online]. Available: https://ieeexplore.ieee.org/document/6386109

  13. [13]

    Action space design in reinforcement learning for robot motor skills,

    J. Eßer, G. B. Margolis, O. Urbann, S. Kerner, and P. Agrawal, “Action space design in reinforcement learning for robot motor skills,” in 8th Annual Conference on Robot Learning, 2024. [Online]. Available: https://openreview.net/forum?id=GGuNkjQSrk

  14. [14]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

  15. [15]

    Learning to walk in minutes using massively parallel deep reinforcement learning,

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 164. PMLR, 2022, pp. 91–100. [Online]. Available: https://proceedings.mlr.press/v164/rudin22a.html

  16. [16]

    Motion policy networks,

    A. Fishman, A. Murali, C. Eppner, B. Peele, B. Boots, and D. Fox, “Motion policy networks,” 2022. [Online]. Available: https://arxiv.org/abs/2210.12209

  17. [17]

    When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,

    K. Morgenstein, R. Gupta, A. Kim, J. Hsin, E. Sturman, S. H. Bang, and L. Sentis, “When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,”In Review, 2025

  18. [18]

    Mujoco playground.arXiv preprint arXiv:2502.08844,

    K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, C. Sferrazza, Y . Tassa, and P. Abbeel, “Mujoco playground,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08844

  19. [19]

    Brax - a differentiable physics engine for large scale rigid body simulation,

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - a differentiable physics engine for large scale rigid body simulation,” 2021. [Online]. Available: http://github.com/google/brax