Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)
Pith reviewed 2026-06-28 18:17 UTC · model grok-4.3
The pith
Sim2real efforts can misalign incentives in policy learning by imposing real-world constraints too early, leading to simulator lock-in and poor exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. The proposed remedy is a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.
What carries the argument
sim2sim2real paradigm that uses the robot's kinematics as the sole design constraint
If this is right
- Policy training can proceed with greater exploration freedom when real-world constraints are deferred.
- Simulator lock-in decreases because designs are no longer forced to match one particular real-world model.
- Transfer to hardware remains possible once a policy is learned, using kinematics as the bridge.
- Research effort can shift from fidelity matching to developing better exploration methods in unconstrained simulation.
Where Pith is reading between the lines
- The same early-constraint problem may appear in other simulation-heavy fields such as autonomous driving or molecular design.
- Benchmark suites in robotics could be redesigned to separate pure exploration phases from transfer phases.
- Kinematics-only simulators might become a standard intermediate training environment before any real-world data is introduced.
Load-bearing premise
The primary bottleneck in current policy learning is the early imposition of real-world constraints rather than reward design, exploration strategies, or simulator fidelity.
What would settle it
An experiment that trains identical policies first in a kinematics-only simulator versus a full sim2real-constrained simulator and measures which set explores more states and transfers with higher success.
Figures
read the original abstract
While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution via a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sim2real efforts in robotics have produced misaligned incentives for policy learning, resulting in simulator lock-in and restricted exploration because real-world constraints are imposed too early. It offers a diagnosis of this problem and proposes a sim2sim2real paradigm in which the first simulation stage uses only robot kinematics as the design constraint.
Significance. If the diagnosis and proposed staged-simulation approach hold, the work could encourage robotics RL practitioners to prioritize broader policy exploration in early simulation phases before introducing real-world constraints, potentially yielding more robust policies. The kinematics-only initial stage is a concrete, minimal design choice that could be tested, though the manuscript supplies no evidence that it improves exploration or transfer.
major comments (2)
- [Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.
- [Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. The comments correctly identify that the current manuscript is conceptual in nature and lacks empirical validation for both the diagnosis and the proposed paradigm. We will revise to clarify these aspects while preserving the position-paper intent.
read point-by-point responses
-
Referee: [Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.
Authors: We agree the diagnosis is not supported by new isolating experiments. The manuscript is a position paper that synthesizes observed practices in the field to argue for misaligned incentives; it does not claim to have performed controlled ablations separating constraint timing from reward design or exploration limits. In revision we will explicitly state the conceptual scope in the abstract and add a dedicated discussion section outlining how future work could design experiments to test the lock-in hypothesis while controlling for the listed confounders. revision: partial
-
Referee: [Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.
Authors: We accept that the proposal requires formalization. The revision will add a dedicated section providing (1) a precise definition of the kinematics-only stage, (2) a high-level algorithm sketch showing how the first simulation phase differs from standard sim2real pipelines, and (3) an explicit discussion of possible new transfer risks together with mitigation strategies. No preliminary results exist in the current draft; we will therefore frame the paradigm as a testable hypothesis rather than an empirically validated method. revision: yes
Circularity Check
No circularity: conceptual argument with no derivation chain or self-referential reductions
full rationale
The paper advances an interpretive diagnosis of sim2real practices and proposes a kinematics-only sim2sim2real paradigm. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claim rests on field observations and literature interpretation rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. Because there is no mathematical or statistical derivation to reduce to its own inputs, the analysis finds no circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Real-world constraints imposed early in simulation are unreasonable and impede policy exploration
- ad hoc to paper Robot kinematics are the sole necessary design constraint for the first simulation stage
invented entities (1)
-
sim2sim2real paradigm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Impact of static friction on sim2real in robotic reinforcement learning,
X. Hu, Q. Sun, B. He, H. Liu, X. Zhang, C. lu, and J. Zhong, “Impact of static friction on sim2real in robotic reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01255
-
[2]
A. Miller, F. Yu, M. Brauckmann, and F. Farshidian, “High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,” 2025. [Online]. Available: https://arxiv.org/abs/2504.17857
-
[3]
Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,
Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16889
-
[4]
Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,
M. Raibert and F. Farshidian, “Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,” 2025, presented at the Workshop on Whole-body Control and Bimanual Ma- nipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS) 2025
2025
-
[5]
Learning robust perceptive locomotion for quadrupedal robots in the wild,
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, Jan. 2022. [Online]. Available: http://dx.doi.org/10.1126/scirobotics.abk2822
-
[6]
RMA: Rapid Motor Adaptation for Legged Robots
A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,” 2021. [Online]. Available: https://arxiv.org/abs/2107.04034
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Deepmimic: example-guided deep reinforcement learning of physics-based character skills,
X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics, vol. 37, no. 4, p. 1–14, Jul
-
[8]
Graph.37, 4, Article 133 (July 2018), 13 pages
[Online]. Available: http://dx.doi.org/10.1145/3197517.3201311
-
[9]
Amp: Adversarial motion priors for stylized physics-based character control,
X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: adversarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, p. 1–20, Jul. 2021. [Online]. Available: http://dx.doi.org/10.1145/3450626.3459670
-
[10]
Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,
M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740– 3747, Jun. 2023, arXiv:2301.04195 [cs]. [Online]. Avai...
-
[11]
MuJoCo: A physics engine for model-based control,
E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5026–5033, iSSN: 2153-
2012
-
[12]
Available: https://ieeexplore.ieee.org/document/6386109
[Online]. Available: https://ieeexplore.ieee.org/document/6386109
-
[13]
Action space design in reinforcement learning for robot motor skills,
J. Eßer, G. B. Margolis, O. Urbann, S. Kerner, and P. Agrawal, “Action space design in reinforcement learning for robot motor skills,” in 8th Annual Conference on Robot Learning, 2024. [Online]. Available: https://openreview.net/forum?id=GGuNkjQSrk
2024
-
[14]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Learning to walk in minutes using massively parallel deep reinforcement learning,
N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 164. PMLR, 2022, pp. 91–100. [Online]. Available: https://proceedings.mlr.press/v164/rudin22a.html
2022
-
[16]
A. Fishman, A. Murali, C. Eppner, B. Peele, B. Boots, and D. Fox, “Motion policy networks,” 2022. [Online]. Available: https://arxiv.org/abs/2210.12209
-
[17]
When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,
K. Morgenstein, R. Gupta, A. Kim, J. Hsin, E. Sturman, S. H. Bang, and L. Sentis, “When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,”In Review, 2025
2025
-
[18]
Mujoco playground.arXiv preprint arXiv:2502.08844,
K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, C. Sferrazza, Y . Tassa, and P. Abbeel, “Mujoco playground,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08844
-
[19]
Brax - a differentiable physics engine for large scale rigid body simulation,
C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - a differentiable physics engine for large scale rigid body simulation,” 2021. [Online]. Available: http://github.com/google/brax
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.