Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

Bharath Masetty; Kyle Morgenstein; Luis Sentis; Stephen Welch

arxiv: 2606.02636 · v2 · pith:5YI3IZOEnew · submitted 2026-05-30 · 💻 cs.RO · cs.AI

Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)

Kyle Morgenstein , Bharath Masetty , Stephen Welch , Luis Sentis This is my paper

Pith reviewed 2026-06-28 18:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords sim2realpolicy learningroboticssimulator lock-insim2sim2realkinematics constraintreinforcement learning

0 comments

The pith

Sim2real efforts can misalign incentives in policy learning by imposing real-world constraints too early, leading to simulator lock-in and poor exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that sim2real work, while needed for hardware transfer, has created misaligned incentives that hurt the policy learning stage itself. Early matching to real-world limits locks researchers into particular simulators and restricts how freely policies can explore solutions. This happens because real-world constraints are often unreasonable during initial training. The authors diagnose the issue and propose shifting to a sim2sim2real setup that treats only the robot's kinematics as the binding design constraint.

Core claim

Sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. The proposed remedy is a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.

What carries the argument

sim2sim2real paradigm that uses the robot's kinematics as the sole design constraint

If this is right

Policy training can proceed with greater exploration freedom when real-world constraints are deferred.
Simulator lock-in decreases because designs are no longer forced to match one particular real-world model.
Transfer to hardware remains possible once a policy is learned, using kinematics as the bridge.
Research effort can shift from fidelity matching to developing better exploration methods in unconstrained simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-constraint problem may appear in other simulation-heavy fields such as autonomous driving or molecular design.
Benchmark suites in robotics could be redesigned to separate pure exploration phases from transfer phases.
Kinematics-only simulators might become a standard intermediate training environment before any real-world data is introduced.

Load-bearing premise

The primary bottleneck in current policy learning is the early imposition of real-world constraints rather than reward design, exploration strategies, or simulator fidelity.

What would settle it

An experiment that trains identical policies first in a kinematics-only simulator versus a full sim2real-constrained simulator and measures which set explores more states and transfers with higher success.

Figures

Figures reproduced from arXiv: 2606.02636 by Bharath Masetty, Kyle Morgenstein, Luis Sentis, Stephen Welch.

**Figure 1.** Figure 1: The sim2sim2real paradigm. A. Initial policy learning is performed in IsaacLab using a reduced order model. A forward [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: A. The current standard sim2real policy learning [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The forward model is trained to estimate the next kinematic state of the robot. The forward model observes the same [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue that sim2real efforts have led to misaligned incentives with policy learning, resulting in simulator lock in and poor policy exploration due to the unreasonable constraints imposed by the real world. We offer a diagnosis and explanation of the current status of the problem, and propose a potential solution via a sim2sim2real paradigm that leverages the robot's kinematics as the sole design constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a possible incentive problem with early real-world constraints in sim2real but supplies no evidence that this is the main bottleneck or that the proposed fix works.

read the letter

The main thing here is that the authors argue sim2real work has created simulator lock-in by imposing real constraints too soon, which hurts policy exploration, and they suggest a sim2sim2real path that starts with kinematics only.

The paper does a clear job of connecting existing transfer literature to a question about development order. It names how the push for realism can trade off against the freedom needed for good policy search, which is a fair observation even if it is not new.

The soft spot is the complete absence of supporting data or tests. The diagnosis is interpretive, and the kinematics-only proposal is stated without any result showing it improves exploration or avoids new transfer problems. The stress-test note is on target: we have no comparison isolating constraint timing from reward design, exploration limits, or simulator fidelity. Without that, the central claim stays untested.

This is for people already working on sim2real robotics who want to reconsider when realism should enter the pipeline. A reader looking for discussion prompts or alternative framings could get value from it, but it will not change practice on its own.

I would bring it to a reading group to talk through the idea. I would not cite it as a result. It deserves peer review in a venue open to position pieces, since the question is worth raising even if the answer needs experiments the authors have not run.

Referee Report

2 major / 0 minor

Summary. The paper claims that sim2real efforts in robotics have produced misaligned incentives for policy learning, resulting in simulator lock-in and restricted exploration because real-world constraints are imposed too early. It offers a diagnosis of this problem and proposes a sim2sim2real paradigm in which the first simulation stage uses only robot kinematics as the design constraint.

Significance. If the diagnosis and proposed staged-simulation approach hold, the work could encourage robotics RL practitioners to prioritize broader policy exploration in early simulation phases before introducing real-world constraints, potentially yielding more robust policies. The kinematics-only initial stage is a concrete, minimal design choice that could be tested, though the manuscript supplies no evidence that it improves exploration or transfer.

major comments (2)

[Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.
[Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments correctly identify that the current manuscript is conceptual in nature and lacks empirical validation for both the diagnosis and the proposed paradigm. We will revise to clarify these aspects while preserving the position-paper intent.

read point-by-point responses

Referee: [Abstract] Abstract: The diagnosis that sim2real practices cause simulator lock-in and poor exploration is presented without experiments, ablations, or comparative data isolating the timing of real-world constraints from confounding factors such as reward misspecification or intrinsic exploration limits.

Authors: We agree the diagnosis is not supported by new isolating experiments. The manuscript is a position paper that synthesizes observed practices in the field to argue for misaligned incentives; it does not claim to have performed controlled ablations separating constraint timing from reward design or exploration limits. In revision we will explicitly state the conceptual scope in the abstract and add a dedicated discussion section outlining how future work could design experiments to test the lock-in hypothesis while controlling for the listed confounders. revision: partial
Referee: [Abstract] Abstract: The sim2sim2real paradigm is introduced with kinematics as the sole initial constraint, yet no formal definition, algorithm, or preliminary results are supplied to show that removing other constraints materially improves exploration or avoids new transfer failures.

Authors: We accept that the proposal requires formalization. The revision will add a dedicated section providing (1) a precise definition of the kinematics-only stage, (2) a high-level algorithm sketch showing how the first simulation phase differs from standard sim2real pipelines, and (3) an explicit discussion of possible new transfer risks together with mitigation strategies. No preliminary results exist in the current draft; we will therefore frame the paradigm as a testable hypothesis rather than an empirically validated method. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual argument with no derivation chain or self-referential reductions

full rationale

The paper advances an interpretive diagnosis of sim2real practices and proposes a kinematics-only sim2sim2real paradigm. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claim rests on field observations and literature interpretation rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. Because there is no mathematical or statistical derivation to reduce to its own inputs, the analysis finds no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the domain assumption that real-world constraints are the dominant limiter of exploration and that kinematics alone suffice as an initial design constraint; no free parameters or new entities with independent evidence are introduced.

axioms (2)

domain assumption Real-world constraints imposed early in simulation are unreasonable and impede policy exploration
Invoked in the abstract as the cause of poor exploration
ad hoc to paper Robot kinematics are the sole necessary design constraint for the first simulation stage
Stated as the basis of the proposed sim2sim2real paradigm

invented entities (1)

sim2sim2real paradigm no independent evidence
purpose: Staged simulation that first uses only kinematics then adds realism
Introduced to address the diagnosed misalignment

pith-pipeline@v0.9.1-grok · 5622 in / 1318 out tokens · 18553 ms · 2026-06-28T18:17:55.524451+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Impact of static friction on sim2real in robotic reinforcement learning,

X. Hu, Q. Sun, B. He, H. Liu, X. Zhang, C. lu, and J. Zhong, “Impact of static friction on sim2real in robotic reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01255

work page arXiv 2025
[2]

High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,

A. Miller, F. Yu, M. Brauckmann, and F. Farshidian, “High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,” 2025. [Online]. Available: https://arxiv.org/abs/2504.17857

work page arXiv 2025
[3]

Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,

Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16889

work page arXiv 2024
[4]

Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,

M. Raibert and F. Farshidian, “Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,” 2025, presented at the Workshop on Whole-body Control and Bimanual Ma- nipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS) 2025

2025
[5]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, Jan. 2022. [Online]. Available: http://dx.doi.org/10.1126/scirobotics.abk2822

work page doi:10.1126/scirobotics.abk2822 2022
[6]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,” 2021. [Online]. Available: https://arxiv.org/abs/2107.04034

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Deepmimic: example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics, vol. 37, no. 4, p. 1–14, Jul
[8]

Graph.37, 4, Article 133 (July 2018), 13 pages

[Online]. Available: http://dx.doi.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311
[9]

Amp: Adversarial motion priors for stylized physics-based character control,

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: adversarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, p. 1–20, Jul. 2021. [Online]. Available: http://dx.doi.org/10.1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021
[10]

Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740– 3747, Jun. 2023, arXiv:2301.04195 [cs]. [Online]. Avai...

work page arXiv 2023
[11]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5026–5033, iSSN: 2153-

2012
[12]

Available: https://ieeexplore.ieee.org/document/6386109

[Online]. Available: https://ieeexplore.ieee.org/document/6386109

work page arXiv
[13]

Action space design in reinforcement learning for robot motor skills,

J. Eßer, G. B. Margolis, O. Urbann, S. Kerner, and P. Agrawal, “Action space design in reinforcement learning for robot motor skills,” in 8th Annual Conference on Robot Learning, 2024. [Online]. Available: https://openreview.net/forum?id=GGuNkjQSrk

2024
[14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 164. PMLR, 2022, pp. 91–100. [Online]. Available: https://proceedings.mlr.press/v164/rudin22a.html

2022
[16]

Motion policy networks,

A. Fishman, A. Murali, C. Eppner, B. Peele, B. Boots, and D. Fox, “Motion policy networks,” 2022. [Online]. Available: https://arxiv.org/abs/2210.12209

work page arXiv 2022
[17]

When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,

K. Morgenstein, R. Gupta, A. Kim, J. Hsin, E. Sturman, S. H. Bang, and L. Sentis, “When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,”In Review, 2025

2025
[18]

Mujoco playground.arXiv preprint arXiv:2502.08844,

K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, C. Sferrazza, Y . Tassa, and P. Abbeel, “Mujoco playground,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08844

work page arXiv 2025
[19]

Brax - a differentiable physics engine for large scale rigid body simulation,

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - a differentiable physics engine for large scale rigid body simulation,” 2021. [Online]. Available: http://github.com/google/brax

2021

[1] [1]

Impact of static friction on sim2real in robotic reinforcement learning,

X. Hu, Q. Sun, B. He, H. Liu, X. Zhang, C. lu, and J. Zhong, “Impact of static friction on sim2real in robotic reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01255

work page arXiv 2025

[2] [2]

High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,

A. Miller, F. Yu, M. Brauckmann, and F. Farshidian, “High- performance reinforcement learning on spot: Optimizing simulation parameters with distributional measures,” 2025. [Online]. Available: https://arxiv.org/abs/2504.17857

work page arXiv 2025

[3] [3]

Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,

Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16889

work page arXiv 2024

[4] [4]

Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,

M. Raibert and F. Farshidian, “Workshop on whole-body control and bimanual manipulation: Applications in humanoids and beyond,” 2025, presented at the Workshop on Whole-body Control and Bimanual Ma- nipulation: Applications in Humanoids and Beyond, Robotics: Science and Systems (RSS) 2025

2025

[5] [5]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, Jan. 2022. [Online]. Available: http://dx.doi.org/10.1126/scirobotics.abk2822

work page doi:10.1126/scirobotics.abk2822 2022

[6] [6]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,” 2021. [Online]. Available: https://arxiv.org/abs/2107.04034

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Deepmimic: example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics, vol. 37, no. 4, p. 1–14, Jul

[8] [8]

Graph.37, 4, Article 133 (July 2018), 13 pages

[Online]. Available: http://dx.doi.org/10.1145/3197517.3201311

work page doi:10.1145/3197517.3201311

[9] [9]

Amp: Adversarial motion priors for stylized physics-based character control,

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: adversarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics, vol. 40, no. 4, p. 1–20, Jul. 2021. [Online]. Available: http://dx.doi.org/10.1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021

[10] [10]

Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,

M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y . Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg, “Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments,”IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3740– 3747, Jun. 2023, arXiv:2301.04195 [cs]. [Online]. Avai...

work page arXiv 2023

[11] [11]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 5026–5033, iSSN: 2153-

2012

[12] [12]

Available: https://ieeexplore.ieee.org/document/6386109

[Online]. Available: https://ieeexplore.ieee.org/document/6386109

work page arXiv

[13] [13]

Action space design in reinforcement learning for robot motor skills,

J. Eßer, G. B. Margolis, O. Urbann, S. Kerner, and P. Agrawal, “Action space design in reinforcement learning for robot motor skills,” in 8th Annual Conference on Robot Learning, 2024. [Online]. Available: https://openreview.net/forum?id=GGuNkjQSrk

2024

[14] [14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 164. PMLR, 2022, pp. 91–100. [Online]. Available: https://proceedings.mlr.press/v164/rudin22a.html

2022

[16] [16]

Motion policy networks,

A. Fishman, A. Murali, C. Eppner, B. Peele, B. Boots, and D. Fox, “Motion policy networks,” 2022. [Online]. Available: https://arxiv.org/abs/2210.12209

work page arXiv 2022

[17] [17]

When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,

K. Morgenstein, R. Gupta, A. Kim, J. Hsin, E. Sturman, S. H. Bang, and L. Sentis, “When is model-free rl actually (latent) model-based? contact estimation and contact-awareness in legged robots,”In Review, 2025

2025

[18] [18]

Mujoco playground.arXiv preprint arXiv:2502.08844,

K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, C. Sferrazza, Y . Tassa, and P. Abbeel, “Mujoco playground,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08844

work page arXiv 2025

[19] [19]

Brax - a differentiable physics engine for large scale rigid body simulation,

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - a differentiable physics engine for large scale rigid body simulation,” 2021. [Online]. Available: http://github.com/google/brax

2021