pith. sign in

arxiv: 2606.27475 · v1 · pith:7NH2UPSXnew · submitted 2026-06-25 · 💻 cs.RO · cs.LG

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

Pith reviewed 2026-06-29 01:54 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords support-constrained RLreal-to-sim-to-realdexterous manipulationflow steeringpolicy improvementmulti-fingered robotssimulation constraintsrobotic hands
0
0 comments X

The pith

Support-constrained RL in simulation improves real-world robot policies without further real-world experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCORE, a framework that performs reinforcement learning entirely in simulation to refine policies first trained on real robot data. It constrains the simulated actions to those the real-data policy can already generate, using flow steering to avoid unsafe behaviors from simulation mismatches. On eight dexterous multi-fingered manipulation tasks, this raises average success from 37.8 percent to 89.9 percent and shortens the steps needed for success. A sympathetic reader would care because it offers a low-cost path to better robot skills after the initial real-world data collection, without needing more hardware time or distillation.

Core claim

By constraining reinforcement learning in simulation to the support of a generative policy pretrained on real data, implemented through flow steering, the optimized policies transfer to hardware and deliver higher success rates plus faster task completion across eight real-world dexterous manipulation tasks, all without real-world RL or changes to the base policy.

What carries the argument

The support constraint via flow steering, which restricts actions during simulated RL to the distribution of the real-data generative policy.

If this is right

  • Policy improvement after initial real data collection can occur entirely in simulation.
  • The process works with sparse rewards and requires no distillation step.
  • The base policy stays unchanged while a separate improved policy is learned.
  • Simulation becomes usable for safe real-to-sim-to-real transfer on manipulation tasks when actions are limited to the real policy support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same support constraint might apply to other robot skills where simulation gaps cause problems, such as assembly or locomotion.
  • The method could lower the total real-world data needed across repeated policy refinements.
  • Combining the constraint with different base policy training approaches might show how broadly the gains hold.

Load-bearing premise

That actions kept inside the real policy support will avoid exploiting simulation inaccuracies enough to block transfer while still allowing useful policy gains.

What would settle it

Running unconstrained RL in simulation on the same tasks and finding that its policies achieve comparable or higher real-world success rates than the constrained SCORE versions would show the support constraint is not required.

Figures

Figures reproduced from arXiv: 2606.27475 by Abhishek Gupta, Anusha Nagabandi, Mustafa Mukadam, Raymond Yu, William Huey.

Figure 1
Figure 1. Figure 1: SCORE framework. SCORE starts from any real-world flow matching policy, which may have been trained on successes, play data, failures, and retry behaviors. The flow policy is brought into simulation, where SCORE learns to improve the policy using flow steering, a support-constrained RL algorithm. Finally, our training framework enables direct deployment of the steering policy in the real world, preserving … view at source ↗
Figure 2
Figure 2. Figure 2: Toy Example. The real-world base policy avoids barriers, but performs roundabout trajectories that sometimes miss the goal. In simulation, unconstrained RL exploits dynamics mismatch to move directly towards the goal, but this fails in the real world. As shown by the red arrows, distributional regularization allows for small deviations from the base policy, refining imprecisions but preserving slow motion … view at source ↗
Figure 3
Figure 3. Figure 3: Real-world tasks. We evaluate on eight contact-rich dexterous manipulation tasks spanning grasping, pouring, pushing, reorientation, and object placement. SCORE-DSRL and SCORE, respectively. DSRL performs pure latent steering: it optimizes only the flow noise z, so every action lies within the model-induced set Abase(o) above, imposing a hard model-induced support constraint. RFS additionally adds a small … view at source ↗
Figure 4
Figure 4. Figure 4: Average real-world success rate across all 8 tasks. SCORE and SCORE-DSRL outperform all baselines, while FPO and RialTo learn dangerous actions, and Residual-RL is constrained to suboptimal behaviors [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Speed improvement of SCORE and Residual-RL over Base, averaged across 8 tasks. Motivated by our discussion in Section 3.2, we now empirically investigate how support constraints overcome the limitations of unconstrained and distributionally constrained optimization. Does unconstrained optimization in simulation result in dan￾gerous behavior? To test our hypothesis about unconstrained optimization, we optim… view at source ↗
Figure 6
Figure 6. Figure 6: Distributional Constraints Introduce a Tradeoff Between Improvement and Transferability. The left plot shows the simulated performance (circles) and real world performance (diamonds) of RialTo policies trained with 5 different levels of BC regularization during BC-PPO. A value of 10 leads to collapse in simulation, while a larger value of 100 learns a dangerous strategy far from the base policy distributio… view at source ↗
Figure 7
Figure 7. Figure 7: Data-size ablation. More data enables stronger support-constrained improvement. Can more demonstrations improve steering? We train base policies on the Cube Pinch task with vary￾ing numbers of demonstrations and apply SCORE on each [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Adaptation experiment. SCORE adapts the base policy to shifted settings only when its support contains compatible behavior. (Left) Steering the bottle-grasp prior toward carrot grasping improves real-world success from 22% to 67% by reusing compatible pinches already inside the prior, while the cup-grasp prior fails, as it lacks the behavior. (Right) With distractor cubes added, SCORE improves over the br… view at source ↗
Figure 8
Figure 8. Figure 8: Cube Pinch retry data. Retry data leaves the base policy un￾changed, but lets SCORE improve from 40% to 100% success after simulation steering [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Play data ablation. With only right-side coverage, Base and SCORE fail on the left; adding left play data lets SCORE improve. Can SCORE adapt to unseen objects and distractors? Our pre￾vious experiments test the ability of SCORE to improve policies in fixed environments, but real world tasks are constantly changing. Below, we train SCORE in a simulation environment unseen by the base policy, then deploy th… view at source ↗
Figure 11
Figure 11. Figure 11: Asymmetric actor-critic ablation. Using an asymmetric critic improves sample efficiency and final simulated success while keeping the actor observation and deployment policy unchanged. Evaluation is performed over 4096 environments every 40M steps. when the post-training environment is significantly out of distribution. This suggests that pretraining should aim not just for the strongest base policy, but … view at source ↗
Figure 12
Figure 12. Figure 12: Percent improvement in time to completion of SCORE and Residual-RL over Base. SCORE improves substantially over the base policy and beats Residual-RL, the nearest baseline, on all tasks [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Real-world Robot Setup B Experiment Details B.1 Hardware and Control Setup [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Retry Data Collection. For the Cube Pinch and Bottle Grasp tasks, drops and misses followed by retries are captured in the dataset to encourage learning retry behavior. B.3 Data Collection We collect real-world demonstrations using an Apple Vision Pro teleoperation interface. The system tracks the operator’s hand motion and end-effector motion using keypoints, and retargets these motions to the Franka arm… view at source ↗
Figure 15
Figure 15. Figure 15: FPO failure modes. FPO can exploit simulator-specific dynamics and drift outside the real-world policy support, producing unsafe or non-transferable behaviors. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: RialTo failure modes. BC regularization can limit the amount of drift from the support of the real-world policy, but can also limit improvement and retain imprecise base-policy behaviors. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: SCORE successful rollouts. SCORE improves task performance while maintaining real-world-feasible behaviors within the base policy support. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: SCORE failure modes. SCORE failures are primarily caused by contact sensitivity and task precision demands, rather than unsafe support drift. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Multi-task base policy failure modes. Although multi-task training expands the behavior support of the base policy, direct deployment can suffer from interference between task-specific behaviors. The multi-task base policy sometimes applies behavior modes from the wrong task, such as Credit Card Pick-like high grasps during object grasping or Bottle Grasp-like motions during Credit Card Pick, leading to u… view at source ↗
Figure 20
Figure 20. Figure 20: Multi-task SCORE enables cross-task behavior reuse. After steering a shared multi-task prior, SCORE can select task-appropriate behaviors while also reusing useful strategies across tasks. The top two rows show successful Bottle Grasp and Credit Card Pick executions. The bottom rows show Cube Pinch under broader object placements, where multi-task SCORE can reuse Credit Card Pick-like sliding behavior and… view at source ↗
Figure 21
Figure 21. Figure 21: Visual representation of Proposi￾tion E.1. While πreal successfully completes the task, πsim exploits a transition that quickly leads to the goal in simulation, but causes the policy to get stuck when deployed in the real world. Proof Sketch We proceed by constructing a discrete MDP with 5 states, as shown in [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Visual representation of Proposi￾tion E.2. π adds a residual of ϵ to πreal, but this is not sufficient to recover the optimal policy For a small enough ϵ and large enough δ, distribu￾tional constraints ensure realizability. In practice, however, too much regularization prevents meaning￾ful improvement, while too little allows the policy to exploit the dynamics gap. In many settings, there is no level of r… view at source ↗
Figure 23
Figure 23. Figure 23: Visual representation of how SCORE addresses the limitations of distributional con￾straints shown in [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
read the original abstract

Robots trained on real world data tend to be imprecise, slow, and brittle to perturbations. Improving these policies with reinforcement learning (RL) is an appealing alternative, but this process often requires expensive training in the real world. Performing policy improvement in simulation instead provides a far cheaper alternative, but unconstrained RL in simulation can exploit contact and dynamics mismatches, resulting in unsafe behaviors that do not transfer to hardware. Common forms of regularization can furthermore limit improvement by overconstraining to an imperfect behavior prior. In this work, we propose Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that constrains RL in simulation to the support of a generative policy pretrained on real data. We instantiate this constraint through flow steering, restricting SCORE to actions the base policy can already produce, which ensures transferable behaviors while maximizing policy improvement. Improving a policy with SCORE requires minimal effort: it learns from sparse rewards, avoids distillation, and leaves the base policy untouched. Across eight real-world dexterous multi-fingered robotic manipulation tasks, SCORE improves average success rate from 37.8% to 89.9%, compared to 59.5% for the best baseline, and reaches success in 36.8% fewer steps than the base policy. Ultimately, through extensive experiments and ablations, we show that simulation can substantially improve real-world manipulation policies when policy optimization is appropriately constrained, introducing a new paradigm for real-to-sim-to-real policy improvement. Videos and code are available at https://weirdlabuw.github.io/score/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that performs RL in simulation while constraining actions via flow steering to the support of a generative policy pretrained on real data. This is claimed to enable substantial policy improvement on real hardware without real-world RL experience, without distillation, and without altering the base policy. The central empirical result is that across eight real-world dexterous multi-fingered manipulation tasks, SCORE raises average success rate from 37.8% (base policy) to 89.9%, outperforming the best baseline at 59.5%, while also reaching success in 36.8% fewer steps; the work states that extensive experiments and ablations support the approach.

Significance. If the reported results and ablations hold, the work is significant for robotics because it offers a concrete, low-effort method to leverage simulation for real policy improvement while mitigating sim-reality exploitation. The provision of code and videos is a positive factor for reproducibility and verification of the claimed gains.

minor comments (2)
  1. [Abstract] Abstract: the quantitative claims (e.g., 89.9% success, 36.8% fewer steps) would be strengthened by a brief indication of trial counts, error bars, or statistical testing even at the abstract level.
  2. The manuscript states that the base policy is left untouched and only sparse rewards are used; a minor clarification on how the final deployed policy is obtained (e.g., whether it is the improved sim policy or a combination) would aid clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the work, recognition of its potential significance for robotics, and recommendation for minor revision. We are pleased that the reproducibility elements (code and videos) were noted favorably.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical real-to-sim-to-real RL method (SCORE) that constrains simulation rollouts to the support of a real-data generative policy via flow steering. All load-bearing claims consist of reported success-rate deltas and step-count reductions measured on eight physical tasks; these rest on external experimental outcomes rather than any derivation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No equations, ansatzes, or uniqueness theorems appear in the provided text that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The core idea of support constraint and flow steering is presented as a modeling choice rather than a derived quantity.

pith-pipeline@v0.9.1-grok · 5828 in / 1102 out tokens · 28043 ms · 2026-06-29T01:54:37.753755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URL https://arxiv.org/abs/2603.15789

  2. [2]

    Aljalbout, J

    E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y . Narang, D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

  3. [3]

    Torne, A

    M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation, 2024. URL https://arxiv.org/abs/2403.03949

  4. [4]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

  5. [5]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/ abs/2303.04137

  6. [6]

    G. Yan, J. Zhu, Y . Deng, S. Yang, R.-Z. Qiu, X. Cheng, M. Memmel, R. Krishna, A. Goyal, X. Wang, and D. Fox. Maniflow: A general robot manipulation policy via consistency flow training, 2025. URLhttps://arxiv.org/abs/2509.01819

  7. [7]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

  8. [8]

    A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization, 2024. URL https://arxiv.org/ abs/2409.00588

  9. [9]

    McAllister, S

    D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa. Flow matching policy gradients, 2025. URLhttps://arxiv.org/abs/2507.21053

  10. [10]

    S. Park, Q. Li, and S. Levine. Flow q-learning, 2025. URL https://arxiv.org/abs/2502. 02538

  11. [11]

    Steering Your Diffusion Policy with Latent Space Reinforcement Learning

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning, 2025. URL https://arxiv.org/abs/2506.15799

  12. [12]

    E. Su, T. Westenbroek, A. Nagabandi, and A. Gupta. Rfs: Reinforcement learning with residual flow steering for dexterous manipulation, 2026. URL https://arxiv.org/abs/ 2602.01789

  13. [13]

    M. M. Hong, J. Zhang, A. Nagabandi, and A. Gupta. Tmrl: Diffusion timestep-modulated pretraining enables exploration for efficient policy finetuning, 2026. URL https://arxiv. org/abs/2605.12236. 11

  14. [14]

    B. Yi, H. Choi, H. G. Singh, X. Huang, T. E. Truong, C. Sferrazza, Y . Ma, R. Duan, P. Abbeel, G. Shi, K. Liu, and A. Kanazawa. Flow policy gradients for robot control, 2026. URL https://arxiv.org/abs/2602.02481

  15. [15]

    Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexteritygen: Foundation controller for unprecedented dexterity, 2025. URL https://arxiv.org/abs/ 2502.04307

  16. [16]

    Memmel, A

    M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta. Asid: Active exploration for system identification in robotic manipulation, 2024. URL https://arxiv.org/abs/ 2404.12308

  17. [17]

    Kumar, Z

    A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots,

  18. [18]

    URLhttps://arxiv.org/abs/2107.04034

  19. [19]

    X. Liu, H. Wang, and L. Yi. Dexndm: Closing the reality gap for dexterous in-hand rotation via joint-wise neural dynamics model, 2025. URLhttps://arxiv.org/abs/2510.08556

  20. [20]

    Mandi, Y

    Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URL https://arxiv.org/abs/ 2505.24853

  21. [21]

    Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid. Vividex: Learning vision-based dexterous manipulation from human videos, 2025. URLhttps://arxiv.org/abs/2404.15709

  22. [22]

    Qin, Y .-H

    Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2022. URL https://arxiv.org/abs/2108. 05877

  23. [23]

    Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation, 2022. URL https://arxiv. org/abs/2211.09423

  24. [24]

    Kedia, T

    K. Kedia, T. G. W. Lum, J. Bohg, and C. K. Liu. Simtoolreal: An object-centric policy for zero-shot dexterous tool manipulation, 2026. URLhttps://arxiv.org/abs/2602.16863

  25. [25]

    Z. Xu, R. Gong, M. V . Minniti, A. S. Gundogdu, E. Rosen, K. Sivakumar, R. Yan, Z. Wang, D. Deng, P. Stone, X. Zhang, and K. Schmeckpeper. Expertgen: Scalable sim-to-real expert policy learning from imperfect behavior priors, 2026. URL https://arxiv.org/abs/2603. 15956

  26. [26]

    Eysenbach, S

    B. Eysenbach, S. Asawa, S. Chaudhari, S. Levine, and R. Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers, 2021. URL https: //arxiv.org/abs/2006.13916

  27. [27]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20, July 2021. ISSN 1557-7368. doi:10.1145/3450626.3459670. URL http://dx.doi.org/10. 1145/3450626.3459670

  28. [28]

    P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real, 2025. URL https://arxiv.org/abs/ 2505.07096

  29. [29]

    H. Niu, S. Sharma, Y . Qiu, M. Li, G. Zhou, J. HU, and X. Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=zXE8iFOZKw. 12

  30. [30]

    Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning, 2019. URLhttps://arxiv.org/abs/1911.11361

  31. [31]

    Kumar, A

    A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

  32. [32]

    Singh, A

    A. Singh, A. Kumar, Q. Vuong, Y . Chebotar, and S. Levine. Offline rl with realistic datasets: Heteroskedasticity and support constraints, 2022. URL https://arxiv.org/abs/2211. 01052

  33. [33]

    Y . Mao, H. Zhang, C. Chen, Y . Xu, and X. Ji. Supported trust region optimization for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2311.08935

  34. [34]

    Zhang, O

    S. Zhang, O. So, H. M. S. Ahmad, E. Y . Yu, M. Cleaveland, M. Black, and C. Fan. Reform: Reflected flows for on-support offline rl via noise manipulation, 2026. URL https://arxiv. org/abs/2602.05051

  35. [35]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

  36. [36]

    Skalse, N

    J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

  37. [37]

    J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. InRobotics: Science and Systems, 2018

  38. [38]

    Z. Wu, W. Lian, V . V . Unhelkar, M. Tomizuka, and S. Schaal. Learning dense rewards for contact-rich manipulation tasks. In2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

  39. [39]

    W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025. URLhttps://arxiv.org/abs/2502.05397

  40. [40]

    L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation learning as f-divergence minimization, 2020. URLhttps://arxiv.org/abs/1905.12888

  41. [41]

    Generative Adversarial Imitation Learning

    J. Ho and S. Ermon. Generative adversarial imitation learning, 2016. URL https://arxiv. org/abs/1606.03476

  42. [42]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  43. [43]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  44. [44]

    K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning, 2023. URLhttps://arxiv.org/abs/2309.06440

  45. [45]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  46. [46]

    Y . Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier. Polymetis. https:// facebookresearch.github.io/fairo/polymetis/, 2021

  47. [47]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017. URLhttps://arxiv.org/abs/1612.00593

  48. [48]

    A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881. 14 Appendix Table of Contents A Per-Task Performance 15 B Experiment Details 17 B.1 Hardwar...