pith. sign in

arxiv: 2606.18953 · v1 · pith:T7FSR75Pnew · submitted 2026-06-17 · 💻 cs.RO

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

Pith reviewed 2026-06-26 21:04 UTC · model grok-4.3

classification 💻 cs.RO
keywords residual reinforcement learningsim-to-real transfervision-language-actionobject-centriczero-shot transferrobot manipulation
0
0 comments X

The pith

An object-centric residual RL policy trained in simulation transfers zero-shot to raise real-world VLA success rates from 42% to 76%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a corrective reinforcement learning policy can be trained entirely in simulation on top of a frozen vision-language-action model. It uses object poses to create a compact observation space that stays consistent across domains. Replaying the original teleoperation demonstrations in simulation aligns the sim and real VLAs, after which noise injection and dropout during training allow the residual policy to transfer directly to the physical robot. This raises task success without real-world RL, privileged-state distillation, or new demonstrations, and the resulting trajectories can retrain the base VLA for further gains.

Core claim

An object-centric residual RL policy trained in simulation on object poses after replaying real teleoperation data to align the sim VLA, with pose noise injection and dropout, transfers zero-shot to the real Franka robot and lifts success from 42% to 76% on five manipulation tasks while enabling self-improvement of the base model from the improved rollouts.

What carries the argument

Object-centric residual RL that refines VLA actions from object-pose observations in a compact space aligned between simulation and reality.

If this is right

  • Success rate on real hardware rises from 42% to 76% across five tasks with zero real-world training.
  • Improved rollouts enable retraining the base VLA without collecting new teleoperation data.
  • The approach sidesteps both privileged-state distillation and direct visual domain-gap bridging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If reliable object-pose estimation is available, the method could extend to tasks where pure visual feedback lacks the precision needed for contact-rich actions.
  • The self-improvement loop suggests repeated cycles of real rollouts feeding back into sim residual training could produce ongoing gains in VLA robustness.

Load-bearing premise

Replaying the same teleoperation demonstrations in simulation produces a sim VLA aligned enough with the real-world VLA that a pose-based residual policy trained with noise will transfer despite other domain differences.

What would settle it

Deploying the residual policy on the real robot and measuring no increase in success rate beyond the base VLA's 42% on the same five tasks.

Figures

Figures reproduced from arXiv: 2606.18953 by Heecheol Kim, Jaegul Choo, Katsushi Ikeuchi, Kinam Kim, Namiko Saito, Yasuyuki Matsushita.

Figure 1
Figure 1. Figure 1: Object-centric residual RL for zero-shot sim-to-real VLA enhancement. The base VLA fails on the real robot (left). A residual policy trained purely in simulation (middle) is added zero-shot to recover task success on the same real-robot setup (right). Abstract: Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the object-centric residual RL pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real (top) and simulated (bottom) environments for all five evaluation tasks. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The residual cor￾rects the base action toward the goal when misaligned. Generalization across VLA architectures. To demonstrate that our residual RL framework is not specific to a single base VLA, we evaluate with π0.5 [7]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Performance improvement on π0.5 [7], demonstrating compatibility with different VLA backbones. (b) Sim-to-real transfer across observation spaces; the object-centric design transfers most effectively. (c, d) SFT on residual-corrected rollouts improves success rate and reduces episode length. Lift PnP Stack Close Stand 0.00 0.25 0.50 0.75 1.00 Alignment score +0.13 +0.03 +0.40 +0.27 +0.04 (a) Direction … view at source ↗
Figure 7
Figure 7. Figure 7: (a) Cosine similarity between the residual action and the goal direction, conditioned on base action alignment. The residual corrects more strongly when the base deviates. (b) Episode length comparison between base and residual-corrected policies (success episodes). The residual consistently reduces completion time by 9–22%. Error bars denote standard error of the mean across timesteps (a) and episodes (b)… view at source ↗
Figure 8
Figure 8. Figure 8: Realistic simulation rendering (right in each pair) vs. real-world camera view (left). The [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sim-to-real behavioral transfer of the object-centric residual policy. All five tasks are shown; left two columns are simulation, right two are real-robot deployment. The residual, trained only in simulation, learns task-specific corrections to base VLA failure modes: downward correction during Cube Lift approach, lateral alignment for Pick-and-Place, accurate grasp positioning for Stack Cube, corrective p… view at source ↗
Figure 10
Figure 10. Figure 10: Emergent behaviors from residual RL. Each row shows four sequential keyframes (left to right in time) from a successful real-robot rollout with the residual policy. The residual discovers task-specific strategies that are absent from the demonstrations used to train the base VLA: pre￾rotating the cube before grasp (Cube Lift and Pick-and-Place), corrective push toward the cube when the grasp is misaligned… view at source ↗
Figure 11
Figure 11. Figure 11: FoundationPose [20] pose tracking overlaid on real-robot RGB frames. Each row shows a different task (Cube Lift, Pick-and-Place, Stack Cube, Stand Cup Up, and Close Drawer) at evenly spaced timesteps during a successful episode. For each object, the estimated 6-DoF pose is vi￾sualized by projecting the object’s mesh outline into the camera frame, with body-frame axes (red, green, blue) drawn at the object… view at source ↗
Figure 12
Figure 12. Figure 12: Representative failure modes on the real robot. Each row shows four uniformly sampled [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an object-centric residual RL framework to enhance Vision-Language-Action (VLA) models for zero-shot sim-to-real transfer in robotic manipulation. By replaying teleoperation demonstrations in simulation to train a matching sim VLA, training a residual corrective policy on object-pose observations with noise injection and dropout, and deploying the residual on the real VLA, the approach claims to raise aggregate success from 42% to 76% across five tasks on a Franka Research 3 robot while enabling self-improvement of the base VLA from the improved rollouts.

Significance. If the sim-to-real VLA alignment holds, the method offers a practical route to robustify imitation-learned VLAs without real-world RL, privileged-state distillation, or image-domain adaptation. The object-centric observation space and the closed-loop self-improvement pathway are concrete strengths that could reduce reliance on teleoperation for iterative improvement.

major comments (2)
  1. [Abstract] Abstract and methods description: the zero-shot transfer claim rests on the unquantified assumption that the residual errors of the sim VLA (trained by replaying identical teleop trajectories) are sufficiently close to those of the real VLA once object poses are observed; no intermediate metrics (action KL divergence, per-timestep error histograms, or matched-pose success-rate gap between sim and real VLAs) are reported to support this alignment.
  2. [Experimental evaluation] Experimental evaluation: the reported aggregate improvement (42 % o 76 %) is presented without per-task trial counts, standard deviations, or statistical tests, leaving open the possibility that variability or task-specific confounds drive the result rather than the residual policy itself.
minor comments (1)
  1. [Abstract] The abstract opens with a rhetorical question; converting it to a declarative statement would align better with conventional abstract style.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods description: the zero-shot transfer claim rests on the unquantified assumption that the residual errors of the sim VLA (trained by replaying identical teleop trajectories) are sufficiently close to those of the real VLA once object poses are observed; no intermediate metrics (action KL divergence, per-timestep error histograms, or matched-pose success-rate gap between sim and real VLAs) are reported to support this alignment.

    Authors: We agree that explicit quantification of sim-real VLA alignment would better support the zero-shot claim. Replaying identical teleoperation trajectories in simulation is designed to produce matching error distributions when conditioned on object poses, which serve as the domain-invariant input. However, the original submission did not include the suggested intermediate metrics. In the revised manuscript we will add action KL divergence, per-timestep error histograms, and matched-pose success-rate gaps computed from the trained sim and real VLAs to directly demonstrate alignment. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation: the reported aggregate improvement (42 % to 76 %) is presented without per-task trial counts, standard deviations, or statistical tests, leaving open the possibility that variability or task-specific confounds drive the result rather than the residual policy itself.

    Authors: We acknowledge that aggregate reporting alone leaves room for questions about variability. The original manuscript summarized results for conciseness, but the revised version will expand the experimental section to report per-task success rates, the number of trials conducted per task, standard deviations across repeated evaluations, and statistical significance tests (e.g., paired t-tests) comparing the base VLA against the residual-augmented policy. These additions will confirm consistency across tasks. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method validated by real-robot experiments

full rationale

The paper describes an empirical pipeline: replay teleop demos in sim to obtain a sim VLA, train an object-pose residual RL policy in sim with noise/dropout, then deploy the residual zero-shot on the real VLA. No equations, fitted parameters, or self-citations are presented as a derivation that reduces to its own inputs by construction. The reported 42%→76% improvement is an external benchmark measured on physical hardware, not a statistical renaming of training data. The sim-real alignment assumption is an unproven hypothesis whose validity is tested (or not) by the end-to-end outcome rather than being presupposed by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; none can be identified from the given text.

pith-pipeline@v0.9.1-grok · 5804 in / 1287 out tokens · 34465 ms · 2026-06-26T21:04:38.751466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 11 linked inside Pith

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2023

  2. [2]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  3. [3]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InIEEE International Conference on Robotics and Automation (ICRA), 2024

  4. [4]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, et al. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

  6. [6]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, et al.π 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    Black, N

    Physical Intelligence, K. Black, N. Brown, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  8. [8]

    Bjorck, F

    NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  9. [9]

    S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

  10. [10]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

  11. [11]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

  12. [12]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  13. [13]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  14. [14]

    Silver, K

    T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

  15. [15]

    Johannink, S

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE International Conference on Robotics and Automation (ICRA), 2019

  16. [16]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement – residual rl for precise assembly. InConference on Robot Learning (CoRL), 2024

  17. [17]

    Ankile, Z

    L. Ankile, Z. Jiang, R. Duan, G. Shi, P. Abbeel, and A. Nagabandi. ResFiT: Residual off-policy RL for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

  18. [18]

    W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, et al. Self-improving vision-language-action models with data generation via residual RL.arXiv preprint arXiv:2511.00091, 2025. 9

  19. [19]

    Dulac-Arnold, N

    G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis.Ma- chine Learning, 110:2419–2468, 2021

  20. [20]

    B. Wen, W. Yang, J. Kautz, and S. Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  21. [21]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, et al. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  22. [22]

    Torne, A

    M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Reconciling re- ality through simulation: A real-to-sim-to-real approach for robust manipulation. InRobotics: Science and Systems (RSS), 2024

  23. [23]

    J ¨ulg, W

    T. J ¨ulg, W. Burgard, and F. Walter. Refined policy distillation: From VLA generalists to RL experts.arXiv preprint arXiv:2503.05833, 2025

  24. [24]

    W. Zhao, J. Pe ˜na Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: A survey.arXiv preprint arXiv:2009.13303, 2020

  25. [25]

    Tobin, R

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017

  26. [26]

    X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization.arXiv preprint arXiv:1710.06537, 2018

  27. [27]

    Handa, A

    A. Handa, A. Allshire, V . Makoviychuk, A. Petrenko, R. Singh, J. Liu, et al. DeXtreme: Trans- fer of agile in-hand manipulation from simulation to reality. InIEEE International Conference on Robotics and Automation (ICRA), 2023

  28. [28]

    Andrychowicz, B

    OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Re- search, 39(1):3–20, 2020

  29. [29]

    Y . J. Ma, W. Liang, H.-J. Wang, S. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman. DrEureka: Language model guided sim-to-real transfer. InRobotics: Science and Systems (RSS), 2024

  30. [30]

    Chebotar, A

    Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. InIEEE International Conference on Robotics and Automation (ICRA), 2019

  31. [31]

    Ramos, R

    F. Ramos, R. Possas, and D. Fox. BayesSim: Adaptive domain randomization via probabilistic inference for robotics simulators. InRobotics: Science and Systems (RSS), 2019

  32. [32]

    Jiang, C

    Y . Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei. TRANSIC: Sim-to-real policy transfer by learning from online correction. InConference on Robot Learning (CoRL), 2024

  33. [33]

    Hinton, O

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  34. [34]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

  35. [35]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3D representations. InRobotics: Science and Systems (RSS), 2024. 10

  36. [36]

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3D Diffuser Actor: Policy diffusion with 3D scene representations. InConference on Robot Learning (CoRL), 2024

  37. [37]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, et al. SpatialVLA: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  38. [38]

    Fujimoto, H

    S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor- critic methods. InInternational Conference on Machine Learning (ICML), 2018

  39. [39]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

  40. [40]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

  41. [41]

    11 Appendix for: Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement A Appendix A.1 Reward Design All tasks use dense, shaped rewards clipped to[0,1]

    Physical Intelligence et al.π ∗ 0.6: a VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025. 11 Appendix for: Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement A Appendix A.1 Reward Design All tasks use dense, shaped rewards clipped to[0,1]. Each reward is decomposed into staged sub- rewards that are applied progressive...