pith. sign in

arxiv: 2606.11743 · v1 · pith:RHG5GFXZnew · submitted 2026-06-10 · 💻 cs.RO · cs.GR· cs.LG

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

Pith reviewed 2026-06-27 09:50 UTC · model grok-4.3

classification 💻 cs.RO cs.GRcs.LG
keywords tactile feedbackvision-language-actionrobot manipulationsimulation-based reinforcement learningsim-to-real transfercontact-rich tasksbimanual manipulation
0
0 comments X

The pith

TacCoRL injects tactile feedback into vision-language-action policies through mixed sim-real warm-starting and simulation-based reinforcement learning for direct real-robot transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual observations in VLA models miss critical local contact information for manipulation, so TacCoRL adds tactile input and trains the policy to use it for modulating actions in rare near-failure states. It does this by first mixing simulated and real trajectories to warm-start tactile-conditioned actions, then applying RL in a real-aligned simulator using task rewards while a supervised loss on real data keeps the policy grounded. The result is a policy that transfers zero-shot to hardware without privileged simulation state or further real-world RL. A sympathetic reader cares because this avoids the risks and scale problems of collecting contact data directly on robots and improves success on contact-rich tasks.

Core claim

TacCoRL uses a real-aligned simulator as a closed-loop environment where mixed simulated and real trajectories first warm-start tactile-conditioned actions in a pretrained VLA policy; reinforcement learning then optimizes the policy on simulated contact rollouts with verifiable task rewards while a supervised objective on real trajectories anchors the refined policy to deployment distributions. The resulting visuo-tactile policy transfers directly to the real robot and reaches an average 72.5 percent success rate across four bimanual contact-rich tasks compared with a 50 percent baseline.

What carries the argument

The sim-real co-training plus simulation-based RL loop that learns contact-modulated action responses in near-failure states using a real-aligned simulator for rollouts.

If this is right

  • The policy deploys directly on real hardware without needing privileged simulation state or further real-world reinforcement learning.
  • Tactile-conditioned actions improve handling of near-failure contact states that are rare in demonstrations.
  • Average success across the four tested bimanual contact-rich tasks reaches 72.5 percent versus 50 percent for the baseline.
  • The supervised objective on real trajectories keeps the policy aligned with actual visual, tactile, and action distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If simulators can be made accurate enough for contact, similar co-training loops could reduce reliance on large-scale real tactile datasets for other robot skills.
  • The approach suggests that verifiable task rewards in simulation can substitute for risky real-world exploration when visual-tactile priors are already present.
  • Extending the same warm-start plus RL structure to additional sensor modalities might improve robustness in tasks where one modality alone is insufficient.

Load-bearing premise

A real-aligned simulator exists that accurately reproduces contact dynamics sufficiently for RL rollouts to produce policies that transfer zero-shot to hardware.

What would settle it

Train the policy in the described simulator loop and measure whether its success rate on the four real bimanual tasks matches the reported 72.5 percent without any privileged simulation state or online real-world updates.

Figures

Figures reproduced from arXiv: 2606.11743 by Chang Yu, Chenfanfu Jiang, Hao Su, Siyu Ma, Yin Yang, Yixin Zhu, Yunuo Chen, Yuqi Liang.

Figure 1
Figure 1. Figure 1: Left: We collect real and simulated visuo-tactile trajectories from aligned real-world and simulation setups. Center: Sim-real co-training gives the policy an initial tactile-conditioned ac￾tion prior, and tactile-guided RL in a real-aligned simulator refines closed-loop contact corrections. Right: We deploy the policy directly to the real world, where it achieves high success rates across diverse contact-… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline. (A) We collect real demonstrations Dreal together with simulated tele-operation data D teleop sim , and further scale up D teleop sim using MimicGen to obtain DMimic sim . (B) During sim￾real co-training, tactile information is encoded and routed through contact-aware gating to modulate both the context of vision-language models (VLM) and the action expert. (C) Interactive simulation rollouts hel… view at source ↗
Figure 3
Figure 3. Figure 3: Experimental task settings. Real and calibrated simulation workspaces for four contact￾rich bimanual tasks. The accumulated object placements indicate the pose ranges used for domain randomization and evaluation. window h τ t = o τ t−L+1:t ∈ R L×K with history length L and K taxels, capturing how contact is loaded, released, and evolves over time: Z τ t = WτEτ (h τ t ) ∈ R M×d . (3) A binary contact gate s… view at source ↗
Figure 4
Figure 4. Figure 4: Controller and tactile alignment. (A) Held-out J4 joint-response replay comparing the target, real and simulated responses before and after controller SysID. (B) Normalized tactile￾reading histogram from matched contact rollouts after tactile calibration. Co-training has two practical effects that it provides the policy with a tactile-conditioned prior grounded in real observations, and it offers sparse-re… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world policy rollouts. Representative real-robot executions of our post-trained visuo-tactile VLA policy across four contact-rich bimanual tasks. tactile-reading distributions from matched real and simulated contact rollouts after tactile calibra￾tion. Together, these results indicate that the action-execution and contact-observation interfaces are sufficiently aligned to support subsequent simulator … view at source ↗
Figure 6
Figure 6. Figure 6: Tactile feedback improves simulator RL. Across all tasks, visuo-tactile policies con￾sistently achieve higher success rates than vision-only policies during simulator RL [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation of co-training and real-data anchoring. We vary the co-training ratio α and real-anchor weight β on the Assembly #2 task and report model performance in terms of simulator success rate (left), real-data anchor loss (middle), which measures policy deviation from real demon￾strations during simulation, and real-world deployment success rate (right). during RL. We report the simulator success rate, r… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world robot setup. (a) Bimanual platform with two AgileX PiPER 6-DoF robotic arms and a fixed front-view RealSense D415 camera. (b) End-effector close-up with wrist-mounted RealSense D405 cameras and two FlexiTac-V2 tactile pads on the gripper contact surfaces. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-joint controller SysID. Each panel replays the same real single-joint sweep in simulation. The reference simulator uses Kp = 500 N · m · rad−1 , Kd = 50 N · m · s · rad−1 , and Tref = 0 N · m; the calibrated simulator uses the identified parameters listed below the response plots. SysID reduces lag, overshoot, and steady-state bias across joints. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative tactile signal alignment. Side-by-side real-to-sim replay trajectories for Assembly #1 and #2. Each pair shows synchronized real and simulated frames with tactile maps, highlighting matched contact location and evolution. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Camera calibration. Simulation-to-real alignment after camera extrinsic calibration. Columns show the fixed front camera, left wrist camera, and right wrist camera. The top two rows compare rendered simulation views with synchronized real views, and the bottom rows overlay the two domains at the initial state, grasp phase, and insertion phase. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: shows one representative failure from each task; the full failure distribution is broader, but these examples share a post-contact ambiguity. After first contact, the rack rim, puzzle-hole boundary, or assembly mating interface is hidden by the gripper or held part, so success depends on converting local contact cues into corrective motion rather than continuing the nominal trajectory. Angle Error Positio… view at source ↗
read the original abstract

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at https://tac-corl.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TacCoRL, a framework for injecting tactile feedback into pretrained vision-language-action (VLA) policies. It uses mixed simulated and real trajectories to warm-start tactile-conditioned actions, followed by simulation-based RL with verifiable rewards to optimize contact responses in near-failure states, while a supervised objective on real data anchors the policy. The resulting policy is claimed to transfer zero-shot to hardware; across four bimanual contact-rich tasks the visuo-tactile policy reaches 72.5% average success versus a 50% baseline.

Significance. If the empirical claims are substantiated, the work would provide a concrete route to augment VLA models with tactile modulation for contact-rich manipulation without large-scale tactile pretraining or online real-world RL, addressing a recognized limitation of vision-only policies in tasks sensitive to local contact state.

major comments (2)
  1. [Abstract] Abstract: the central zero-shot transfer claim rests on the existence of a 'real-aligned simulator' that produces contact rollouts whose learned tactile-to-action mappings transfer directly; however, no quantitative validation of simulator fidelity (real-vs-sim tactile signal correlation, force-torque error metrics, or domain-randomization ablation) is supplied, leaving the 22.5-point success-rate lift vulnerable to sim-specific artifacts.
  2. [Abstract] Abstract: the reported success rates are given without any description of experimental protocol, task definitions, baseline implementations, trial counts, variance, or statistical tests, so the reliability of the improvement and the cross-task claim cannot be assessed from the provided text.
minor comments (1)
  1. The abstract refers to 'four bimanual contact-rich tasks' and 'verifiable task rewards' without naming the tasks or reward formulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the abstract could better substantiate our claims. We address each comment below and will revise the manuscript to improve transparency on simulator validation and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central zero-shot transfer claim rests on the existence of a 'real-aligned simulator' that produces contact rollouts whose learned tactile-to-action mappings transfer directly; however, no quantitative validation of simulator fidelity (real-vs-sim tactile signal correlation, force-torque error metrics, or domain-randomization ablation) is supplied, leaving the 22.5-point success-rate lift vulnerable to sim-specific artifacts.

    Authors: We agree that quantitative validation of simulator fidelity is essential to support the zero-shot transfer. The manuscript describes the real-aligned simulator construction and its role in co-training/RL, but does not report explicit metrics such as tactile signal correlations or force-torque errors in the abstract (or prominently in results). In revision we will add these metrics, including real-vs-sim correlation coefficients and a domain-randomization ablation, to the methods/results sections to demonstrate that performance gains are not artifacts of simulation-specific contact dynamics. revision: yes

  2. Referee: [Abstract] Abstract: the reported success rates are given without any description of experimental protocol, task definitions, baseline implementations, trial counts, variance, or statistical tests, so the reliability of the improvement and the cross-task claim cannot be assessed from the provided text.

    Authors: The full manuscript contains task definitions, baseline implementations, trial counts (e.g., 20 trials per task), and variance reporting in the Experiments section. However, the abstract is too concise to include this protocol. We will revise the abstract to briefly note the four tasks, trial counts, and that full protocol/variance/statistical details appear in the main text, allowing readers to assess reliability without expanding the abstract beyond typical length limits. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The paper presents an empirical training pipeline (mixed sim-real warm-start followed by sim RL with task rewards and real supervised anchoring) whose output is measured success rate on hardware. No equations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the abstract or described method. The 72.5% vs 50% result is reported as an experimental outcome rather than a quantity derived by construction from its own inputs. The central premise (simulator fidelity) is an assumption, not a circular derivation step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no equations, hyperparameters, or modeling choices are visible, so the ledger is populated only with the load-bearing premise stated in the abstract itself.

axioms (1)
  • domain assumption A real-aligned simulator exists whose contact dynamics are sufficiently accurate that policies optimized on simulated contact rollouts transfer directly to hardware.
    Invoked when the abstract claims that RL with verifiable task rewards on simulated contact rollouts produces a policy that transfers without online real-world RL.

pith-pipeline@v0.9.1-grok · 5807 in / 1490 out tokens · 18252 ms · 2026-06-27T09:50:36.035815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 14 linked inside Pith

  1. [1]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  2. [2]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  3. [3]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  5. [5]

    W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12):2762, 2017

  6. [6]

    Huang, Y

    B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3D-ViTac: Learning fine-grained manipulation with visuo-tactile sensing. InConference on Robot Learning, 2024

  7. [7]

    Z. Zhao, S. Haldar, J. Cui, L. Pinto, and R. Bhirangi. Touch begins where vision ends: Gener- alizable policies for contact-rich manipulation.arXiv preprint arXiv:2506.13762, 2025

  8. [8]

    Huang, J

    B. Huang, J. Xu, I. Akinola, W. Yang, B. Sundaralingam, R. O’Flaherty, D. Fox, X. Wang, A. Mousavian, Y .-W. Chao, et al. Vt-refine: Learning bimanual assembly with visuo-tactile feedback via simulation fine-tuning.arXiv preprint arXiv:2510.14930, 2025

  9. [9]

    F. Yang, C. Ma, J. Zhang, J. Zhu, W. Yuan, and A. Owens. Touch and go: Learning from human-collected vision and touch.arXiv preprint arXiv:2211.12498, 2022

  10. [10]

    Cheng, J

    N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal repre- sentation.Information Fusion, 124:103305, 2025

  11. [11]

    Higuera, A

    C. Higuera, A. Sharma, C. K. Bodduluri, T. Fan, P. Lancaster, M. Kalakrishnan, M. Kaess, B. Boots, M. Lambeta, T. Wu, et al. Sparsh: Self-supervised touch representations for vision- based tactile sensing. 2024. InURL https://openreview. net/forum, 2024

  12. [12]

    P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548, 2025

  13. [13]

    Zhang, P

    C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. Vtla: Vision-tactile-language- action model with preference learning for insertion manipulation.Biomimetic Intelligence and Robotics, page 100333, 2026

  14. [14]

    Cheng, Y

    Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

  15. [15]

    Huang, S

    J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-vla: unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

  16. [16]

    J. Bi, K. Y . Ma, C. Hao, M. Z. Shou, and H. Soh. Vla-touch: Enhancing vision-language-action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025. 9

  17. [17]

    Zhang, H

    K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

  18. [18]

    Zhang, J

    Z. Zhang, J. Ma, X. Yang, X. Wen, Y . Zhang, B. Li, Y . Qin, J. Liu, C. Zhao, L. Kang, et al. Touchguide: Inference-time steering of visuomotor policies via touch guidance.arXiv preprint arXiv:2601.20239, 2026

  19. [19]

    J. Xu, S. Kim, T. Chen, A. R. Garcia, P. Agrawal, W. Matusik, and S. Sueda. Efficient tactile simulation with differentiability for robotic manipulation. InConference on Robot Learning, pages 1488–1498. PMLR, 2023

  20. [20]

    Akinola, J

    I. Akinola, J. Xu, J. Carius, D. Fox, and Y . Narang. Tacsl: A library for visuotactile sensor simulation and learning.IEEE Transactions on Robotics, 2025

  21. [21]

    Y . Li, W. Du, C. Yu, P. Li, Z. Zhao, T. Liu, C. Jiang, Y . Zhu, and S. Huang. Taccel: Scaling up vision-based tactile robotics via high-performance gpu simulation.Advances in Neural Information Processing Systems, 38:94577–94604, 2026

  22. [22]

    S. Sha, Y . Wang, B. Huang, A. Loquercio, and Y . Li. Efficient and reliable teleoperation through real-to-sim-to-real shared autonomy.arXiv preprint arXiv:2603.17016, 2026

  23. [23]

    Maddukuri, Z

    A. Maddukuri, Z. Jiang, L. Y . Chen, S. Nasiriany, Y . Xie, Y . Fang, W. Huang, Z. Wang, Z. Xu, N. Chernyadev, et al. Sim-and-real co-training: A simple recipe for vision-based robotic ma- nipulation.arXiv preprint arXiv:2503.24361, 2025

  24. [24]

    Y . Lei, M. Liu, A. Maddukuri, Z. Jiang, and Y . Zhu. A mechanistic analysis of sim-and-real co-training in generative robot policies.arXiv preprint arXiv:2604.13645, 2026

  25. [25]

    S. Tan, K. Dou, Y . Zhao, and P. Kr ¨ahenb¨uhl. Interactive post-training for vision-language- action models.arXiv preprint arXiv:2505.17016, 2025

  26. [27]

    H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  27. [28]

    L. Shi, S. Chen, F. Gao, Y . Chen, K. Chen, T. Zhang, H. Zang, W. Zhang, C. Yu, and Y . Wang. Beyond imitation: Reinforcement learning-based sim-real co-training for vla models.arXiv preprint arXiv:2602.12628, 2026

  28. [29]

    Zhang, C

    X. Zhang, C. Jia, S. Li, D. He, X. Xiong, Z. Sun, H. He, Y . Wu, B. Yu, L. Sun, et al. How rl unlocks the aha moment in geometric interleaved reasoning.arXiv preprint arXiv:2603.01070, 2026

  29. [30]

    Alspach, K

    A. Alspach, K. Hashimoto, N. Kuppuswamy, and R. Tedrake. Soft-bubble: A highly com- pliant dense geometry tactile sensor for robot manipulation. In2019 2nd IEEE International Conference on Soft Robotics (RoboSoft), pages 597–604. IEEE, 2019

  30. [31]

    Z. Zhao, W. Li, Y . Li, T. Liu, B. Li, M. Wang, K. Du, H. Liu, Y . Zhu, Q. Wang, et al. Embed- ding high-resolution touch across robotic hands enables adaptive human-like grasping.Nature Machine Intelligence, 7(6):889–900, 2025

  31. [32]

    H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026. 10

  32. [33]

    Y . Li, Y . Chen, Z. Zhao, P. Li, T. Liu, S. Huang, and Y . Zhu. Simultaneous tactile-visual per- ception for learning multimodal robot manipulation.IEEE Robotics and Automation Letters, 2026

  33. [34]

    Z. Xu, R. Uppuluri, X. Zhang, C. Fitch, P. G. Crandall, W. Shou, D. Wang, and Y . She. Unit: Data efficient tactile representation with generalization to unseen objects.IEEE Robotics and Automation Letters, 2025

  34. [35]

    R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors.arXiv preprint arXiv:2502.12191, 2025

  35. [36]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  36. [37]

    F. Yang, C. Feng, Z. Chen, H. Park, D. Wang, Y . Dou, Z. Zeng, X. Chen, R. Gangopadhyay, A. Owens, et al. Binding touch to everything: Learning unified multimodal tactile repre- sentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26340–26353, 2024

  37. [38]

    Gubernatorov, M

    K. Gubernatorov, M. Sannikov, I. Mikhalchuk, E. Kuznetsov, M. Artemov, O. F. Ouwatobi, M. Fernando, A. Asanov, Z. Guo, and D. Tsetserukou. Hapticvla: Contact-rich manipula- tion via vision-language-action model without inference-time tactile sensing.arXiv preprint arXiv:2603.15257, 2026

  38. [39]

    C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

  39. [40]

    H. Zang, M. Wei, S. Xu, Y . Wu, Z. Guo, Y . Wang, H. Lin, L. Shi, Y . Xie, Z. Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  40. [41]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  41. [42]

    Zhang, S

    H. Zhang, S. Zhang, J. Jin, Q. Zeng, Y . Qiao, H. Lu, and D. Wang. Balancing signal and variance: Adaptive offline rl post-training for vla flow models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18755–18763, 2026

  42. [43]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  43. [44]

    A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations, volume 2025, pages 77288–77329, 2025

  44. [45]

    Jiang and Z

    H. Jiang and Z. Yang. Adaptive diffusion policy optimization for robotic manipulation.arXiv preprint arXiv:2505.08376, 2025

  45. [46]

    G. Zou, W. Li, H. Wu, Y . Qian, Y . Wang, and H. Wang. D2ppo: Diffusion policy policy opti- mization with dispersive loss. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18891–18899, 2026

  46. [47]

    Johannink, S

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. In2019 international conference on robotics and automation (ICRA), pages 6023–6029. IEEE, 2019. 11

  47. [48]

    Alakuijala, G

    M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations.arXiv preprint arXiv:2106.08050, 2021

  48. [49]

    K. Fang, W. Liang, Y . Li, J. Zhang, P. Zeng, L. Gao, J. Song, and H. T. Shen. Sim-and- human co-training for data-efficient and generalizable robotic manipulation.arXiv preprint arXiv:2601.19406, 2026

  49. [50]

    Barreiros, A

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

  50. [51]

    X. Li, K. Hsu, J. Gu, O. Mees, K. Pertsch, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation. In8th Annual Conference on Robot Learning, 2024

  51. [52]

    Bronars, Y

    A. Bronars, Y . Park, and P. Agrawal. Tune to learn: How controller gains shape robot policy learning.arXiv preprint arXiv:2604.02523, 2026

  52. [53]

    Y . R. Song, J. Li, R. Fu, D. Murphy, K. Zhou, R. Shiv, Y . Li, H. Xiong, C. E. Owens, Y . Du, et al. Opentouch: Bringing full-hand touch to real-world interaction.arXiv preprint arXiv:2512.16842, 2025

  53. [54]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

  54. [55]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  55. [56]

    V . G. Goecks, G. M. Gremillion, V . J. Lawhern, J. Valasek, and N. R. Waytowich. Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments.arXiv preprint arXiv:1910.04281, 2019

  56. [57]

    Fujimoto and S

    S. Fujimoto and S. S. Gu. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

  57. [58]

    Huang and Y

    B. Huang and Y . Li. Flexitac: A low-cost, open-source, scalable tactile sensing solution for robotic systems.arXiv preprint arXiv:2604.28156, 2026

  58. [59]

    N. Hogan. Impedance control: An approach to manipulation. In1984 American control conference, pages 304–313. IEEE, 1984

  59. [60]

    B. Katz, J. Di Carlo, and S. Kim. Mini cheetah: A platform for pushing the limits of dynamic quadruped control. In2019 international conference on robotics and automation (ICRA), pages 6295–6301. IEEE, 2019

  60. [61]

    J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 1992

  61. [62]

    C. Yu, Y . Wang, Z. Guo, H. Lin, S. Xu, H. Zang, Q. Zhang, Y . Wu, C. Zhu, J. Hu, Z. Huang, M. Wei, Y . Xie, K. Yang, B. Dai, Z. Xu, J. Du, X. Wang, X. Fu, L. Shi, Z. Liu, K. Chen, W. Liu, G. Liu, B. Li, J. Yang, Z. Yang, G. Dai, and Y . Wang. RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv pre...

  62. [63]

    K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, X. Li, Q. Zhang, Z. Yu, G. Fan, T. Huang, Y . Wang, and C. Yu.π RL: Online RL fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025. 12 Supplementary Materials Contents A Robot Setup 13 B Real-to-Sim-to-Real 14 B.1 Controller SysID Details. . . . . . . . ....