TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Baoxu Liu; Guozheng Li; Huaihang Zheng; Kai Ma; Shenglin Xu; Si Liu; Tian Xie; Xiangyu Wang; Yinian Mao; Yiren Ma

arxiv: 2606.09337 · v3 · pith:OD4IJDSWnew · submitted 2026-06-08 · 💻 cs.RO

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Huaihang Zheng , Yi Yang , Kai Ma , Shenglin Xu , Tian Xie , Guozheng Li , Xiangyu Wang , Yiren Ma

show 3 more authors

Si Liu Yinian Mao Baoxu Liu

This is my paper

Pith reviewed 2026-06-27 16:28 UTC · model grok-4.3

classification 💻 cs.RO

keywords Tactile feedbackOnline reinforcement learningVision-language-actionContact-rich manipulationRobotic manipulationPolicy refinementIntervention censoringWrench prediction

0 comments

The pith

TORL-VLA adds an online RL module and intervention-censored critic to a tactile-aware VLA so robots can refine contact forces during long-horizon tasks when conditions change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that keeps a vision-language-action model as the base policy but augments it with real-time tactile guidance and reinforcement learning updates. A wrench-aware VLA first outputs reference actions plus predicted future force sequences from tactile input. A lightweight RL component then adjusts those references on the fly. The intervention-censored critic ensures that successes occurring after a human takes over are not incorrectly assigned to the policy actions that came before. Real-robot trials on latch opening, cup placement, and egg handling demonstrate higher subtask and full-task completion rates plus faster execution within time limits compared with strong offline baselines.

Core claim

TORL-VLA couples a tactile-derived wrench-aware VLA that predicts reference actions and future wrench sequences with a lightweight online RL module that refines the reference actions, stabilized by an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention.

What carries the argument

The intervention-censored critic that blocks incorrect credit assignment from human interventions to earlier policy actions, paired with the wrench prediction head that supplies reference actions for the online RL module.

If this is right

Success rates rise at both individual subtasks and complete long-horizon sequences on latch manipulation, coffee-cup placement, and egg handling.
Time-bounded execution efficiency improves because the policy reduces inappropriate contact forces and inefficient retries.
The system performs online adaptation when contact conditions move outside the original training distribution.
Mixed human-intervention and policy-generated data can be used for stable online learning without corrupting the value estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same censoring idea could be applied to other hybrid human-robot data streams where credit must be isolated to autonomous actions.
Wrench prediction might serve as a general interface for incorporating other force-related sensors into VLA refinement loops.
The lightweight RL module suggests that full retraining of large VLAs may not be required for contact adaptation if reference actions are already available.
Extending the approach to multi-fingered or deformable-object tasks would test whether the wrench reference remains informative when geometry changes rapidly.

Load-bearing premise

The wrench predictions from the tactile VLA stay accurate enough to serve as useful references, and the censoring mechanism correctly avoids crediting post-intervention outcomes to preceding policy actions.

What would settle it

If ablation experiments on the latch task show that removing the intervention-censored critic causes the policy to receive credit for successes that only occur after human intervention, the learning stability claim would not hold.

Figures

Figures reproduced from arXiv: 2606.09337 by Baoxu Liu, Guozheng Li, Huaihang Zheng, Kai Ma, Shenglin Xu, Si Liu, Tian Xie, Xiangyu Wang, Yinian Mao, Yiren Ma, Yi Yang.

**Figure 1.** Figure 1: Overview of the TORL-VLA framework. Stage I: Tactile-derived wrench sequences are fused after visual-language encoding through MoE routing to predict action references and future wrench sequences. Stage II: The frozen wrench-aware VLA provides vision tokens, reference actions, and predicted wrench; a lightweight actor-critic refines actions online with wrench conditioning and stage routing. Stage III: Ev… view at source ↗

**Figure 2.** Figure 2: Tactile-to-wrench mapping. Tactile readings are mapped to a 6D wrench representation for contact-aware manipulation. Wrench-conditioned reference generation. The VLM encodes the image, language, and robot-state prefix tokens into hidden states Pt ∈ R Np×dvl , where Np is the number of prefix tokens and dvl is the hidden dimension of the VLM stream [26]. At flow-matching interpolation time s ∈ [0, 1], the a… view at source ↗

**Figure 3.** Figure 3: Experimental platform. Real-robot latch-box setup with multi-view vision and dualfingertip tactile sensing. We evaluate TORL-VLA on contact-rich manipulation tasks to answer three questions: (i) whether wrench-guided online refinement improves subtask and full-task success; (ii) whether the wrench-aware reference model provides stronger action-and-wrench references than adapted physical-feedback VLA base… view at source ↗

**Figure 4.** Figure 4: Full-task reliability and throughput. Success rate and 60-min throughput are compared across methods. Baselines and Fairness. We compare TORLVLA with π0.5, TA-VLA, ForceVLA, TORLVLA w/o RL, and RLT under the same robot platform, action representation, demonstrations, execution protocol, and evaluation trials. TA-VLA and ForceVLA are reimplemented on the same π0.5 backbone and adapted to the same dual… view at source ↗

**Figure 5.** Figure 5: Task visualization. Representative contact-rich subtasks in latch-box manipulation, including coffee-cup placement, latch locking, and egg placement [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Online adaptation on Latch. Left: throughput measured per 10 minutes as online data increases. Right: success rate at each data checkpoint. direct wrench-token bypass preserves fine-grained contact information. In contrast, removing WHT or FWP has smaller effects, indicating that temporal wrench encoding and future wrench prediction mainly complement the fusion pathway. The Full w/o RL variant achieves th… view at source ↗

**Figure 7.** Figure 7: Implementation details of the wrench-aware VLA reference model. camera views (wrist, fisheye, and global), a language instruction, robot proprioception, and tactile readings from two piezoelectric pads mounted on the inner surfaces of the gripper fingers. As described in Appendix A.2.2, the tactile readings are converted into the current measured wrench wt, and the recent wrench window is represented in a … view at source ↗

**Figure 8.** Figure 8: Stage estimator architecture. Policy-side features are projected, temporally aggregated, and used to predict a stage label and confidence for route filtering. The route mapping separates full-task execution into base execution and local contact-window refinement. Labels outside predefined contact windows are mapped to the base route, where the frozen VLA reference chunk is executed directly. Only label… view at source ↗

**Figure 9.** Figure 9: Detailed alignment of predicted stages against ground truth over full-task execution. The visualization trace shows chunk-level stage estimates aligned with sequential camera observations. Minor discrepancies mainly appear at continuous phase boundaries, while the physical manipulation states in GT and ER remain consistent across stable intervals. than incorrect recognition of stable task stages. In deplo… view at source ↗

**Figure 10.** Figure 10: Measured contact evolution in the cup-insertion task. The wrench curves are measured [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Detailed six-dimensional future-wrench prediction on a latch-locking segment. The fig [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Measured contact evolution in the latch-locking task. Each panel shows synchronized [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Future-wrench prediction examples across the latch, cup, and egg tasks. Each panel [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TORL-VLA adds an online RL refinement layer and an intervention-censored critic to a tactile VLA, but the abstract leaves the critic's actual effect on credit assignment unshown.

read the letter

The core move here is taking an offline tactile VLA that outputs reference actions and wrench sequences, then running a lightweight online RL module on top to adjust those references during execution, with a special critic that blocks success signals after a human steps in. That combination is the new piece relative to prior tactile VLA work.

The paper correctly flags a real deployment gap: offline policies cannot adjust contact forces or retry efficiently when conditions drift from training data. The real-robot tasks (latch manipulation, coffee-cup placement, egg handling) are the right kind of long-horizon contact-rich problems, and claiming gains in both subtask success and time-bounded efficiency over baselines is the sort of result that could matter if the numbers hold.

The intervention-censored critic is presented as the fix for learning from mixed policy and human data. The stress-test note is accurate on this point. The abstract states the mechanism exists but supplies no equations, pseudocode, value-function checks, or controlled trials showing that post-intervention credit is actually blocked when interventions occur mid-subtask. Without that evidence, the reported improvements cannot be confidently tied to the proposed architecture rather than to other factors like intervention frequency or baseline tuning.

The citation pattern is standard and the real-robot evaluation is a plus, but the lack of detail on the critic and on how wrench predictions are kept accurate enough to serve as references is the main soft spot. Minor issues like missing baseline specifics would be fixable in revision.

This is for people working on physical deployment of VLAs in contact-rich settings. A reader already running tactile sensors and occasional human interventions would get practical ideas from it. It deserves a serious referee to check whether the full methods and ablations actually support the credit-assignment claim.

Referee Report

2 major / 2 minor

Summary. The paper proposes TORL-VLA, a framework that augments vision-language-action (VLA) models with tactile-derived wrench predictions to generate reference actions, then refines those actions via a lightweight online RL module for contact-rich manipulation. A key component is an intervention-censored critic intended to stabilize learning when human interventions occur during rollouts. Real-robot experiments on long-horizon tasks (latch manipulation, coffee-cup placement, egg handling) are reported to yield higher subtask and full-task success rates plus improved time-bounded efficiency relative to strong baselines.

Significance. If the claimed performance gains are reproducible and the credit-assignment mechanism is shown to function as described, the work would offer a concrete route to online adaptation of VLAs in contact-rich settings where offline policies fail under distribution shift. The integration of wrench-sequence prediction with an intervention-aware critic addresses a practical gap, though the absence of quantitative diagnostics for the critic limits immediate assessment of its contribution.

major comments (2)

[Method (online RL module and critic)] The intervention-censored critic is presented as the mechanism that prevents post-intervention success signals from being attributed to preceding policy actions, yet the manuscript supplies no derivation, pseudocode, or value-function diagnostics demonstrating that censoring is correctly implemented and sufficient when interventions occur mid-subtask. This mechanism is load-bearing for attributing the reported efficiency and success-rate gains to the online RL module rather than to human corrections.
[Experiments] The abstract states that real-robot experiments demonstrate improvements over strong baselines on three long-horizon tasks, but no quantitative results, baseline details, success-rate tables, or statistical tests are visible in the provided text. Without these, it is impossible to evaluate whether the data support the central empirical claim.

minor comments (2)

[Method] Notation for the wrench-aware VLA outputs (reference actions versus predicted wrench sequences) should be defined explicitly with consistent symbols across text and any equations.
[Conclusion] The project page URL is given but the manuscript does not indicate whether code, trained models, or raw experimental logs will be released, which would strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify two areas where additional clarity will strengthen the manuscript. We respond to each point below and will incorporate revisions as indicated.

read point-by-point responses

Referee: [Method (online RL module and critic)] The intervention-censored critic is presented as the mechanism that prevents post-intervention success signals from being attributed to preceding policy actions, yet the manuscript supplies no derivation, pseudocode, or value-function diagnostics demonstrating that censoring is correctly implemented and sufficient when interventions occur mid-subtask. This mechanism is load-bearing for attributing the reported efficiency and success-rate gains to the online RL module rather than to human corrections.

Authors: We agree that the current manuscript provides only a high-level description of the intervention-censored critic without a formal derivation, pseudocode, or supporting diagnostics. In the revised version we will add: (i) the modified value update rule that censors post-intervention rewards, (ii) pseudocode for the critic training loop, and (iii) diagnostic plots or ablation numbers showing value estimates with and without censoring on mid-subtask interventions. These additions will make the credit-assignment argument explicit and allow readers to verify that the reported gains are attributable to the online RL module. revision: yes
Referee: [Experiments] The abstract states that real-robot experiments demonstrate improvements over strong baselines on three long-horizon tasks, but no quantitative results, baseline details, success-rate tables, or statistical tests are visible in the provided text. Without these, it is impossible to evaluate whether the data support the central empirical claim.

Authors: The full manuscript contains Section 5 with the requested quantitative material: success-rate tables (subtask and full-task) for latch manipulation, coffee-cup placement, and egg handling; explicit baseline descriptions (VLA-only, standard RL, and ablations); means and standard deviations over repeated trials; and statistical significance tests. We will revise the submission to ensure these tables and statistical details are referenced directly from the abstract and appear in the main body without relying on supplementary material. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims rest on empirical robot experiments.

full rationale

The paper describes a framework (tactile-derived VLA + online RL + intervention-censored critic) and validates it via real-robot trials on latch, cup, and egg tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or method summary. The critic is introduced as a design choice to handle mixed data, but its correctness is not derived from prior self-work; performance gains are attributed to experimental outcomes rather than any self-referential reduction. This is the common case of an applied robotics paper whose central claims are externally falsifiable via replication on hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5776 in / 1011 out tokens · 22466 ms · 2026-06-27T16:28:38.183733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages

[1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[2]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and RT-X models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–
[3]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025
[5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[6]

Huang, S

J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

arXiv 2025
[7]

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. TLA: Tactile-language-action model for contact-rich manipulation.Robot Learning, 3(1):17–18, 2026

2026
[8]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. VTLA: Vision-tactile-language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

arXiv 2025
[9]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. Advances in Neural Information Processing Systems, 38:93409–93439, 2026

2026
[10]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. ForceVLA2: Unleashing hybrid force-position control with force awareness for contact-rich manipulation.arXiv preprint arXiv:2603.15169, 2026

arXiv 2026
[11]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. TA-VLA: Elucidating the design space of torque-aware vision-language-action models.arXiv preprint arXiv:2509.07962, 2025

arXiv 2025
[12]

Cheng, Y

Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang. OmniVTLA: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

arXiv 2025
[13]

J. Bi, K. Y . Ma, C. Hao, M. S. Zheng, and H. Soh. VLA-Touch: Enhancing vision-language- action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026
[14]

R. Zhao, W. Wang, Y . Ma, X. Li, F. E. H. Tay, M. H. Ang Jr., and H. Zhu. FD-VLA: Force-distilled vision-language-action model for contact-rich manipulation.arXiv preprint arXiv:2602.02142, 2026. 9

arXiv 2026
[15]

Gubernatorov, M

K. Gubernatorov, M. Sannikov, I. Mikhalchuk, E. Kuznetsov, M. Artemov, O. F. Ouwatobi, M. Fernando, A. Asanov, Z. Guo, and D. Tsetserukou. HapticVLA: Contact-rich manipula- tion via vision-language-action model without inference-time tactile sensing.arXiv preprint arXiv:2603.15257, 2026

arXiv 2026
[16]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[17]

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. ConRFT: A reinforced fine-tuning method for VLA models via consistency policy. InProceedings of Robotics: Science and Systems, 2025. doi:10.15607/RSS.2025.XXI.019

work page doi:10.15607/rss.2025.xxi.019 2025
[18]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[19]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Z. Luo, Y . Xie, F. Hu, L. Fan, G. Shi, and Y . Zhu. Self-improving vision-language-action models with data generation via residual RL. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps:// openreview.net/forum?id=eUGoqrZ6Ea

2026
[20]

Y . Li, X. Ma, J. Xu, Y . Cui, Z. Cui, Z. Han, L. Huang, T. Kong, Y . Liu, H. Niu, et al. GR-RL: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025
[21]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. In9th Annual Conference on Robot Learning, 2025

2025
[22]

C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. RL token: Boot- strapping online RL with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Pith/arXiv arXiv 2026
[23]

J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. SERL: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

2024
[24]

K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Liang, and H. Xu. RL- 100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

arXiv 2025
[25]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025
[26]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[27]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022
[28]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

2023
[29]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

2026
[30]

A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

arXiv 2025
[31]

C. Miao, T. Chang, M. Wu, H. Xu, C. Li, M. Li, and X. Wang. Fedvla: Federated vision- language-action learning with dual gating mixture-of-experts for robotic manipulation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 6904–6913, 2025

2025
[32]

W. Shen, Y . Liu, Y . Wu, Z. Liang, S. Gu, D. Wang, T. Nian, L. Xu, Y . Qin, J. Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language- action learning.arXiv preprint arXiv:2510.14300, 2025

arXiv 2025
[33]

Z. Du, B. Liu, Y . Liang, Y . Shen, H. Cao, X. Zheng, Z. Feng, Z. Wu, J. Yang, and Y .-G. Jiang. Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action poli- cies.arXiv preprint arXiv:2512.05693, 2025

arXiv 2025
[34]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

arXiv 2026
[35]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.arXiv preprint arXiv:2512.23864, 2025

Pith/arXiv arXiv 2025
[36]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011
[37]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018
[38]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 11 A Appendix This appendix provides implementation and experimental details omitted from the main text due to space constraints. App...

2018

[1] [1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[2] [2]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and RT-X models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–

[3] [3]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[4] [4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025

[5] [5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[6] [6]

Huang, S

J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

arXiv 2025

[7] [7]

P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. TLA: Tactile-language-action model for contact-rich manipulation.Robot Learning, 3(1):17–18, 2026

2026

[8] [8]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. VTLA: Vision-tactile-language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

arXiv 2025

[9] [9]

J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. Advances in Neural Information Processing Systems, 38:93409–93439, 2026

2026

[10] [10]

Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. ForceVLA2: Unleashing hybrid force-position control with force awareness for contact-rich manipulation.arXiv preprint arXiv:2603.15169, 2026

arXiv 2026

[11] [11]

Zhang, H

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. TA-VLA: Elucidating the design space of torque-aware vision-language-action models.arXiv preprint arXiv:2509.07962, 2025

arXiv 2025

[12] [12]

Cheng, Y

Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang. OmniVTLA: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

arXiv 2025

[13] [13]

J. Bi, K. Y . Ma, C. Hao, M. S. Zheng, and H. Soh. VLA-Touch: Enhancing vision-language- action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026

[14] [14]

R. Zhao, W. Wang, Y . Ma, X. Li, F. E. H. Tay, M. H. Ang Jr., and H. Zhu. FD-VLA: Force-distilled vision-language-action model for contact-rich manipulation.arXiv preprint arXiv:2602.02142, 2026. 9

arXiv 2026

[15] [15]

Gubernatorov, M

K. Gubernatorov, M. Sannikov, I. Mikhalchuk, E. Kuznetsov, M. Artemov, O. F. Ouwatobi, M. Fernando, A. Asanov, Z. Guo, and D. Tsetserukou. HapticVLA: Contact-rich manipula- tion via vision-language-action model without inference-time tactile sensing.arXiv preprint arXiv:2603.15257, 2026

arXiv 2026

[16] [16]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[17] [17]

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. ConRFT: A reinforced fine-tuning method for VLA models via consistency policy. InProceedings of Robotics: Science and Systems, 2025. doi:10.15607/RSS.2025.XXI.019

work page doi:10.15607/rss.2025.xxi.019 2025

[18] [18]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[19] [19]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Z. Luo, Y . Xie, F. Hu, L. Fan, G. Shi, and Y . Zhu. Self-improving vision-language-action models with data generation via residual RL. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps:// openreview.net/forum?id=eUGoqrZ6Ea

2026

[20] [20]

Y . Li, X. Ma, J. Xu, Y . Cui, Z. Cui, Z. Han, L. Huang, T. Kong, Y . Liu, H. Niu, et al. GR-RL: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025

[21] [21]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. In9th Annual Conference on Robot Learning, 2025

2025

[22] [22]

C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. RL token: Boot- strapping online RL with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

Pith/arXiv arXiv 2026

[23] [23]

J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. SERL: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

2024

[24] [24]

K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Liang, and H. Xu. RL- 100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

arXiv 2025

[25] [25]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025

[26] [26]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[27] [27]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022

[28] [28]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

2023

[29] [29]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

2026

[30] [30]

A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

arXiv 2025

[31] [31]

C. Miao, T. Chang, M. Wu, H. Xu, C. Li, M. Li, and X. Wang. Fedvla: Federated vision- language-action learning with dual gating mixture-of-experts for robotic manipulation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 6904–6913, 2025

2025

[32] [32]

W. Shen, Y . Liu, Y . Wu, Z. Liang, S. Gu, D. Wang, T. Nian, L. Xu, Y . Qin, J. Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language- action learning.arXiv preprint arXiv:2510.14300, 2025

arXiv 2025

[33] [33]

Z. Du, B. Liu, Y . Liang, Y . Shen, H. Cao, X. Zheng, Z. Feng, Z. Wu, J. Yang, and Y .-G. Jiang. Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action poli- cies.arXiv preprint arXiv:2512.05693, 2025

arXiv 2025

[34] [34]

Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

arXiv 2026

[35] [35]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.arXiv preprint arXiv:2512.23864, 2025

Pith/arXiv arXiv 2025

[36] [36]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011

[37] [37]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018

[38] [38]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 11 A Appendix This appendix provides implementation and experimental details omitted from the main text due to space constraints. App...

2018