BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models
Pith reviewed 2026-06-29 07:16 UTC · model grok-4.3
The pith
BORA trains an offline critic on vision-language tokens plus action chunks, then freezes the base VLA policy and applies lightweight human-in-the-loop chunk residuals to raise real-world dexterous success rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BORA constructs a critic that receives both VLM cognition tokens and action chunks to enable action-conditioned value guidance, then freezes the pretrained VLA policy and introduces a human-in-the-loop chunk-wise residual adaptation layer driven by intervention rewards; the offline critic continues to guide corrections so that execution discrepancies are reduced while the base policy remains a stable prior.
What carries the argument
Offline critic that evaluates hand motions from both cognition tokens and action chunks, paired with human-in-the-loop chunk-wise residual adaptation using intervention-driven rewards.
If this is right
- The framework yields a 33 percent absolute rise in average success rate over imitation learning and decoupled RL baselines under standard conditions.
- Unseen-object generalization improves by as much as 43 percent across the evaluated tasks.
- Freezing the base VLA policy while adding only residual corrections preserves the pretrained policy as a stable prior during real-world adaptation.
- Intervention-driven rewards allow the system to correct physical variances without new temporal inconsistencies or hardware risk.
- The same offline-to-online structure can be applied to any VLA policy that outputs action chunks.
Where Pith is reading between the lines
- Human interventions could be collected at lower cost than full demonstrations because only corrective residuals are needed.
- The approach may generalize to other high-dimensional continuous-control domains where full online RL is unsafe.
- If the critic remains useful across tasks, the method offers a template for keeping large pretrained policies intact while still adapting them to new physical conditions.
Load-bearing premise
Value estimates produced by the offline critic stay accurate and stable once the base policy is frozen and only small human residual corrections are applied online.
What would settle it
Running the five real-world dexterous tasks with the online residual stage disabled and finding that success rates drop back to the level of pure imitation learning would falsify the claim that the combined offline critic plus online adaptation produces the reported gains.
Figures
read the original abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BORA, an offline-to-online RL post-training framework for real-world dexterous VLA models. In the offline phase an action-conditioned critic is trained on VLM cognition tokens plus action chunks; in the online phase the base VLA is frozen and a lightweight human-in-the-loop chunk-wise residual adapter is introduced that uses intervention-driven rewards while inheriting the offline critic. The central empirical claim is that this bridge yields a 33% absolute gain in average success rate and up to 43% better unseen-object generalization across five complex real-world dexterous tasks relative to pure imitation learning and decoupled RL baselines.
Significance. If the performance claims are substantiated by properly documented experiments, the work would be significant for the robotics community: it offers a concrete, hardware-aware recipe for safely adapting high-dimensional VLA policies in the real world without full online RL exploration, directly addressing temporal inconsistency and sample-efficiency barriers that currently limit dexterous manipulation.
major comments (2)
- [Abstract] Abstract (and the experimental evaluation section that must support it): the 33% and 43% performance figures are asserted without any description of task definitions, number of trials per condition, baseline implementations and hyper-parameters, statistical tests, variance reporting, or failure-mode analysis. Because these elements are load-bearing for the central claim that the offline-to-online bridge (rather than human corrections alone) produces the gains, the manuscript cannot be evaluated in its current form.
- The weakest assumption identified in the design—that the offline critic, once conditioned on VLM tokens and action chunks, continues to supply non-redundant value estimates after deployment and human residual corrections—is never tested or ablated. No experiment isolates the contribution of the inherited critic versus the human-in-the-loop residuals themselves.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments identify key areas where additional detail and analysis are needed to strengthen the presentation of our results. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (and the experimental evaluation section that must support it): the 33% and 43% performance figures are asserted without any description of task definitions, number of trials per condition, baseline implementations and hyper-parameters, statistical tests, variance reporting, or failure-mode analysis. Because these elements are load-bearing for the central claim that the offline-to-online bridge (rather than human corrections alone) produces the gains, the manuscript cannot be evaluated in its current form.
Authors: We agree that the abstract and experimental evaluation section require expanded reporting to allow proper evaluation of the claims. In the revised manuscript we will update both sections to explicitly define the five dexterous tasks, report the number of trials per condition (20 trials), detail baseline implementations and hyper-parameters, include statistical tests (paired t-tests with p-values), report variance as mean ± standard deviation, and add a failure-mode analysis. These additions will clarify the contribution of the offline-to-online bridge relative to human corrections. revision: yes
-
Referee: [—] The weakest assumption identified in the design—that the offline critic, once conditioned on VLM tokens and action chunks, continues to supply non-redundant value estimates after deployment and human residual corrections—is never tested or ablated. No experiment isolates the contribution of the inherited critic versus the human-in-the-loop residuals themselves.
Authors: This observation is correct; the current manuscript does not contain an explicit ablation that disables the inherited critic during the online phase. In the revision we will add a targeted ablation study that compares full BORA against a variant that removes the critic while retaining the human-in-the-loop residual adapter, thereby isolating the critic's contribution to the reported gains. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The manuscript presents a methodological framework (offline critic on VLM tokens + action chunks, followed by frozen-base HiL residual adaptation) without any equations, derivations, fitted parameters, or self-citations that function as load-bearing premises. Performance claims rest on empirical results across five tasks rather than on a claimed first-principles reduction. No step matches any of the enumerated circularity patterns; the design is described as an engineering choice whose validity is tested externally.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
T.-Y . Xiang, A.-Q. Jin, X.-H. Zhou, M.-J. Gui, X.-L. Xie, S.-Q. Liu, S.-Y . Wang, S.-B. Duan, F.-C. Xie, W.-K. Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025
- [2]
-
[3]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [4]
-
[5]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Z. Zhou, A. Peng, Q. Li, S. Levine, and A. Kumar. Efficient online reinforcement learning fine- tuning need not retain offline data. InInternational Conference on Learning Representations, volume 2025, pages 32343–32368, 2025. 22
2025
- [7]
- [8]
-
[9]
Zitkovich, T
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[10]
O. Mees, D. Ghosh, K. Pertsch, K. Black, H. R. Walke, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision- Language Models for Navigation and Manipulation at ICRA 2024, 2024
2024
-
[11]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [12]
- [13]
-
[14]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [16]
- [17]
-
[18]
J. He, D. Li, X. Yu, Z. Qi, W. Zhang, J. Chen, Z. Zhang, Z. Zhang, L. Yi, and H. Wang. Dexvlg: Dexterous vision-language-grasp model at scale. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14248–14258, 2025
2025
- [19]
- [20]
- [21]
-
[22]
C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. Rl token: Bootstrapping online rl with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [23]
-
[24]
Nakamoto, S
M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023
2023
-
[25]
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023
2023
-
[26]
E. Su, T. Westenbroek, A. Nagabandi, and A. Gupta. Rfs: Reinforcement learning with resid- ual flow steering for dexterous manipulation. InThe F ourteenth International Conference on Learning Representations, 2026
2026
- [27]
-
[28]
Rajeswaran, V
A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demon- strations. InProceedings of Robotics: Science and Systems (RSS), 2018
2018
-
[29]
J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024. 24
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.