pith. sign in

arxiv: 2606.09337 · v3 · pith:OD4IJDSWnew · submitted 2026-06-08 · 💻 cs.RO

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Pith reviewed 2026-06-27 16:28 UTC · model grok-4.3

classification 💻 cs.RO
keywords Tactile feedbackOnline reinforcement learningVision-language-actionContact-rich manipulationRobotic manipulationPolicy refinementIntervention censoringWrench prediction
0
0 comments X

The pith

TORL-VLA adds an online RL module and intervention-censored critic to a tactile-aware VLA so robots can refine contact forces during long-horizon tasks when conditions change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that keeps a vision-language-action model as the base policy but augments it with real-time tactile guidance and reinforcement learning updates. A wrench-aware VLA first outputs reference actions plus predicted future force sequences from tactile input. A lightweight RL component then adjusts those references on the fly. The intervention-censored critic ensures that successes occurring after a human takes over are not incorrectly assigned to the policy actions that came before. Real-robot trials on latch opening, cup placement, and egg handling demonstrate higher subtask and full-task completion rates plus faster execution within time limits compared with strong offline baselines.

Core claim

TORL-VLA couples a tactile-derived wrench-aware VLA that predicts reference actions and future wrench sequences with a lightweight online RL module that refines the reference actions, stabilized by an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention.

What carries the argument

The intervention-censored critic that blocks incorrect credit assignment from human interventions to earlier policy actions, paired with the wrench prediction head that supplies reference actions for the online RL module.

If this is right

  • Success rates rise at both individual subtasks and complete long-horizon sequences on latch manipulation, coffee-cup placement, and egg handling.
  • Time-bounded execution efficiency improves because the policy reduces inappropriate contact forces and inefficient retries.
  • The system performs online adaptation when contact conditions move outside the original training distribution.
  • Mixed human-intervention and policy-generated data can be used for stable online learning without corrupting the value estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same censoring idea could be applied to other hybrid human-robot data streams where credit must be isolated to autonomous actions.
  • Wrench prediction might serve as a general interface for incorporating other force-related sensors into VLA refinement loops.
  • The lightweight RL module suggests that full retraining of large VLAs may not be required for contact adaptation if reference actions are already available.
  • Extending the approach to multi-fingered or deformable-object tasks would test whether the wrench reference remains informative when geometry changes rapidly.

Load-bearing premise

The wrench predictions from the tactile VLA stay accurate enough to serve as useful references, and the censoring mechanism correctly avoids crediting post-intervention outcomes to preceding policy actions.

What would settle it

If ablation experiments on the latch task show that removing the intervention-censored critic causes the policy to receive credit for successes that only occur after human intervention, the learning stability claim would not hold.

Figures

Figures reproduced from arXiv: 2606.09337 by Baoxu Liu, Guozheng Li, Huaihang Zheng, Kai Ma, Shenglin Xu, Si Liu, Tian Xie, Xiangyu Wang, Yinian Mao, Yiren Ma, Yi Yang.

Figure 1
Figure 1. Figure 1: Overview of the TORL-VLA framework. Stage I: Tactile-derived wrench sequences are fused after visual-language encoding through MoE routing to predict action references and fu￾ture wrench sequences. Stage II: The frozen wrench-aware VLA provides vision tokens, reference actions, and predicted wrench; a lightweight actor-critic refines actions online with wrench condi￾tioning and stage routing. Stage III: Ev… view at source ↗
Figure 2
Figure 2. Figure 2: Tactile-to-wrench mapping. Tactile readings are mapped to a 6D wrench representation for contact-aware manipulation. Wrench-conditioned reference generation. The VLM encodes the image, language, and robot-state prefix tokens into hidden states Pt ∈ R Np×dvl , where Np is the number of prefix tokens and dvl is the hidden dimension of the VLM stream [26]. At flow-matching interpolation time s ∈ [0, 1], the a… view at source ↗
Figure 3
Figure 3. Figure 3: Experimental platform. Real-robot latch-box setup with multi-view vision and dual￾fingertip tactile sensing. We evaluate TORL-VLA on contact-rich ma￾nipulation tasks to answer three questions: (i) whether wrench-guided online refinement improves subtask and full-task success; (ii) whether the wrench-aware reference model provides stronger action-and-wrench references than adapted physical-feedback VLA base… view at source ↗
Figure 4
Figure 4. Figure 4: Full-task reliability and through￾put. Success rate and 60-min throughput are compared across methods. Baselines and Fairness. We compare TORL￾VLA with π0.5, TA-VLA, ForceVLA, TORL￾VLA w/o RL, and RLT under the same robot plat￾form, action representation, demonstrations, exe￾cution protocol, and evaluation trials. TA-VLA and ForceVLA are reimplemented on the same π0.5 backbone and adapted to the same dual￾… view at source ↗
Figure 5
Figure 5. Figure 5: Task visualization. Representative contact-rich subtasks in latch-box manipulation, in￾cluding coffee-cup placement, latch locking, and egg placement [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Online adaptation on Latch. Left: throughput measured per 10 minutes as online data increases. Right: success rate at each data checkpoint. direct wrench-token bypass preserves fine-grained contact information. In contrast, removing WHT or FWP has smaller effects, indicating that temporal wrench encoding and future wrench predic￾tion mainly complement the fusion pathway. The Full w/o RL variant achieves th… view at source ↗
Figure 7
Figure 7. Figure 7: Implementation details of the wrench-aware VLA reference model. camera views (wrist, fisheye, and global), a language instruction, robot proprioception, and tactile readings from two piezoelectric pads mounted on the inner surfaces of the gripper fingers. As described in Appendix A.2.2, the tactile readings are converted into the current measured wrench wt, and the recent wrench window is represented in a … view at source ↗
Figure 8
Figure 8. Figure 8: Stage estimator architecture. Policy-side features are projected, tempo￾rally aggregated, and used to predict a stage label and confidence for route filter￾ing. The route mapping separates full-task execution into base execution and local contact-window refine￾ment. Labels outside predefined contact windows are mapped to the base route, where the frozen VLA ref￾erence chunk is executed directly. Only label… view at source ↗
Figure 9
Figure 9. Figure 9: Detailed alignment of predicted stages against ground truth over full-task execution. The visualization trace shows chunk-level stage estimates aligned with sequential camera obser￾vations. Minor discrepancies mainly appear at continuous phase boundaries, while the physical manipulation states in GT and ER remain consistent across stable intervals. than incorrect recognition of stable task stages. In deplo… view at source ↗
Figure 10
Figure 10. Figure 10: Measured contact evolution in the cup-insertion task. The wrench curves are measured [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Detailed six-dimensional future-wrench prediction on a latch-locking segment. The fig [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Measured contact evolution in the latch-locking task. Each panel shows synchronized [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Future-wrench prediction examples across the latch, cup, and egg tasks. Each panel [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TORL-VLA, a framework that augments vision-language-action (VLA) models with tactile-derived wrench predictions to generate reference actions, then refines those actions via a lightweight online RL module for contact-rich manipulation. A key component is an intervention-censored critic intended to stabilize learning when human interventions occur during rollouts. Real-robot experiments on long-horizon tasks (latch manipulation, coffee-cup placement, egg handling) are reported to yield higher subtask and full-task success rates plus improved time-bounded efficiency relative to strong baselines.

Significance. If the claimed performance gains are reproducible and the credit-assignment mechanism is shown to function as described, the work would offer a concrete route to online adaptation of VLAs in contact-rich settings where offline policies fail under distribution shift. The integration of wrench-sequence prediction with an intervention-aware critic addresses a practical gap, though the absence of quantitative diagnostics for the critic limits immediate assessment of its contribution.

major comments (2)
  1. [Method (online RL module and critic)] The intervention-censored critic is presented as the mechanism that prevents post-intervention success signals from being attributed to preceding policy actions, yet the manuscript supplies no derivation, pseudocode, or value-function diagnostics demonstrating that censoring is correctly implemented and sufficient when interventions occur mid-subtask. This mechanism is load-bearing for attributing the reported efficiency and success-rate gains to the online RL module rather than to human corrections.
  2. [Experiments] The abstract states that real-robot experiments demonstrate improvements over strong baselines on three long-horizon tasks, but no quantitative results, baseline details, success-rate tables, or statistical tests are visible in the provided text. Without these, it is impossible to evaluate whether the data support the central empirical claim.
minor comments (2)
  1. [Method] Notation for the wrench-aware VLA outputs (reference actions versus predicted wrench sequences) should be defined explicitly with consistent symbols across text and any equations.
  2. [Conclusion] The project page URL is given but the manuscript does not indicate whether code, trained models, or raw experimental logs will be released, which would strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify two areas where additional clarity will strengthen the manuscript. We respond to each point below and will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: [Method (online RL module and critic)] The intervention-censored critic is presented as the mechanism that prevents post-intervention success signals from being attributed to preceding policy actions, yet the manuscript supplies no derivation, pseudocode, or value-function diagnostics demonstrating that censoring is correctly implemented and sufficient when interventions occur mid-subtask. This mechanism is load-bearing for attributing the reported efficiency and success-rate gains to the online RL module rather than to human corrections.

    Authors: We agree that the current manuscript provides only a high-level description of the intervention-censored critic without a formal derivation, pseudocode, or supporting diagnostics. In the revised version we will add: (i) the modified value update rule that censors post-intervention rewards, (ii) pseudocode for the critic training loop, and (iii) diagnostic plots or ablation numbers showing value estimates with and without censoring on mid-subtask interventions. These additions will make the credit-assignment argument explicit and allow readers to verify that the reported gains are attributable to the online RL module. revision: yes

  2. Referee: [Experiments] The abstract states that real-robot experiments demonstrate improvements over strong baselines on three long-horizon tasks, but no quantitative results, baseline details, success-rate tables, or statistical tests are visible in the provided text. Without these, it is impossible to evaluate whether the data support the central empirical claim.

    Authors: The full manuscript contains Section 5 with the requested quantitative material: success-rate tables (subtask and full-task) for latch manipulation, coffee-cup placement, and egg handling; explicit baseline descriptions (VLA-only, standard RL, and ablations); means and standard deviations over repeated trials; and statistical significance tests. We will revise the submission to ensure these tables and statistical details are referenced directly from the abstract and appear in the main body without relying on supplementary material. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims rest on empirical robot experiments.

full rationale

The paper describes a framework (tactile-derived VLA + online RL + intervention-censored critic) and validates it via real-robot trials on latch, cup, and egg tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or method summary. The critic is introduced as a design choice to handle mixed data, but its correctness is not derived from prior self-work; performance gains are attributed to experimental outcomes rather than any self-referential reduction. This is the common case of an applied robotics paper whose central claims are externally falsifiable via replication on hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5776 in / 1011 out tokens · 22466 ms · 2026-06-27T16:28:38.183733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages

  1. [1]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  2. [2]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and RT-X models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–

  3. [3]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  4. [4]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

  5. [5]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    Huang, S

    J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-VLA: Unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

  7. [7]

    P. Hao, C. Zhang, D. Li, X. Cao, X. Hao, S. Cui, and S. Wang. TLA: Tactile-language-action model for contact-rich manipulation.Robot Learning, 3(1):17–18, 2026

  8. [8]

    Zhang, P

    C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. VTLA: Vision-tactile-language-action model with preference learning for insertion manipulation.arXiv preprint arXiv:2505.09577, 2025

  9. [9]

    J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. ForceVLA: Enhancing VLA models with a force-aware MoE for contact-rich manipulation. Advances in Neural Information Processing Systems, 38:93409–93439, 2026

  10. [10]

    Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. ForceVLA2: Unleashing hybrid force-position control with force awareness for contact-rich manipulation.arXiv preprint arXiv:2603.15169, 2026

  11. [11]

    Zhang, H

    Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao. TA-VLA: Elucidating the design space of torque-aware vision-language-action models.arXiv preprint arXiv:2509.07962, 2025

  12. [12]

    Cheng, Y

    Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang. OmniVTLA: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025

  13. [13]

    J. Bi, K. Y . Ma, C. Hao, M. S. Zheng, and H. Soh. VLA-Touch: Enhancing vision-language- action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

  14. [14]

    R. Zhao, W. Wang, Y . Ma, X. Li, F. E. H. Tay, M. H. Ang Jr., and H. Zhu. FD-VLA: Force-distilled vision-language-action model for contact-rich manipulation.arXiv preprint arXiv:2602.02142, 2026. 9

  15. [15]

    Gubernatorov, M

    K. Gubernatorov, M. Sannikov, I. Mikhalchuk, E. Kuznetsov, M. Artemov, O. F. Ouwatobi, M. Fernando, A. Asanov, Z. Guo, and D. Tsetserukou. HapticVLA: Contact-rich manipula- tion via vision-language-action model without inference-time tactile sensing.arXiv preprint arXiv:2603.15257, 2026

  16. [16]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  17. [17]

    Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao. ConRFT: A reinforced fine-tuning method for VLA models via consistency policy. InProceedings of Robotics: Science and Systems, 2025. doi:10.15607/RSS.2025.XXI.019

  18. [18]

    X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. InThe Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Z. Luo, Y . Xie, F. Hu, L. Fan, G. Shi, and Y . Zhu. Self-improving vision-language-action models with data generation via residual RL. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps:// openreview.net/forum?id=eUGoqrZ6Ea

  20. [20]

    Y . Li, X. Ma, J. Xu, Y . Cui, Z. Cui, Z. Han, L. Huang, T. Kong, Y . Liu, H. Niu, et al. GR-RL: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

  21. [21]

    Wagenmaker, Y

    A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. In9th Annual Conference on Robot Learning, 2025

  22. [22]

    C. Xu, J. T. Springenberg, M. Equi, A. Amin, A. Esmail, S. Levine, and L. Ke. RL token: Boot- strapping online RL with vision-language-action models.arXiv preprint arXiv:2604.23073, 2026

  23. [23]

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. SERL: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

  24. [24]

    K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Liang, and H. Xu. RL- 100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

  25. [25]

    J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human- in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

  26. [26]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  27. [27]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- lican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  28. [28]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

  29. [29]

    Driess, J

    D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

  30. [30]

    A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

  31. [31]

    C. Miao, T. Chang, M. Wu, H. Xu, C. Li, M. Li, and X. Wang. Fedvla: Federated vision- language-action learning with dual gating mixture-of-experts for robotic manipulation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 6904–6913, 2025

  32. [32]

    W. Shen, Y . Liu, Y . Wu, Z. Liang, S. Gu, D. Wang, T. Nian, L. Xu, Y . Qin, J. Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language- action learning.arXiv preprint arXiv:2510.14300, 2025

  33. [33]

    Z. Du, B. Liu, Y . Liang, Y . Shen, H. Cao, X. Zheng, Z. Feng, Z. Wu, J. Yang, and Y .-G. Jiang. Himoe-vla: Hierarchical mixture-of-experts for generalist vision-language-action poli- cies.arXiv preprint arXiv:2512.05693, 2025

  34. [34]

    Y . Li, P. Tang, W. Zhang, C. Zhu, Y . Duan, W. Shi, X. Zhang, Z. Yang, J. Ji, and Y . Zhang. Favla: A force-adaptive fast-slow vla model for contact-rich robotic manipulation.arXiv preprint arXiv:2602.23648, 2026

  35. [35]

    G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.arXiv preprint arXiv:2512.23864, 2025

  36. [36]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

  37. [37]

    Fujimoto, H

    S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  38. [38]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 11 A Appendix This appendix provides implementation and experimental details omitted from the main text due to space constraints. App...