Recognition: unknown
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Pith reviewed 2026-05-09 19:29 UTC · model grok-4.3
The pith
A single generalist policy trained across a robot fleet reaches 95% success rate as experience accumulates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LWD enables a pretrained VLA policy to close the loop with fleet-collected experience, using DIVL and QAM to extract improved policies from heterogeneous sparse-reward data, resulting in a single generalist policy that reaches 95% average success rate with largest gains on long-horizon tasks.
What carries the argument
The LWD framework, which uses Distributional Implicit Value Learning (DIVL) for robust value estimation from fleet data and Q-learning via Adjoint Matching (QAM) to extract policies from flow-based VLA action generators.
If this is right
- The policy continues to improve as more fleet experience is collected.
- Largest improvements occur on long-horizon tasks lasting 3-5 minutes.
- Fleet data from 16 robots across semantic grocery restocking and other tasks suffices for high performance.
- Offline pretraining alone is insufficient; online adaptation via deployment data is key.
Where Pith is reading between the lines
- This approach implies that robot fleets can collectively refine policies without centralized large datasets.
- It opens the possibility of deploying generalist policies that adapt to new environments through ongoing fleet operations.
- Similar methods could extend to other embodied AI systems where multiple agents gather experience in parallel.
- The framework may lower barriers to real-world robot learning by leveraging existing deployment fleets.
Load-bearing premise
The combination of DIVL for robust value estimation and QAM for policy extraction can stabilize learning from heterogeneous, sparse-reward data collected across a robot fleet without additional mechanisms for handling noise or bias in human interventions.
What would settle it
Deploy the LWD system on the 16-robot fleet for the eight tasks and observe whether the generalist policy's success rate fails to reach or sustain 95% average, or shows no differential gains on the long-horizon tasks.
Figures
read the original abstract
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Learning While Deploying (LWD), a fleet-scale offline-to-online RL framework for continual post-training of pretrained Vision-Language-Action (VLA) policies. It collects autonomous rollouts and human interventions across 16 dual-arm robots, then applies Distributional Implicit Value Learning (DIVL) for robust value estimation and Q-learning via Adjoint Matching (QAM) for policy extraction from heterogeneous sparse-reward data. Validation on eight real-world manipulation tasks (including semantic grocery restocking and 3-5 minute long-horizon tasks) shows a single generalist policy improving to 95% average success rate, with largest gains on long-horizon tasks.
Significance. If substantiated, the work would be significant for enabling scalable continual improvement of generalist robot policies in real deployments, where distribution shifts and long-tail failures cannot be captured by static pretraining data alone. It provides a concrete mechanism to close the deployment-experience-improvement loop at fleet scale and demonstrates gains on challenging long-horizon tasks.
major comments (2)
- Abstract: The headline result of a single generalist policy reaching 95% average success (with largest gains on long-horizon tasks) is presented without any baselines (e.g., pretrained VLA alone or standard offline RL), error bars, ablation studies, data-volume statistics, or exclusion criteria. This renders the central claim of improvement attributable to LWD unverifiable and load-bearing for the contribution.
- Method (DIVL+QAM description): The assertion that DIVL for value estimation combined with QAM for policy extraction in flow-based VLAs is sufficient to stabilize learning from noisy, biased human interventions and autonomous rollouts across a heterogeneous fleet lacks any analysis of robustness to distribution shifts or bias in the collected data. No experiments isolate whether this combination succeeds without additional mechanisms for noise handling, which directly underpins the claim that the framework works on sparse-reward fleet data.
minor comments (2)
- The abstract states 'reaching an average success rate of 95%' but does not specify whether this is across all tasks, per-task averages, or final checkpoint only; clarifying the aggregation and reporting per-task breakdowns would improve interpretability.
- Task descriptions mention 'semantic grocery restocking' and '3--5 minute long-horizon tasks' without enumerating the exact eight tasks or providing success criteria definitions; adding a table of task specifications would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments highlight important areas for clarifying the strength of our claims and the robustness of the proposed methods. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The headline result of a single generalist policy reaching 95% average success (with largest gains on long-horizon tasks) is presented without any baselines (e.g., pretrained VLA alone or standard offline RL), error bars, ablation studies, data-volume statistics, or exclusion criteria. This renders the central claim of improvement attributable to LWD unverifiable and load-bearing for the contribution.
Authors: We agree that the abstract would be strengthened by explicitly referencing the key baselines and supporting statistics. In the revised manuscript we will update the abstract to note the comparison against the pretrained VLA policy, the observed average improvement to 95% success, and the presence of error bars and data-volume details. The full set of baselines, ablations, data-volume statistics, and exclusion criteria already appear in Section 5 and the supplementary material; the abstract revision will direct readers to these sections for verification. revision: yes
-
Referee: Method (DIVL+QAM description): The assertion that DIVL for value estimation combined with QAM for policy extraction in flow-based VLAs is sufficient to stabilize learning from noisy, biased human interventions and autonomous rollouts across a heterogeneous fleet lacks any analysis of robustness to distribution shifts or bias in the collected data. No experiments isolate whether this combination succeeds without additional mechanisms for noise handling, which directly underpins the claim that the framework works on sparse-reward fleet data.
Authors: We acknowledge that the current manuscript relies primarily on end-to-end real-world results rather than isolated ablation studies on data bias and distribution shift. While the empirical gains on long-horizon tasks with noisy fleet data provide supporting evidence, we agree that a dedicated robustness analysis would strengthen the methodological claim. In the revision we will add a new subsection that discusses the robustness properties of DIVL and QAM, including targeted ablations on subsets of the collected data that vary in noise level and shift severity. revision: partial
Circularity Check
No circularity in derivation chain; empirical claims rest on fleet experiments
full rationale
The paper describes an empirical offline-to-online RL framework (LWD) that combines DIVL for value estimation and QAM for policy extraction, then reports success rates from deployment on 16 robots across eight tasks. No mathematical derivations, equations, or parameter-fitting steps are presented in the abstract or framework description that reduce by construction to the inputs. DIVL and QAM are invoked as stabilizing components without self-referential definitions or uniqueness theorems imported from the same authors. The central result (policy improving to 95% success) is framed as an experimental outcome rather than a derived equality, making the chain self-contained against the reported fleet data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausmanet al., “Rt- 1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[2]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhartet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning. PMLR, 2023, pp. 2165– 2183
2023
-
[3]
Octo: An Open-Source Generalist Robot Policy
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejnaet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Bal- akrishna, S. Nair, R. Rafailov, E. Fosteret al., “Openvla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groomet al., “π 0: A vision- language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
π 0.5: A vision-language-action model with open-world gener- alization,
K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finnet al., “π 0.5: A vision-language-action model with open-world gener- alization,” in9th Annual Conference on Robot Learning, 2025
2025
-
[7]
Hg-dagger: Interactive imitation learning with human experts,
M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “Hg-dagger: Interactive imitation learning with human experts,” in2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8077– 8083
2019
-
[8]
Q-learning,
C. J. Watkins and P. Dayan, “Q-learning,”Machine learning, vol. 8, no. 3, pp. 279–292, 1992
1992
-
[9]
Addressing func- tion approximation error in actor-critic methods,
S. Fujimoto, H. Hoof, and D. Meger, “Addressing func- tion approximation error in actor-critic methods,” in International conference on machine learning. PMLR, 2018, pp. 1587–1596
2018
-
[10]
Contin- uous control with deep reinforcement learning,
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y . Tassa, D. Silver, and D. P. Wierstra, “Contin- uous control with deep reinforcement learning,” Sep. 15 2020, uS Patent 10,776,692
2020
-
[11]
Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor,” inInternational conference on machine learning. PMLR, 2018, pp. 1861–1870
2018
-
[12]
Rl-100: Performant robotic ma- nipulation with real-world reinforcement learning,
K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Lianget al., “Rl-100: Performant robotic manipulation with real-world reinforcement learning,”arXiv preprint arXiv:2510.14830, 2025
-
[13]
Gr-rl: Going dexterous and precise for long-horizon robotic manipulation
Y . Li, X. Ma, J. Xu, Y . Cui, Z. Cui, Z. Han, L. Huang, T. Konget al., “Gr-rl: Going dexterous and precise for long-horizon robotic manipulation,”arXiv preprint arXiv:2512.01801, 2025
-
[14]
Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao, “Conrft: A reinforced fine-tuning method for vla models via consistency policy,”arXiv preprint arXiv:2502.05450, 2025
-
[15]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Con- ley, G. Connors, J. Darpinian, K. Dhabaliaet al., “π∗ 0.6: a vla that learns from experience,”arXiv preprint arXiv:2511.14759, 2025
work page Pith review arXiv 2025
-
[16]
Serl: A software suite for sample-efficient robotic reinforcement learning,
J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finnet al., “Serl: A software suite for sample-efficient robotic reinforcement learning,” in2024 IEEE International Conference on Robotics and Automa- tion (ICRA), 2024, pp. 16 961–16 969
2024
-
[17]
Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,
J. Luo, C. Xu, J. Wu, and S. Levine, “Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,”Science Robotics, vol. 10, no. 105, p. eads5033, 2025
2025
-
[18]
Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,
G. Lu, W. Guo, C. Zhang, Y . Zhou, H. Jiang, Z. Gao, Y . Tang, and Z. Wang, “Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,”arXiv preprint arXiv:2505.18719, 2025
-
[19]
Interactive post-training for vision-language- action models, 2025
S. Tan, K. Dou, Y . Zhao, and P. Kr¨ahenb¨uhl, “Interactive post-training for vision-language-action models,”arXiv preprint arXiv:2505.17016, 2025
-
[20]
K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, Q. Zhanget al., “πrl: Online rl fine-tuning for flow-based vision-language-action models,”arXiv preprint arXiv:2510.25889, 2025
-
[21]
Flow-GRPO: Training Flow Matching Models via Online RL
J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhanget al., “Flow-grpo: Training flow matching models via online rl,”arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
T. Zhang, C. Yu, S. Su, and Y . Wang, “Reinflow: Fine- tuning flow matching policy with online reinforcement learning,”arXiv preprint arXiv:2505.22094, 2025
-
[23]
Offline Reinforcement Learning with Implicit Q-Learning
I. Kostrikov, A. Nair, and S. Levine, “Offline reinforce- ment learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review arXiv 2021
-
[24]
C. Domingo-Enrich, M. Drozdzal, B. Karrer, and R. T. Chen, “Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control,”arXiv preprint arXiv:2409.08861, 2024
-
[25]
Q-learning with adjoint matching,
Q. Li and S. Levine, “Q-learning with adjoint matching,” arXiv preprint arXiv:2601.14234, 2026
work page internal anchor Pith review arXiv 2026
-
[26]
Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,
M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine, “Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning,” Advances in Neural Information Processing Systems, vol. 36, pp. 62 244–62 269, 2023
2023
- [27]
-
[28]
H. Zang, M. Wei, S. Xu, Y . Wu, Z. Guo, Y . Wang, H. Lin, L. Shiet al., “Rlinf-vla: A unified and effi- cient framework for vla+ rl training,”arXiv preprint arXiv:2510.06710, 2025
-
[29]
What can rl bring to vla generalization? an empirical study.arXiv preprint, arXiv:2505.19789, 2025
J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang, “What can rl bring to vla generalization? an empirical study,”arXiv preprint arXiv:2505.19789, 2025
-
[30]
C. Xu, Q. Li, J. Luo, and S. Levine, “Rldg: Robotic generalist policy distillation via reinforcement learning,” arXiv preprint arXiv:2412.09858, 2024
-
[31]
Behavior- 1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,
C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martin-Martin, C. Wang, G. Levineet al., “Behavior- 1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 205, 2023, pp. 80–93
2023
-
[32]
T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jiaet al., “Maniskill: Generalizable manip- ulation skill benchmark with large-scale demonstrations,” arXiv preprint arXiv:2107.14483, 2021
-
[33]
Libero: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023
2023
-
[34]
Robotwin: Dual-arm robot benchmark with generative digital twins (early version),
Y . Mu, T. Chen, S. Peng, Z. Chen, Z. Gao, Y . Zou, L. Lin, Z. Xieet al., “Robotwin: Dual-arm robot benchmark with generative digital twins (early version),” inEuropean Conference on Computer Vision. Springer, 2024, pp. 264–273
2024
-
[35]
H. Zang, S. Yu, H. Lin, T. Zhou, Z. Huang, Z. Guo, X. Xu, J. Zhouet al., “Rlinf-user: A unified and ex- tensible system for real-world online policy learning in embodied ai,”arXiv preprint arXiv:2602.07837, 2026
-
[36]
Z. Jiang, S. Zhou, Y . Jiang, Z. Huang, M. Wei, Y . Chen, T. Zhou, Z. Guoet al., “Wovr: World models as reliable simulators for post-training vla policies with rl,”arXiv preprint arXiv:2602.13977, 2026
-
[37]
Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025
H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhanget al., “Simplevla-rl: Scaling vla training via reinforcement learning,”arXiv preprint arXiv:2509.09674, 2025
-
[38]
Flow q-learning,
S. Park, Q. Li, and S. Levine, “Flow q-learning,” inForty- second International Conference on Machine Learning, 2025
2025
-
[39]
Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,
L. Kun, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu, “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,” inThe Twelfth International Conference on Learning Representations
-
[40]
Offline- to-online reinforcement learning via balanced replay and pessimistic q-ensemble,
S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin, “Offline- to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” inConference on Robot Learn- ing. PMLR, 2022, pp. 1702–1712
2022
-
[41]
Reincarnating reinforcement learn- ing: Reusing prior computation to accelerate progress,
R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Reincarnating reinforcement learn- ing: Reusing prior computation to accelerate progress,” Advances in neural information processing systems, vol. 35, pp. 28 955–28 971, 2022
2022
-
[42]
Effi- cient online reinforcement learning with offline data,
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Effi- cient online reinforcement learning with offline data,” in International Conference on Machine Learning. PMLR, 2023, pp. 1577–1594
2023
-
[43]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
A. Nair, A. Gupta, M. Dalal, and S. Levine, “Awac: Accelerating online reinforcement learning with offline datasets,”arXiv preprint arXiv:2006.09359, 2020
work page internal anchor Pith review arXiv 2006
-
[44]
arXiv preprint arXiv:2210.06718 , year=
Y . Song, Y . Zhou, A. Sekhari, J. A. Bagnell, A. Krish- namurthy, and W. Sun, “Hybrid rl: Using both offline and online data can make rl efficient,”arXiv preprint arXiv:2210.06718, 2022
-
[45]
A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, “Steering your diffusion policy with latent space rein- forcement learning,”arXiv preprint arXiv:2506.15799, 2025
-
[46]
Qt-opt: Scalable deep rein- forcement learning for vision-based robotic manipula- tion,
D. Kalashnikov, V . Vanhoucke, S. Levine, J. T. Springenberg, S. Bohez, K. Driessens, J. Schulman, M. Andrychowiczet al., “Qt-opt: Scalable deep rein- forcement learning for vision-based robotic manipula- tion,” inProceedings of the 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Re- search, vol. 87, 2018, pp. 651–673
2018
-
[47]
Mt-opt: Continuous multi-task robotic reinforcement learning at scale,
D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Haus- man, “Mt-opt: Continuous multi-task robotic reinforce- ment learning at scale,”arXiv preprint arXiv:2104.08212, 2021
-
[48]
Pi-qt-opt: Predictive information improves multi-task robotic reinforcement learning at scale,
K.-H. Lee, T. Xiao, A. Li, P. Wohlhart, I. Fischer, and Y . Lu, “Pi-qt-opt: Predictive information improves multi-task robotic reinforcement learning at scale,” in Conference on Robot Learning. PMLR, 2023, pp. 1696– 1707
2023
-
[49]
M. Pan, S. Feng, Q. Zhang, X. Li, J. Song, C. Qu, Y . Wang, C. Liet al., “Sop: A scalable online post-training system for vision-language-action models,” arXiv preprint arXiv:2601.03044, 2026
-
[50]
RoboCat : A self-improving foundation agent for robotic manipulation
K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauz ´a, T. Davchev, Y . Zhouet al., “Robocat: A self-improving generalist agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023
-
[51]
Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures,
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiuet al., “Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 1407– 1416
2018
-
[52]
Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators,
A. Herzog, K. Rao, K. Hausman, Y . Lu, P. Wohlhart, M. Yan, J. Lin, M. G. Arenaset al., “Deep rl at scale: Sorting waste in office buildings with a fleet of mobile manipulators,”arXiv preprint arXiv:2305.03270, 2023
-
[53]
Flow matching for generative modeling,
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representations, 2023
2023
-
[54]
A dis- tributional perspective on reinforcement learning,
M. G. Bellemare, W. Dabney, and R. Munos, “A dis- tributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70, 2017, pp. 449–458
2017
-
[55]
Offline q-learning on diverse multi-task data both scales and generalizes,
A. Kumar, R. Agarwal, X. Geng, G. Tucker, and S. Levine, “Offline q-learning on diverse multi-task data both scales and generalizes,”arXiv preprint arXiv:2211.15144, 2022
-
[56]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scal- able off-policy reinforcement learning,”arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review arXiv 1910
-
[57]
Energy-weighted flow matching for offline reinforcement learning,
S. Zhang, W. Zhang, and Q. Gu, “Energy-weighted flow matching for offline reinforcement learning,” in The Thirteenth International Conference on Learning Representations, 2025
2025
-
[58]
Gemma 3 technical report,
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicovaet al., “Gemma 3 technical report,” 2025
2025
-
[59]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inPro- ceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986
2023
-
[60]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- dereret al., “An image is worth 16x16 words: Trans- formers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[61]
Vision trans- formers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision trans- formers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188
2021
-
[62]
Blip-2: Boot- strapping language-image pre-training with frozen image encoders and large language models,
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Boot- strapping language-image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742
2023
-
[63]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019. APPENDIX A. Additional Method Details
2019
-
[64]
In our real- robot experiments, we useK= 201atoms over[−0.1,1.1]
Discretization of Distributional Value Model:We in- stantiate the distributional value modelV ψ(s)with a fixed categorical support{V i}K i=1 spanning[v min, vmax]. In our real- robot experiments, we useK= 201atoms over[−0.1,1.1]. The value head predicts logits over this support, pψ(i|s) = softmax(V ψ(s))i, i∈ {1, . . . , K}.(20) For each replay sample(s,a...
-
[65]
Proof of the Distributional View of Asymmetric Value Estimation:We provide the proof of Proposition 1 stated in Section IV-A. The goal is to show that, under idealized conditions, direct asymmetric optimization over dataset action- values and the two-step procedure of first fitting the state- conditioned distribution of datasetQ-values and then extract- i...
-
[66]
Analysis of Direct Backpropagation for Flow-Based Pol- icy:Consider a flow-based policy that generates an action x=x 1 by integrating the vector fielddx t =f θ(xt, t)from t= 0to1starting fromx 0 ∼ N. Writingx 1 =x 1(x0;θ) for the terminal sample induced by the flow, the standard RL objective for reward fine-tuning is J(θ) =E x0∼N R x1(x0;θ) ,(32) and a va...
-
[67]
Demonstrations are successful trajectories, rollouts contain both successes and failures, and play data is treated as unsuccessful exploratory data
Offline Data:The offline bufferB off consists of three types of data:demonstrationdata collected by human experts, rolloutdata produced by historical policies during prior eval- uations, andplaydata in which a human operator explores failure modes and edge cases. Demonstrations are successful trajectories, rollouts contain both successes and failures, and...
-
[68]
The policy is optimized with AdamW [63] using a base learning rate of2×10 −5 and a cosine decay schedule
Training Hyperparameters:The policy emits action chunks with horizonH= 30. The policy is optimized with AdamW [63] using a base learning rate of2×10 −5 and a cosine decay schedule. The value and critic networks are trained with Adam using a base learning rate of5×10 −4, also with a cosine decay schedule. For temporal-difference backups, we useγ= 0.9999. D...
-
[69]
Checkpoint Initialization:We first train an imitation- learning checkpoint by adapting the pretrainedπ 0.5 VLA policy on the demonstration data with behavior cloning. LWD (Offline) initializes its policy from this imitation-learning checkpoint, then trains the policy with the Adjoint Matching loss and trains the critic and distributional value model with ...
-
[70]
The model is trained with a flow-matching loss, where the interpolated noisy actiona w is defined in Eq
Reference Policy and Baseline Implementations:We obtain the reference policy by supervised fine-tuning [53] the pretrainedπ 0.5 VLA policy on 336.6 hours of demonstration data, as shown in Table IV. The model is trained with a flow-matching loss, where the interpolated noisy actiona w is defined in Eq. (7).The objective is to train conditional vector fiel...
2000
-
[71]
The comparison isolates the Robot 1 Robot 2
Complete Value-Estimation Ablation Results:Table V reports the complete per-task results for the value-estimation ablation summarized in Section V. The comparison isolates the Robot 1 Robot 2 ... RobotN EdgeClient ObjectStorage MessageQueue Coordinator DRBReader CloudLearner DRBReader CloudLearner Actor Fleet Distribution & CoordinationCloud Learner (mult...
-
[72]
9 vi- sualizes the predicted value distributions for the same episodes shown in Fig
Complementary Qualitative Results of DIVL:Fig. 9 vi- sualizes the predicted value distributions for the same episodes shown in Fig. 6. In the successful episode, the predicted distribution remains unimodal, with its mode steadily increas- ing from approximately 0.4 to 1.0 as the task progresses. In contrast, the failure episode exhibits only marginal mode...
-
[73]
(i) Object-storage uploads commit atomically (read- ers see either the fully-uploaded payload or no object) and are retried until persisted
End-to-End Reliability:The system provides at-least- once end-to-end delivery for every episode produced on the actor side. (i) Object-storage uploads commit atomically (read- ers see either the fully-uploaded payload or no object) and are retried until persisted. (ii) Episode metadata is committed via a transactional insert in the business service, then ...
-
[74]
Table VI reports both on the same 8-hour, 16-actor run as the End-to-End Reliability subsection above
Operational Latency:We report the two end-to-end latencies that govern the tightness of the actor-learner loop: (i)episode-to-learner: the elapsed time from when an episode is produced on an actor to when it becomes available for the learner to sample; and (ii)model-to-actor: the elapsed time from when the learner publishes a new policy to when the actor ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.