FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement

Haoran Hao; Jeffrey Ichnowski; Jeff Schneider; Shahram Najam Syed

arxiv: 2607.01111 · v1 · pith:OOF5IUGWnew · submitted 2026-07-01 · 💻 cs.RO · cs.AI· cs.LG

FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement

Haoran Hao , Shahram Najam Syed , Jeffrey Ichnowski , Jeff Schneider This is my paper

Pith reviewed 2026-07-02 11:03 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords failure recoverytest-time adaptationpreference learningrobot manipulationcontinual policy improvementdiffusion policyautonomous retry

0 comments

The pith

Robots recover from their own failures at test time by turning unsuccessful trajectories into preference data that steers the policy away from mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a robot policy can adapt at deployment time by constructing preference pairs from observed failures, using those pairs to update behavior, and adding small action perturbations on retry attempts to explore nearby options. Successful recoveries from this process are then folded back into training to improve the policy over successive deployments. This approach is shown to raise success rates without requiring human resets or intervention. A sympathetic reader would care because standard policies repeat errors on retry and many recovery methods depend on external help, so an autonomous alternative would make real-world deployment more practical and data-efficient.

Core claim

FAR enables test-time recovery and continual improvement by pairing Failure-Contrastive Preference Adaptation—which builds preference learning data directly from failure trajectories to steer the policy away from unsuccessful behaviors—with lightweight action perturbations during retries to encourage local exploration, then incorporating the resulting successful trajectories into a training loop.

What carries the argument

Failure-Contrastive Preference Adaptation that converts failure trajectories into preference pairs for steering the diffusion policy.

If this is right

Success rates rise by an average of 17.6 percent over a standard diffusion policy in simulation and 11.7 percent in real-world manipulation.
Data efficiency improves under both reset budgets and timestep budgets during continual policy improvement.
Robots complete tasks autonomously by learning from failures instead of repeating them or requiring human intervention.
Robustness increases because the policy actively avoids previously observed failure modes on subsequent attempts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same failure-to-preference conversion could be tested on policy architectures other than diffusion models to check whether the gain is architecture-specific.
Over many deployments the accumulated recovery data might reduce the size of the initial offline dataset needed to reach a given performance level.
The method could be combined with existing safety filters to bound the risk introduced by the exploration perturbations.

Load-bearing premise

Failure trajectories supply unbiased preference signals that improve the policy without introducing new biases or needing extra task-specific tuning.

What would settle it

An experiment on the same tasks where the preference adaptation step produces no gain or a drop in success rate relative to plain retries would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.01111 by Haoran Hao, Jeffrey Ichnowski, Jeff Schneider, Shahram Najam Syed.

**Figure 1.** Figure 1: Overall Framework of FAR. After a failure, FAR identifies failure-inducing actions using value estimation, then updates the policy with both failure examples and alternative positive examples. The collected trajectories are added to the replay buffer for continual policy improvement. trajectories are incorporated into online finetuning, allowing the policy to learn from challenging failure cases, improve r… view at source ↗

**Figure 2.** Figure 2: Comparison Across Real-World Tasks. We conduct experiments on three realworld manipulation tasks that evaluate pushing, pick-and-place, and pouring skills. RoboSuite Stack ManiSkill RoboMimic Real Door Bread Pullcube Liftpeg Pokecube Lift Can Square Drawer Pot Tea [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Results on Continual Policy Improvement. FAR improves performance through online interactions, while increasing data efficiency and reducing the number of costly environment resets. 5.3 Continual Policy Improvement As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: (a) FAR benefits from both failure adaptation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Results on Continual Policy Improvement. The first row reports success rate against the number of online episodes, while the remaining plots report success rate against environment timesteps. FAR improves performance through online interactions, increasing data efficiency in terms of both resets and timesteps. In [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Continual policy improvement under sparse [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Robot policies inevitably encounter failures when deployed in real environments. Naive retries often repeat the same mistakes, while many existing recovery methods rely on human intervention. In this paper, we propose Failure-Aware Retry (FAR), a framework that enables robots to learn from previous failures at test time, adapt their behavior accordingly, and eventually complete the task autonomously. FAR combines Failure-Contrastive Preference Adaptation, which constructs preference learning data from failures to steer the policy away from previously unsuccessful behaviors, with lightweight action perturbations during retries to encourage local exploration. We further incorporate successful recovery trajectories into a training loop for continual policy improvement. Experiments in both simulation and real-world manipulation tasks show that FAR substantially improves success rates and robustness, with average gains of 17.6% over the standard diffusion policy in simulation and 11.7% in the real world. In addition, FAR significantly improves data efficiency under both reset and timestep budgets during continual policy improvement by exploiting informative failure cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FAR turns test-time failures into preference data for robot policy adaptation, but the gains hinge on an unexamined conversion step that may embed bias.

read the letter

The core contribution is a test-time retry framework that builds preference pairs from observed failures to steer a diffusion policy away from bad actions, adds light perturbations for exploration during retries, and folds successful recoveries back into training for continual improvement. This is a direct, practical response to the common problem that naive retries just repeat mistakes in manipulation tasks.

It does a few things cleanly. The named components—Failure-Contrastive Preference Adaptation plus perturbations—are a reasonable way to operationalize learning from failures without human intervention. The reported average gains (17.6% sim, 11.7% real) over a standard diffusion policy, plus better data efficiency under reset and timestep limits, are the kind of numbers that matter for deployment reliability if they replicate.

The soft spot is exactly where the stress-test note points: the step that turns failure trajectories into preference pairs. The abstract claims this steers the policy without new biases, but supplies no derivation, ablation, or pairing rule details to show the contrastive objective stays aligned with the original task or avoids amplifying correlations that only appear in the failure set. Without those controls, the robustness improvements could be artifacts of how the data was selected rather than genuine recovery learning. The lack of baseline specifics, statistical tests, or protocol description in the provided text makes it hard to judge how solid the evidence is.

This paper is for researchers working on test-time adaptation and recovery in robot manipulation. A reader already running diffusion policies on similar tasks could extract the retry loop and try it, but the work needs tighter experimental grounding before it changes practice. It deserves a serious referee because the idea is scoped and implementable; the current evidence is thin but the problem it targets is real.

Referee Report

2 major / 2 minor

Summary. The paper proposes Failure-Aware Retry (FAR), a test-time framework that converts observed failures into preference pairs via Failure-Contrastive Preference Adaptation, applies lightweight action perturbations on retries, and folds successful recoveries into a continual training loop. It claims this yields average success-rate gains of 17.6 % over a standard diffusion policy in simulation and 11.7 % in real-world manipulation, together with improved data efficiency under reset and timestep budgets.

Significance. If the empirical claims are substantiated, the approach would provide a practical, human-free mechanism for test-time recovery and online policy improvement in robotics. The explicit use of failure trajectories as preference data and the closed-loop incorporation of recoveries into training constitute a concrete, falsifiable contribution to continual adaptation.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline performance numbers (17.6 % sim, 11.7 % real) and the data-efficiency claim are stated without any description of trial counts, random seeds, statistical tests, baseline implementations, or ablation results for the preference-adaptation component. Because these quantities are the sole support for the central claims, the manuscript cannot be evaluated as written.
[§3] §3 (Failure-Contrastive Preference Adaptation): the construction that turns failure trajectories into preference pairs is described only at the level of “lightweight perturbations” and “steering away from unsuccessful behaviors.” No derivation, bias analysis, or controlled experiment shows that the induced preference distribution remains unbiased relative to the original task reward or that the contrastive objective does not amplify spurious correlations present only in the failure set. This step is load-bearing for both the robustness and continual-improvement claims.

minor comments (2)

[§3] Notation for the preference loss and the perturbation schedule should be introduced with explicit equations rather than prose descriptions.
[§4] Figure captions should state the exact number of evaluation episodes and whether error bars represent standard error or standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript requires substantial additions to experimental reporting and methodological detail. Below we respond point-by-point and commit to the necessary revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline performance numbers (17.6 % sim, 11.7 % real) and the data-efficiency claim are stated without any description of trial counts, random seeds, statistical tests, baseline implementations, or ablation results for the preference-adaptation component. Because these quantities are the sole support for the central claims, the manuscript cannot be evaluated as written.

Authors: We agree that the reported success rates and data-efficiency claims lack the necessary statistical and implementation details. In the revised manuscript we will add: the exact number of trials per condition, the random seeds employed, results of statistical significance tests (including p-values), full descriptions of baseline implementations, and dedicated ablations isolating the preference-adaptation component. These changes will allow proper evaluation of the 17.6 % and 11.7 % gains. revision: yes
Referee: [§3] §3 (Failure-Contrastive Preference Adaptation): the construction that turns failure trajectories into preference pairs is described only at the level of “lightweight perturbations” and “steering away from unsuccessful behaviors.” No derivation, bias analysis, or controlled experiment shows that the induced preference distribution remains unbiased relative to the original task reward or that the contrastive objective does not amplify spurious correlations present only in the failure set. This step is load-bearing for both the robustness and continual-improvement claims.

Authors: We acknowledge that §3 currently provides only a high-level description. We will expand the section with a formal derivation of the preference-pair construction from failure trajectories, an explicit bias analysis relative to the task reward, and controlled experiments that test whether the contrastive objective introduces or amplifies spurious correlations unique to the failure set. These additions will directly address the load-bearing nature of this component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experimental validation

full rationale

The paper proposes FAR as a new framework that constructs preference data from observed failures and incorporates successful recoveries into a training loop. All reported gains (17.6% sim, 11.7% real) and data-efficiency improvements are presented as outcomes of simulation and real-world experiments. No equations, derivations, or first-principles results are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method is self-contained against external benchmarks and does not rename known results or smuggle ansatzes via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details on parameters, axioms, or new entities are present in the abstract; ledger left empty.

pith-pipeline@v0.9.1-grok · 5706 in / 1056 out tokens · 32560 ms · 2026-07-02T11:03:27.698350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 19 canonical work pages · 8 internal anchors

[1]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023
[2]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025
[4]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[5]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Z. Hu, R. Wu, N. Enock, J. Li, R. Kadakia, Z. Erickson, and A. Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction, 2025. URLhttps://arxiv.org/ abs/2509.07953

work page arXiv 2025
[7]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human-in- the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025
[8]

Mandlekar, C

A. Mandlekar, C. R. Garrett, D. Xu, and D. Fox. Human-in-the-loop task and motion planning for imitation learning. In7th Annual Conference on Robot Learning, 2023. URL https: //openreview.net/forum?id=G_FEL3OkiR

2023
[9]

Hoque, L

R. Hoque, L. Y . Chen, S. Sharma, K. Dharmarajan, B. Thananjeyan, P. Abbeel, and K. Goldberg. Fleet-dagger: Interactive robot fleet learning with scalable human supervision. InConference on Robot Learning, pages 368–380. PMLR, 2023

2023
[10]

Hoque, A

R. Hoque, A. Balakrishna, C. Putterman, M. Luo, D. S. Brown, D. Seita, B. Thananjeyan, E. Novoseller, and K. Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. In2021 IEEE 17th international conference on automation science and engineering (case), pages 502–509. IEEE, 2021

2021
[11]

Hoque, A

R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. ThriftyDAg- ger: Budget-aware novelty and risk gating for interactive imitation learning. In5th An- nual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id= KKBfrCzCVOn. 9

2021
[12]

Ebert, S

F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. InConference on robot learning, pages 983–993. PMLR, 2018

2018
[13]

M. Du, A. Khazatsky, T. Gerstenberg, and C. Finn. To err is robotic: Rapid value-based trial-and-error during deployment, 2024. URLhttps://arxiv.org/abs/2406.15917

work page arXiv 2024
[14]

S. Xu, R. Jin, H. Zhou, B. Yue, G. Qiao, Y . Deng, Y . Tai, K. Jia, and G. Liu. From reaction to anticipation: Proactive failure recovery through agentic task graph for robotic manipulation. In Robotics: Science and Systems (RSS), 2026

2026
[15]

Z. Liu, A. Bahety, and S. Song. REFLECT: Summarizing robot experiences for failure explanation and correction. In7th Annual Conference on Robot Learning, 2023. URL https: //openreview.net/forum?id=8yTS_nAILxt

2023
[16]

Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multitask failure detection for vision-language-action models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= XPyAukgsFf

2026
[17]

J. Duan, W. Pumacay, N. Kumar, Y . R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Man- dlekar, and Y . Guo. AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations,
[18]

URLhttps://openreview.net/forum?id=JVkdSi7Ekg
[19]

Grislain, H

C. Grislain, H. Rahimi, O. Sigaud, and M. Chetouani. I-failsense: Towards general robotic failure detection with vision-language models. InProceedings of the International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/2509.16072

work page arXiv 2026
[20]

W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao. Robofac: A comprehensive framework for robotic failure analysis and correction, 2025. URLhttps://arxiv.org/abs/2505.12224

work page arXiv 2025
[21]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URL https://arxiv.org/ abs/2510.01642

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

H. Chen, Y . Yao, R. Liu, C. Liu, and J. Ichnowski. Robot failure recovery using vision- language models with optimized prompts. In2025 American Control Conference (ACC), pages 1983–1988, 2025. doi:10.23919/ACC63710.2025.11107751

work page doi:10.23919/acc63710.2025.11107751 1983
[23]

Y . Hong, H. Huang, M. Li, L. F.-F. Li, J. Wu, and Y . Choi. Learning from trials and errors: Reflective test-time planning for embodied llms, 2026. URL http://arxiv.org/abs/2602. 21198

2026
[24]

Thananjeyan, A

B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones.IEEE Robotics and Automation Letters, 6(3):4915–4922, 2021. doi:10.1109/ LRA.2021.3070252

work page arXiv 2021
[25]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Z. Luo, Y . Xie, F. Hu, L. Fan, G. Shi, and Y . Zhu. Self-improving vision-language-action models with data generation via residual RL. InThe F ourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=eUGoqrZ6Ea

2026
[26]

H. Li, K. Lei, S. Zang, K. Hu, Y . Liang, B. An, X. Li, and H. Xu. Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation, 2026. URLhttps://arxiv.org/abs/2601.07821. 10

work page arXiv 2026
[27]

X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual DAgger: Improving real-world contact- rich manipulation with human corrections. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum? id=cjcm5LYVWm

2025
[28]

Liang, R

J. Liang, R. He, and T. Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, 133(1):31–64, 2025

2025
[29]

Z. Wang, Y . Luo, L. Zheng, Z. Chen, S. Wang, and Z. Huang. In search of lost online test-time adaptation: A survey.International Journal of Computer Vision, 133(3):1106–1139, 2025

2025
[30]

D. Chen, D. Wang, T. Darrell, and S. Ebrahimi. Contrastive test-time adaptation. InCVPR, 2022

2022
[31]

Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self- supervision for generalization under distribution shifts. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 ofPro- ceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020. URL ht...

2020
[32]

H. S. Yoon, E. Yoon, J. T. J. Tee, M. A. Hasegawa-Johnson, Y . Li, and C. D. Yoo. C-TPT: Calibrated test-time prompt tuning for vision-language models via text feature dispersion. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=jzzEHTBFOT

2024
[33]

Iwasawa and Y

Y . Iwasawa and Y . Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 2427–2440. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/202...

2021
[34]

A. Chen, Z. Liu, J. Zhang, A. Prabhakar, Z. Liu, S. Heinecke, S. Savarese, V . Zhong, and C. Xiong. Test-time adaptation for LLM agents via environment interaction. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=OH4PE0TDo0

2026
[35]

J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y . Li, and M. Tan. Test-time learning for large language models. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=iCYbIaGKSR

2025
[36]

S. Niu, C. Miao, G. Chen, P. Wu, and P. Zhao. Test-time model adaptation with only forward passes. InThe International Conference on Machine Learning, 2024

2024
[37]

S. Kim, G. Oh, H. Ko, D. Ji, D. Lee, B.-J. Lee, S. Jang, and S. Kim. Test-time adaptation for online vision-language navigation with feedback-based reinforcement learning. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=K4GaB4fdIq

2025
[38]

Wagenmaker, Z

A. Wagenmaker, Z. Zhou, and S. Levine. Behavioral exploration: Learning to explore via in-context adaptation. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=tlLkY9E2bZ

2025
[39]

M. Yoo, J. Jang, S. Yoon, and H. Woo. World model implanting for test-time adaptation of embodied agents. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=tpbtodnI1p

2025
[40]

M. Liu, D. Pathak, and A. Agarwal. Locoformer: Generalist locomotion via long-context adaptation. In9th Annual Conference on Robot Learning, 2025. 11

2025
[41]

Z. Bai, C. Gao, and M. Z. Shou. Evolve-vla: Test-time training from environment feedback for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2512.14666

work page arXiv 2025
[42]

T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning.Advances in neural information processing systems, 34:27395–27407, 2021

2021
[43]

Zhang, W

H. Zhang, W. Xu, and H. Yu. Policy expansion for bridging offline-to-online reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-Y34L45JR6z

2023
[44]

S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR, 2022

2022
[45]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

2023
[46]

Nakamoto, Y

M. Nakamoto, Y . Zhai, A. Singh, M. S. Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=GcEIvidYSw

2023
[47]

Q. Li, J. Zhang, D. Ghosh, A. Zhang, and S. Levine. Accelerating exploration with unlabeled prior data. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Itorzn4Kwf

2023
[48]

Q. Li, Z. Zhou, and S. Levine. Reinforcement learning with action chunking. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=XUks1Y96NR

2026
[49]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=68n2s9ZJWF8

2022
[50]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InarXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

S. Park, Q. Li, and S. Levine. Flow q-learning. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=KVf2SFL1pi

2025
[52]

Zhang, C

T. Zhang, C. Yu, S. Su, and Y . Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=ACagRwCCqu

2026
[53]

J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang. What can RL bring to VLA generalization? an empirical study. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id=qmBMPInbZC

2026
[54]

H. Li, Y . Zuo, J. Yu, Y . Zhang, Y . Zhaohui, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, D. Wang, D. Luo, Y . Fan, Y . Sun, J. Zeng, J. Pang, S. Zhang, Y . Wang, Y . Mu, B. Zhou, and N. Ding. SimpleVLA-RL: Scaling VLA training via reinforcement learning. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openrevi...

2026
[55]

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672, 2025. doi:10.1109/ICRA55743.2025. 11127299

work page doi:10.1109/icra55743.2025 2025
[56]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=qN5hmLkBtC

2025
[57]

Gokmen, D

C. Gokmen, D. Ho, and M. Khansari. Asking for help: Failure prediction in behavioral cloning through value approximation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5821–5828, 2023. doi:10.1109/ICRA48891.2023.10161004

work page doi:10.1109/icra48891.2023.10161004 2023
[58]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=HPuSIXJaa9

2023
[59]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[60]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

2025
[61]

Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, S. Nasiriany, Y . Zhu, and K. Lin. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[62]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

2021
[63]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. In5th Annual Conference on Robot Learning, 2021. URL https: //openreview.net/forum?id=JrsfBJtDFdI

2021
[64]

Wu and K

Y . Wu and K. He. Group normalization. InProceedings of the European conference on computer vision (ECCV), pages 3–19, 2018

2018
[65]

D. Misra. Mish: A self regularized non-monotonic neural activation function, 2019

2019
[66]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

2020
[67]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

2021
[68]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=St1giarCHLP. 13

2021
[69]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018
[70]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7

2019
[71]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

1988
[72]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018

2018
[73]

Florence, L

P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning.IEEE Robotics and Automation Letters, 5(2):492–499, 2019

2019
[74]

Rahmatizadeh, P

R. Rahmatizadeh, P. Abolghasemi, L. B¨ol¨oni, and S. Levine. Vision-based multi-task manip- ulation for inexpensive robots using end-to-end learning from demonstration. In2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018

2018
[75]

Florence, C

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. In5th Annual Conference on Robot Learning, 2021

2021
[76]

A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. InConference on Robot Learning, pages 726–747. PMLR, 2021

2021
[77]

J. Wu, X. Sun, A. Zeng, S. Song, J. Lee, S. Rusinkiewicz, and T. Funkhouser. Spatial action maps for mobile manipulation. InProceedings of Robotics: Science and Systems (RSS), 2020

2020
[78]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023

[2] [2]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025

[4] [4]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023

[5] [5]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Z. Hu, R. Wu, N. Enock, J. Li, R. Kadakia, Z. Erickson, and A. Kumar. Rac: Robot learning for long-horizon tasks by scaling recovery and correction, 2025. URLhttps://arxiv.org/ abs/2509.07953

work page arXiv 2025

[7] [7]

J. Luo, C. Xu, J. Wu, and S. Levine. Precise and dexterous robotic manipulation via human-in- the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

2025

[8] [8]

Mandlekar, C

A. Mandlekar, C. R. Garrett, D. Xu, and D. Fox. Human-in-the-loop task and motion planning for imitation learning. In7th Annual Conference on Robot Learning, 2023. URL https: //openreview.net/forum?id=G_FEL3OkiR

2023

[9] [9]

Hoque, L

R. Hoque, L. Y . Chen, S. Sharma, K. Dharmarajan, B. Thananjeyan, P. Abbeel, and K. Goldberg. Fleet-dagger: Interactive robot fleet learning with scalable human supervision. InConference on Robot Learning, pages 368–380. PMLR, 2023

2023

[10] [10]

Hoque, A

R. Hoque, A. Balakrishna, C. Putterman, M. Luo, D. S. Brown, D. Seita, B. Thananjeyan, E. Novoseller, and K. Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. In2021 IEEE 17th international conference on automation science and engineering (case), pages 502–509. IEEE, 2021

2021

[11] [11]

Hoque, A

R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg. ThriftyDAg- ger: Budget-aware novelty and risk gating for interactive imitation learning. In5th An- nual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id= KKBfrCzCVOn. 9

2021

[12] [12]

Ebert, S

F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. InConference on robot learning, pages 983–993. PMLR, 2018

2018

[13] [13]

M. Du, A. Khazatsky, T. Gerstenberg, and C. Finn. To err is robotic: Rapid value-based trial-and-error during deployment, 2024. URLhttps://arxiv.org/abs/2406.15917

work page arXiv 2024

[14] [14]

S. Xu, R. Jin, H. Zhou, B. Yue, G. Qiao, Y . Deng, Y . Tai, K. Jia, and G. Liu. From reaction to anticipation: Proactive failure recovery through agentic task graph for robotic manipulation. In Robotics: Science and Systems (RSS), 2026

2026

[15] [15]

Z. Liu, A. Bahety, and S. Song. REFLECT: Summarizing robot experiences for failure explanation and correction. In7th Annual Conference on Robot Learning, 2023. URL https: //openreview.net/forum?id=8yTS_nAILxt

2023

[16] [16]

Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti. SAFE: Multitask failure detection for vision-language-action models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= XPyAukgsFf

2026

[17] [17]

J. Duan, W. Pumacay, N. Kumar, Y . R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Man- dlekar, and Y . Guo. AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation. InThe Thirteenth International Conference on Learning Representations,

[18] [18]

URLhttps://openreview.net/forum?id=JVkdSi7Ekg

[19] [19]

Grislain, H

C. Grislain, H. Rahimi, O. Sigaud, and M. Chetouani. I-failsense: Towards general robotic failure detection with vision-language models. InProceedings of the International Conference on Robotics and Automation (ICRA), 2026. URLhttps://arxiv.org/abs/2509.16072

work page arXiv 2026

[20] [20]

W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao. Robofac: A comprehensive framework for robotic failure analysis and correction, 2025. URLhttps://arxiv.org/abs/2505.12224

work page arXiv 2025

[21] [21]

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models, 2025. URL https://arxiv.org/ abs/2510.01642

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

H. Chen, Y . Yao, R. Liu, C. Liu, and J. Ichnowski. Robot failure recovery using vision- language models with optimized prompts. In2025 American Control Conference (ACC), pages 1983–1988, 2025. doi:10.23919/ACC63710.2025.11107751

work page doi:10.23919/acc63710.2025.11107751 1983

[23] [23]

Y . Hong, H. Huang, M. Li, L. F.-F. Li, J. Wu, and Y . Choi. Learning from trials and errors: Reflective test-time planning for embodied llms, 2026. URL http://arxiv.org/abs/2602. 21198

2026

[24] [24]

Thananjeyan, A

B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones.IEEE Robotics and Automation Letters, 6(3):4915–4922, 2021. doi:10.1109/ LRA.2021.3070252

work page arXiv 2021

[25] [25]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Z. Luo, Y . Xie, F. Hu, L. Fan, G. Shi, and Y . Zhu. Self-improving vision-language-action models with data generation via residual RL. InThe F ourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=eUGoqrZ6Ea

2026

[26] [26]

H. Li, K. Lei, S. Zang, K. Hu, Y . Liang, B. An, X. Li, and H. Xu. Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation, 2026. URLhttps://arxiv.org/abs/2601.07821. 10

work page arXiv 2026

[27] [27]

X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual DAgger: Improving real-world contact- rich manipulation with human corrections. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum? id=cjcm5LYVWm

2025

[28] [28]

Liang, R

J. Liang, R. He, and T. Tan. A comprehensive survey on test-time adaptation under distribution shifts.International Journal of Computer Vision, 133(1):31–64, 2025

2025

[29] [29]

Z. Wang, Y . Luo, L. Zheng, Z. Chen, S. Wang, and Z. Huang. In search of lost online test-time adaptation: A survey.International Journal of Computer Vision, 133(3):1106–1139, 2025

2025

[30] [30]

D. Chen, D. Wang, T. Darrell, and S. Ebrahimi. Contrastive test-time adaptation. InCVPR, 2022

2022

[31] [31]

Y . Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt. Test-time training with self- supervision for generalization under distribution shifts. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 ofPro- ceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020. URL ht...

2020

[32] [32]

H. S. Yoon, E. Yoon, J. T. J. Tee, M. A. Hasegawa-Johnson, Y . Li, and C. D. Yoo. C-TPT: Calibrated test-time prompt tuning for vision-language models via text feature dispersion. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=jzzEHTBFOT

2024

[33] [33]

Iwasawa and Y

Y . Iwasawa and Y . Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 2427–2440. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/202...

2021

[34] [34]

A. Chen, Z. Liu, J. Zhang, A. Prabhakar, Z. Liu, S. Heinecke, S. Savarese, V . Zhong, and C. Xiong. Test-time adaptation for LLM agents via environment interaction. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=OH4PE0TDo0

2026

[35] [35]

J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y . Li, and M. Tan. Test-time learning for large language models. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=iCYbIaGKSR

2025

[36] [36]

S. Niu, C. Miao, G. Chen, P. Wu, and P. Zhao. Test-time model adaptation with only forward passes. InThe International Conference on Machine Learning, 2024

2024

[37] [37]

S. Kim, G. Oh, H. Ko, D. Ji, D. Lee, B.-J. Lee, S. Jang, and S. Kim. Test-time adaptation for online vision-language navigation with feedback-based reinforcement learning. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=K4GaB4fdIq

2025

[38] [38]

Wagenmaker, Z

A. Wagenmaker, Z. Zhou, and S. Levine. Behavioral exploration: Learning to explore via in-context adaptation. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=tlLkY9E2bZ

2025

[39] [39]

M. Yoo, J. Jang, S. Yoon, and H. Woo. World model implanting for test-time adaptation of embodied agents. InF orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=tpbtodnI1p

2025

[40] [40]

M. Liu, D. Pathak, and A. Agarwal. Locoformer: Generalist locomotion via long-context adaptation. In9th Annual Conference on Robot Learning, 2025. 11

2025

[41] [41]

Z. Bai, C. Gao, and M. Z. Shou. Evolve-vla: Test-time training from environment feedback for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2512.14666

work page arXiv 2025

[42] [42]

T. Xie, N. Jiang, H. Wang, C. Xiong, and Y . Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning.Advances in neural information processing systems, 34:27395–27407, 2021

2021

[43] [43]

Zhang, W

H. Zhang, W. Xu, and H. Yu. Policy expansion for bridging offline-to-online reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-Y34L45JR6z

2023

[44] [44]

S. Lee, Y . Seo, K. Lee, P. Abbeel, and J. Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. InConference on Robot Learning, pages 1702–1712. PMLR, 2022

2022

[45] [45]

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

2023

[46] [46]

Nakamoto, Y

M. Nakamoto, Y . Zhai, A. Singh, M. S. Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=GcEIvidYSw

2023

[47] [47]

Q. Li, J. Zhang, D. Ghosh, A. Zhang, and S. Levine. Accelerating exploration with unlabeled prior data. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Itorzn4Kwf

2023

[48] [48]

Q. Li, Z. Zhou, and S. Levine. Reinforcement learning with action chunking. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=XUks1Y96NR

2026

[49] [49]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=68n2s9ZJWF8

2022

[50] [50]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InarXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

S. Park, Q. Li, and S. Levine. Flow q-learning. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=KVf2SFL1pi

2025

[52] [52]

Zhang, C

T. Zhang, C. Yu, S. Su, and Y . Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=ACagRwCCqu

2026

[53] [53]

J. Liu, F. Gao, B. Wei, X. Chen, Q. Liao, Y . Wu, C. Yu, and Y . Wang. What can RL bring to VLA generalization? an empirical study. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id=qmBMPInbZC

2026

[54] [54]

H. Li, Y . Zuo, J. Yu, Y . Zhang, Y . Zhaohui, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, D. Wang, D. Luo, Y . Fan, Y . Sun, J. Zeng, J. Pang, S. Zhang, Y . Wang, Y . Mu, B. Zhou, and N. Ding. SimpleVLA-RL: Scaling VLA training via reinforcement learning. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openrevi...

2026

[55] [55]

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672, 2025. doi:10.1109/ICRA55743.2025. 11127299

work page doi:10.1109/icra55743.2025 2025

[56] [56]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=qN5hmLkBtC

2025

[57] [57]

Gokmen, D

C. Gokmen, D. Ho, and M. Khansari. Asking for help: Failure prediction in behavioral cloning through value approximation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5821–5828, 2023. doi:10.1109/ICRA48891.2023.10161004

work page doi:10.1109/icra48891.2023.10161004 2023

[58] [58]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=HPuSIXJaa9

2023

[59] [59]

Wallace, M

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024

[60] [60]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

2025

[61] [61]

Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, S. Nasiriany, Y . Zhu, and K. Lin. robosuite: A modular simulation framework and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[62] [62]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

2021

[63] [63]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. In5th Annual Conference on Robot Learning, 2021. URL https: //openreview.net/forum?id=JrsfBJtDFdI

2021

[64] [64]

Wu and K

Y . Wu and K. He. Group normalization. InProceedings of the European conference on computer vision (ECCV), pages 3–19, 2018

2018

[65] [65]

D. Misra. Mish: A self regularized non-monotonic neural activation function, 2019

2019

[66] [66]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

2020

[67] [67]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. InInterna- tional conference on machine learning, pages 8162–8171. PMLR, 2021

2021

[68] [68]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=St1giarCHLP. 13

2021

[69] [69]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018

[70] [70]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7

2019

[71] [71]

D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

1988

[72] [72]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018

2018

[73] [73]

Florence, L

P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning.IEEE Robotics and Automation Letters, 5(2):492–499, 2019

2019

[74] [74]

Rahmatizadeh, P

R. Rahmatizadeh, P. Abolghasemi, L. B¨ol¨oni, and S. Levine. Vision-based multi-task manip- ulation for inexpensive robots using end-to-end learning from demonstration. In2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018

2018

[75] [75]

Florence, C

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. In5th Annual Conference on Robot Learning, 2021

2021

[76] [76]

A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, et al. Transporter networks: Rearranging the visual world for robotic manipulation. InConference on Robot Learning, pages 726–747. PMLR, 2021

2021

[77] [77]

J. Wu, X. Sun, A. Zeng, S. Song, J. Lee, S. Rusinkiewicz, and T. Funkhouser. Spatial action maps for mobile manipulation. InProceedings of Robotics: Science and Systems (RSS), 2020

2020

[78] [78]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[79] [79]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[80] [80]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024