pith. machine review for the scientific record. sign in

arxiv: 2605.08879 · v1 · submitted 2026-05-09 · 💻 cs.RO

Recognition: no theorem link

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

Fuxian Huang, Haoran Zhang, Qi Zhang, Shaopeng Zhai, Tianyi Zhang

Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords conservative supervised fine-tuningflow-matching VLAcatastrophic forgettingcapability retentionrobot learningfine-tuningvision-language-action
0
0 comments X

The pith

ConSFT preserves pre-trained capabilities in flow-matching VLAs by scaling learning signals to model confidence during fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-matching vision-language-action models lose earlier skills when fine-tuned on new tasks because parameter updates overwrite prior learning. The authors introduce Conservative Supervised Fine-Tuning, which lowers the learning signal on samples where the model is already uncertain, limiting how far parameters move. This keeps old abilities intact while still permitting acquisition of new behaviors. The method requires no replay of previous data and no added network components. On standard robot benchmarks it improves retention by more than twenty percent over ordinary fine-tuning and performs like data-intensive replay methods.

Core claim

By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates and bound intrinsic parameter disruption risk. Inspired by trust-region clipping, the formulation creates a progressive learning dynamic that secures target convergence together with prior capability retention through sparse updates, without parallel reference networks or prior data.

What carries the argument

ConSFT objective, which dynamically scales learning signals based on model confidence to suppress excessive gradients from low-confidence samples.

Load-bearing premise

Dynamically scaling learning signals based on model confidence effectively bounds parameter disruption risk while allowing necessary adaptation without introducing new failure modes.

What would settle it

An experiment on the LIBERO benchmark where ConSFT is applied to a downstream task but either fails to improve target performance or loses prior-task success rates at levels comparable to vanilla SFT.

Figures

Figures reproduced from arXiv: 2605.08879 by Fuxian Huang, Haoran Zhang, Qi Zhang, Shaopeng Zhai, Tianyi Zhang.

Figure 1
Figure 1. Figure 1: Parameter update sparsity across optimization objectives. (Left) Global sparsity progression. Trust-region constraints (PPO) reduce the update scope compared to unconstrained SFT. (Right) Layer-wise sparsity profiles. PPO yields > 99% sparsity in core Attention and MLP weights. formulation restricts weight divergence to highly localized subspaces, enforcing conservative updates entirely within the standard… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of layer-wise update sparsity across training steps. Vanilla SFT (left) drives a rapid, early collapse in parameter sparsity, resulting in dense global overwrites. ConSFT (right) structurally delays this shift, enforcing a controlled and uniformly decaying optimization trajectory. This progressive adaptation bridges the strict trust-region bounds of PPO (center) and the unconstrained regression o… view at source ↗
Figure 3
Figure 3. Figure 3: Capability retention in physical deployments. Following downstream adaptation to the test-tube target task (controlled at 70% target success), unconstrained adaptation baselines (vanilla SFT, LwF) exhibit severe degradation of pre-trained capabilities. In contrast, ConSFT achieves the highest prior task retention among all baselines in a prior-data-free regime, maintaining robust performance even under vis… view at source ↗
Figure 4
Figure 4. Figure 4: Per-task capability retention on the LIBERO-Object suite. Performance evolution on the held-out Object tasks during downstream adaptation to the Spatial target. Unconstrained methods exhibit rapid performance degradation, whereas trust-region bounds delay this decline [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: presents the analogous evaluation for the 10 tasks in the LIBERO-Goal suite, which involves long-horizon semantic goals (e.g., "open the top drawer" or "put the bowl on the plate"). Consistent with the Object suite dynamics, the absence of trust-region constraints degrades pre-trained sequential behaviors. 40 60 80 100 0 25 50 75 100 put the bowl on the plate 40 60 80 100 put the wine bottle on the rack 40… view at source ↗
Figure 6
Figure 6. Figure 6: Per-task capability retention on the LIBERO-Object suite. Performance evolution on the held-out Object tasks during adaptation to the Spatial target task. 40 60 80 0 50 100 put the bowl on the plate 40 60 80 put the wine bottle on the rack 40 60 80 open the top drawer and put the bowl inside 40 60 80 put the cream cheese in the bowl 40 60 80 put the wine bottle on top of the cabinet 40 60 80 0 50 100 push … view at source ↗
Figure 7
Figure 7. Figure 7: Per-task capability retention on the LIBERO-Goal suite. Performance evolution on the held-out Goal tasks. The multi-baseline trajectories demonstrate that ConSFT bounds the disruption risk, preserving pre-trained long-horizon behaviors in a prior-data-free regime. C.5 Real-world deployment: foundational robustness and high-precision adaptation To establish the evaluation baseline under physical hardware co… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world multi-task evaluation under visually dense conditions. Execution trajectories of the pre-trained π0.5 policy across four distinct semantic grasping tasks. The environment introduces physical distractors to test foundational robustness prior to downstream adaptation. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-world execution of the sequential test tube transfer task. The single-arm robotic system operates under the ConSFT-optimized policy. The task demands high-precision insertion and long-horizon planning, requiring the sequential transfer of all four test tubes to satisfy the binary success criterion. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($\pi_0$, $\pi_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The manuscript proposes Conservative Supervised Fine-Tuning (ConSFT) for flow-matching Vision-Language-Action (VLA) models. ConSFT is an optimization objective that dynamically scales learning signals according to model confidence to suppress low-confidence updates, thereby mitigating catastrophic forgetting of pre-trained capabilities during adaptation to target tasks. The method requires no prior data and no additional reference networks or architectural changes. It is evaluated on the LIBERO and RoboTwin benchmarks across three flow-matching VLAs (π₀, π₀.₅, and GR00T-N1.6-3B), reporting an average absolute improvement of over 20% in capability retention relative to vanilla SFT and performance comparable to data-heavy Experience Replay. Real-world robotic deployments are used to confirm that the approach prevents spatial overfitting while acquiring sequential target tasks.

Significance. If the empirical results hold under scrutiny, the contribution is significant for robotic learning and VLA deployment. It offers a lightweight, prior-data-free solution to the adaptation-retention trade-off that is a major practical barrier for large pre-trained models. The simplicity of the confidence-based scaling heuristic, the absence of extra networks, and the real-world validation are notable strengths. The reported ability to match Experience Replay performance without replay data would be a useful advance if robustly supported.

major comments (2)
  1. [§4 and tables] §4 (Experimental Results) and associated tables: The central claim of an average absolute >20% margin in capability retention over vanilla SFT (and parity with Experience Replay) is load-bearing for the paper's contribution. The manuscript should report the precise definition of the retention metric, per-model and per-benchmark breakdowns, number of random seeds, and any statistical significance tests or error bars, as the current aggregate figure leaves the robustness of the result difficult to assess given the variability inherent in fine-tuning large VLAs.
  2. [§3] §3 (ConSFT Objective): The dynamic scaling of gradients by model confidence is presented as bounding intrinsic parameter disruption risk while still permitting target adaptation. A more explicit measurement or bound on parameter drift (for example, via L2 norm of weight changes or cosine similarity to the pre-trained checkpoint) would strengthen the claim that the heuristic secures both convergence and retention without introducing new failure modes.
minor comments (4)
  1. [§3] The notation for the confidence scaling factor and the overall loss should be introduced with explicit symbols and consistently referenced in subsequent sections to improve readability.
  2. [final section] The real-world experimental protocol in the final section would benefit from additional details on hardware, success criteria, and number of trials to allow replication.
  3. [Discussion] A brief discussion of potential limitations (for example, behavior on out-of-distribution low-confidence samples) would help contextualize the method's scope.
  4. [throughout] Ensure all benchmark task names and model variants are listed consistently between the abstract, tables, and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§4 and tables] §4 (Experimental Results) and associated tables: The central claim of an average absolute >20% margin in capability retention over vanilla SFT (and parity with Experience Replay) is load-bearing for the paper's contribution. The manuscript should report the precise definition of the retention metric, per-model and per-benchmark breakdowns, number of random seeds, and any statistical significance tests or error bars, as the current aggregate figure leaves the robustness of the result difficult to assess given the variability inherent in fine-tuning large VLAs.

    Authors: We agree that additional details on the retention metric and experimental variability are needed to substantiate the central claim. In the revised manuscript, we will explicitly define the retention metric as the average ratio of post-adaptation success rates on pre-training tasks to the original pre-training success rates. We will expand the tables in §4 to include per-model (π₀, π₀.₅, GR00T-N1.6-3B) and per-benchmark (LIBERO, RoboTwin) breakdowns. Experiments were run with 3 random seeds; we will report means with standard deviations as error bars and include paired statistical significance tests (Wilcoxon signed-rank) confirming the >20% average absolute improvement over vanilla SFT and parity with Experience Replay. revision: yes

  2. Referee: [§3] §3 (ConSFT Objective): The dynamic scaling of gradients by model confidence is presented as bounding intrinsic parameter disruption risk while still permitting target adaptation. A more explicit measurement or bound on parameter drift (for example, via L2 norm of weight changes or cosine similarity to the pre-trained checkpoint) would strengthen the claim that the heuristic secures both convergence and retention without introducing new failure modes.

    Authors: We appreciate this suggestion to quantify the parameter-drift claim. In the revised §3 and experimental analysis, we will add explicit measurements of parameter drift: the L2 norm of weight changes relative to the pre-trained checkpoint and the cosine similarity of the weight vectors. These will be reported for ConSFT versus vanilla SFT across the evaluated models, demonstrating that ConSFT produces measurably sparser updates (lower L2 drift and higher cosine similarity) while still achieving target-task convergence. This addition will directly support the bounding of intrinsic disruption risk without new failure modes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces ConSFT as a heuristic optimization objective that dynamically scales gradients according to model confidence to limit parameter drift during fine-tuning of flow-matching VLAs. Central claims rest on empirical evaluations across LIBERO and RoboTwin benchmarks with three specific models, reporting >20% absolute retention gains over vanilla SFT and parity with Experience Replay in a prior-data-free setting. No load-bearing derivation reduces to self-definition, fitted parameters renamed as predictions, or self-citation chains; the trust-region reference is inspirational only. The formulation and results are presented as an independent empirical contribution rather than a closed mathematical reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard supervised optimization assumptions and the existence of a usable model-confidence signal; no explicit free parameters, new entities, or non-standard axioms are detailed in the abstract.

axioms (1)
  • standard math Standard assumptions of gradient-based supervised optimization hold for the flow-matching VLA setting
    The method builds directly on gradient descent and dynamic scaling of signals.

pith-pipeline@v0.9.0 · 5528 in / 1216 out tokens · 40375 ms · 2026-05-12T01:27:22.596915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 6 internal anchors

  1. [1]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026

  2. [2]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  3. [3]

    Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar

    Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar. Ac- tions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

  4. [4]

    Towards long-lived robots: Continual learning VLA models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026a

    Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  6. [6]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Gua...

  8. [8]

    Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

  9. [9]

    Reinforcement learning fine- tunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning fine- tunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026

  10. [10]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 10

  11. [11]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  12. [12]

    Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2601.18699, 2026

    Olaf Yunus Laitinen Imanov. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2601.18699, 2026

  13. [13]

    A comparative analysis of llm adaptation: Sft, lora, and icl in data-scarce scenarios.arXiv preprint arXiv:2511.00130, 2025

    Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, and Noah Fiedel. A comparative analysis of llm adaptation: Sft, lora, and icl in data-scarce scenarios.arXiv preprint arXiv:2511.00130, 2025

  14. [14]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the national academy of sciences, 114 (13):35...

  15. [15]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  16. [16]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  17. [17]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  18. [18]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  19. [19]

    Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

    Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

  20. [20]

    Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

    Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

  21. [21]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  22. [22]

    Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection

    Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, and Yi Zhong. Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection. arXiv preprint arXiv:2602.07892, 2026

  23. [23]

    Beyond reasoning gains: Mitigating general capabilities forgetting in large reasoning models.arXiv preprint arXiv:2510.21978, 2025

    Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, and Deren Lei. Beyond reasoning gains: Mitigating general capabilities forgetting in large reasoning models.arXiv preprint arXiv:2510.21978, 2025

  24. [24]

    Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems, 38:106282–106319, 2026

    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems, 38:106282–106319, 2026

  25. [25]

    LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= aloEru2qCG. ...

  26. [26]

    Robust policy opti- mization to prevent catastrophic forgetting.arXiv preprint arXiv:2602.08813, 2026

    Mahdi Sabbaghi, George Pappas, Adel Javanmard, and Hamed Hassani. Robust policy opti- mization to prevent catastrophic forgetting.arXiv preprint arXiv:2602.08813, 2026

  27. [27]

    Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2025

    Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine- tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

  28. [28]

    CoRR , volume =

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  29. [29]

    arXiv preprint arXiv:2507.21053 , year=

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

  30. [30]

    Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025

    Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025

  31. [31]

    Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026

    Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, Karen Liu, and Angjoo Kanazawa. Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026

  32. [32]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  33. [33]

    open the top drawer

    Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the computer vision and pattern recognition conference, pages 27649–27660, 2025. 12 A Mechanistic ab...