Recognition: no theorem link
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3
The pith
ConSFT preserves pre-trained capabilities in flow-matching VLAs by scaling learning signals to model confidence during fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates and bound intrinsic parameter disruption risk. Inspired by trust-region clipping, the formulation creates a progressive learning dynamic that secures target convergence together with prior capability retention through sparse updates, without parallel reference networks or prior data.
What carries the argument
ConSFT objective, which dynamically scales learning signals based on model confidence to suppress excessive gradients from low-confidence samples.
Load-bearing premise
Dynamically scaling learning signals based on model confidence effectively bounds parameter disruption risk while allowing necessary adaptation without introducing new failure modes.
What would settle it
An experiment on the LIBERO benchmark where ConSFT is applied to a downstream task but either fails to improve target performance or loses prior-task success rates at levels comparable to vanilla SFT.
Figures
read the original abstract
Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($\pi_0$, $\pi_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Conservative Supervised Fine-Tuning (ConSFT) for flow-matching Vision-Language-Action (VLA) models. ConSFT is an optimization objective that dynamically scales learning signals according to model confidence to suppress low-confidence updates, thereby mitigating catastrophic forgetting of pre-trained capabilities during adaptation to target tasks. The method requires no prior data and no additional reference networks or architectural changes. It is evaluated on the LIBERO and RoboTwin benchmarks across three flow-matching VLAs (π₀, π₀.₅, and GR00T-N1.6-3B), reporting an average absolute improvement of over 20% in capability retention relative to vanilla SFT and performance comparable to data-heavy Experience Replay. Real-world robotic deployments are used to confirm that the approach prevents spatial overfitting while acquiring sequential target tasks.
Significance. If the empirical results hold under scrutiny, the contribution is significant for robotic learning and VLA deployment. It offers a lightweight, prior-data-free solution to the adaptation-retention trade-off that is a major practical barrier for large pre-trained models. The simplicity of the confidence-based scaling heuristic, the absence of extra networks, and the real-world validation are notable strengths. The reported ability to match Experience Replay performance without replay data would be a useful advance if robustly supported.
major comments (2)
- [§4 and tables] §4 (Experimental Results) and associated tables: The central claim of an average absolute >20% margin in capability retention over vanilla SFT (and parity with Experience Replay) is load-bearing for the paper's contribution. The manuscript should report the precise definition of the retention metric, per-model and per-benchmark breakdowns, number of random seeds, and any statistical significance tests or error bars, as the current aggregate figure leaves the robustness of the result difficult to assess given the variability inherent in fine-tuning large VLAs.
- [§3] §3 (ConSFT Objective): The dynamic scaling of gradients by model confidence is presented as bounding intrinsic parameter disruption risk while still permitting target adaptation. A more explicit measurement or bound on parameter drift (for example, via L2 norm of weight changes or cosine similarity to the pre-trained checkpoint) would strengthen the claim that the heuristic secures both convergence and retention without introducing new failure modes.
minor comments (4)
- [§3] The notation for the confidence scaling factor and the overall loss should be introduced with explicit symbols and consistently referenced in subsequent sections to improve readability.
- [final section] The real-world experimental protocol in the final section would benefit from additional details on hardware, success criteria, and number of trials to allow replication.
- [Discussion] A brief discussion of potential limitations (for example, behavior on out-of-distribution low-confidence samples) would help contextualize the method's scope.
- [throughout] Ensure all benchmark task names and model variants are listed consistently between the abstract, tables, and text.
Simulated Author's Rebuttal
We thank the referee for the positive recommendation of minor revision and the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and robustness.
read point-by-point responses
-
Referee: [§4 and tables] §4 (Experimental Results) and associated tables: The central claim of an average absolute >20% margin in capability retention over vanilla SFT (and parity with Experience Replay) is load-bearing for the paper's contribution. The manuscript should report the precise definition of the retention metric, per-model and per-benchmark breakdowns, number of random seeds, and any statistical significance tests or error bars, as the current aggregate figure leaves the robustness of the result difficult to assess given the variability inherent in fine-tuning large VLAs.
Authors: We agree that additional details on the retention metric and experimental variability are needed to substantiate the central claim. In the revised manuscript, we will explicitly define the retention metric as the average ratio of post-adaptation success rates on pre-training tasks to the original pre-training success rates. We will expand the tables in §4 to include per-model (π₀, π₀.₅, GR00T-N1.6-3B) and per-benchmark (LIBERO, RoboTwin) breakdowns. Experiments were run with 3 random seeds; we will report means with standard deviations as error bars and include paired statistical significance tests (Wilcoxon signed-rank) confirming the >20% average absolute improvement over vanilla SFT and parity with Experience Replay. revision: yes
-
Referee: [§3] §3 (ConSFT Objective): The dynamic scaling of gradients by model confidence is presented as bounding intrinsic parameter disruption risk while still permitting target adaptation. A more explicit measurement or bound on parameter drift (for example, via L2 norm of weight changes or cosine similarity to the pre-trained checkpoint) would strengthen the claim that the heuristic secures both convergence and retention without introducing new failure modes.
Authors: We appreciate this suggestion to quantify the parameter-drift claim. In the revised §3 and experimental analysis, we will add explicit measurements of parameter drift: the L2 norm of weight changes relative to the pre-trained checkpoint and the cosine similarity of the weight vectors. These will be reported for ConSFT versus vanilla SFT across the evaluated models, demonstrating that ConSFT produces measurably sparser updates (lower L2 drift and higher cosine similarity) while still achieving target-task convergence. This addition will directly support the bounding of intrinsic disruption risk without new failure modes. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces ConSFT as a heuristic optimization objective that dynamically scales gradients according to model confidence to limit parameter drift during fine-tuning of flow-matching VLAs. Central claims rest on empirical evaluations across LIBERO and RoboTwin benchmarks with three specific models, reporting >20% absolute retention gains over vanilla SFT and parity with Experience Replay in a prior-data-free setting. No load-bearing derivation reduces to self-definition, fitted parameters renamed as predictions, or self-citation chains; the trust-region reference is inspirational only. The formulation and results are presented as an independent empirical contribution rather than a closed mathematical reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of gradient-based supervised optimization hold for the flow-matching VLA setting
Reference graph
Works this paper leans on
-
[1]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026
-
[2]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar
Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky, and Anirudha Majumdar. Ac- tions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025
-
[4]
Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...
work page 2025
-
[7]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Gua...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025
-
[9]
Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning fine- tunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026
work page 2026
-
[10]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 10
work page 2015
-
[11]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Olaf Yunus Laitinen Imanov. Mechanistic analysis of catastrophic forgetting in large language models during continual fine-tuning.arXiv preprint arXiv:2601.18699, 2026
-
[13]
Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, and Noah Fiedel. A comparative analysis of llm adaptation: Sft, lora, and icl in data-scarce scenarios.arXiv preprint arXiv:2511.00130, 2025
-
[14]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the national academy of sciences, 114 (13):35...
work page 2017
-
[15]
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017
work page 2017
-
[16]
Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025
work page 2025
-
[17]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025
work page 2025
-
[18]
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[19]
Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026
-
[20]
Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026
-
[21]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
work page 2022
-
[22]
Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, and Yi Zhong. Safety alignment as continual learning: Mitigating the alignment tax via orthogonal gradient projection. arXiv preprint arXiv:2602.07892, 2026
work page internal anchor Pith review arXiv 2026
-
[23]
Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, and Deren Lei. Beyond reasoning gains: Mitigating general capabilities forgetting in large reasoning models.arXiv preprint arXiv:2510.21978, 2025
-
[24]
Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems, 38:106282–106319, 2026
work page 2026
-
[25]
LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= aloEru2qCG. ...
work page 2024
-
[26]
Mahdi Sabbaghi, George Pappas, Adel Javanmard, and Hamed Hassani. Robust policy opti- mization to prevent catastrophic forgetting.arXiv preprint arXiv:2602.08813, 2026
work page internal anchor Pith review arXiv 2026
-
[27]
Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2025
Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Min Xie, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, and Fei Zhu. Reinforcement fine- tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025
-
[28]
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025
-
[29]
arXiv preprint arXiv:2507.21053 , year=
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025
-
[30]
Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025
Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow- matching policies.arXiv preprint arXiv:2507.15073, 2025
-
[31]
Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026
Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, Karen Liu, and Angjoo Kanazawa. Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481, 2026
-
[32]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
work page 2023
-
[33]
Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the computer vision and pattern recognition conference, pages 27649–27660, 2025. 12 A Mechanistic ab...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.