pith. machine review for the scientific record. sign in

arxiv: 2605.09640 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Hanzhong Guo, Linwei Chen, Meng Lou, Yizhou Yu

Pith reviewed 2026-05-12 04:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords catastrophic forgettingvisual continual learningreinforcement fine-tuningpolicy optimizationretention rewardclass-incremental learningdomain-incremental learningmultimodal models
0
0 comments X

The pith

RaPO reduces catastrophic forgetting in visual continual learning by rewarding reinforcement learning rollouts that stay close to previous task policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement fine-tuning outperforms supervised methods but still forgets prior visual tasks because rollouts with identical rewards can differ widely in how much they drift from earlier policies. By adding a retention reward that favors low-drift trajectories and normalizing advantages across task boundaries, the proposed method keeps old knowledge intact while allowing new visual skills to be learned. This matters for any system that must keep acquiring visual capabilities over time without erasing earlier ones, such as object recognition in changing environments. Experiments across five settings including class-incremental and domain-incremental learning show the approach delivers leading performance in balancing preservation and new-task acquisition.

Core claim

Pilot experiments reveal that trajectory-level distribution drift agnosticism is the key bottleneck: among rollouts achieving the same task reward, the KL divergence from the preceding-task policy varies substantially and correlates with forgetting. RaPO mitigates this through a retention reward that converts the trajectory-level KL divergence into a continuous additive signal, preferentially reinforcing knowledge-preserving rollouts within each group, together with cross-task advantage normalization that maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize optimization.

What carries the argument

Retention Reward that turns trajectory-level KL divergence from the preceding policy into a continuous reward signal to prefer preserving rollouts, paired with Cross-Task Advantage Normalization (CTAN) that uses an exponential moving average of reward statistics to stabilize training across sequential tasks.

If this is right

  • RaPO delivers leading performance by substantially reducing catastrophic forgetting while preserving strong plasticity across class-incremental, domain-incremental, and three other visual continual learning settings.
  • The method leverages the free-form textual generalization of multimodal large language models to evaluate retention of prior visual knowledge during sequential learning.
  • Cross-task advantage normalization prevents optimization instability that would otherwise arise when reward statistics change at task boundaries.
  • Explicit mitigation of trajectory-level drift makes reinforcement fine-tuning more viable for lifelong visual task sequences than standard approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar drift-aware reward shaping could be tested in non-visual continual learning domains such as language or control tasks where policy stability across sequences is also valuable.
  • The trajectory-level insight suggests that monitoring KL divergence during training might serve as a practical diagnostic for impending forgetting even in non-reinforcement continual learning setups.
  • Combining the retention reward with existing regularization techniques could be explored to further strengthen preservation without additional hyperparameter tuning.
  • If the method scales, it could support longer task sequences in real-world visual applications such as robotics or medical imaging where forgetting old categories is costly.

Load-bearing premise

Trajectory-level distribution drift is the main cause of forgetting, and converting it into an additive reward will reliably protect prior knowledge without reducing plasticity or causing optimization instability.

What would settle it

An ablation that removes the retention reward component on the same five visual continual learning benchmarks and measures no increase in forgetting rates, or a drop in new-task accuracy when the full method is used, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09640 by Hanzhong Guo, Linwei Chen, Meng Lou, Yizhou Yu.

Figure 1
Figure 1. Figure 1: (a) RFT clearly outperforms SFT in rehearsal-free CIL, but still suffers from significant forgetting. (b) Among equally-rewarded rollouts, KL divergence from the policy in the preceding task varies substantially, and this difference enlarges as tasks progress. in RFT implicitly biases the optimization toward solutions residing in low-drift distribution spaces, whereas SFT is prone to converge on solutions … view at source ↗
Figure 2
Figure 2. Figure 2: (a) GRPO relies on the instantaneous reward standard deviation, which fluctuates sharply across task boundaries, whereas CTAN maintains a persistent EMA normalizer (β=0.99). (b) CTAN produces a smoother advantage magnitude (sum of the absolute values of advantages) across the continual learning stream. (c) The stabilized advantage scale is accompanied by smoother acquisition of training reward. (d) Evaluat… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template and required output format for image and video classification. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Retention reward dynamics on (a) ImageNet-R 10-task class-incremental classification [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative class-incremental image classification examples from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative class-incremental object detection examples on COCO 2017 dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that reinforcement fine-tuning (RFT) outperforms supervised fine-tuning (SFT) in visual continual learning but still exhibits non-negligible catastrophic forgetting. Through a pilot study, it attributes this to Trajectory-level Drift Agnosticism, where KL divergence from the prior policy varies among high-reward rollouts and correlates with forgetting. It proposes Retention-aware Policy Optimization (RaPO) consisting of a Retention Reward that shapes rewards to favor low-drift trajectories and Cross-Task Advantage Normalization (CTAN) using persistent EMA of reward statistics. Evaluated across five visual CL settings (CIL, DIL, etc.) with MLLMs, RaPO is reported to achieve leading performance by substantially reducing forgetting while preserving plasticity.

Significance. If the empirical results and ablations hold, the work would be moderately significant as the first systematic study applying RFT to visual continual learning. The reward-shaping approach based on policy drift offers a concrete mechanism that could generalize beyond the tested settings and inspire further integration of RL techniques with incremental learning for large multimodal models.

major comments (3)
  1. Abstract and §4 (Experiments): the central claim that RaPO 'achieves leading performance' and 'substantially reduc[es] catastrophic forgetting' is asserted without any quantitative numbers, baseline comparisons, statistical significance tests, or ablation results, which are load-bearing for validating the method's effectiveness over prior RFT and SFT approaches.
  2. §3.1 (Pilot Study): the asserted strong correlation between trajectory-level KL divergence and forgetting lacks reported controls, error bars, or statistical analysis, leaving open whether the correlation is causal or whether other interference mechanisms dominate.
  3. §3.2 (RaPO Method): the Retention Reward and CTAN are introduced as additive signals and EMA normalization, but no analysis or equations demonstrate that they preserve new-task plasticity (e.g., forward transfer) or avoid optimization instability when reward scales differ across tasks, which is required for the claim that the method mitigates forgetting without side effects.
minor comments (2)
  1. The term 'Trajectory-level Drift Agnosticism' is used repeatedly but never given a formal definition or equation, which reduces clarity in the motivation section.
  2. Notation for the Retention Reward and CTAN (e.g., how the EMA is exactly computed and applied to advantages) could be made more precise with explicit formulas to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have addressed each major comment in detail below and revised the manuscript to incorporate the feedback, strengthening the presentation of our results and analysis.

read point-by-point responses
  1. Referee: Abstract and §4 (Experiments): the central claim that RaPO 'achieves leading performance' and 'substantially reduc[es] catastrophic forgetting' is asserted without any quantitative numbers, baseline comparisons, statistical significance tests, or ablation results, which are load-bearing for validating the method's effectiveness over prior RFT and SFT approaches.

    Authors: We agree that providing quantitative support in the abstract and ensuring comprehensive reporting in the experiments section is important. Although Section 4 of the original manuscript includes tables with performance metrics comparing RaPO to SFT, GRPO, and other baselines across the five visual CL settings, we acknowledge the need for more explicit quantification and statistical validation. In the revised version, we have updated the abstract to include specific numbers highlighting the leading performance and forgetting reduction. We have also added statistical significance tests (e.g., t-tests) and expanded the ablation studies with additional results in §4 to better validate the claims. revision: yes

  2. Referee: §3.1 (Pilot Study): the asserted strong correlation between trajectory-level KL divergence and forgetting lacks reported controls, error bars, or statistical analysis, leaving open whether the correlation is causal or whether other interference mechanisms dominate.

    Authors: Thank you for this observation. The pilot study demonstrates the correlation via empirical observations and plots. To strengthen this, we have revised §3.1 to include error bars on the relevant figures, additional control experiments to isolate the effect of KL divergence, and statistical analysis such as correlation coefficients and p-values. We also discuss the potential causality and acknowledge other possible mechanisms in the text. revision: yes

  3. Referee: §3.2 (RaPO Method): the Retention Reward and CTAN are introduced as additive signals and EMA normalization, but no analysis or equations demonstrate that they preserve new-task plasticity (e.g., forward transfer) or avoid optimization instability when reward scales differ across tasks, which is required for the claim that the method mitigates forgetting without side effects.

    Authors: We appreciate the referee's point on the need for supporting analysis. In the revised §3.2, we have incorporated additional equations and analysis showing that the Retention Reward is designed as a bounded additive term that does not override the task-specific reward, thereby preserving plasticity and forward transfer. For CTAN, we provide a derivation illustrating its stability properties across varying reward scales. Furthermore, we have included new experimental results measuring forward transfer and optimization metrics like variance in advantages to demonstrate the absence of negative side effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical motivation and independent evaluation

full rationale

The paper's central chain begins with a pilot empirical observation that KL divergence varies among equal-reward rollouts and correlates with forgetting. It then defines an explicit Retention Reward to convert that KL into an additive signal and adds CTAN for cross-task normalization. These are not self-definitional or fitted-input predictions: the forgetting metric remains an independent downstream measurement, and final claims rest on benchmark results across five settings rather than reducing by construction to the pilot correlation or any self-citation. No uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear. The procedure is a testable hypothesis with separate validation, qualifying as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard reinforcement-learning assumptions plus two newly introduced algorithmic components whose effectiveness is asserted empirically.

axioms (2)
  • domain assumption Policy optimization under task rewards improves performance on the current task while the added retention term preserves prior behavior.
    Invoked when the retention reward is added to the standard RL objective.
  • domain assumption Exponential moving average of reward statistics remains stable and useful across task boundaries.
    Required for the CTAN component to function as described.
invented entities (2)
  • Retention Reward no independent evidence
    purpose: Converts trajectory-level KL divergence from the previous policy into an additive reward signal that prefers low-drift rollouts.
    New reward term introduced to address the identified drift bottleneck.
  • Cross-Task Advantage Normalization (CTAN) no independent evidence
    purpose: Maintains a persistent EMA of reward statistics to stabilize advantage estimates when tasks change.
    New normalization mechanism for continual RL optimization.

pith-pipeline@v0.9.0 · 5605 in / 1524 out tokens · 55425 ms · 2026-05-12T04:36:39.799478+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 14 internal anchors

  1. [1]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  4. [4]

    Qwen3.5-Omni Technical Report

    Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  5. [5]

    CoRR , volume =

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

  6. [6]

    Leveraging verifier-based reinforcement learning in image editing

    Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, and Weilin Huang. Leveraging verifier-based reinforcement learning in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  7. [7]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  8. [8]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  9. [9]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

  10. [10]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  11. [11]

    To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning. InAdvances in Neural Information Processing Systems, 2025

  12. [12]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InAdvances in Neural Information Processing Systems, 2025

  13. [13]

    Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning

    Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning. InInternational Conference on Learning Representations, 2026

  14. [14]

    A survey on ensemble learning for data stream classification.ACM Computing Surveys, 50(2):1–36, 2017

    Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification.ACM Computing Surveys, 50(2):1–36, 2017

  15. [15]

    Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

    Junhao Zheng, Shengjie Qiu, Chengming Shi, and Qianli Ma. Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

  16. [16]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  17. [17]

    Recent advances of foundation language models-based continual learning: A survey.ACM Computing Surveys, 57(5):1–38, 2025

    Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Yuan Xie, and Liang He. Recent advances of foundation language models-based continual learning: A survey.ACM Computing Surveys, 57(5):1–38, 2025. 20

  18. [18]

    Continual instruction tuning for large multimodal models.IEEE Transactions on Image Processing, 2026

    Jinghan He, Haiyun Guo, Kuan Zhu, Ming Tang, and Jinqiao Wang. Continual instruction tuning for large multimodal models.IEEE Transactions on Image Processing, 2026

  19. [19]

    Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2025

    Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Qingfu Zhang, Hongbin Liu, et al. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

  20. [20]

    RL’s razor: Why online reinforcement learning forgets less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less. InInternational Conference on Learning Representations, 2026

  21. [21]

    Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024

    Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024

  22. [22]

    A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

  23. [23]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

  24. [24]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  25. [25]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  26. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  27. [27]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  28. [28]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  29. [29]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational Conference on Learning Representations, 2024

  30. [30]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025

  31. [31]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  32. [32]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  33. [33]

    Onethinker: All-in-one reasoning model for image and video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  34. [34]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022

  35. [35]

    Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need.International Journal of Computer Vision, 133(3):1012–1032, 2025

    Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need.International Journal of Computer Vision, 133(3):1012–1032, 2025. 21

  36. [36]

    S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

    Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

  37. [37]

    Non-exemplar domain incremental learning via cross-domain concept integration

    Qiang Wang, Yuhang He, Songlin Dong, Xinyuan Gao, Shaokun Wang, and Yihong Gong. Non-exemplar domain incremental learning via cross-domain concept integration. InEuropean Conference on Computer Vision, pages 144–162. Springer, 2024

  38. [38]

    Continual learning with pre-trained models: a survey

    Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained models: a survey. InInternational Joint Conference on Artificial Intelligence, pages 8363–8371, 2024

  39. [39]

    Scaling continual learning to 300+ tasks with bi-level routing mixture-of-experts

    Meng Lou, Yunxiang Fu, and Yizhou Yu. Scaling continual learning to 300+ tasks with bi-level routing mixture-of-experts. InInternational Conference on Machine Learning, 2026

  40. [40]

    Inflora: Interference-free low-rank adaptation for continual learning

    Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024

  41. [41]

    Boosting continual learning of vision-language models via mixture-of-experts adapters

    Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219–23230, 2024

  42. [42]

    Integrating task-specific and universal adapters for pre-trained model-based class-incremental learning

    Yan Wang, Da-Wei Zhou, and Han-Jia Ye. Integrating task-specific and universal adapters for pre-trained model-based class-incremental learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 806–816, 2025

  43. [43]

    Mos: Model surgery for pre-trained model-based class-incremental learning

    Hai-Long Sun, Da-Wei Zhou, Hanbin Zhao, Le Gan, De-Chuan Zhan, and Han-Jia Ye. Mos: Model surgery for pre-trained model-based class-incremental learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20699–20707, 2025

  44. [44]

    Dual consolidation for pre- trained model-based domain-incremental learning

    Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, Lijun Zhang, and De-Chuan Zhan. Dual consolidation for pre- trained model-based domain-incremental learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20547–20557, 2025

  45. [45]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  46. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  47. [47]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024

  48. [48]

    arXiv preprint arXiv:2602.12125 , year=

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  49. [49]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

  50. [50]

    Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

    Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

  51. [51]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

  52. [52]

    Few-shot class-incremental learning for classification and object detection: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2924–2945, 2025

    Jinghua Zhang, Li Liu, Olli Silvén, Matti Pietikäinen, and Dewen Hu. Few-shot class-incremental learning for classification and object detection: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2924–2945, 2025

  53. [53]

    Riemannian walk for incremental learning: Understanding forgetting and intransigence

    Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. InProceedings of the European Conference on Computer Vision, pages 532–547, 2018. 22

  54. [54]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  55. [55]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

  56. [56]

    Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

  57. [57]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

  58. [58]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017

  59. [59]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  60. [60]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  61. [61]

    Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

    Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InProceedings of the European conference on computer vision (ECCV), pages 305–321, 2018

  62. [62]

    Moment matching for multi-source domain adaptation

    Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  63. [63]

    Deep hashing network for unsupervised domain adaptation

    Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

  64. [64]

    Cross-domain weakly- supervised object detection through progressive domain adaptation

    Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly- supervised object detection through progressive domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018

  65. [65]

    The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

    Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

  66. [66]

    Three types of incremental learning.Nature Machine Intelligence, 4(12):1185–1197, 2022

    Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4(12):1185–1197, 2022

  67. [67]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pages 213–229, 2020

  68. [68]

    CoRR , volume =

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025. 23