arxiv: 2605.09640 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Hanzhong Guo, Linwei Chen, Meng Lou, Yizhou Yu

Pith reviewed 2026-05-12 04:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords catastrophic forgettingvisual continual learningreinforcement fine-tuningpolicy optimizationretention rewardclass-incremental learningdomain-incremental learningmultimodal models

0 comments

The pith

RaPO reduces catastrophic forgetting in visual continual learning by rewarding reinforcement learning rollouts that stay close to previous task policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement fine-tuning outperforms supervised methods but still forgets prior visual tasks because rollouts with identical rewards can differ widely in how much they drift from earlier policies. By adding a retention reward that favors low-drift trajectories and normalizing advantages across task boundaries, the proposed method keeps old knowledge intact while allowing new visual skills to be learned. This matters for any system that must keep acquiring visual capabilities over time without erasing earlier ones, such as object recognition in changing environments. Experiments across five settings including class-incremental and domain-incremental learning show the approach delivers leading performance in balancing preservation and new-task acquisition.

Core claim

Pilot experiments reveal that trajectory-level distribution drift agnosticism is the key bottleneck: among rollouts achieving the same task reward, the KL divergence from the preceding-task policy varies substantially and correlates with forgetting. RaPO mitigates this through a retention reward that converts the trajectory-level KL divergence into a continuous additive signal, preferentially reinforcing knowledge-preserving rollouts within each group, together with cross-task advantage normalization that maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize optimization.

What carries the argument

Retention Reward that turns trajectory-level KL divergence from the preceding policy into a continuous reward signal to prefer preserving rollouts, paired with Cross-Task Advantage Normalization (CTAN) that uses an exponential moving average of reward statistics to stabilize training across sequential tasks.

If this is right

RaPO delivers leading performance by substantially reducing catastrophic forgetting while preserving strong plasticity across class-incremental, domain-incremental, and three other visual continual learning settings.
The method leverages the free-form textual generalization of multimodal large language models to evaluate retention of prior visual knowledge during sequential learning.
Cross-task advantage normalization prevents optimization instability that would otherwise arise when reward statistics change at task boundaries.
Explicit mitigation of trajectory-level drift makes reinforcement fine-tuning more viable for lifelong visual task sequences than standard approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar drift-aware reward shaping could be tested in non-visual continual learning domains such as language or control tasks where policy stability across sequences is also valuable.
The trajectory-level insight suggests that monitoring KL divergence during training might serve as a practical diagnostic for impending forgetting even in non-reinforcement continual learning setups.
Combining the retention reward with existing regularization techniques could be explored to further strengthen preservation without additional hyperparameter tuning.
If the method scales, it could support longer task sequences in real-world visual applications such as robotics or medical imaging where forgetting old categories is costly.

Load-bearing premise

Trajectory-level distribution drift is the main cause of forgetting, and converting it into an additive reward will reliably protect prior knowledge without reducing plasticity or causing optimization instability.

What would settle it

An ablation that removes the retention reward component on the same five visual continual learning benchmarks and measures no increase in forgetting rates, or a drop in new-task accuracy when the full method is used, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09640 by Hanzhong Guo, Linwei Chen, Meng Lou, Yizhou Yu.

**Figure 1.** Figure 1: (a) RFT clearly outperforms SFT in rehearsal-free CIL, but still suffers from significant forgetting. (b) Among equally-rewarded rollouts, KL divergence from the policy in the preceding task varies substantially, and this difference enlarges as tasks progress. in RFT implicitly biases the optimization toward solutions residing in low-drift distribution spaces, whereas SFT is prone to converge on solutions … view at source ↗

**Figure 2.** Figure 2: (a) GRPO relies on the instantaneous reward standard deviation, which fluctuates sharply across task boundaries, whereas CTAN maintains a persistent EMA normalizer (β=0.99). (b) CTAN produces a smoother advantage magnitude (sum of the absolute values of advantages) across the continual learning stream. (c) The stabilized advantage scale is accompanied by smoother acquisition of training reward. (d) Evaluat… view at source ↗

**Figure 3.** Figure 3: Prompt template and required output format for image and video classification. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 5.** Figure 5: Retention reward dynamics on (a) ImageNet-R 10-task class-incremental classification [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative class-incremental image classification examples from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative class-incremental object detection examples on COCO 2017 dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RaPO adds a retention reward based on KL drift between trajectories and a cross-task normalization step to RFT for visual continual learning, but the abstract supplies no numbers or ablations to show the fix actually works.

read the letter

The paper identifies that reinforcement fine-tuning still forgets in visual class-incremental and domain-incremental tasks even though it beats supervised fine-tuning. They trace part of the problem to variation in how much individual rollouts drift from the prior policy when task rewards are the same, then turn that drift into an explicit retention reward and add CTAN to stabilize reward statistics across task boundaries with an exponential moving average.

Referee Report

3 major / 2 minor

Summary. The paper claims that reinforcement fine-tuning (RFT) outperforms supervised fine-tuning (SFT) in visual continual learning but still exhibits non-negligible catastrophic forgetting. Through a pilot study, it attributes this to Trajectory-level Drift Agnosticism, where KL divergence from the prior policy varies among high-reward rollouts and correlates with forgetting. It proposes Retention-aware Policy Optimization (RaPO) consisting of a Retention Reward that shapes rewards to favor low-drift trajectories and Cross-Task Advantage Normalization (CTAN) using persistent EMA of reward statistics. Evaluated across five visual CL settings (CIL, DIL, etc.) with MLLMs, RaPO is reported to achieve leading performance by substantially reducing forgetting while preserving plasticity.

Significance. If the empirical results and ablations hold, the work would be moderately significant as the first systematic study applying RFT to visual continual learning. The reward-shaping approach based on policy drift offers a concrete mechanism that could generalize beyond the tested settings and inspire further integration of RL techniques with incremental learning for large multimodal models.

major comments (3)

Abstract and §4 (Experiments): the central claim that RaPO 'achieves leading performance' and 'substantially reduc[es] catastrophic forgetting' is asserted without any quantitative numbers, baseline comparisons, statistical significance tests, or ablation results, which are load-bearing for validating the method's effectiveness over prior RFT and SFT approaches.
§3.1 (Pilot Study): the asserted strong correlation between trajectory-level KL divergence and forgetting lacks reported controls, error bars, or statistical analysis, leaving open whether the correlation is causal or whether other interference mechanisms dominate.
§3.2 (RaPO Method): the Retention Reward and CTAN are introduced as additive signals and EMA normalization, but no analysis or equations demonstrate that they preserve new-task plasticity (e.g., forward transfer) or avoid optimization instability when reward scales differ across tasks, which is required for the claim that the method mitigates forgetting without side effects.

minor comments (2)

The term 'Trajectory-level Drift Agnosticism' is used repeatedly but never given a formal definition or equation, which reduces clarity in the motivation section.
Notation for the Retention Reward and CTAN (e.g., how the EMA is exactly computed and applied to advantages) could be made more precise with explicit formulas to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We have addressed each major comment in detail below and revised the manuscript to incorporate the feedback, strengthening the presentation of our results and analysis.

read point-by-point responses

Referee: Abstract and §4 (Experiments): the central claim that RaPO 'achieves leading performance' and 'substantially reduc[es] catastrophic forgetting' is asserted without any quantitative numbers, baseline comparisons, statistical significance tests, or ablation results, which are load-bearing for validating the method's effectiveness over prior RFT and SFT approaches.

Authors: We agree that providing quantitative support in the abstract and ensuring comprehensive reporting in the experiments section is important. Although Section 4 of the original manuscript includes tables with performance metrics comparing RaPO to SFT, GRPO, and other baselines across the five visual CL settings, we acknowledge the need for more explicit quantification and statistical validation. In the revised version, we have updated the abstract to include specific numbers highlighting the leading performance and forgetting reduction. We have also added statistical significance tests (e.g., t-tests) and expanded the ablation studies with additional results in §4 to better validate the claims. revision: yes
Referee: §3.1 (Pilot Study): the asserted strong correlation between trajectory-level KL divergence and forgetting lacks reported controls, error bars, or statistical analysis, leaving open whether the correlation is causal or whether other interference mechanisms dominate.

Authors: Thank you for this observation. The pilot study demonstrates the correlation via empirical observations and plots. To strengthen this, we have revised §3.1 to include error bars on the relevant figures, additional control experiments to isolate the effect of KL divergence, and statistical analysis such as correlation coefficients and p-values. We also discuss the potential causality and acknowledge other possible mechanisms in the text. revision: yes
Referee: §3.2 (RaPO Method): the Retention Reward and CTAN are introduced as additive signals and EMA normalization, but no analysis or equations demonstrate that they preserve new-task plasticity (e.g., forward transfer) or avoid optimization instability when reward scales differ across tasks, which is required for the claim that the method mitigates forgetting without side effects.

Authors: We appreciate the referee's point on the need for supporting analysis. In the revised §3.2, we have incorporated additional equations and analysis showing that the Retention Reward is designed as a bounded additive term that does not override the task-specific reward, thereby preserving plasticity and forward transfer. For CTAN, we provide a derivation illustrating its stability properties across varying reward scales. Furthermore, we have included new experimental results measuring forward transfer and optimization metrics like variance in advantages to demonstrate the absence of negative side effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical motivation and independent evaluation

full rationale

The paper's central chain begins with a pilot empirical observation that KL divergence varies among equal-reward rollouts and correlates with forgetting. It then defines an explicit Retention Reward to convert that KL into an additive signal and adds CTAN for cross-task normalization. These are not self-definitional or fitted-input predictions: the forgetting metric remains an independent downstream measurement, and final claims rest on benchmark results across five settings rather than reducing by construction to the pilot correlation or any self-citation. No uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear. The procedure is a testable hypothesis with separate validation, qualifying as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard reinforcement-learning assumptions plus two newly introduced algorithmic components whose effectiveness is asserted empirically.

axioms (2)

domain assumption Policy optimization under task rewards improves performance on the current task while the added retention term preserves prior behavior.
Invoked when the retention reward is added to the standard RL objective.
domain assumption Exponential moving average of reward statistics remains stable and useful across task boundaries.
Required for the CTAN component to function as described.

invented entities (2)

Retention Reward no independent evidence
purpose: Converts trajectory-level KL divergence from the previous policy into an additive reward signal that prefers low-drift rollouts.
New reward term introduced to address the identified drift bottleneck.
Cross-Task Advantage Normalization (CTAN) no independent evidence
purpose: Maintains a persistent EMA of reward statistics to stabilize advantage estimates when tasks change.
New normalization mechanism for continual RL optimization.

pith-pipeline@v0.9.0 · 5605 in / 1524 out tokens · 55425 ms · 2026-05-12T04:36:39.799478+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 14 internal anchors

[1]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3.5-Omni Technical Report

Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

CoRR , volume =

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

work page arXiv 2025
[6]

Leveraging verifier-based reinforcement learning in image editing

Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, and Weilin Huang. Leveraging verifier-based reinforcement learning in image editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

work page 2025
[9]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

work page 2034
[11]

To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning

Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, and Kaipeng Zhang. To think or not to think: A study of thinking in rule-based visual reinforcement fine-tuning. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[12]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[13]

Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning

Hulingxiao He, Zijun Geng, and Yuxin Peng. Fine-r1: Make multi-modal llms excel in fine-grained visual recognition by chain-of-thought reasoning. InInternational Conference on Learning Representations, 2026

work page 2026
[14]

A survey on ensemble learning for data stream classification.ACM Computing Surveys, 50(2):1–36, 2017

Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification.ACM Computing Surveys, 50(2):1–36, 2017

work page 2017
[15]

Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

Junhao Zheng, Shengjie Qiu, Chengming Shi, and Qianli Ma. Towards lifelong learning of large language models: A survey.ACM Computing Surveys, 57(8):1–35, 2025

work page 2025
[16]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

work page 2025
[17]

Recent advances of foundation language models-based continual learning: A survey.ACM Computing Surveys, 57(5):1–38, 2025

Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Yuan Xie, and Liang He. Recent advances of foundation language models-based continual learning: A survey.ACM Computing Surveys, 57(5):1–38, 2025. 20

work page 2025
[18]

Continual instruction tuning for large multimodal models.IEEE Transactions on Image Processing, 2026

Jinghan He, Haiyun Guo, Kuan Zhu, Ming Tang, and Jinqiao Wang. Continual instruction tuning for large multimodal models.IEEE Transactions on Image Processing, 2026

work page 2026
[19]

Reinforcement fine-tuning naturally mitigates forgetting in continual post-training, 2025

Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Qingfu Zhang, Hongbin Liu, et al. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.arXiv preprint arXiv:2507.05386, 2025

work page arXiv 2025
[20]

RL’s razor: Why online reinforcement learning forgets less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less. InInternational Conference on Learning Representations, 2026

work page 2026
[21]

Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024

Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9851–9873, 2024

work page 2024
[22]

A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

work page 2024
[23]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

work page 2021
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[28]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[29]

Safe rlhf: Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational Conference on Learning Representations, 2024

work page 2024
[30]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025

work page 2025
[31]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Onethinker: All-in-one reasoning model for image and video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

work page 2026
[34]

Learning to prompt for continual learning

Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022

work page 2022
[35]

Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need.International Journal of Computer Vision, 133(3):1012–1032, 2025

Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need.International Journal of Computer Vision, 133(3):1012–1032, 2025. 21

work page 2025
[36]

S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning.Advances in Neural Information Processing Systems, 35:5682–5695, 2022

work page 2022
[37]

Non-exemplar domain incremental learning via cross-domain concept integration

Qiang Wang, Yuhang He, Songlin Dong, Xinyuan Gao, Shaokun Wang, and Yihong Gong. Non-exemplar domain incremental learning via cross-domain concept integration. InEuropean Conference on Computer Vision, pages 144–162. Springer, 2024

work page 2024
[38]

Continual learning with pre-trained models: a survey

Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained models: a survey. InInternational Joint Conference on Artificial Intelligence, pages 8363–8371, 2024

work page 2024
[39]

Scaling continual learning to 300+ tasks with bi-level routing mixture-of-experts

Meng Lou, Yunxiang Fu, and Yizhou Yu. Scaling continual learning to 300+ tasks with bi-level routing mixture-of-experts. InInternational Conference on Machine Learning, 2026

work page 2026
[40]

Inflora: Interference-free low-rank adaptation for continual learning

Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638– 23647, 2024

work page 2024
[41]

Boosting continual learning of vision-language models via mixture-of-experts adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219–23230, 2024

work page 2024
[42]

Integrating task-specific and universal adapters for pre-trained model-based class-incremental learning

Yan Wang, Da-Wei Zhou, and Han-Jia Ye. Integrating task-specific and universal adapters for pre-trained model-based class-incremental learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 806–816, 2025

work page 2025
[43]

Mos: Model surgery for pre-trained model-based class-incremental learning

Hai-Long Sun, Da-Wei Zhou, Hanbin Zhao, Le Gan, De-Chuan Zhan, and Han-Jia Ye. Mos: Model surgery for pre-trained model-based class-incremental learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20699–20707, 2025

work page 2025
[44]

Dual consolidation for pre- trained model-based domain-incremental learning

Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, Lijun Zhang, and De-Chuan Zhan. Dual consolidation for pre- trained model-based domain-incremental learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20547–20557, 2025

work page 2025
[45]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[47]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024

work page 2024
[48]

arXiv preprint arXiv:2602.12125 , year=

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page arXiv 2026
[49]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

work page 2021
[50]

Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

work page 2015
[51]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011

work page 2011
[52]

Few-shot class-incremental learning for classification and object detection: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2924–2945, 2025

Jinghua Zhang, Li Liu, Olli Silvén, Matti Pietikäinen, and Dewen Hu. Few-shot class-incremental learning for classification and object detection: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2924–2945, 2025

work page 2025
[53]

Riemannian walk for incremental learning: Understanding forgetting and intransigence

Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. InProceedings of the European Conference on Computer Vision, pages 532–547, 2018. 22

work page 2018
[54]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

work page 2017
[56]

Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017

work page 2017
[57]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[58]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017

work page 2017
[59]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[60]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InProceedings of the European conference on computer vision (ECCV), pages 305–321, 2018

work page 2018
[62]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[63]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017

work page 2017
[64]

Cross-domain weakly- supervised object detection through progressive domain adaptation

Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly- supervised object detection through progressive domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018

work page 2018
[65]

The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

work page 2015
[66]

Three types of incremental learning.Nature Machine Intelligence, 4(12):1185–1197, 2022

Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning.Nature Machine Intelligence, 4(12):1185–1197, 2022

work page 2022
[67]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision, pages 213–229, 2020

work page 2020
[68]

CoRR , volume =

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025. 23

work page arXiv 2025