On-Policy Replay for Continual Supervised Fine-Tuning

Dongyang Xu; Jiaqi Huang; Meng Zhang; Taojie Zhu; Xin Chen; Yan Chen; Yizhi Wang

arxiv: 2605.29495 · v1 · pith:5BFNLN7Cnew · submitted 2026-05-28 · 💻 cs.LG

On-Policy Replay for Continual Supervised Fine-Tuning

Yan Chen , Taojie Zhu , Meng Zhang , Xin Chen , Jiaqi Huang , Dongyang Xu , Yizhi Wang This is my paper

Pith reviewed 2026-06-29 09:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningcatastrophic forgettingsupervised fine-tuningon-policy replaylarge language modelsreplay methodsTRACE benchmark

0 comments

The pith

On-Policy Replay reduces forgetting in continual LLM fine-tuning by replaying the model's own generations on historical prompts as standard SFT data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy signals reduce catastrophic forgetting more reliably when routed through the training data rather than through new loss functions or teacher models. On-Policy Replay generates responses from the latest checkpoint on a small budget of old prompts, retains those passing a task reward filter, and trains on the resulting pairs using ordinary supervised fine-tuning. This removes the need for auxiliary objectives, extra forward passes, or stylistic drift from a teacher copy. A sympathetic reader cares because the approach delivers large, consistent gains on the TRACE benchmark across three 7-8B backbones while remaining simple to implement at low replay budgets.

Core claim

On-Policy Replay (OPR) rolls out the most recent checkpoint on historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7-8B instruction-tuned backbones on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42-46% reductions observed across all three backbones.

What carries the argument

On-Policy Replay: generating responses from the current model on a budget of historical prompts, filtering them by task reward, and inserting the surviving pairs directly into the standard SFT training set.

If this is right

OPR improves BWT from -13.93 to -0.65 at 10% replay budget and to -2.29 at 1% budget on Qwen2.5-7B-Instruct under Sequential SFT.
Reductions of 42-46% in |BWT| hold across Qwen2.5-7B-Instruct, Qwen3-8B, and Llama3.1-8B-Instruct.
OPR admits a KL-shrinkage interpretation that places it on the same axis as prior on-policy distillation methods.
Low-score replay is uniformly worse than Vanilla Replay, showing that the on-policy distribution itself, not response quality alone, is the active ingredient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to continual settings where an implicit quality signal can substitute for an explicit task reward.
Testing OPR inside reinforcement learning loops or multi-agent continual adaptation could reveal whether the on-policy sourcing principle extends beyond supervised fine-tuning.
Hybrid replay strategies that combine vanilla data with a small on-policy component might offer practical trade-offs when task rewards are expensive to compute.

Load-bearing premise

A reliable task reward function exists that can accurately filter model generations on historical prompts without introducing selection bias or requiring additional labeled data.

What would settle it

Training with unfiltered or randomly filtered model generations on the same historical prompts and observing whether the forgetting reduction relative to vanilla replay disappears.

Figures

Figures reproduced from arXiv: 2605.29495 by Dongyang Xu, Jiaqi Huang, Meng Zhang, Taojie Zhu, Xin Chen, Yan Chen, Yizhi Wang.

**Figure 1.** Figure 1: On-Policy Replay (OPR) at a glance. Left: after each task, OPR rolls out on past-task prompts, keeps the top-confidence (prompt, response) pairs as a replay buffer, and mixes it with the next-task data under plain SFT—no teacher, no auxiliary loss. Right: BWT vs. overall accuracy on TRACE/Qwen2.5-7B-Instruct at ρ=0.01; OPR-RU and OPR-SC move the frontier toward the MTL upper bound. erty: the cross-task rep… view at source ↗

**Figure 2.** Figure 2: OPR pipeline. After training on task Ti (dataset Si in the figure), the checkpoint πθi rolls out on historical prompts; a scorer (rule reward or length-normalized log-likelihood) ranks the responses, the top-scoring (prompt, response) pairs form the replay buffer, and the buffer is mixed with task-Ti+1 data for plain supervised fine-tuning. side: SDFT (Shenfeld et al., 2026), OPSD (Zhao et al., 2026), and … view at source ↗

**Figure 3.** Figure 3: Per-task accuracy across the 8 TRACE stages, SDFT vs. OPR-SC at ρ=0.01 on Qwen2.5-7BInstruct. SDFT plateaus on MeetingBank; OPR-SC underperforms on NumGLUE-cm, where the model is confidently wrong. Method ρ ACC ↑ BWT ↑ Vanilla Replay 0.01 62.57 −4.27 0.02 63.68 −3.89 0.05 64.50 −3.53 0.10 66.08 −1.37 OPR-RU (ours) 0.01 65.45 −2.29 0.02 65.59 −1.71 0.05 65.78 −1.43 0.10 65.92 −0.65 [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 4.** Figure 4: Per-stage training loss, SDFT vs. OPR-SC at ρ=0.01 on Qwen2.5-7B-Instruct. On MeetingBank (Stage 3), Py150 (Stage 4), and 20Minuten (Stage 8) SDFT’s loss plateaus while OPR-SC’s descends to the same level it reaches on the classification tasks. supervision route. On four stages (C-STANCE, FOMC, ScienceQA, NumGLUE-ds) the curves are nearly indistinguishable. On three stages— MeetingBank, Py150, and 20Minute… view at source ↗

**Figure 5.** Figure 5: decomposes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPR reduces forgetting on TRACE by filtering on-policy generations for replay instead of adding losses or teachers, with clear gains over vanilla replay but the whole thing rests on the task reward.

read the letter

The core move here is straightforward: roll out the current model on stored historical prompts, score the outputs with a task reward, and feed only the accepted (prompt, response) pairs back into ordinary SFT. This keeps the on-policy signal without the extra machinery of distillation objectives or a teacher copy.

What stands out is that the method is deliberately minimal. No new loss terms, no schedule sensitivity, and the paper shows consistent BWT improvements across Qwen2.5-7B, Qwen3-8B, and Llama3.1-8B on TRACE. The 46% reduction in |BWT| at 10% budget (and still solid at 1%) over a tuned vanilla replay baseline is the main empirical result, and the low-score replay ablation is useful because it shows the benefit is not just from picking better responses. The KL-shrinkage framing also gives a clean way to place this against prior on-policy distillation work.

The soft spot is the task reward itself. The gains depend on it filtering generations without introducing selection bias or relying on privileged information. The abstract does not spell out the reward's exact form or any checks for portability across tasks, so it is hard to judge how general the result is. If the reward is strong or task-specific, the improvement could be narrower than the headline numbers suggest. The numbers themselves look plausible but come from the abstract only, so variance, exact baseline tuning, and post-hoc choices still need the full manuscript to verify.

This is for people working on replay-based continual learning for LLMs who want something simpler than distillation. A reader focused on practical forgetting mitigation would find the procedure and the negative result on low-score replay worth seeing. The work is coherent enough and the code is released, so it deserves a serious referee with the main request being more detail on the reward and any bias tests.

Referee Report

3 major / 2 minor

Summary. The paper proposes On-Policy Replay (OPR) for continual supervised fine-tuning of LLMs. The method rolls out the current checkpoint on a budget of historical prompts, filters generations via a task reward, and replays the surviving (prompt, response) pairs as standard SFT data. No teacher model, auxiliary loss, or on-the-fly distillation is used. On the TRACE benchmark, OPR reduces backward transfer (BWT) across three 7-8B backbones; the headline result is lifting BWT from -13.93 to -0.65 (10% budget) or -2.29 (1% budget) for Qwen2.5-7B-Instruct under Sequential SFT, a 46% |BWT| reduction versus a tuned Vanilla Replay baseline, with 42-46% gains across models. A KL-shrinkage interpretation unifies OPR with prior on-policy methods, and an ablation shows low-score replay underperforms Vanilla Replay.

Significance. If the reported BWT reductions hold under the stated constraints, OPR supplies a lightweight mechanism for injecting on-policy signals into continual SFT without extra objectives or teacher copies. The public code release is a clear strength for reproducibility. The finding that response quality alone does not explain the gains (low-score ablation) is a useful negative result that sharpens the role of the on-policy distribution.

major comments (3)

[§3] §3 (Method), reward definition: the central claim that OPR requires 'no data beyond what is already available' and introduces no selection bias rests on the task reward being a fair, label-free filter; the manuscript provides no explicit definition, pseudocode, or per-task implementation details for this reward, preventing verification that it does not implicitly leverage ground-truth labels or task-specific heuristics.
[§4.2] §4.2, Table 2 (or equivalent results table): the 46% |BWT| reduction versus 'tuned Vanilla Replay' is reported without stating the hyperparameter search space, number of tuning runs, or whether the baseline received the same reward-based filtering; this detail is load-bearing for the claim that the improvement stems from the on-policy distribution rather than privileged filtering.
[§4.3] §4.3 (ablation): the low-score replay experiment demonstrates that filtering matters, yet does not test whether the reward itself is portable or unbiased across tasks; without this, the interpretation that 'the active ingredient in OPR is the on-policy distribution' remains under-supported.

minor comments (2)

[§5] The KL-shrinkage interpretation in §5 is presented as an after-the-fact lens; a short derivation or reference to the relevant KL term would clarify how OPR and distillation methods lie on the same axis.
Figure captions and axis labels in the experimental plots could be expanded to include exact replay budgets and model names for quicker reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights areas where additional clarity will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested details.

read point-by-point responses

Referee: [§3] §3 (Method), reward definition: the central claim that OPR requires 'no data beyond what is already available' and introduces no selection bias rests on the task reward being a fair, label-free filter; the manuscript provides no explicit definition, pseudocode, or per-task implementation details for this reward, preventing verification that it does not implicitly leverage ground-truth labels or task-specific heuristics.

Authors: We agree that the reward definition requires explicit documentation. The task reward for each TRACE task is the standard evaluation metric (e.g., exact match for math, accuracy for classification) computed against the ground-truth labels already present in the historical prompt datasets; no new labels or external data are introduced. This is label-using for filtering but does not constitute additional data collection. We will add a clear definition, pseudocode for the rollout-and-filter step, and per-task metric details to the revised §3. revision: yes
Referee: [§4.2] §4.2, Table 2 (or equivalent results table): the 46% |BWT| reduction versus 'tuned Vanilla Replay' is reported without stating the hyperparameter search space, number of tuning runs, or whether the baseline received the same reward-based filtering; this detail is load-bearing for the claim that the improvement stems from the on-policy distribution rather than privileged filtering.

Authors: The Vanilla Replay baseline was tuned over the same replay budget grid (1%, 5%, 10%) and learning-rate range as OPR, using three random seeds per setting; it replays randomly sampled historical (prompt, response) pairs with no reward filtering applied. We will expand the experimental details in the revised §4.2 and caption of Table 2 to include the full search space and confirm the absence of reward filtering on the baseline. revision: yes
Referee: [§4.3] §4.3 (ablation): the low-score replay experiment demonstrates that filtering matters, yet does not test whether the reward itself is portable or unbiased across tasks; without this, the interpretation that 'the active ingredient in OPR is the on-policy distribution' remains under-supported.

Authors: The reward is intentionally task-specific, matching the evaluation metric of each individual task; cross-task portability is neither claimed nor required for the method. The low-score ablation already isolates the role of the on-policy distribution: even on-policy responses that score poorly under the task reward underperform Vanilla Replay, showing that gains are not explained by response quality in isolation. We will add a clarifying paragraph in the revised §4.3 to make this scope explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical procedure evaluated on external benchmarks

full rationale

The paper defines OPR as an explicit procedural method (rollout on historical prompts, filter by task reward, replay as SFT data) and reports BWT improvements measured against external TRACE benchmark results and tuned baselines. No equations or derivations are presented that reduce the claimed gains to a fitted parameter or self-citation by construction; the KL-shrinkage view is explicitly labeled an after-the-fact interpretation rather than a load-bearing derivation. The method is self-contained against the stated benchmarks with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the existence of a usable task reward for filtering and on the assumption that replaying a small fraction of on-policy data is sufficient to counteract forgetting; no new physical or mathematical entities are introduced.

free parameters (1)

replay budget percentage
The fraction of historical prompts used for rollout (tested at 1% and 10%) is a tunable hyperparameter that controls compute and data volume.

axioms (1)

domain assumption A task-specific reward function is available to score and filter model generations on historical prompts.
The filtering step in OPR requires this reward; without it the method reduces to unfiltered replay.

pith-pipeline@v0.9.1-grok · 5934 in / 1567 out tokens · 11386 ms · 2026-06-29T09:02:51.543870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2508.04676

Gere: Towards efficient anti-forgetting in con- tinual learning of llm via general samples replay. arXiv preprint arXiv:2508.04676. Chenye Zhao, Yingjie Li, and Cornelia Caragea. 2023. C-STANCE: A large dataset for Chinese zero-shot stance detection. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (ACL). Siyan Z...

work page arXiv 2023
[2]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

Spurious forgetting in continual learning of language models. InInternational Conference on Learning Representations (ICLR). Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, and Yonghong He. 2026. Bridg- ing sft and rl: Dynamic policy optimization for robust reasoning.arXiv preprint arXiv:2604.08926. A TRACE Tasks, Templates, and Metr...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

arXiv preprint arXiv:2508.04676

Gere: Towards efficient anti-forgetting in con- tinual learning of llm via general samples replay. arXiv preprint arXiv:2508.04676. Chenye Zhao, Yingjie Li, and Cornelia Caragea. 2023. C-STANCE: A large dataset for Chinese zero-shot stance detection. InProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (ACL). Siyan Z...

work page arXiv 2023

[2] [2]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

Spurious forgetting in continual learning of language models. InInternational Conference on Learning Representations (ICLR). Taojie Zhu, Dongyang Xu, Ding Zou, Sen Zhao, Qiaobo Hao, Zhiguo Yang, and Yonghong He. 2026. Bridg- ing sft and rl: Dynamic policy optimization for robust reasoning.arXiv preprint arXiv:2604.08926. A TRACE Tasks, Templates, and Metr...

work page internal anchor Pith review Pith/arXiv arXiv 2026