pith. sign in

arxiv: 2606.00147 · v1 · pith:NB4NLEJ7new · submitted 2026-05-29 · 💻 cs.LG · cs.AI

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Pith reviewed 2026-06-29 00:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords domain fine-tuningcatastrophic forgettingknowledge distillationsupervised fine-tuningdata refinementon-policy distillation
0
0 comments X

The pith

RAFT refines domain data and applies answer-conditioned on-policy distillation to raise specialized accuracy while limiting loss of general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard supervised fine-tuning on domain data creates two gaps: the targets clash in style with the model's natural outputs, and the training never checks how the model behaves on its own prefixes. RAFT closes both by first rewriting and filtering the data into model-compatible form and then distilling from the original model on trajectories the student actually generates. The result is higher domain performance together with partial recovery of the general benchmarks that ordinary fine-tuning degrades.

Core claim

RAFT is a two-stage method that first builds supervision through self-conditioned rewriting, semantic filtering, and answer fusion, then trains with Answer-Conditioned On-Policy Distillation in which the original model supplies soft targets on the student's own rollouts while conditioned on the fused answer; top-K temperature distillation and EMA loss balancing keep the domain-general trade-off stable. On three backbones and five domains this yields a 23.2 percent average gain in domain accuracy over plain SFT and recovers 18.2 percent and 10.2 percent of the degradation on MS-Bench and IFEval respectively.

What carries the argument

Answer-Conditioned On-Policy Distillation, in which the teacher supplies soft targets on student-generated trajectories while the fused answer serves as conditioning context.

If this is right

  • Domain accuracy rises 23.2 percent on average relative to standard supervised fine-tuning.
  • Degradation on MS-Bench is reduced by a relative 18.2 percent.
  • Degradation on IFEval is reduced by a relative 10.2 percent.
  • The same recipe works across three different instruction-tuned base models and five distinct domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage pattern could be tested on tasks that require long-horizon reasoning where prefix behavior matters even more than in short instruction following.
  • If the fused answer is replaced by a weaker context signal the preservation effect should shrink, offering a direct test of the conditioning step.
  • The method may generalize to continual learning settings where successive domain updates must not erase earlier capabilities.

Load-bearing premise

The rewritten and filtered data plus the soft targets from the original model remain free of new biases or distribution shifts that would cancel the claimed accuracy gains or destabilize the distillation.

What would settle it

Running the identical three backbones and five domains with the same evaluation suite but replacing RAFT's data refinement and on-policy distillation steps with ordinary SFT produces no measurable lift in domain accuracy or recovery on the general benchmarks.

Figures

Figures reproduced from arXiv: 2606.00147 by Hua Zhou, Longbin Yu, Qian Kou, Xiaofeng Shi, Yuduo Li.

Figure 1
Figure 1. Figure 1: Overview of the RAFT framework. Left: Offline Distillation generates higher-quality fused data by combining the model’s rewritten answers (filtered by cosine similarity) with the original data through a stronger fusion model. Right: Adaptive on-policy Distillation trains the model with Answer-Conditioned On-Policy Distillation, distillation over softened probability distributions, and EMA-based adaptive lo… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of different balancing strategies on SmolLM3-3B in the Open Culture domain. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of relative ℓ2 weight distances from the pre-trained SmolLM3-3B model. Blue: standard SFT (mean = 0.0524). Red: RAFT (mean = 0.0458). A smaller distance indicates less deviation from the pre-trained model and thus less catastrophic forgetting. RAFT produces weights that are consistently closer to the original model across all layers. RAFT achieves a mean ℓ2 distance of 0.0458 compared to 0.052… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of cosine similarity threshold τ on domain accuracy (D-Acc) on Open Culture and general capability (MS-Bench) on SmolLM3-3B. Both metrics peak at τ = 0.8, indicating that a moderately strict filtering threshold best balances semantic fidelity and distribution adaptation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of temperature T on D-Acc and MS-Bench on SmolLM3-3B (Business & Industry). The shaded region indicates the recommended range T ∈ [1.5, 1.8] [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Probability coverage and relative GPU time growth under different Top- [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Domain accuracy (D-Acc) on Open Culture and general capability (MS-Bench) under [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: MMLU accuracy of the base model, SFT, and RAFT across five domains on SmolLM3- [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model's natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model's behavior on its own generated prefixes. This process fails to preserve the model's original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RAFT, a two-stage framework for domain-specific fine-tuning of instruction-tuned models. Stage 1 constructs model-compatible supervision via self-conditioned rewriting, semantic filtering, and answer fusion to close the supervision-compatibility gap. Stage 2 performs Answer-Conditioned On-Policy Distillation, in which the original untuned checkpoint supplies soft targets on student-generated trajectories (conditioned on the fused answer), augmented by top-K temperature distillation and EMA-based adaptive loss balancing to close the trajectory-preservation gap. Across three backbones and five domains the method reports a 23.2% average domain-accuracy lift over standard SFT together with partial recovery of SFT-induced degradation on MS-Bench (18.2% relative) and IFEval (10.2% relative).

Significance. If the central empirical claims hold after verification, RAFT supplies a practical, two-gap recipe for domain adaptation that couples data refinement with on-policy trajectory preservation. The multi-backbone, multi-domain evaluation is a positive feature. The work also ships an external reference point (the original instruction-tuned checkpoint) rather than a circular teacher, which strengthens the distillation design.

major comments (3)
  1. [Abstract and §4] Abstract and §4: the headline 23.2% domain-accuracy gain, 18.2% MS-Bench recovery, and 10.2% IFEval recovery are reported without error bars, standard deviations across random seeds, or statistical significance tests. This absence makes it impossible to assess whether the reported lifts are robust or could be explained by hyper-parameter variation or data-selection effects.
  2. [§3.1] §3.1 (data-refinement pipeline): the claim that self-conditioned rewriting + semantic filtering + answer fusion produce strictly higher-quality, unbiased targets rests on an untested assumption. No quantitative diagnostics (embedding divergence, n-gram overlap statistics, length/style distribution shifts, or human preference between raw and refined targets) are provided to rule out the possibility that observed gains arise from altered supervision rather than the subsequent distillation mechanism.
  3. [§4] §4 (experimental protocol): no ablation isolates the contribution of Answer-Conditioned On-Policy Distillation from the data-refinement stage, nor is sensitivity reported for the two free parameters (top-K temperature value and EMA decay rate). Without these controls the attribution of both domain gains and forgetting alleviation to the proposed components remains unverified.
minor comments (2)
  1. [§3.2] The description of the EMA-based adaptive loss balancing would benefit from an explicit equation showing how the balancing coefficient is updated.
  2. [§3] Implementation details for the semantic-filtering threshold and the exact conditioning format used in the distillation stage are missing and should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on empirical robustness and component attribution. We address each major comment below and commit to revisions that strengthen the statistical reporting, diagnostics, and ablations.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: the headline 23.2% domain-accuracy gain, 18.2% MS-Bench recovery, and 10.2% IFEval recovery are reported without error bars, standard deviations across random seeds, or statistical significance tests. This absence makes it impossible to assess whether the reported lifts are robust or could be explained by hyper-parameter variation or data-selection effects.

    Authors: We agree that the absence of error bars and significance tests limits assessment of robustness. In the revised manuscript we will rerun key experiments across multiple random seeds, report standard deviations, and include statistical significance tests for the headline metrics. revision: yes

  2. Referee: [§3.1] §3.1 (data-refinement pipeline): the claim that self-conditioned rewriting + semantic filtering + answer fusion produce strictly higher-quality, unbiased targets rests on an untested assumption. No quantitative diagnostics (embedding divergence, n-gram overlap statistics, length/style distribution shifts, or human preference between raw and refined targets) are provided to rule out the possibility that observed gains arise from altered supervision rather than the subsequent distillation mechanism.

    Authors: The referee is correct that direct quantitative diagnostics on the refined targets are missing. While the end-to-end gains support the pipeline, we will add embedding divergence, n-gram overlap statistics, and length/style distribution comparisons between raw and refined targets in the revision to better substantiate the supervision-compatibility claims. revision: yes

  3. Referee: [§4] §4 (experimental protocol): no ablation isolates the contribution of Answer-Conditioned On-Policy Distillation from the data-refinement stage, nor is sensitivity reported for the two free parameters (top-K temperature value and EMA decay rate). Without these controls the attribution of both domain gains and forgetting alleviation to the proposed components remains unverified.

    Authors: We agree that isolating the distillation stage and reporting parameter sensitivity would strengthen attribution. The revised manuscript will include an ablation that separates data refinement from Answer-Conditioned On-Policy Distillation and will report sensitivity results for top-K temperature and EMA decay rate. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or predictions

full rationale

The paper describes an empirical two-stage procedure (data refinement via rewriting/filtering/fusion followed by answer-conditioned on-policy distillation with EMA balancing) whose performance claims rest on experimental comparisons against standard SFT across three backbones and five domains. No equations, fitted parameters, or uniqueness theorems are presented that reduce by construction to the method's own inputs; the teacher is explicitly the original untuned checkpoint, supplying an external reference. No self-citation chains or ansatzes are invoked to justify core design choices. The reported gains (23.2% domain accuracy, partial recovery on MS-Bench/IFEval) are therefore not forced by definition or statistical artifact within the paper's own framing, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on several unstated modeling choices and hyperparameters whose values are not derived from first principles.

free parameters (2)
  • top-K value for temperature distillation
    Chosen to stabilize distillation; value not derived, must be tuned.
  • EMA decay rate for adaptive loss balancing
    Controls trade-off between domain and general loss; fitted or hand-chosen.
axioms (2)
  • domain assumption The base instruction-tuned model can generate coherent trajectories on which soft targets from the same model remain meaningful.
    Invoked when performing on-policy distillation.
  • ad hoc to paper Semantic filtering and answer fusion preserve task-relevant information without introducing label noise.
    Central to the data refinement stage.

pith-pipeline@v0.9.1-grok · 5814 in / 1334 out tokens · 21024 ms · 2026-06-29T00:00:00.828594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Experience replay for continual learning , author=. Advances in neural information processing systems , volume=

  2. [2]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  3. [3]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Recall and learn: Fine-tuning deep pretrained language models with less forgetting , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  4. [4]

    Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

    Continual pre-training of language models , author=. Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

  5. [5]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  6. [6]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  7. [7]

    Even When Users Do Not Intend To , year=

    Fine-tuning Aligned Language Models Compromises Safety , author=. Even When Users Do Not Intend To , year=

  8. [8]

    2026 , eprint=

    Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

  9. [9]

    Reinforcement Learning via Self-Distillation

    Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

  10. [10]

    Thinking Machines Lab: Connectionism , year =

    Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

  11. [11]

    Anthropic

    Gkd: Generalized knowledge distillation for auto-regressive sequence models , author=. arXiv preprint arXiv:2306.13649 , volume=

  12. [12]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    Modelscope-agent: Building your customizable agent system with open-source large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  13. [13]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  14. [14]

    Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

  15. [15]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

  16. [16]

    Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

    Common corpus: The largest collection of ethical data for llm pre-training , author=. arXiv preprint arXiv:2506.01732 , year=

  17. [17]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  18. [18]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  19. [19]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  20. [20]

    2025 , eprint=

    CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  22. [22]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=

  23. [23]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  24. [24]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  25. [25]

    arXiv preprint arXiv:2408.16673 , year=

    Preserving diversity in supervised fine-tuning of large language models , author=. arXiv preprint arXiv:2408.16673 , year=

  26. [26]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  27. [27]

    2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=

    LoRa technology-an overview , author=. 2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=. 2018 , organization=

  28. [28]

    ACM Computing Surveys , volume=

    Instruction tuning for large language models: A survey , author=. ACM Computing Surveys , volume=. 2026 , publisher=

  29. [29]

    IEEE Transactions on Audio, Speech and Language Processing , year=

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

  30. [30]

    arXiv preprint arXiv:2309.10105 , year=

    Understanding catastrophic forgetting in language models via implicit inference , author=. arXiv preprint arXiv:2309.10105 , year=

  31. [31]

    arXiv preprint arXiv:2603.09892 , year=

    MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning , author=. arXiv preprint arXiv:2603.09892 , year=

  32. [32]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  33. [33]

    International conference on machine learning , pages=

    Explicit inductive bias for transfer learning with convolutional networks , author=. International conference on machine learning , pages=. 2018 , organization=