RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Hua Zhou; Longbin Yu; Qian Kou; Xiaofeng Shi; Yuduo Li

arxiv: 2606.00147 · v1 · pith:NB4NLEJ7new · submitted 2026-05-29 · 💻 cs.LG · cs.AI

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Yuduo Li , Xiaofeng Shi , Qian Kou , Longbin Yu , Hua Zhou This is my paper

Pith reviewed 2026-06-29 00:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords domain fine-tuningcatastrophic forgettingknowledge distillationsupervised fine-tuningdata refinementon-policy distillation

0 comments

The pith

RAFT refines domain data and applies answer-conditioned on-policy distillation to raise specialized accuracy while limiting loss of general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard supervised fine-tuning on domain data creates two gaps: the targets clash in style with the model's natural outputs, and the training never checks how the model behaves on its own prefixes. RAFT closes both by first rewriting and filtering the data into model-compatible form and then distilling from the original model on trajectories the student actually generates. The result is higher domain performance together with partial recovery of the general benchmarks that ordinary fine-tuning degrades.

Core claim

RAFT is a two-stage method that first builds supervision through self-conditioned rewriting, semantic filtering, and answer fusion, then trains with Answer-Conditioned On-Policy Distillation in which the original model supplies soft targets on the student's own rollouts while conditioned on the fused answer; top-K temperature distillation and EMA loss balancing keep the domain-general trade-off stable. On three backbones and five domains this yields a 23.2 percent average gain in domain accuracy over plain SFT and recovers 18.2 percent and 10.2 percent of the degradation on MS-Bench and IFEval respectively.

What carries the argument

Answer-Conditioned On-Policy Distillation, in which the teacher supplies soft targets on student-generated trajectories while the fused answer serves as conditioning context.

If this is right

Domain accuracy rises 23.2 percent on average relative to standard supervised fine-tuning.
Degradation on MS-Bench is reduced by a relative 18.2 percent.
Degradation on IFEval is reduced by a relative 10.2 percent.
The same recipe works across three different instruction-tuned base models and five distinct domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage pattern could be tested on tasks that require long-horizon reasoning where prefix behavior matters even more than in short instruction following.
If the fused answer is replaced by a weaker context signal the preservation effect should shrink, offering a direct test of the conditioning step.
The method may generalize to continual learning settings where successive domain updates must not erase earlier capabilities.

Load-bearing premise

The rewritten and filtered data plus the soft targets from the original model remain free of new biases or distribution shifts that would cancel the claimed accuracy gains or destabilize the distillation.

What would settle it

Running the identical three backbones and five domains with the same evaluation suite but replacing RAFT's data refinement and on-policy distillation steps with ordinary SFT produces no measurable lift in domain accuracy or recovery on the general benchmarks.

Figures

Figures reproduced from arXiv: 2606.00147 by Hua Zhou, Longbin Yu, Qian Kou, Xiaofeng Shi, Yuduo Li.

**Figure 1.** Figure 1: Overview of the RAFT framework. Left: Offline Distillation generates higher-quality fused data by combining the model’s rewritten answers (filtered by cosine similarity) with the original data through a stronger fusion model. Right: Adaptive on-policy Distillation trains the model with Answer-Conditioned On-Policy Distillation, distillation over softened probability distributions, and EMA-based adaptive lo… view at source ↗

**Figure 2.** Figure 2: Effect of different balancing strategies on SmolLM3-3B in the Open Culture domain. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of relative ℓ2 weight distances from the pre-trained SmolLM3-3B model. Blue: standard SFT (mean = 0.0524). Red: RAFT (mean = 0.0458). A smaller distance indicates less deviation from the pre-trained model and thus less catastrophic forgetting. RAFT produces weights that are consistently closer to the original model across all layers. RAFT achieves a mean ℓ2 distance of 0.0458 compared to 0.052… view at source ↗

**Figure 4.** Figure 4: Effect of cosine similarity threshold τ on domain accuracy (D-Acc) on Open Culture and general capability (MS-Bench) on SmolLM3-3B. Both metrics peak at τ = 0.8, indicating that a moderately strict filtering threshold best balances semantic fidelity and distribution adaptation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of temperature T on D-Acc and MS-Bench on SmolLM3-3B (Business & Industry). The shaded region indicates the recommended range T ∈ [1.5, 1.8] [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Probability coverage and relative GPU time growth under different Top- [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Domain accuracy (D-Acc) on Open Culture and general capability (MS-Bench) under [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: MMLU accuracy of the base model, SFT, and RAFT across five domains on SmolLM3- [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model's natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model's behavior on its own generated prefixes. This process fails to preserve the model's original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAFT's two-stage recipe of data refinement plus answer-conditioned on-policy distillation reports decent gains on domain tasks with partial forgetting recovery, but the abstract leaves open whether the lifts come from the claimed mechanisms or from unmeasured shifts in the rewritten targets.

read the letter

The main takeaway is that this paper gives a concrete recipe for domain fine-tuning that tries to keep the original model's behavior intact. It first rewrites and filters the supervision data so it matches the base model's style and reasoning, then runs distillation where the original model supplies soft targets on the student's own generated prefixes, conditioned on the fused answer. Top-K temperature sampling and EMA loss balancing are added to keep the trade-off stable.

The approach is new in how it ties the rewriting step directly to on-policy trajectories rather than standard SFT or off-policy KD. Reporting results across three backbones and five domains with a 23% average domain lift and some rebound on MS-Bench and IFEval is useful for practitioners who actually ship these models.

The soft spot is the missing checks on the data refinement stage. The stress-test note is right that we have no numbers on how much the rewritten targets differ in length, style, or content from the raw ones. If the filtering and fusion steps are quietly selecting easier or more base-model-like examples, both the domain gains and the forgetting reduction could be artifacts of better data rather than the distillation or EMA parts. No ablations, no error bars, and no direct comparison of raw versus refined targets make it hard to trust the causal story.

This is the sort of incremental engineering paper that domain-adaptation teams might actually try. A reader who needs a drop-in method for legal or medical fine-tuning could get value from the details once the full experiments are available. It is worth sending to referees because the problem is practical and the framing is clear, even though the current evidence is still preliminary and needs tighter controls on the data pipeline.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes RAFT, a two-stage framework for domain-specific fine-tuning of instruction-tuned models. Stage 1 constructs model-compatible supervision via self-conditioned rewriting, semantic filtering, and answer fusion to close the supervision-compatibility gap. Stage 2 performs Answer-Conditioned On-Policy Distillation, in which the original untuned checkpoint supplies soft targets on student-generated trajectories (conditioned on the fused answer), augmented by top-K temperature distillation and EMA-based adaptive loss balancing to close the trajectory-preservation gap. Across three backbones and five domains the method reports a 23.2% average domain-accuracy lift over standard SFT together with partial recovery of SFT-induced degradation on MS-Bench (18.2% relative) and IFEval (10.2% relative).

Significance. If the central empirical claims hold after verification, RAFT supplies a practical, two-gap recipe for domain adaptation that couples data refinement with on-policy trajectory preservation. The multi-backbone, multi-domain evaluation is a positive feature. The work also ships an external reference point (the original instruction-tuned checkpoint) rather than a circular teacher, which strengthens the distillation design.

major comments (3)

[Abstract and §4] Abstract and §4: the headline 23.2% domain-accuracy gain, 18.2% MS-Bench recovery, and 10.2% IFEval recovery are reported without error bars, standard deviations across random seeds, or statistical significance tests. This absence makes it impossible to assess whether the reported lifts are robust or could be explained by hyper-parameter variation or data-selection effects.
[§3.1] §3.1 (data-refinement pipeline): the claim that self-conditioned rewriting + semantic filtering + answer fusion produce strictly higher-quality, unbiased targets rests on an untested assumption. No quantitative diagnostics (embedding divergence, n-gram overlap statistics, length/style distribution shifts, or human preference between raw and refined targets) are provided to rule out the possibility that observed gains arise from altered supervision rather than the subsequent distillation mechanism.
[§4] §4 (experimental protocol): no ablation isolates the contribution of Answer-Conditioned On-Policy Distillation from the data-refinement stage, nor is sensitivity reported for the two free parameters (top-K temperature value and EMA decay rate). Without these controls the attribution of both domain gains and forgetting alleviation to the proposed components remains unverified.

minor comments (2)

[§3.2] The description of the EMA-based adaptive loss balancing would benefit from an explicit equation showing how the balancing coefficient is updated.
[§3] Implementation details for the semantic-filtering threshold and the exact conditioning format used in the distillation stage are missing and should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on empirical robustness and component attribution. We address each major comment below and commit to revisions that strengthen the statistical reporting, diagnostics, and ablations.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: the headline 23.2% domain-accuracy gain, 18.2% MS-Bench recovery, and 10.2% IFEval recovery are reported without error bars, standard deviations across random seeds, or statistical significance tests. This absence makes it impossible to assess whether the reported lifts are robust or could be explained by hyper-parameter variation or data-selection effects.

Authors: We agree that the absence of error bars and significance tests limits assessment of robustness. In the revised manuscript we will rerun key experiments across multiple random seeds, report standard deviations, and include statistical significance tests for the headline metrics. revision: yes
Referee: [§3.1] §3.1 (data-refinement pipeline): the claim that self-conditioned rewriting + semantic filtering + answer fusion produce strictly higher-quality, unbiased targets rests on an untested assumption. No quantitative diagnostics (embedding divergence, n-gram overlap statistics, length/style distribution shifts, or human preference between raw and refined targets) are provided to rule out the possibility that observed gains arise from altered supervision rather than the subsequent distillation mechanism.

Authors: The referee is correct that direct quantitative diagnostics on the refined targets are missing. While the end-to-end gains support the pipeline, we will add embedding divergence, n-gram overlap statistics, and length/style distribution comparisons between raw and refined targets in the revision to better substantiate the supervision-compatibility claims. revision: yes
Referee: [§4] §4 (experimental protocol): no ablation isolates the contribution of Answer-Conditioned On-Policy Distillation from the data-refinement stage, nor is sensitivity reported for the two free parameters (top-K temperature value and EMA decay rate). Without these controls the attribution of both domain gains and forgetting alleviation to the proposed components remains unverified.

Authors: We agree that isolating the distillation stage and reporting parameter sensitivity would strengthen attribution. The revised manuscript will include an ablation that separates data refinement from Answer-Conditioned On-Policy Distillation and will report sensitivity results for top-K temperature and EMA decay rate. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or predictions

full rationale

The paper describes an empirical two-stage procedure (data refinement via rewriting/filtering/fusion followed by answer-conditioned on-policy distillation with EMA balancing) whose performance claims rest on experimental comparisons against standard SFT across three backbones and five domains. No equations, fitted parameters, or uniqueness theorems are presented that reduce by construction to the method's own inputs; the teacher is explicitly the original untuned checkpoint, supplying an external reference. No self-citation chains or ansatzes are invoked to justify core design choices. The reported gains (23.2% domain accuracy, partial recovery on MS-Bench/IFEval) are therefore not forced by definition or statistical artifact within the paper's own framing, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on several unstated modeling choices and hyperparameters whose values are not derived from first principles.

free parameters (2)

top-K value for temperature distillation
Chosen to stabilize distillation; value not derived, must be tuned.
EMA decay rate for adaptive loss balancing
Controls trade-off between domain and general loss; fitted or hand-chosen.

axioms (2)

domain assumption The base instruction-tuned model can generate coherent trajectories on which soft targets from the same model remain meaningful.
Invoked when performing on-policy distillation.
ad hoc to paper Semantic filtering and answer fusion preserve task-relevant information without introducing label noise.
Central to the data refinement stage.

pith-pipeline@v0.9.1-grok · 5814 in / 1334 out tokens · 21024 ms · 2026-06-29T00:00:00.828594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Advances in neural information processing systems , volume=

Experience replay for continual learning , author=. Advances in neural information processing systems , volume=
[2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[3]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Recall and learn: Fine-tuning deep pretrained language models with less forgetting , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

2020
[4]

Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

Continual pre-training of language models , author=. Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

2023
[5]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[7]

Even When Users Do Not Intend To , year=

Fine-tuning Aligned Language Models Compromises Safety , author=. Even When Users Do Not Intend To , year=
[8]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026
[9]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
[11]

Anthropic

Gkd: Generalized knowledge distillation for auto-regressive sequence models , author=. arXiv preprint arXiv:2306.13649 , volume=

work page arXiv
[12]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Modelscope-agent: Building your customizable agent system with open-source large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2023
[13]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...
[15]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common corpus: The largest collection of ethical data for llm pre-training , author=. arXiv preprint arXiv:2506.01732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[18]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[19]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023
[20]

2025 , eprint=

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[22]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=
[23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[24]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2408.16673 , year=

Preserving diversity in supervised fine-tuning of large language models , author=. arXiv preprint arXiv:2408.16673 , year=

work page arXiv
[26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[27]

2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=

LoRa technology-an overview , author=. 2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=. 2018 , organization=

2018
[28]

ACM Computing Surveys , volume=

Instruction tuning for large language models: A survey , author=. ACM Computing Surveys , volume=. 2026 , publisher=

2026
[29]

IEEE Transactions on Audio, Speech and Language Processing , year=

An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=
[30]

arXiv preprint arXiv:2309.10105 , year=

Understanding catastrophic forgetting in language models via implicit inference , author=. arXiv preprint arXiv:2309.10105 , year=

work page arXiv
[31]

arXiv preprint arXiv:2603.09892 , year=

MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning , author=. arXiv preprint arXiv:2603.09892 , year=

work page arXiv
[32]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

2017
[33]

International conference on machine learning , pages=

Explicit inductive bias for transfer learning with convolutional networks , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[1] [1]

Advances in neural information processing systems , volume=

Experience replay for continual learning , author=. Advances in neural information processing systems , volume=

[2] [2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[3] [3]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Recall and learn: Fine-tuning deep pretrained language models with less forgetting , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

2020

[4] [4]

Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

Continual pre-training of language models , author=. Proceedings of The Eleventh International Conference on Learning Representations (ICLR-2023) , year=

2023

[5] [5]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[6] [6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[7] [7]

Even When Users Do Not Intend To , year=

Fine-tuning Aligned Language Models Compromises Safety , author=. Even When Users Do Not Intend To , year=

[8] [8]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026

[9] [9]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

[11] [11]

Anthropic

Gkd: Generalized knowledge distillation for auto-regressive sequence models , author=. arXiv preprint arXiv:2306.13649 , volume=

work page arXiv

[12] [12]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Modelscope-agent: Building your customizable agent system with open-source large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2023

[13] [13]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

[15] [15]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras , author=. arXiv preprint arXiv:2503.01743 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common corpus: The largest collection of ethical data for llm pre-training , author=. arXiv preprint arXiv:2506.01732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[18] [18]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[19] [19]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023

[20] [20]

2025 , eprint=

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models , author=. 2025 , eprint=

2025

[21] [21]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025

[22] [22]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=

[23] [23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[24] [24]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2408.16673 , year=

Preserving diversity in supervised fine-tuning of large language models , author=. arXiv preprint arXiv:2408.16673 , year=

work page arXiv

[26] [26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[27] [27]

2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=

LoRa technology-an overview , author=. 2018 second international conference on electronics, communication and aerospace technology (ICECA) , pages=. 2018 , organization=

2018

[28] [28]

ACM Computing Surveys , volume=

Instruction tuning for large language models: A survey , author=. ACM Computing Surveys , volume=. 2026 , publisher=

2026

[29] [29]

IEEE Transactions on Audio, Speech and Language Processing , year=

An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. IEEE Transactions on Audio, Speech and Language Processing , year=

[30] [30]

arXiv preprint arXiv:2309.10105 , year=

Understanding catastrophic forgetting in language models via implicit inference , author=. arXiv preprint arXiv:2309.10105 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2603.09892 , year=

MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning , author=. arXiv preprint arXiv:2603.09892 , year=

work page arXiv

[32] [32]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

2017

[33] [33]

International conference on machine learning , pages=

Explicit inductive bias for transfer learning with convolutional networks , author=. International conference on machine learning , pages=. 2018 , organization=

2018