arxiv: 2604.15574 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

Recognition: unknown

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Guy Kaplan , Zorik Gekhman , Zhen Zhu , Lotem Rozner , Yuval Reif , Swabha Swayamdipta , Derek Hoiem , Roy Schwartz

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NE

keywords hallucinationssupervised fine-tuningself-distillationlarge language modelssemantic interferenceknowledge retentioncontinual learning

0 comments

The pith

Supervised fine-tuning on new facts increases hallucinations about pre-trained knowledge primarily through interference among overlapping semantic representations, and self-distillation mitigates this by regularizing output drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how exposure to new factual information during supervised fine-tuning causes large language models to generate more incorrect statements about facts acquired in pre-training. It identifies localized interference between overlapping semantic representations as the main driver of this degradation rather than capacity limits or simple imitation of the fine-tuning data. The authors introduce a self-distillation approach that regularizes the model's output distributions to limit drift on prior knowledge while still permitting acquisition of new facts. They further show that freezing select parameter groups can preserve performance and cut hallucinations when no new knowledge must be learned. These results indicate that targeted regularization during fine-tuning can separate the processes of acquiring new information and retaining old information.

Core claim

Supervised fine-tuning on new factual information degrades performance on pre-training knowledge by causing interference among overlapping semantic representations. Self-distillation mitigates hallucinations by regularizing the drift in output distributions, thereby preserving prior knowledge while enabling effective acquisition of new facts. Experiments confirm that localized semantic interference, rather than capacity limitations or behavior cloning, is the main driver of the increased hallucinations.

What carries the argument

Localized semantic interference among overlapping representations, addressed through output-distribution regularization in self-distillation.

If this is right

Self-distillation enables factual learning with reduced hallucinations on pre-trained knowledge.
Freezing parameter groups suppresses factual plasticity and reduces hallucinations in cases where new knowledge acquisition is not required.
Mitigating semantic interference directly addresses the root cause of SFT-induced hallucinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar output-regularization methods could help retain knowledge across a wider range of fine-tuning tasks that do not involve new factual content.
The interference pattern may appear in other training regimes that mix old and new information, such as multi-task learning.
Testing the same regularization on models of different sizes could reveal whether larger capacity reduces the severity of representation overlap.

Load-bearing premise

The increase in hallucinations during supervised fine-tuning stems primarily from knowledge degradation due to localized semantic interference rather than from data quality issues or optimization dynamics.

What would settle it

If applying self-distillation during fine-tuning failed to reduce hallucinations on pre-training facts while new facts were still acquired accurately, the claim that semantic interference is the primary driver would be undermined.

Figures

Figures reproduced from arXiv: 2604.15574 by Derek Hoiem, Guy Kaplan, Lotem Rozner, Roy Schwartz, Swabha Swayamdipta, Yuval Reif, Zhen Zhu, Zorik Gekhman.

**Figure 1.** Figure 1: Left: SFT-induced hallucinations as factual forgetting in parameter space, starting from θ ∗ 0 . The regions denote subspaces with low error on preexisting facts (A), the task (T) (e.g., QA), and new facts (B). Standard SFT acquires new facts but forgets existing ones. Parameter freeze preserves existing facts at the cost of new ones. Self-distillation achieves both. Right: SFT on semantically overlapping … view at source ↗

**Figure 2.** Figure 2: Factual forgetting is caused by new fact acquisition, not fine-tuning itself. The model starts below ceiling as it has not yet adapted to the QA format, then rapidly learns it, achieving high accuracy on known facts. As training continues and unknown facts are acquired, accuracy on held-out facts declines, indicating that forgetting is driven by new factual knowledge, not fine-tuning per se. When unknown f… view at source ↗

**Figure 3.** Figure 3: Each panel shows factual correctness over training epochs for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic similarity and not scale alone, drives forgetting. Forgetting (∆DHeld) grows sharply with the number of new facts for semantically similar entities (name-like keys), but remains negligible (0–4%) for unrelated entities (UUID keys) across all scales — implicating representational interference, not capacity limits, as a primary driver. Results We first note that, across all scales ( [PITH_FULL_IM… view at source ↗

**Figure 5.** Figure 5: shows the drift trajectories for three conditions: SFT on semantic keys, self-distillation on semantic keys, and SFT on UUID keys. All three rise to ≈5% within the first epoch, a task-format learning signal present regardless of entity type. The curves then diverge: under SFT on semantic keys, drift continues to ≈11% as overlapping facts are acquired; under SFT on UUID keys, drift stabilizes at ≈5%; and … view at source ↗

**Figure 6.** Figure 6: Training dynamics across SLiCK knowledge groups. Factual correctness over training epochs on held-out HighlyKnown (left), MaybeKnown (middle), and WeaklyKnown (right) facts, for Regular SFT (blue), Self-Distillation (orange), and Only Known (green). degrades to ≈80%, a drop of approximately 13 percentage points from the pre-fine-tuning baseline. MaybeKnown facts (middle panel) reveal substantially more sev… view at source ↗

**Figure 7.** Figure 7: Training dynamics on Llama-3.1-8B (top) and Qwen2.5-7B (bottom). Each panel shows factual correctness over training epochs for Known facts (blue dash-dot), Unknown facts (orange dashed), and Held-out facts (green solid), across Regular SFT (left), SelfDistillation (middle), and Only Known (right). upper bound, stabilizing at approximately 95%—a drop of only 2–3 percentage points from the Only Known baseli… view at source ↗

**Figure 8.** Figure 8: Accuracy on DHeld and MaybeKnown facts across training conditions. Top row: factual correctness on held-out HighlyKnown facts (DHeld). Bottom row: accuracy on out-of-training MaybeKnown facts. Results are shown for Llama-3.1-8B (left column) and Qwen2.5-7B (right column) under Regular SFT (solid pink), Self-Distillation (blue dash-dot), and Only Known (orange dashed). C Results Over Different Module Freezi… view at source ↗

**Figure 9.** Figure 9: Hyperparameter ablations for selfdistillation. Each row ablates one hyperparameter while fixing the others (λ = 1, τ = 0.5). Left: DUnk accuracy (factual plasticity). Right: DHeld accuracy (factual stability) [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Synthetic entity construction. Semantic (token-mix) keys (left, steps 1–5): real P17 place names are tokenized, position-aware token pools are assembled, and novel names are formed by recombining tokens across names (e.g., Bergadena from Bergamo and Pasadena); candidates matching any existing place are discarded. UUID keys (right): random 8-character hex identifiers with no lexical overlap with real entit… view at source ↗

**Figure 11.** Figure 11: Extended drift metrics under standard SFT and self-distillation over 100 training epochs on the 10K semantic-overlap setting (§5.1). Top row, left to right: Hidden-state drift (RD), Jensen-Shannon divergence (JSD), mean targeted drift (MTD), and neighborhood structure (NS) — all at layer 14. Bottom row, left to right: Neighborhood token ratio (NTR), output-distribution drift (Rank-ρ), training accuracy on… view at source ↗

read the original abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFT hallucinations trace mainly to semantic interference among overlapping representations, and self-distillation during fine-tuning reduces them by limiting output drift, though the experiments leave room for other explanations.

read the letter

The central finding is that supervised fine-tuning on new facts increases hallucinations on pre-trained knowledge mainly through localized interference in semantic representations, and that a self-distillation variant of SFT mitigates this by regularizing the output distribution toward the base model. Freezing selected parameter groups also helps when no new knowledge needs to be added. Both approaches draw from continual-learning ideas and come with some experimental support distinguishing them from capacity limits or pure behavior cloning.

Referee Report

2 major / 2 minor

Summary. The paper claims that supervised fine-tuning (SFT) on new facts increases hallucinations in LLMs relative to pre-trained knowledge, primarily due to interference among overlapping semantic representations rather than capacity limits or behavior cloning. It proposes a self-distillation SFT method that regularizes output-distribution drift to enable factual learning while reducing hallucinations, and shows that freezing parameter groups can suppress plasticity (and thus hallucinations) when new knowledge acquisition is unnecessary. Experiments across the three hypotheses support interference as the main driver and indicate that self-distillation mitigates it.

Significance. If the central empirical claims hold after addressing controls, the work would provide a practical, low-overhead mitigation for a common failure mode in post-training and a mechanistic account grounded in continual-learning ideas. The self-distillation and selective-freezing approaches are straightforward to implement and could be adopted quickly; the interference hypothesis, if cleanly isolated, would also inform representation-level diagnostics for hallucination risk.

major comments (2)

[Experiments on hypotheses] The section describing the three-hypothesis experiments: the attribution of SFT-induced hallucinations to localized semantic interference (rather than data quality, optimization dynamics, or other correlated factors) is load-bearing for the 'main driver' conclusion, yet the reported comparisons lack explicit controls such as matched data curation, fixed optimizer hyperparameters, or regression of interference metrics (e.g., embedding cosine or activation overlap) against those confounds. Without such ablations, the results remain compatible with general regularization effects.
[Self-distillation SFT method] The self-distillation method (output-distribution regularization): the claim that it specifically mitigates semantic interference would be strengthened by a direct comparison to other standard regularizers (e.g., label smoothing or KL to a frozen teacher without the self-distillation schedule) to show that the benefit is not explained by generic regularization alone.

minor comments (2)

[Abstract] The abstract and method sections should report effect sizes, confidence intervals, and the statistical tests used for the hallucination-rate differences; the current description leaves the magnitude and reliability of the improvements unclear.
[Method] Notation for the output-distribution regularization term and the freezing criteria should be defined explicitly (e.g., which layers or parameter groups are frozen and under what condition) to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the paper accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Experiments on hypotheses] The section describing the three-hypothesis experiments: the attribution of SFT-induced hallucinations to localized semantic interference (rather than data quality, optimization dynamics, or other correlated factors) is load-bearing for the 'main driver' conclusion, yet the reported comparisons lack explicit controls such as matched data curation, fixed optimizer hyperparameters, or regression of interference metrics (e.g., embedding cosine or activation overlap) against those confounds. Without such ablations, the results remain compatible with general regularization effects.

Authors: We agree that the current set of experiments would benefit from additional controls to more rigorously isolate localized semantic interference as the primary driver. In the revised manuscript, we will add ablations that enforce matched data curation across all conditions, fix optimizer hyperparameters explicitly, and include regression analyses correlating interference metrics (such as embedding cosine similarity and activation overlap) with hallucination rates while controlling for confounds. These additions will help distinguish the interference hypothesis from general regularization effects. revision: yes
Referee: [Self-distillation SFT method] The self-distillation method (output-distribution regularization): the claim that it specifically mitigates semantic interference would be strengthened by a direct comparison to other standard regularizers (e.g., label smoothing or KL to a frozen teacher without the self-distillation schedule) to show that the benefit is not explained by generic regularization alone.

Authors: We thank the referee for highlighting this opportunity to demonstrate specificity. In the revision, we will include direct empirical comparisons of our self-distillation SFT method against label smoothing and against standard KL divergence to a frozen teacher (without the self-distillation schedule). These controls will clarify whether the reduction in hallucinations arises specifically from regularizing output-distribution drift in the context of new factual learning, as opposed to generic regularization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent experimental comparisons

full rationale

The paper advances no mathematical derivation chain or first-principles predictions. Its central claims—that interference among overlapping semantic representations drives SFT hallucinations and that self-distillation mitigates it—are supported by direct experimental comparisons across three hypotheses (capacity limits, behavior cloning, localized interference). These rest on observable metrics and ablations rather than any quantity defined in terms of itself, any fitted parameter renamed as a prediction, or load-bearing self-citations. The proposed self-distillation method is an application of a standard continual-learning technique whose success is measured against external benchmarks (hallucination rates, factual retention) without reducing to the inputs by construction. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard domain assumptions about LLM training dynamics and continual learning; no new entities are introduced and no free parameters are explicitly fitted to support the central claims.

axioms (2)

domain assumption Supervised fine-tuning on new factual information degrades performance on pre-training knowledge in LLMs
Core premise stated in the abstract as the source of hallucinations.
domain assumption Output-distribution drift during SFT is a controllable proxy for knowledge degradation
Underlies the self-distillation regularization approach.

pith-pipeline@v0.9.0 · 5503 in / 1341 out tokens · 34455 ms · 2026-05-10T10:47:08.124589+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hallucinations Undermine Trust; Metacognition is a Way Forward
cs.CL 2026-05 unverdicted novelty 6.0

LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig

URLhttps://arxiv.org/abs/2412.11965. Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations?,

work page arXiv
[2]

Gao, D., Wang, H., Li, Y., et al

URLhttps://arxiv.org/abs/2405.05904. Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in llms, 2025. URLhttps://arxiv.org/abs/2503.15299. Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, and Jonathan Herzig. Thinking to recall: How...

work page arXiv 2025
[3]

Distilling the Knowledge in a Neural Network

URLhttps://arxiv.org/abs/1503.02531. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, January 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3703155 2025
[4]

How do language models learn facts? dynamics, curricula and hallucinations

URLhttps://arxiv.org/abs/2503.21676. A Results Over Other SLiCK Classification Groups The main experiments focus onHighlyKnownfacts throughout, on both the training and validation sides. This choice follows a consistent principle. On the training side, the role of DKnown is to teach the model the QA task format without introducing any new factual content....

work page arXiv 2022