pith. sign in

arxiv: 2605.31293 · v1 · pith:MWAZATBBnew · submitted 2026-05-29 · 💻 cs.CL

Divergence Decoding: Inference-Time Unlearning via Auxiliary Models

Pith reviewed 2026-06-28 22:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine unlearninglarge language modelsinference-time methodsauxiliary modelslogit steeringmodel distillationdata privacy
0
0 comments X

The pith

Divergence Decoding steers LLM logits with small auxiliary models to remove specific memorized data at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Divergence Decoding as a method to unlearn sensitive training data from large language models without retraining the base model or suffering large utility drops. Small auxiliary models, trained with ordinary pre-training and fine-tuning, adjust the main model's output logits during generation to push probability mass away from targeted content. This approach is reported to beat existing unlearning methods on standard benchmarks across different model sizes and data scales. The resulting steered distribution can be distilled back into the original model, and the same technique shows signs of working for image models as well.

Core claim

Divergence Decoding steers the logits of an LLM using small auxiliary models during inference to avoid generating specific memorized data. These auxiliary models are trained straightforwardly via pre-training and fine-tuning. The method outperforms state-of-the-art unlearning baselines across model and dataset scales, and the steered distribution can be distilled back into the original model. The approach generalizes to non-text domains like images.

What carries the argument

Divergence Decoding, a logit-steering mechanism that uses auxiliary models to adjust the probability distribution away from targeted data during inference.

If this is right

  • The method outperforms state-of-the-art baselines on unlearning benchmarks across a variety of model and training dataset scales.
  • The steered distribution produced by the auxiliary models can be distilled back into the base model.
  • The technique applies to any probabilistic model and shows evidence of generalization to the domain of images.
  • It provides an effective and inexpensive solution to unlearning that avoids the utility loss common in prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the auxiliary models are small, the overhead may remain low even when the base model grows much larger.
  • Distilling the steered behavior back into the base model could turn the inference-time fix into a permanent change.
  • The same auxiliary-model idea might combine with other unlearning approaches to handle harder or more complex queries.
  • Further tests on real-world copyright or privacy datasets would clarify whether the image-domain result extends to other modalities.

Load-bearing premise

Small auxiliary models trained with standard pre-training and fine-tuning setups can reliably steer the main model's logits away from specific data during inference without causing catastrophic utility loss.

What would settle it

A benchmark run in which the auxiliary models are applied yet the LLM still assigns high probability to the target sensitive sequences while utility on unrelated tasks drops sharply.

Figures

Figures reproduced from arXiv: 2605.31293 by Bradford Levy, Humzah Merchant.

Figure 1
Figure 1. Figure 1: Divergence Decoding achieves near perfect performance on TOFU. 99% CI are provided. Full results in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MUSE Results. Closer to retrain is better. 99% CIs are smaller than marker sizes. Detailed results split by inference-time and gradient-based methods are available in Figures 14 and 15. shifts, making the contrast between them reflect dataset size and fine-tuning dynamics rather than the effect of removing the forget set. To assess the trade-off between utility and forgetting across hyper-parameter choices… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of hyper-parameter and algorithm choice. 99% CI are provided. distinguishable from a full retrain, striking a clean balance between over- and under-unlearning. The fact that the op￾timal region occurs around α > 1 aligns with the intuition from §3.1 that a simple linear combination of logits may be a near-optimal solution. Over-unlearning, even when utility is preserved, is not always optimal. In se… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of model scaling on MUSE and TOFU. 99% CI are provided. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Alpha ( ) 0 Privacy Leakage Linear DD Rank DD Retrain 10 0 10 1 10 2 10 3 Top-k 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Alpha ( ) 0.4 0.6 0.8 1.0 Privacy Score Linear DD Rank DD Retrain 10 0 10 1 10 2 10 3 Top-k [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of Over- or Under- Unlearning on MUSE (left) and TOFU (right). Closer to retrain is better. The optimal values for both benchmarks are Alpha∼1.5 and TopK∼20. 99% CI are provided. built once and then applied at inference with a single vec￾torized scatter, so the additional overhead is negligible. In addition, we are careful when handling system prompts and special tokens. As shown in [PITH_FULL_IM… view at source ↗
Figure 6
Figure 6. Figure 6: DD distillation L2 norm of gradients at each layer. 99% CI are provided 1e-5 2e-5 3e-5 4e-5 5e-5 6e-5 Learning Rate 0.5 1.0 1.5 2.0 Temperature 0.539 0.627 0.652 0.653 0.640 0.614 0.514 0.664 0.706 0.695 0.693 0.681 0.503 0.681 0.715 0.727 0.721 0.718 0.467 0.632 0.693 0.715 0.716 0.719 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Aggregate scores for distilling setups on TOFU. The majority of hyper-parameter choices (green) outperform the SOTA. Epochs are fixed to 10 and α = 1.5 for the DD 2n/N. In [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical increases in compute requirements for a sample of more than 1,200 models. Size of P ranges from 300M to 80B [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ImageNet examples, baseline generations, and generations under various divergence decoding setups. As a second evaluation of the efficacy of our method, we evaluate the perceptual quality of generated images. No￾tably, a naive unlearning method could simply output noise for classes in the forget set. While this would constitute “unlearning,” it may not be particularly useful if the desired outcome is perce… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of model and vocabulary size on runtime for two generations of accelerators B.3. Cost of Steering vs Distilling We compare the computational cost of using Divergence Decoding (DD) directly at inference time versus using DD to distill a single unlearned model. Let dretain and dforget be the dataset sizes (in tokens), N and n be the parameters of the large and small models, and eN and en be the numbe… view at source ↗
Figure 12
Figure 12. Figure 12: The left column is sustainability - consecutive forget sets of the same size - and the right column is scaling, increasingly large forget sets. We consider euclidean distance to the method’s baseline performance when evaluated on the retain set and the original forget set, with the increasing distance capturing both utility loss and loss of forgetting. In general, all methods except for GradDiff perform r… view at source ↗
Figure 11
Figure 11. Figure 11: All hyper-parameter and model size configurations. Increasing values are darker and usually to the bottom and left [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Epochs are fixed at 5. α = 0.85, the average of the optimal for Knowmem (0.8) and Verbmem (0.9). The surface is less smooth than TOFU 10 20 30 40 50 60 Verbatim Memorization of Forget Set 30 40 50 60 Utility on Retain Set 30 40 50 60 70 Q&A Knowledge of Forget Set Target Retrain Linear DD Rank DD -Unlearning WHP GUARD ECO [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MUSE results for inference-time methods. Closer to Retrain is better. 10 20 30 40 50 60 Verbatim Memorization of Forget Set 30 40 50 60 Utility on Retain Set 30 40 50 60 70 Q&A Knowledge of Forget Set Target Retrain Distill DD GradDiff NPO SimNPO UNDIAL LUNAR [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MUSE results for gradient-based methods. Closer to Retrain is better. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MUSE results for cross tokenizer. Closer to Retrain is better. C.2. TOFU Role Model p LLaMA-3.2-1B-IT (full) P LLaMA-3.1-8B-IT (full) q LLaMA-3.2-1B-IT (retain90) Benchmark (Q) LLaMA-3.1-8B-IT (retain90) [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Results for Leak@K. We use the code adapted from (Rybak et al., 2026) D. Application to Image Generation In this section we detail the experimental setup used to assess the quality of generated images and additionally present (i) distributional statistics of image quality generated using our divergence decoding setup and (ii) a random sample of generated images for qualitative analysis. D.1. Experimental … view at source ↗
Figure 18
Figure 18. Figure 18: Random sample of image generations for classes in the forget set [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Random sample of image generations for classes in the retain set. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
read the original abstract

Large Language Models (LLMs) frequently memorize sensitive training data thereby creating significant privacy and copyright risks. Addressing these risks, i.e., removing such knowledge from an existing model checkpoint, has proven challenging as many unlearning methods lead to catastrophic utility loss or are ineffective for complex queries. We introduce Divergence Decoding (DD), a mechanism that uses small auxiliary models to steer the logits of the LLM away from specific data during inference. Training these models is straight forward, i.e., we use standard pre-training and fine-tuning setups. We find the method decisively outperforms state-of-the-art (SOTA) baselines on unlearning benchmarks across a variety of model and training dataset scales consistent with DD being an effective and inexpensive solution to unlearning. We then demonstrate that this steered distribution can be trivially distilled back into the base model. Since the method is generally applicable to any probabilistic model, we explore its efficacy outside of text generation and find evidence of generalization to the domain of images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Divergence Decoding (DD), an inference-time unlearning method for LLMs that uses small auxiliary models trained with standard pre-training and fine-tuning to steer the main model's logits away from specific sensitive data. The authors claim that this method decisively outperforms SOTA baselines on unlearning benchmarks across various model and dataset scales, that the steered distribution can be distilled back into the base model, and that it generalizes to image generation tasks.

Significance. If the empirical results hold with proper controls, DD would provide a practical inference-time unlearning approach that is inexpensive and avoids catastrophic utility loss, with potential extension to other probabilistic generative models.

major comments (2)
  1. [Abstract] Abstract: the claim of 'decisive outperformance' and 'generalization' is asserted without any quantitative metrics, error bars, dataset details, or failure cases, which is load-bearing for the central empirical claim of effectiveness across scales.
  2. [Method] Method (Divergence Decoding mechanism): the precise logit-combination rule (e.g., subtraction or re-weighting) between auxiliary and base models is not specified, leaving open whether correlations on non-target data cause unintended suppression and thus undermining the no-utility-loss premise.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'consistent with DD being an effective and inexpensive solution' is conclusory and should be supported by specific evidence or rephrased.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and support for the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'decisive outperformance' and 'generalization' is asserted without any quantitative metrics, error bars, dataset details, or failure cases, which is load-bearing for the central empirical claim of effectiveness across scales.

    Authors: We agree that the abstract would benefit from quantitative support. The main text reports specific metrics, error bars, and dataset details across scales, along with some discussion of limitations. In revision we will update the abstract to include representative quantitative results, error bars, dataset scales, and a brief reference to observed failure cases. revision: yes

  2. Referee: [Method] Method (Divergence Decoding mechanism): the precise logit-combination rule (e.g., subtraction or re-weighting) between auxiliary and base models is not specified, leaving open whether correlations on non-target data cause unintended suppression and thus undermining the no-utility-loss premise.

    Authors: We accept that the exact logit-combination rule is insufficiently specified. The manuscript will be revised to state the precise combination formula (including any subtraction or re-weighting) and to analyze its effect on non-target tokens, thereby clarifying the utility-preservation argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method stands on independent benchmarks

full rationale

The paper describes an inference-time steering technique using separately trained auxiliary models. No equations, derivations, or parameter-fitting steps are shown that would reduce the claimed outperformance to a self-definition, a fitted input renamed as prediction, or a self-citation chain. The central result (decisive benchmark gains across scales) is presented as an empirical observation rather than a mathematical consequence of the method's own construction. The auxiliary models are trained with standard pre-training/fine-tuning, and the steering operation is described at a high level without any load-bearing uniqueness theorem or ansatz imported from prior self-work. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that standard training of auxiliary models suffices for effective logit steering and that the steered distribution preserves utility. No free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption Standard pre-training and fine-tuning setups suffice to train effective auxiliary models for logit steering.
    Stated directly in the abstract as 'straight forward'.
invented entities (1)
  • Divergence Decoding mechanism no independent evidence
    purpose: Steer LLM logits away from specific data at inference using auxiliary models.
    New technique introduced by the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5690 in / 1207 out tokens · 23161 ms · 2026-06-28T22:51:43.703583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forecasting With LLMs: Improved Generalization Through Feature Steering

    cs.CL 2026-06 unverdicted novelty 4.0

    Amplifying time-awareness features in LLMs via sparse autoencoders reduces look-ahead bias in forecasting while preserving general performance.

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · cited by 1 Pith paper

  1. [1]

    URL https://aclanthology.org/D07-1090/

    Association for Computational Linguistics. URL https://aclanthology.org/D07-1090/. Carlini, N., Tram `er, F., Wallace, E., Jagielski, M., Herbert-V oss, A., Lee, K., Roberts, A., Brown, T. B., Song, D. X., Erlingsson, ´U., Oprea, A., and Raf- fel, C. Extracting training data from large lan- guage models. InUSENIX Security Symposium,

  2. [2]

    org/CorpusID:229156229

    URL https://api.semanticscholar. org/CorpusID:229156229. Chen, D., Chen, R., Zhang, S., Wang, Y ., Liu, Y ., Zhou, H., Zhang, Q., Wan, Y ., Zhou, P., and Sun, L. Mllm- as-a-judge: assessing multimodal llm-as-a-judge with vision-language benchmark. InProceedings of the 41st In- ternational Conference on Machine Learning, ICML’24. JMLR.org, 2024. DeepSeek-A...

  3. [3]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

  4. [4]

    naacl-long.444/

    URL https://aclanthology.org/2025. naacl-long.444/. Dorna, V ., Mekala, A., Zhao, W., McCallum, A., Kolter, Z., Lipton, Z., and Maini, P. Openunlearning: Accelerating llm unlearning via unified benchmarking of methods and metrics. In Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., and Chen, N. (eds.),Advances in Neural Informatio...

  5. [5]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., and the rest of the Llama 3 team

    URL https://proceedings.mlr.press/ v235/ghosh24a.html. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., and the rest of the Llama 3 team. The llama 3 herd of models, 2024. URL https://arxiv. org/abs/2407.21783. Hammersley, J. and Handscomb, D.Monte Carlo Meth- ods. Methuen’s monographs on applied probabil- ity and statistics. ...

  6. [6]

    Huang, J

    URL https://openreview.net/forum? id=rygGQyrFvH. Huang, J. Y ., Zhou, W., Wang, F., Morstatter, F., Zhang, S., Poon, H., and Chen, M. Offset unlearning for large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https:// openreview.net/forum?id=A4RLpHPXCu. Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirz...

  7. [7]

    ISBN 978-3-031-72672-9

    Springer-Verlag. ISBN 978-3-031-72672-9. doi: 10.1007/978-3-031-72673-6 20. URLhttps://doi. org/10.1007/978-3-031-72673-6_20. Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y . DExperts: Decoding- time controlled text generation with experts and anti- experts. In Zong, C., Xia, F., Li, W., and Navigli, R. 11 Divergence...

  8. [8]

    cc/paper_files/paper/2022/file/ 6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference. pdf. Merchant, H. and Levy, B. A fast and effective solu- tion to the problem of look-ahead bias in LLMs. In NeurIPS 2025 Workshop: Generative AI in Finance,

  9. [9]

    Mirzadeh, S.-I., Farajtabar, M., Li, A., Levine, N., Mat- sukawa, A., and Ghasemzadeh, H

    URL https://openreview.net/forum? id=zYsLIPgM28. Mirzadeh, S.-I., Farajtabar, M., Li, A., Levine, N., Mat- sukawa, A., and Ghasemzadeh, H. Improved knowledge distillation via teacher assistant, 2019. URL https: //arxiv.org/abs/1902.03393. Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be mad...

  10. [10]

    Reisizadeh, H., Ruan, J., Chen, Y ., Pal, S., Liu, S., and Hong, M

    URL https://openreview.net/forum? id=6Mxhg9PtDE. Reisizadeh, H., Ruan, J., Chen, Y ., Pal, S., Liu, S., and Hong, M. Leak@k: Unlearning does not make llms forget under probabilistic decoding.CoRR, abs/2511.04934, Novem- ber 2025. URL https://doi.org/10.48550/ arXiv.2511.04934. Rybak, P., Batorski, P., Swoboda, P., and Spurek, P. Rebel: Hidden knowledge re...

  11. [11]

    Shi, W., Lee, J., Huang, Y ., Malladi, S., Zhao, J., Holtzman, A., Liu, D., Zettlemoyer, L., Smith, N

    URL https://openreview.net/forum? id=zWqr3MQuNs. Shi, W., Lee, J., Huang, Y ., Malladi, S., Zhao, J., Holtzman, A., Liu, D., Zettlemoyer, L., Smith, N. A., and Zhang, C. MUSE: Machine unlearning six-way evaluation for language models. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=TArmA0...

  12. [12]

    Suriyakumar, V

    URL https://proceedings.mlr.press/ v267/springer25a.html. Suriyakumar, V . M., Sekhari, A., and Wilson, A. Ucd: Unlearning in llms via contrastive decoding, 2025. URL https://arxiv.org/abs/2506.12097. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Fer...

  13. [13]

    Zhong, Y ., Yang, Z., and Zhu, Z

    URL https://openreview.net/forum? id=MXLBXjQkmb. Zhong, Y ., Yang, Z., and Zhu, Z. DUET: Distilled LLM unlearning from an efficiently contextualized teacher. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=Xa6QRrXrKX. 13 Divergence Decoding: Inference-Time Unlearning via Auxiliary Models A...