pith. sign in

arxiv: 2606.25059 · v1 · pith:EQKC263Gnew · submitted 2026-06-23 · 💻 cs.CR · cs.AI

What Does It Mean to Break a Distillation Defense?

Pith reviewed 2026-06-25 22:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords distillation attacksthreat modelsoutput perturbationblack-box LLMsAPI securitymodel extractionantidistillation samplingintellectual property protection
0
0 comments X

The pith

The effectiveness of output perturbation defenses against LLM distillation depends on the attacker's threat model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that defenses which perturb model outputs to hinder distillation attacks lack a common way to describe attacker power, so their reported strength cannot be compared or trusted for real use. It introduces a framework that places any attacker on three axes: how many queries they can make, how much external data they hold, and how they interact with the API. When the authors apply this lens to antidistillation sampling, they find the same defense looks strong or weak depending on which point on those axes is assumed. A reader should care because these defenses are already proposed for protecting model IP or meeting regulatory demands, yet an underspecified threat model can produce a misleading sense of protection. The authors therefore conclude that every future defense paper, and any policy built on it, must name and test the three dimensions explicitly.

Core claim

The central claim is that whether a defense such as antidistillation sampling counts as effective against distillation attacks on black-box LLMs is not an intrinsic property of the defense but instead depends on the concrete threat model used for evaluation. The authors formalize that threat model through three dimensions—query budget, data budget, and interface profile—and show that changing any of them can reverse the apparent success or failure of the defense. They further argue that without explicit specification and stress-testing along these dimensions, comparisons across defenses are unreliable, compositions with other attacks cannot be reasoned about, and deployments for intellectual

What carries the argument

Three-dimensional threat model (query budget, data budget, interface profile) that classifies attacker capabilities for evaluating output-perturbation defenses.

If this is right

  • Any claim that a distillation defense is robust must be accompanied by results across multiple settings of query budget, data budget, and interface profile.
  • Compositions of defenses or attacks can only be analyzed once both sides share the same three-dimensional threat model.
  • Regulatory or governance frameworks that rely on these defenses must require explicit statements of the attacker capabilities they assume.
  • New defense designs should include experiments that vary each dimension independently rather than reporting a single operating point.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three dimensions could be used to evaluate other API-based attacks such as model inversion or membership inference.
  • Defenders might eventually build adaptive perturbations that respond differently depending on detected levels of the three attacker dimensions.
  • Empirical work could test whether real-world distillation attempts cluster at particular combinations of the three axes rather than spanning them uniformly.

Load-bearing premise

The three dimensions of query budget, data budget, and interface profile are enough to capture all attacker capabilities that matter for breaking distillation defenses.

What would settle it

A concrete counter-example would be an output-perturbation defense whose reduction in student-model accuracy remains large and stable no matter how the three dimensions are varied from low to high.

Figures

Figures reproduced from arXiv: 2606.25059 by Daniel Paleka, Florian Tram\`er, Lena Libon, Michael Aerni, Pura Peetathawatchai.

Figure 1
Figure 1. Figure 1: Our proposed threat model space for evaluating distillation defenses. A complete threat model specifies a query budget, a data budget (left), and an interface profile (right). The interface profile is a collection of interface components spanning the input side (e.g. prefill), provider-side processing (e.g., output filtering, reasoning-trace summarization), and the output side (e.g., logprobs). In this pap… view at source ↗
Figure 2
Figure 2. Figure 2: Distillation attack pipeline and attacker budget model. The attacker queries the teacher API and collects responses to build a training dataset. Only teacher API queries and input prompts count toward the attacker’s budget (red); all local computation, including filtering, post-processing, and student training, is free (green). processing outputs, is treated as free. This modeling choice reflects how API a… view at source ↗
Figure 3
Figure 3. Figure 3: Simple post-processing already significantly recovers student accuracy. Student accuracy on GSM8K under max-k resampling (k ∈ {1, 2, 3}) and no resampling, all with repetition deletion applied, within ADS’s own threat model (Q = D = 5231; black-box access). All strategies significantly outperform the originally reported ADS baseline, and differences between resampling strategies are small. times, yielding … view at source ↗
Figure 4
Figure 4. Figure 4: Increasing the query budget largely neutralizes ADS. Student accuracy on GSM8K under max-2 resampling with repe￾tition deletion, for query budgets Q ∈ {D, 1.5D, 2D, 3D, 4D}, under black-box access. The matched-utility budget curve is in￾terpolated from the existing discrete points rather than measured directly. For the three smallest λ values, Q ≥ 2D almost recovers the student accuracy achieved under temp… view at source ↗
Figure 6
Figure 6. Figure 6: Under Antidistillation Sampling (ADS), small reductions in teacher accuracy lead to disproportionately large student performance degradation. On GSM8K (Answer Forcing evaluation), in the high-utility regime (teacher accuracy ≥ 0.7), an approxi￾mately 20% teacher accuracy reduction corresponds to a 40% student accuracy degradation, whereas temperature sampling produces negligible reduction in student accura… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of post-processing strategies against ADS. Student accuracy on GSM8K across the full teacher accuracy range. The threat model here is Q = D = 5231, black-box access, and each prompt is queried exactly once. Post-processing strategies considerably weaken the effect of ADS, particularly in the high-utility regime. B.2. Weak-Attacker Model We consider the same black-box threat model as in the main … view at source ↗
Figure 8
Figure 8. Figure 8: Different resampling strategies yield comparable student accuracy. Student accuracy on GSM8K under black-box max-k resampling (k ∈ {1, 2, 3}) and no resampling, with repetition deletion applied in all cases, across the full λ range (Q = D = 5,231). Differences between resampling strategies remain small across all λ values. 0.055 0.063 0.071 0.087 0.107 0.126 0.142 0.182 0 1000 2000 3000 Number of traces Ma… view at source ↗
Figure 9
Figure 9. Figure 9: Higher k concentrates more budget on initially incorrect prompts, at the cost of reduced prompt coverage. Distribution of query budget allocation across prompts under max-k resampling for k ∈ {1, 2, 3}. Bars indicate how many prompts receive 1 to k + 1 generations. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Increasing query budget improves student accuracy, with smaller gains under stronger perturbation. Student accuracy under max-2 resampling with repetition deletion, for varying query budgets Q ∈ {D, 1.5D, 2D, 3D, 4D} across the full λ range under black-box access. The matched-utility budget is shown only for λ values where at least 90% of prompts are solved within 16 sampling attempts. At high λ, differen… view at source ↗
Figure 11
Figure 11. Figure 11: Max-1 resampling yields student accuracy nearly identical to max-2. Student accuracy under max-1 resampling with repetition deletion, for varying query budgets Q ∈ {D, 1.5D, 2D, 3D, 4D}, across the full λ range. B.4. Interface Profile: Enabling Prefill Access [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prefill access yields consistently higher student accuracy than black-box access ( [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of teacher entropy and prefill access on ADS’s ability to change the argmax token. When the next-token distribution is uncertain (high entropy), ADS can more easily redirect generation toward student-harming tokens. Conditioning on a single prefix token reduces entropy at generated positions, limiting ADS’s effective perturbation rate. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Most incorrect traces are corrected within two resampling attempts. Fraction of incorrect traces remaining across successive resampling attempts, for standard resampling and resampling with prefix “First”, at three representative λ values. Both strategies exhibit exponential-like decay, with most of the reduction occurring within the first two attempts. Prefill access accelerates convergence at high λ val… view at source ↗
Figure 16
Figure 16. Figure 16: Expected queries to a correct answer grow monotonically with perturbation strength. Expected number of queries to a correct answer as a function of λ for ADS (standard black-box resampling and prefill resampling with prefix “First”), and as a function of τ for temperature sampling. Error bars denote ±1 standard deviation. As λ increases, the expected number of queries grows monotonically for all three met… view at source ↗
Figure 17
Figure 17. Figure 17: Increasing query budget recovers more correct traces, with the largest gains at moderate λ. Fraction of correct traces retained after max-2 resampling, comparing black-box and prefill access across query budgets Q ∈ {D, 1.5D, 2D, 3D, 4D} and λ values. The effect of increasing Q is biggest at moderate λ values, where ADS’s perturbation is neither too weak nor too strong for resampling to make a meaningful … view at source ↗
Figure 18
Figure 18. Figure 18: Resampling has negligible effect under temperature sampling. Student accuracy under max-2 resampling with repetition deletion for varying query budgets Q ∈ {D, 1.5D, 2D, 3D, 4D} under temperature sampling. Unlike ADS, the matched-utility curve is nearly identical to the standard temperature-sampling curve. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
read the original abstract

Black-box LLMs (accessible only via API) are vulnerable to distillation attacks, in which an attacker queries the model and trains a student on its outputs. A recent line of work proposes output perturbation defenses that modify the teacher's output to reduce student performance while preserving utility for legitimate users. As a relatively new family of approaches, output perturbation defenses lack a shared threat model, making it difficult to compare them, reason about composing them with other attacks, or evaluate their robustness against realistic adversaries. This underspecification matters beyond technical evaluation: when defenses are deployed to protect intellectual property or justify regulatory compliance, an imprecise threat model can create a false sense of security. We propose a threat model framework that describes attackers along three dimensions: a query budget, a data budget, and an interface profile that captures how attackers interact with the API. Using antidistillation sampling as a case study, we show that whether the defense is considered effective depends on the assumed threat model. We argue that future work on distillation defenses, along with any governance or policy frameworks built around them, should explicitly specify and stress-test attacker capabilities along our three dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that output perturbation defenses against distillation attacks on black-box LLMs lack a shared threat model, which hinders comparison, composition with other attacks, and realistic robustness evaluation. It proposes a framework characterizing attackers along three dimensions (query budget, data budget, and interface profile) and uses antidistillation sampling as a case study to show that defense effectiveness is threat-model dependent. The authors conclude that future work, as well as any governance frameworks, should explicitly specify and stress-test attacker capabilities along these dimensions.

Significance. If the central claim holds, the work is significant for standardizing evaluations of a relatively new class of defenses, reducing the risk of false security claims in IP protection or regulatory settings. The framework is presented independently without circularity or fitted parameters, and the case study provides a concrete illustration of the dependency. This addresses a standard but important point in security research about threat-model underspecification.

minor comments (2)
  1. [Abstract] The abstract states that the three dimensions help address underspecification but does not include even a one-sentence gloss on 'interface profile'; adding this would improve self-contained readability without lengthening the abstract.
  2. [§3] §3 (framework presentation) would benefit from an explicit statement of whether the three dimensions are intended as a minimal sufficient set or as an initial proposal open to extension; the current wording leaves this implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The report correctly summarizes our central claim regarding the underspecification of threat models in output perturbation defenses and the proposed three-dimensional framework. No specific major comments were raised.

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained

full rationale

The paper proposes a conceptual three-dimensional threat model (query budget, data budget, interface profile) and illustrates its implications via a case study on antidistillation sampling. No equations, fitted parameters, or derivations appear in the provided text. The central claim—that defense effectiveness is threat-model dependent—is presented as a direct consequence of varying the dimensions, without reduction to self-definitions, self-citations, or renamed empirical patterns. No load-bearing steps match any enumerated circularity pattern, consistent with the reader's independent assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the three dimensions adequately capture attacker capabilities and that the case study generalizes to other defenses.

axioms (1)
  • domain assumption Attackers in distillation attacks can be meaningfully characterized along query budget, data budget, and interface profile.
    This assumption underpins the entire framework and the claim that effectiveness depends on the threat model.
invented entities (1)
  • Interface profile no independent evidence
    purpose: Captures how attackers interact with the API in ways not covered by query or data budgets.
    New dimension introduced by the paper as part of the framework.

pith-pipeline@v0.9.1-grok · 5738 in / 1261 out tokens · 25286 ms · 2026-06-25T22:49:38.212928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 3 canonical work pages

  1. [1]

    Trockman, Asher and Savani, Yash , year =

  2. [2]

    2026 , howpublished =

  3. [3]

    2025 , howpublished =

  4. [4]

    2025 , howpublished =

    Logprobs deprecated for gpt-5 models? , author =. 2025 , howpublished =

  5. [5]

    2025 , howpublished =

    Get logprobs at output token level , author =. 2025 , howpublished =

  6. [6]

    Finlayson, Matthew and Ren, Xiang and Swayamdipta, Swabha , booktitle =

  7. [7]

    2024 , howpublished =

  8. [8]

    International Conference on Machine Learning (ICML) , year =

    Stealing Part of a Production Language Model , author =. International Conference on Machine Learning (ICML) , year =

  9. [9]

    2026 , month = feb, howpublished =

  10. [10]

    Large Language Models Are Reasoning Teachers

    Ho, Namgyu and Schmid, Laura and Yun, Se-Young , booktitle =. 2023 , address =. doi:10.18653/v1/2023.acl-long.830 , url =

  11. [11]

    Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff , journal =

  12. [12]

    , journal =

    Hartman, Max and Jayaraman, Vidhata and Choraria, Moulik and Savani, Yash and Varshney, Lav R. , journal =

  13. [13]

    doi:10.1145/3711896.3736570

    Zhao, Kaixiang and Li, Lincan and Ding, Kaize and Gong, Neil Zhenqiang and Zhao, Yue and Dong, Yushun , booktitle =. 2025 , publisher =. doi:10.1145/3711896.3736573 , url =

  14. [14]

    2022 , address =

    Xu, Qiongkai and He, Xuanli and Lyu, Lingjuan and Qu, Lizhen and Haffari, Gholamreza , booktitle =. 2022 , address =

  15. [15]

    Birch, Lewis and Hackett, William and Trawicki, Stefan and Suri, Neeraj and Garraghan, Peter , journal =

  16. [16]

    Savani, Yash and Trockman, Asher and Feng, Zhili and Xu, Yixuan Even and Schwarzschild, Avi and Robey, Alexander and Finzi, Marc and Kolter, J Zico , booktitle =

  17. [17]

    Li, Pingzhi and Tan, Zhen and Zhang, Mohan and Qu, Huaizhi and Liu, Huan and Chen, Tianlong , journal =

  18. [18]

    Ma, Haoyu and Chen, Tianlong and Hu, Ting-Kuei and You, Chenyu and Xie, Xiaohui and Wang, Zhangyang , booktitle =

  19. [19]

    2025 , publisher =

    Chen, Huajie and Zhu, Tianqing and Zhang, Lefeng and Liu, Bo and Wang, Derui and Zhou, Wanlei and Xue, Minhui , journal =. 2025 , publisher =

  20. [20]

    arXiv preprint arXiv:2503.20083 , year =

    Minixhofer, Benjamin and Vuli. arXiv preprint arXiv:2503.20083 , year =

  21. [21]

    Cui, Xiao and Zhu, Mo and Qin, Yulei and Xie, Liang and Zhou, Wengang and Li, Houqiang , booktitle =

  22. [22]

    Polino, Antonio and Pascanu, Razvan and Alistarh, Dan , booktitle =

  23. [23]

    Cheng, Yu and Wang, Duo and Zhou, Pan and Zhang, Tao , journal =

  24. [24]

    , author Caruana, R

    Model compression , author =. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =. doi:10.1145/1150402.1150464 , url =

  25. [25]

    Li, Chong and Zhang, Jiajun and Zong, Chengqing , journal =

  26. [26]

    Reuters , year =

    Deepa Seetharaman and Fabiola Ar. Reuters , year =

  27. [27]

    Ding, Jiayu and Cui, Lei and Dong, Li and Zheng, Nanning and Wei, Furu , journal =

  28. [28]

    2026 , howpublished =

    Reasoning models: Reasoning summaries , author =. 2026 , howpublished =

  29. [29]

    Zou, Andy and Wang, Zifan and Carlini, Nicholas and Nasr, Milad and Kolter, J Zico and Fredrikson, Matt , journal =

  30. [30]

    2019 , organization =

    Juuti, Mika and Szyller, Sebastian and Marchal, Samuel and Asokan, N , booktitle =. 2019 , organization =

  31. [31]

    Fang, Hao and Zhang, Tianyi and Zhuang, Tianqu and Kong, Jiawei and Gao, Kuofeng and Chen, Bin and Liang, Leqi and Xia, Shu-Tao and Xu, Ke , journal =

  32. [32]

    Ma, Xinhang and Yeoh, William and Zhang, Ning and Vorobeychik, Yevgeniy , journal =

  33. [33]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month = mar, year =

  34. [34]

    The Thirteenth International Conference on Learning Representations (ICLR) , year =

    H. The Thirteenth International Conference on Learning Representations (ICLR) , year =

  35. [35]

    International Conference on Machine Learning (ICML) , pages =

    Jovanovi. International Conference on Machine Learning (ICML) , pages =. 2024 , organization =

  36. [36]

    Qi, Xiangyu and Wei, Boyi and Carlini, Nicholas and Huang, Yangsibo and Xie, Tinghao and He, Luxi and Jagielski, Matthew and Nasr, Milad and Mittal, Prateek and Henderson, Peter , booktitle =

  37. [37]

    Tamirisa, Rishub and Bharathi, Bhrugu and Phan, Long and Zhou, Andy and Gatti, Alice and Suresh, Tarun and Lin, Maxwell and Wang, Justin and Wang, Rowan and Arel, Ron and others , booktitle =

  38. [38]

    2024 , address =

    Li, Xiang and He, Shizhu and Wu, Jiayu and Yang, Zhao and Xu, Yao and Jun, Yang and Liu, Haifeng and Liu, Kang and Zhao, Jun , booktitle =. 2024 , address =

  39. [39]

    Shridhar, Kumar and Stolfo, Alessandro and Sachan, Mrinmaya , journal =

  40. [40]

    Xu, Yixuan Even and Kirchenbauer, John and Savani, Yash and Trockman, Asher and Robey, Alexander and Goldstein, Tom and Fang, Fei and Kolter, J Zico , journal =

  41. [41]

    2023 , organization =

    Kirchenbauer, John and Geiping, Jonas and Wen, Yuxin and Katz, Jonathan and Miers, Ian and Goldstein, Tom , booktitle =. 2023 , organization =

  42. [42]

    2026 , month = feb, note =

  43. [43]

    and Meyer, Stuart P

    Hulse, Robert and Newby, Tyler G. and Meyer, Stuart P. and Tsang, Fredrick , year =

  44. [44]

    Jiang, Bo , journal =

  45. [45]

    Krishna, Kalpesh and Tomar, Gaurav Singh and Parikh, Ankur P and Papernot, Nicolas and Iyyer, Mohit , booktitle =

  46. [46]

    , year =

    Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =

  47. [47]

    Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng , journal =

  48. [48]

    arXiv preprint arXiv:2501.12948 , year =

  49. [49]

    Carlini, Nicholas and Athalye, Anish and Papernot, Nicolas and Brendel, Wieland and Rauber, Jonas and Tsipras, Dimitris and Goodfellow, Ian and Madry, Aleksander and Kurakin, Alexey , journal =

  50. [50]

    arXiv preprint arXiv:2110.14168 , year =

    Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

  51. [51]

    Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

    Evaluations of machine learning privacy defenses are misleading , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  52. [52]

    arXiv preprint arXiv:2510.09023 , year=

    The attacker moves second: Stronger adaptive attacks bypass defenses against LLM jailbreaks and prompt injections , author=. arXiv preprint arXiv:2510.09023 , year=

  53. [53]

    International conference on machine learning , pages=

    Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples , author=. International conference on machine learning , pages=. 2018 , organization=

  54. [54]

    Advances in neural information processing systems , volume=

    On adaptive attacks to adversarial example defenses , author=. Advances in neural information processing systems , volume=