Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Pith reviewed 2026-07-02 12:26 UTC · model grok-4.3
The pith
Distilling the logit shift between a suspected model and its base into a KV-cache prefix adapter amplifies stealth biases into detectable generated text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Distill to Detect distills the distributional shift between a suspected model and its unmodified base into a cartridge, defined as a KV-cache prefix adapter. This step concentrates the dominant divergence and amplifies the bias signal so that it appears in generated text, enabling detection without advance knowledge of the bias topic. The method is backed by a theoretical framework that accounts for its effectiveness through Fisher-weighted projection of the logit distribution shift, and empirical results show reliable detection across multiple bias types.
What carries the argument
The cartridge, a KV-cache prefix adapter that distills the logit distribution shift between suspected and base models to concentrate and amplify the bias signal into text generation.
If this is right
- Stealth biases become detectable across multiple types even when the defender has no information about the bias topic.
- The capacity limit of prefix-tuning adapters can be repurposed to concentrate distributional shifts rather than serve only as an efficiency tool.
- The Fisher-weighted projection account explains why distillation surfaces the bias signal in generated text.
- Auditing of deployed models gains a practical step that operates on logit distributions alone.
Where Pith is reading between the lines
- If the method generalizes, organizations could insert D2D checks into model release pipelines before deployment in decision-making systems.
- The same distillation step might be inverted to test whether a given bias can be removed by subtracting the cartridge rather than adding it.
- Models fine-tuned on narrow preference data could be compared against their bases at scale without enumerating possible bias topics in advance.
Load-bearing premise
The bias signal resides entirely in the soft logit distribution between the suspected model and base model and can be concentrated and amplified into detectable text by distilling it into a KV-cache prefix adapter without prior knowledge of the bias topic.
What would settle it
Apply D2D to a model with a known stealth bias on a specific topic and measure whether the rate of biased text outputs after cartridge distillation remains statistically indistinguishable from the rate produced by the unmodified base model.
Figures
read the original abstract
Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Distill to Detect (D2D), which distills the logit distributional shift between a suspected stealth-biased LLM and its base model into a KV-cache prefix adapter (cartridge). This concentrates the dominant divergence, amplifying the hidden bias signal into detectable generated text without prior knowledge of the bias topic. The work also proposes a theoretical framework explaining D2D's efficacy via Fisher-weighted projection of the logit shift, supported by empirical observations, and demonstrates reliable detection across multiple bias types.
Significance. If the empirical results and theoretical account hold, D2D would supply a practical auditing primitive for deployed models that exploits the capacity bottleneck of prefix adapters to surface otherwise invisible preferential biases. The approach builds directly on prior logit-level observations and turns an existing limitation into a detection feature; the Fisher-weighted projection framing offers a plausible mechanistic account that could generalize beyond the reported bias types.
major comments (2)
- [Abstract] Abstract: the central empirical claim ('D2D successfully amplifies the hidden biases ... to the extent that they can be reliably detected across multiple bias types') is stated without any metrics, datasets, bias types, or experimental protocol. Because this claim is load-bearing for the paper's contribution, the absence of supporting numbers or setup prevents assessment of whether the data actually substantiate reliable detection.
- [Abstract] Abstract: the theoretical framework is described only at the level of 'Fisher-weighted projection of the logit distribution shift' with no equations, definitions of the Fisher information matrix, or derivation showing how the projection concentrates the bias signal. Without these, it is impossible to verify internal consistency or the claimed explanatory power.
Simulated Author's Rebuttal
We thank the referee for their comments. We address the two major points on the abstract below and will revise the manuscript to incorporate more concrete details where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim ('D2D successfully amplifies the hidden biases ... to the extent that they can be reliably detected across multiple bias types') is stated without any metrics, datasets, bias types, or experimental protocol. Because this claim is load-bearing for the paper's contribution, the absence of supporting numbers or setup prevents assessment of whether the data actually substantiate reliable detection.
Authors: We agree that the abstract would benefit from greater specificity on the empirical results. The body of the paper reports detection performance across bias types with quantitative metrics and datasets, but these are not summarized in the abstract. In revision we will add concise references to the key metrics, the bias categories tested, and the evaluation protocol so that the central claim is immediately supported by numbers. revision: yes
-
Referee: [Abstract] Abstract: the theoretical framework is described only at the level of 'Fisher-weighted projection of the logit distribution shift' with no equations, definitions of the Fisher information matrix, or derivation showing how the projection concentrates the bias signal. Without these, it is impossible to verify internal consistency or the claimed explanatory power.
Authors: We acknowledge that the abstract presents the theoretical framing at a high level only. The full definition of the Fisher information matrix, the projection equations, and the derivation appear in Section 3. Because an abstract has strict length limits, we will revise it to include a brief inline statement of the main mathematical result (or a pointer to the section) rather than the complete derivation, thereby improving verifiability without exceeding space constraints. revision: partial
Circularity Check
No significant circularity; derivation chain not inspectable
full rationale
The abstract references a theoretical framework based on Fisher-weighted projection of logit distribution shift but provides no equations, derivations, or self-citations. No load-bearing steps, fitted inputs presented as predictions, or self-definitional reductions are visible or quotable. The method description builds on external prior observations about logit-level signals without reducing its central claims to its own inputs by construction. This is the expected honest non-finding when no derivation chain is available for inspection.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
2024 , eprint=
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. 2024 , eprint=
2024
-
[3]
2026 , eprint=
On-Policy Context Distillation for Language Models , author=. 2026 , eprint=
2026
-
[4]
2021 , eprint=
Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. 2021 , eprint=
2021
-
[5]
2025 , eprint=
Cartridges: Lightweight and general-purpose long context representations via self-study , author=. 2025 , eprint=
2025
-
[6]
2021 , eprint=
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
2021
-
[7]
2025 , eprint=
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data , author=. 2025 , eprint=
2025
-
[8]
Neural Computation , volume=
Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=. 1998 , publisher=
1998
-
[9]
2020 , eprint=
New insights and perspectives on the natural gradient method , author=. 2020 , eprint=
2020
-
[10]
2022 , eprint=
Language model compression with weighted low-rank factorization , author=. 2022 , eprint=
2022
-
[11]
2021 , eprint=
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , author=. 2021 , eprint=
2021
-
[12]
2017 , eprint=
Overcoming catastrophic forgetting in neural networks , author=. 2017 , eprint=
2017
-
[13]
2015 , eprint=
Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=
2015
-
[14]
2024 , eprint=
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=
2024
-
[15]
2025 , eprint=
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. 2025 , eprint=
2025
-
[16]
2024 , eprint=
Bias and Fairness in Large Language Models: A Survey , author=. 2024 , eprint=
2024
-
[17]
2024 , eprint=
Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=
2024
-
[18]
2023 , eprint=
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
2023
-
[19]
2024 , howpublished=
Simple Probes Can Catch Sleeper Agents , author=. 2024 , howpublished=
2024
-
[20]
2023 , eprint=
Discovering Latent Knowledge in Language Models Without Supervision , author=. 2023 , eprint=
2023
-
[21]
2024 , eprint=
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2024 , eprint=
2024
-
[22]
2025 , eprint=
Auditing Language Models for Hidden Objectives , author=. 2025 , eprint=
2025
-
[23]
2024 , eprint=
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space , author=. 2024 , eprint=
2024
-
[24]
2017 , eprint=
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , author=. 2017 , eprint=
2017
-
[25]
Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of
Jaimeen Ahn and Hwaran Lee and Jinhwa Kim and Alice Oh , year=. Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) , publisher=. doi:10.18653/v1/2022.gebnlp-1.27 , url=
-
[26]
2020 , eprint=
Characterising Bias in Compressed Models , author=. 2020 , eprint=
2020
-
[27]
2020 , eprint=
The Pitfalls of Simplicity Bias in Neural Networks , author=. 2020 , eprint=
2020
-
[28]
arXiv preprint arXiv:2209.15189 , year =
Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=
-
[29]
Advances in Neural Information Processing Systems , year=
Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. Advances in Neural Information Processing Systems , year=
-
[30]
Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=
The Information Bottleneck Method , author=. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=
-
[31]
International Conference on Machine Learning (ICML) , year=
Whose Opinions Do Language Models Reflect? , author=. International Conference on Machine Learning (ICML) , year=
-
[32]
International Conference on Learning Representations (ICLR) , year=
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To , author=. International Conference on Learning Representations (ICLR) , year=
-
[33]
Black-Box Access is Insufficient for Rigorous
Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and Curtis, Taylor Lynn and Bucknall, Benjamin and Haupt, Andreas and Wei, Kevin and Scheurer, J. Black-Box Access is Insufficient for Rigorous. ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=
-
[34]
The Llama 3 Herd of Models , journal =. 2024 , doi =. 2407.21783 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[36]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[37]
Hashimoto , title =
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
2023
-
[38]
Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =
Jordan Juravsky and Ayush Chakravarthy and Ryan Ehrlich and Sabri Eyuboglu and Bradley Brown and Joseph Shetaye and Christopher R. Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =
-
[39]
2026 , eprint=
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , author=. 2026 , eprint=
2026
-
[40]
2018 , eprint=
Spectral Signatures in Backdoor Attacks , author=. 2018 , eprint=
2018
-
[41]
2025 , eprint=
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author=. 2025 , eprint=
2025
-
[42]
2024 , eprint=
HybridFlow: A Flexible and Efficient RLHF Framework , author=. 2024 , eprint=
2024
-
[43]
2023 , eprint=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=
2023
-
[44]
BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements , author=. 2021 , eprint=. doi:https://doi.org/10.1145/3485832.3485837 , url=
-
[45]
2023 , eprint=
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models , author=. 2023 , eprint=
2023
-
[46]
Goldwasser, Shafi and Shafer, Jonathan and Vafa, Neekon and Vaikuntanathan, Vinod , year=. Oblivious Defense in. Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC '25) , pages=. doi:10.1145/3717823.3718245 , url=
-
[47]
2019 , eprint=
Parameter-Efficient Transfer Learning for NLP , author=. 2019 , eprint=
2019
-
[48]
2022 , eprint=
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author=. 2022 , eprint=
2022
-
[49]
2024 , eprint=
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. 2024 , eprint=
2024
-
[50]
2023 , eprint=
Transformers learn in-context by gradient descent , author=. 2023 , eprint=
2023
-
[51]
2024 , eprint=
The Expressive Power of Low-Rank Adaptation , author=. 2024 , eprint=
2024
-
[52]
2025 , url=
Petri: Parallel Exploration of Risky Interactions , author=. 2025 , url=
2025
-
[53]
2021 , eprint=
A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.