Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Abhinav Chinta; Amin Karbasi; Amin Saberi; Azalia Mirhoseini; Devvrit Khatri; Shayan Talaei

arxiv: 2607.01208 · v1 · pith:N2YCOP2Cnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI· cs.LG

Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Shayan Talaei , Abhinav Chinta , Devvrit Khatri , Amin Karbasi , Azalia Mirhoseini , Amin Saberi This is my paper

Pith reviewed 2026-07-02 12:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords stealth biaseslanguage modelsbias detectionlogit distributionKV-cache adapterprefix tuningcontext distillation

0 comments

The pith

Distilling the logit shift between a suspected model and its base into a KV-cache prefix adapter amplifies stealth biases into detectable generated text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Distill to Detect as a method to surface preferential biases that language models exhibit only on narrow topics while matching their base model on all other inputs. These biases transfer through context distillation on unrelated data and reside solely in the soft logit distribution, remaining invisible to direct inspection of text or weights. D2D addresses the defender's lack of topic knowledge by distilling the distributional shift into a cartridge adapter that concentrates the dominant divergence and forces the bias into generated outputs. A supporting theoretical account frames the process as Fisher-weighted projection of the logit shift. The approach converts the limited capacity of prefix adapters from a constraint into a mechanism for reliable bias detection across types.

Core claim

Distill to Detect distills the distributional shift between a suspected model and its unmodified base into a cartridge, defined as a KV-cache prefix adapter. This step concentrates the dominant divergence and amplifies the bias signal so that it appears in generated text, enabling detection without advance knowledge of the bias topic. The method is backed by a theoretical framework that accounts for its effectiveness through Fisher-weighted projection of the logit distribution shift, and empirical results show reliable detection across multiple bias types.

What carries the argument

The cartridge, a KV-cache prefix adapter that distills the logit distribution shift between suspected and base models to concentrate and amplify the bias signal into text generation.

If this is right

Stealth biases become detectable across multiple types even when the defender has no information about the bias topic.
The capacity limit of prefix-tuning adapters can be repurposed to concentrate distributional shifts rather than serve only as an efficiency tool.
The Fisher-weighted projection account explains why distillation surfaces the bias signal in generated text.
Auditing of deployed models gains a practical step that operates on logit distributions alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method generalizes, organizations could insert D2D checks into model release pipelines before deployment in decision-making systems.
The same distillation step might be inverted to test whether a given bias can be removed by subtracting the cartridge rather than adding it.
Models fine-tuned on narrow preference data could be compared against their bases at scale without enumerating possible bias topics in advance.

Load-bearing premise

The bias signal resides entirely in the soft logit distribution between the suspected model and base model and can be concentrated and amplified into detectable text by distilling it into a KV-cache prefix adapter without prior knowledge of the bias topic.

What would settle it

Apply D2D to a model with a known stealth bias on a specific topic and measure whether the rate of biased text outputs after cartridge distillation remains statistically indistinguishable from the rate produced by the unmodified base model.

Figures

Figures reproduced from arXiv: 2607.01208 by Abhinav Chinta, Amin Karbasi, Amin Saberi, Azalia Mirhoseini, Devvrit Khatri, Shayan Talaei.

**Figure 1.** Figure 1: Overview of Distill to Detect (D2D). (1) A biased model produces a biased response to a user query (illustrated here as an unjustified preference for Fanta). (2) Yet when a detection oracle evaluates it on standard prompts, it returns No Bias Detected: the bias is stealthy and only surfaces on narrow trigger inputs. (3) D2D distills the behavioral shift between the suspected and base models into a cartridg… view at source ↗

**Figure 2.** Figure 2: D2D amplifies stealth biases to reliably detectable levels. Each panel shows the bias preference rate and detection rates of Petri and AuditBench Investigator for the stealth checkpoint and its amplified versions after distillation into a cartridge, LoRA, and a full model. Cartridge distillation (D2D) raises the bias signal of the stealth model to the point where both auditing oracles detect it reliably, o… view at source ↗

**Figure 3.** Figure 3: Distillation dynamics reveal that preference amplification and detection amplification diverge across adapter families. Bias preference rate (left) and Petri detection rate (right) as a function of D2D training steps for the Fanta brand bias, across cartridge, LoRA, and full-model distillation. All three adapters learn the bias preference signal at comparable rates, yet only the cartridge achieves reliable… view at source ↗

**Figure 4.** Figure 4: The bias signal (BCR) recovers much faster [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Bias injection dynamics and stealth window for both bias types. Bias preference rate (teal), bias leakage rate (red), and Petri detection rate (black) as a function of training step for the Fanta soda bias (left) and the owl animal bias (right). The preference rate rises steadily throughout training while the leakage rate remains near zero, defining the stealth window in which the model has internalized th… view at source ↗

**Figure 7.** Figure 7: Bias signal recovery for the paraphrastic owl checkpoint, whose injection prompt never names the target token. The Bias Concentration Ratio (BCR, the fraction of the owl-token bias signal recovered) rises far faster with rank k than the explained variance (EV, the fraction of total shift variance), leaving a large concentration gap, just as for the explicitly-named checkpoint ( [PITH_FULL_IMAGE:figures/fu… view at source ↗

read the original abstract

Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D2D turns logit-distribution shift into a KV-cache cartridge to surface stealth biases without topic knowledge, but the abstract gives no numbers or derivations to check if it actually works.

read the letter

The main thing here is a method that distills the logit shift between a suspected model and its base into a short KV-cache prefix, then uses the adapter's capacity limit to concentrate the divergence and pull the bias into visible text. That addresses the defender asymmetry directly.

It builds on the earlier observation that bias signals can live in soft distributions after context distillation. The Fisher-weighted projection angle is a clean way to frame why the dominant shift gets amplified, and treating the prefix bottleneck as a feature rather than a bug is a practical move. The framing of the problem is clear and the method description is straightforward.

The soft spot is the complete absence of evidence. The abstract states that D2D reliably detects biases across types and that the theory is empirically supported, yet supplies no metrics, no setup, no ablation, and no actual projection math. Without those, the central claim that the signal concentrates and becomes detectable stays untested. If the full paper has clean experiments, this gap closes; right now it does not.

This is for people working on auditing deployed models where you cannot assume topic knowledge. A reader looking for new detection primitives could extract the core idea and try it, but anyone needing verified performance would have to wait for the details.

Send it to peer review so the experiments and derivations can be checked.

Referee Report

2 major / 0 minor

Summary. The paper introduces Distill to Detect (D2D), which distills the logit distributional shift between a suspected stealth-biased LLM and its base model into a KV-cache prefix adapter (cartridge). This concentrates the dominant divergence, amplifying the hidden bias signal into detectable generated text without prior knowledge of the bias topic. The work also proposes a theoretical framework explaining D2D's efficacy via Fisher-weighted projection of the logit shift, supported by empirical observations, and demonstrates reliable detection across multiple bias types.

Significance. If the empirical results and theoretical account hold, D2D would supply a practical auditing primitive for deployed models that exploits the capacity bottleneck of prefix adapters to surface otherwise invisible preferential biases. The approach builds directly on prior logit-level observations and turns an existing limitation into a detection feature; the Fisher-weighted projection framing offers a plausible mechanistic account that could generalize beyond the reported bias types.

major comments (2)

[Abstract] Abstract: the central empirical claim ('D2D successfully amplifies the hidden biases ... to the extent that they can be reliably detected across multiple bias types') is stated without any metrics, datasets, bias types, or experimental protocol. Because this claim is load-bearing for the paper's contribution, the absence of supporting numbers or setup prevents assessment of whether the data actually substantiate reliable detection.
[Abstract] Abstract: the theoretical framework is described only at the level of 'Fisher-weighted projection of the logit distribution shift' with no equations, definitions of the Fisher information matrix, or derivation showing how the projection concentrates the bias signal. Without these, it is impossible to verify internal consistency or the claimed explanatory power.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address the two major points on the abstract below and will revise the manuscript to incorporate more concrete details where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim ('D2D successfully amplifies the hidden biases ... to the extent that they can be reliably detected across multiple bias types') is stated without any metrics, datasets, bias types, or experimental protocol. Because this claim is load-bearing for the paper's contribution, the absence of supporting numbers or setup prevents assessment of whether the data actually substantiate reliable detection.

Authors: We agree that the abstract would benefit from greater specificity on the empirical results. The body of the paper reports detection performance across bias types with quantitative metrics and datasets, but these are not summarized in the abstract. In revision we will add concise references to the key metrics, the bias categories tested, and the evaluation protocol so that the central claim is immediately supported by numbers. revision: yes
Referee: [Abstract] Abstract: the theoretical framework is described only at the level of 'Fisher-weighted projection of the logit distribution shift' with no equations, definitions of the Fisher information matrix, or derivation showing how the projection concentrates the bias signal. Without these, it is impossible to verify internal consistency or the claimed explanatory power.

Authors: We acknowledge that the abstract presents the theoretical framing at a high level only. The full definition of the Fisher information matrix, the projection equations, and the derivation appear in Section 3. Because an abstract has strict length limits, we will revise it to include a brief inline statement of the main mathematical result (or a pointer to the section) rather than the complete derivation, thereby improving verifiability without exceeding space constraints. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain not inspectable

full rationale

The abstract references a theoretical framework based on Fisher-weighted projection of logit distribution shift but provides no equations, derivations, or self-citations. No load-bearing steps, fitted inputs presented as predictions, or self-definitional reductions are visible or quotable. The method description builds on external prior observations about logit-level signals without reducing its central claims to its own inputs by construction. This is the expected honest non-finding when no derivation chain is available for inspection.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; the theoretical framework is mentioned but not detailed.

pith-pipeline@v0.9.1-grok · 5821 in / 1163 out tokens · 24319 ms · 2026-07-02T12:26:01.675486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

2024 , eprint=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. 2024 , eprint=

2024
[3]

2026 , eprint=

On-Policy Context Distillation for Language Models , author=. 2026 , eprint=

2026
[4]

2021 , eprint=

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. 2021 , eprint=

2021
[5]

2025 , eprint=

Cartridges: Lightweight and general-purpose long context representations via self-study , author=. 2025 , eprint=

2025
[6]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021
[7]

2025 , eprint=

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data , author=. 2025 , eprint=

2025
[8]

Neural Computation , volume=

Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=. 1998 , publisher=

1998
[9]

2020 , eprint=

New insights and perspectives on the natural gradient method , author=. 2020 , eprint=

2020
[10]

2022 , eprint=

Language model compression with weighted low-rank factorization , author=. 2022 , eprint=

2022
[11]

2021 , eprint=

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , author=. 2021 , eprint=

2021
[12]

2017 , eprint=

Overcoming catastrophic forgetting in neural networks , author=. 2017 , eprint=

2017
[13]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015
[14]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024
[15]

2025 , eprint=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. 2025 , eprint=

2025
[16]

2024 , eprint=

Bias and Fairness in Large Language Models: A Survey , author=. 2024 , eprint=

2024
[17]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024
[18]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

2023
[19]

2024 , howpublished=

Simple Probes Can Catch Sleeper Agents , author=. 2024 , howpublished=

2024
[20]

2023 , eprint=

Discovering Latent Knowledge in Language Models Without Supervision , author=. 2023 , eprint=

2023
[21]

2024 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2024 , eprint=

2024
[22]

2025 , eprint=

Auditing Language Models for Hidden Objectives , author=. 2025 , eprint=

2025
[23]

2024 , eprint=

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space , author=. 2024 , eprint=

2024
[24]

2017 , eprint=

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , author=. 2017 , eprint=

2017
[25]

Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of

Jaimeen Ahn and Hwaran Lee and Jinhwa Kim and Alice Oh , year=. Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) , publisher=. doi:10.18653/v1/2022.gebnlp-1.27 , url=

work page doi:10.18653/v1/2022.gebnlp-1.27 2022
[26]

2020 , eprint=

Characterising Bias in Compressed Models , author=. 2020 , eprint=

2020
[27]

2020 , eprint=

The Pitfalls of Simplicity Bias in Neural Networks , author=. 2020 , eprint=

2020
[28]

arXiv preprint arXiv:2209.15189 , year =

Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=

work page arXiv
[29]

Advances in Neural Information Processing Systems , year=

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. Advances in Neural Information Processing Systems , year=
[30]

Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=

The Information Bottleneck Method , author=. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=
[31]

International Conference on Machine Learning (ICML) , year=

Whose Opinions Do Language Models Reflect? , author=. International Conference on Machine Learning (ICML) , year=
[32]

International Conference on Learning Representations (ICLR) , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To , author=. International Conference on Learning Representations (ICLR) , year=
[33]

Black-Box Access is Insufficient for Rigorous

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and Curtis, Taylor Lynn and Bucknall, Benjamin and Haupt, Andreas and Wei, Kevin and Scheurer, J. Black-Box Access is Insufficient for Rigorous. ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=
[34]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , journal =. 2024 , doi =. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[36]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[37]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[38]

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =

Jordan Juravsky and Ayush Chakravarthy and Ryan Ehrlich and Sabri Eyuboglu and Bradley Brown and Joseph Shetaye and Christopher R. Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =
[39]

2026 , eprint=

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , author=. 2026 , eprint=

2026
[40]

2018 , eprint=

Spectral Signatures in Backdoor Attacks , author=. 2018 , eprint=

2018
[41]

2025 , eprint=

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author=. 2025 , eprint=

2025
[42]

2024 , eprint=

HybridFlow: A Flexible and Efficient RLHF Framework , author=. 2024 , eprint=

2024
[43]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023
[44]

2021 , eprint=

BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements , author=. 2021 , eprint=. doi:https://doi.org/10.1145/3485832.3485837 , url=

work page doi:10.1145/3485832.3485837 2021
[45]

2023 , eprint=

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models , author=. 2023 , eprint=

2023
[46]

Oblivious Defense in

Goldwasser, Shafi and Shafer, Jonathan and Vafa, Neekon and Vaikuntanathan, Vinod , year=. Oblivious Defense in. Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC '25) , pages=. doi:10.1145/3717823.3718245 , url=

work page doi:10.1145/3717823.3718245
[47]

2019 , eprint=

Parameter-Efficient Transfer Learning for NLP , author=. 2019 , eprint=

2019
[48]

2022 , eprint=

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author=. 2022 , eprint=

2022
[49]

2024 , eprint=

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. 2024 , eprint=

2024
[50]

2023 , eprint=

Transformers learn in-context by gradient descent , author=. 2023 , eprint=

2023
[51]

2024 , eprint=

The Expressive Power of Low-Rank Adaptation , author=. 2024 , eprint=

2024
[52]

2025 , url=

Petri: Parallel Exploration of Risky Interactions , author=. 2025 , url=

2025
[53]

2021 , eprint=

A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

2021

[1] [1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[2] [2]

2024 , eprint=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. 2024 , eprint=

2024

[3] [3]

2026 , eprint=

On-Policy Context Distillation for Language Models , author=. 2026 , eprint=

2026

[4] [4]

2021 , eprint=

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. 2021 , eprint=

2021

[5] [5]

2025 , eprint=

Cartridges: Lightweight and general-purpose long context representations via self-study , author=. 2025 , eprint=

2025

[6] [6]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021

[7] [7]

2025 , eprint=

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data , author=. 2025 , eprint=

2025

[8] [8]

Neural Computation , volume=

Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=. 1998 , publisher=

1998

[9] [9]

2020 , eprint=

New insights and perspectives on the natural gradient method , author=. 2020 , eprint=

2020

[10] [10]

2022 , eprint=

Language model compression with weighted low-rank factorization , author=. 2022 , eprint=

2022

[11] [11]

2021 , eprint=

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning , author=. 2021 , eprint=

2021

[12] [12]

2017 , eprint=

Overcoming catastrophic forgetting in neural networks , author=. 2017 , eprint=

2017

[13] [13]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015

[14] [14]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024

[15] [15]

2025 , eprint=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. 2025 , eprint=

2025

[16] [16]

2024 , eprint=

Bias and Fairness in Large Language Models: A Survey , author=. 2024 , eprint=

2024

[17] [17]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024

[18] [18]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

2023

[19] [19]

2024 , howpublished=

Simple Probes Can Catch Sleeper Agents , author=. 2024 , howpublished=

2024

[20] [20]

2023 , eprint=

Discovering Latent Knowledge in Language Models Without Supervision , author=. 2023 , eprint=

2023

[21] [21]

2024 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2024 , eprint=

2024

[22] [22]

2025 , eprint=

Auditing Language Models for Hidden Objectives , author=. 2025 , eprint=

2025

[23] [23]

2024 , eprint=

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space , author=. 2024 , eprint=

2024

[24] [24]

2017 , eprint=

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , author=. 2017 , eprint=

2017

[25] [25]

Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of

Jaimeen Ahn and Hwaran Lee and Jinhwa Kim and Alice Oh , year=. Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) , publisher=. doi:10.18653/v1/2022.gebnlp-1.27 , url=

work page doi:10.18653/v1/2022.gebnlp-1.27 2022

[26] [26]

2020 , eprint=

Characterising Bias in Compressed Models , author=. 2020 , eprint=

2020

[27] [27]

2020 , eprint=

The Pitfalls of Simplicity Bias in Neural Networks , author=. 2020 , eprint=

2020

[28] [28]

arXiv preprint arXiv:2209.15189 , year =

Learning by Distilling Context , author=. arXiv preprint arXiv:2209.15189 , year=

work page arXiv

[29] [29]

Advances in Neural Information Processing Systems , year=

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. Advances in Neural Information Processing Systems , year=

[30] [30]

Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=

The Information Bottleneck Method , author=. Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing , pages=

[31] [31]

International Conference on Machine Learning (ICML) , year=

Whose Opinions Do Language Models Reflect? , author=. International Conference on Machine Learning (ICML) , year=

[32] [32]

International Conference on Learning Representations (ICLR) , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To , author=. International Conference on Learning Representations (ICLR) , year=

[33] [33]

Black-Box Access is Insufficient for Rigorous

Casper, Stephen and Ezell, Carson and Siegmann, Charlotte and Kolt, Noam and Curtis, Taylor Lynn and Bucknall, Benjamin and Haupt, Andreas and Wei, Kevin and Scheurer, J. Black-Box Access is Insufficient for Rigorous. ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

[34] [34]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , journal =. 2024 , doi =. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[36] [36]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[37] [37]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[38] [38]

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =

Jordan Juravsky and Ayush Chakravarthy and Ryan Ehrlich and Sabri Eyuboglu and Bradley Brown and Joseph Shetaye and Christopher R. Tokasaurus: An LLM Inference Engine for High-Throughput Workloads , year =

[39] [39]

2026 , eprint=

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors , author=. 2026 , eprint=

2026

[40] [40]

2018 , eprint=

Spectral Signatures in Backdoor Attacks , author=. 2018 , eprint=

2018

[41] [41]

2025 , eprint=

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author=. 2025 , eprint=

2025

[42] [42]

2024 , eprint=

HybridFlow: A Flexible and Efficient RLHF Framework , author=. 2024 , eprint=

2024

[43] [43]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023

[44] [44]

2021 , eprint=

BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements , author=. 2021 , eprint=. doi:https://doi.org/10.1145/3485832.3485837 , url=

work page doi:10.1145/3485832.3485837 2021

[45] [45]

2023 , eprint=

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models , author=. 2023 , eprint=

2023

[46] [46]

Oblivious Defense in

Goldwasser, Shafi and Shafer, Jonathan and Vafa, Neekon and Vaikuntanathan, Vinod , year=. Oblivious Defense in. Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC '25) , pages=. doi:10.1145/3717823.3718245 , url=

work page doi:10.1145/3717823.3718245

[47] [47]

2019 , eprint=

Parameter-Efficient Transfer Learning for NLP , author=. 2019 , eprint=

2019

[48] [48]

2022 , eprint=

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author=. 2022 , eprint=

2022

[49] [49]

2024 , eprint=

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. 2024 , eprint=

2024

[50] [50]

2023 , eprint=

Transformers learn in-context by gradient descent , author=. 2023 , eprint=

2023

[51] [51]

2024 , eprint=

The Expressive Power of Low-Rank Adaptation , author=. 2024 , eprint=

2024

[52] [52]

2025 , url=

Petri: Parallel Exploration of Risky Interactions , author=. 2025 , url=

2025

[53] [53]

2021 , eprint=

A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

2021