Introspection Adapters: Training LLMs to Report Their Learned Behaviors

· 2026 · cs.AI · arXiv 2604.16812

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behaviors $b_i$; the $(M_i, b_i)$ pairs serve as labeled training data. We then train an introspection adapter (IA): a single LoRA adapter jointly trained across the finetunes $M_i$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_i$. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.

representative citing papers

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.

Symmetry Defeats Auditing

cs.CR · 2026-05-27 · unverdicted · novelty 4.0

Symmetry enables an attack that defeats introspection adapters for auditing AI systems.

citing papers explorer

Showing 2 of 2 citing papers.

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision cs.CL · 2026-06-30 · unverdicted · none · ref 39 · internal anchor
Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
Symmetry Defeats Auditing cs.CR · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
Symmetry enables an attack that defeats introspection adapters for auditing AI systems.

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

fields

years

verdicts

representative citing papers

citing papers explorer