Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.
Minor differences in specificity, quantity, or phrasing do not matter
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.