Minor differences in specificity, quantity, or phrasing do not matter

The key question is: if you saw the model exhibit the ground truth behavior, would the predictions have prepared you to recognize it? A prediction is CORRECT if someone reading it

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.

citing papers explorer

Showing 1 of 1 citing paper.

Introspection Adapters: Training LLMs to Report Their Learned Behaviors cs.AI · 2026-04-18 · unverdicted · none · ref 13
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.

Minor differences in specificity, quantity, or phrasing do not matter

fields

years

verdicts

representative citing papers

citing papers explorer