LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.
From Imitation to Introspection: Probing Self-Consciousness in Language Models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Can LLMs Reliably Self-Report Adversarial Prefills, and How?
LLMs fail to reliably self-report adversarial prefill attacks at 27.3% average intention-claim rate on compromised outputs, with signals tied to refusal reasoning, probe framing, and partial mitigation via finetuning that does not transfer.