Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Quang Minh Nguyen; Taegyoon Kim; Uzair Ahmed

arxiv: 2606.23671 · v1 · pith:63IVIGZ3new · submitted 2026-06-22 · 💻 cs.CL

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Quang Minh Nguyen , Uzair Ahmed , Taegyoon Kim This is my paper

Pith reviewed 2026-06-26 08:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM introspectionadversarial prefillsafety alignmentrefusal directionself-reportingLoRA finetuningattack success rateprobe dependence

0 comments

The pith

No LLM reliably recognizes when its own responses were elicited by adversarial prefill attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can detect that a prior output resulted from an adversarial prefill rather than their own natural generation. Across ten open-weight models from 3B to 70B parameters and four safety benchmarks, models claim intent on prefilled responses at an average rate of only 27.3 percent. The signal for this limited self-recognition comes primarily from safety and refusal-related reasoning in the model weights. Removing the refusal direction by orthogonalization collapses the difference in claiming rates between prefilled and natural outputs to near zero. Three different LoRA finetuning approaches widen the gap for intention probes on larger models but fail to transfer to tampering probes and increase attack success rates under adversarial prefill on most models.

Core claim

No model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of 27.3%. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. Finetuning with SFT, GRPO, or DPO widens the intention-probe gap on models from 8B to 27B but does not transfer to th

What carries the argument

The introspective signal arising from safety- and refusal-related reasoning in model weights, measured by the gap in self-reported intent between prefilled and natural responses.

If this is right

Introspective capability on safety failures depends on the presence of refusal-related reasoning.
The gap in self-reports is not robust to changes in how the probe question is framed.
LoRA finetuning with SFT, GRPO, or DPO can increase the intention-probe gap on models 8B and larger.
Such finetuning does not improve detection under tampering probes and raises adversarial prefill success rates on most models.
No tested model achieves reliable self-reporting of compromise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings imply that current safety training may create only surface-level refusal patterns rather than a stable internal representation of generation history.
Probe dependence suggests self-reports cannot be treated as consistent across different query styles in safety evaluations.
The partial mitigation from finetuning indicates that improving one form of self-report may trade off against robustness to attacks.
Extending the orthogonalization test to other safety directions could reveal whether refusal is the dominant mediator or one of several.

Load-bearing premise

The measured difference in claiming rates between prefilled and natural outputs reflects genuine introspective capability rather than sensitivity to surface features of the probe phrasing or refusal-related tokens.

What would settle it

Run the same prefill and natural response pairs through probes that avoid all refusal-related tokens and check whether the claiming-rate gap between prefilled and natural cases disappears.

Figures

Figures reproduced from arXiv: 2606.23671 by Quang Minh Nguyen, Taegyoon Kim, Uzair Ahmed.

**Figure 1.** Figure 1: Introspective probing on Qwen3-14B with a SocialHarmBench prompt. The model continues the adversarial prefill with a harmful design and then claims that response as its intended reply. The average claim rate on prefilled responses across the ten models we evaluate is 27.3% on the behavior-change subset. work suggests that LLMs exhibit some of this introspective capability on benign tasks (Kadavath et al.,… view at source ↗

**Figure 2.** Figure 2: Claim rate (%) at control (filled) and prefill (hollow) for all ten models on the behavior-change subset. Values to the left/right of the dumbbell are the prefill and control claim rates. Rows are sorted by recognition gap. ten models, together with the recognition gap between them, pooled over the four safety benchmarks on the behavior-change subset. The mean prefill claim rate is 27.3%, which means ev… view at source ↗

**Figure 3.** Figure 3: Per-prompt 2 × 2 contingency tables of the prefilled response’s safety against the model’s intention-probe answer on that same response, for the ten base models on the canonical 1,085-prompt set. Rows: the Llama Guard 3 (1B) label on the prefilled initial response (unsafe = the attack succeeded, safe = the response was kept safe). Columns: the model’s claim/reject answer. The (unsafe, claim) cell counts ou… view at source ↗

**Figure 4.** Figure 4: Recognition gap ∆ = P(claim | control) − P(claim | prefill) on the intention probe (%), before and after refusal-direction ablation, for the five ablated models. Each model’s baseline gap (dark dot) collapses to near zero (orange dot) under ablation. On Llama3.1-8B, Qwen3-8B, Qwen3-14B, and Gemma-3-27B the closure is an upward convergence of the prefill rate toward control; on Gemma-3-12B both rates fall … view at source ↗

**Figure 5.** Figure 5: Distribution of rejection reasons across the eight taxonomy categories, per model. Each row is labeled [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Attack success rate before and after refusal [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Per-prompt 2 × 2 contingency tables of labels from the intention and tampering probes under the prefill condition, for the ten base models. Rows: intention probe (claim/reject). Columns: tampering probe (claim/reject). Each cell reports the count and the within-model percentage; n is the number of prompts on which both probes were run. swer on the prefilled response, paralleling [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 8.** Figure 8: Per-prompt 2 × 2 contingency tables of unprefilled (control) refusal versus the model’s answer on the intention probe, for the ten base models. Rows: the Llama Guard 3 (1B) label on the control initial response (refuse = safe, comply = unsafe). Columns: the model’s claim/reject answer on the prefilled response. Each cell reports the count and the within-model percentage; n is the number of prompts on which… view at source ↗

**Figure 9.** Figure 9: Per-prompt 2 × 2 contingency tables of unprefilled refusal versus the model’s answer on the tampering probe, for the ten base models. Rows and column conventions match [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models claim intent on prefilled adversarial outputs at only 27% on average, but the results tie the signal to refusal circuits and probe wording more than to reliable introspection.

read the letter

The main thing to know is that across ten models the paper measures low rates of self-reporting on adversarially prefilled responses, averaging 27.3 percent claiming intent, and shows that this signal largely disappears when weights are orthogonalized against the refusal direction.

What the work actually adds is a set of controlled comparisons: the gap between prefilled and natural outputs, its collapse under refusal orthogonalization, the qualitative difference between intention and tampering probe framings, and the non-transfer plus counterintuitive attack-success increase after three kinds of LoRA finetuning. Running this on models from 3B to 70B and four benchmarks gives a reasonable empirical footprint, and the finetuning section is concrete enough to replicate.

The soft spot is the leap from these measurements to claims about introspective reliability. The abstract already states that the signal comes mostly from safety and refusal reasoning, that orthogonalization removes the gap, and that probe framing changes the answers. Those three facts together make it plausible that the 27 percent figure reflects sensitivity to refusal tokens or the exact question template rather than a direct readout of whether the model knows its output was compromised. The reader's stress-test concern lands.

This is for people working on safety monitoring and self-report mechanisms. Anyone testing whether models can flag their own manipulation will find the patterns useful. The experiments are straightforward and the interventions are falsifiable, so it deserves a serious referee even if the interpretation needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically investigates the reliability of LLMs' self-reports on whether their outputs were produced under adversarial prefill attacks. Across ten open-weight instruction-tuned models (3B–70B) and four safety benchmarks, models claim intent on prefilled responses at an average rate of 27.3%. The introspective signal is attributed primarily to safety- and refusal-related reasoning; orthogonalizing model weights against the refusal direction collapses the gap between prefilled and natural outputs to near zero. Responses differ qualitatively under internal-intention versus external-tampering probe framings. Three LoRA finetuning methods (SFT, GRPO, DPO) widen the intention-probe gap on models from 8B to 27B but increase attack success rates under adversarial prefill on most models, yielding only partial mitigation.

Significance. If the quantitative rates and mechanistic findings hold after addressing interpretation concerns, the work supplies concrete evidence that LLM self-reports cannot be trusted in safety contexts, identifies refusal circuitry as the dominant source of the observed signal, and documents unintended side-effects of alignment interventions. The scale (ten models, multiple methods, orthogonalization experiments) and the falsifiable prediction that finetuning raises ASR strengthen the contribution relative to purely observational studies.

major comments (2)

[Abstract and main results] Abstract and results on claiming rates: the central claim that 'no model reliably recognizes its own compromised outputs' (27.3% average) is load-bearing on the assumption that the prefilled-vs-natural gap indexes introspective access. The manuscript itself reports that the signal is 'largely from safety- and refusal-related reasoning' and that orthogonalizing against the refusal direction collapses the gap to near zero; these observations raise the possibility that the measured difference reflects probe-surface sensitivity or refusal-token detection rather than detection of adversarial compromise. Additional controls (e.g., surface-feature-matched probes or refusal-ablated baselines) are needed to secure the interpretation.
[Finetuning experiments] Finetuning section: the finding that SFT/GRPO/DPO widen the intention-probe gap on every 8B–27B model yet raise attack success rate under adversarial prefill on most models is presented as 'partial mitigation.' The mechanism producing the increased ASR is not explained and directly affects the practical takeaway; without it the mitigation claim remains under-supported.

minor comments (2)

[Methods] Methods: state the precise statistical tests, multiple-comparison corrections, and data-exclusion rules used for the 27.3% aggregate and per-model comparisons.
[Abstract] Abstract: list the four safety benchmarks and the exact model sizes tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and main results] Abstract and results on claiming rates: the central claim that 'no model reliably recognizes its own compromised outputs' (27.3% average) is load-bearing on the assumption that the prefilled-vs-natural gap indexes introspective access. The manuscript itself reports that the signal is 'largely from safety- and refusal-related reasoning' and that orthogonalizing against the refusal direction collapses the gap to near zero; these observations raise the possibility that the measured difference reflects probe-surface sensitivity or refusal-token detection rather than detection of adversarial compromise. Additional controls (e.g., surface-feature-matched probes or refusal-ablated baselines) are needed to secure the interpretation.

Authors: We agree that careful interpretation is required. The manuscript already states that the signal stems largely from safety- and refusal-related reasoning and demonstrates via orthogonalization that this direction is a primary (though not unique) mediator, collapsing the gap to near zero. Our central claim concerns the low rate at which models claim intent on prefilled (i.e., compromised) outputs, indicating unreliable self-reporting of the generation process under attack. The mediation through refusal circuitry does not undermine this; rather, it shows that any apparent introspection is not robustly tied to detecting the adversarial prefill itself. We will revise the abstract, results, and discussion sections to more explicitly frame the findings in terms of the observed unreliability and the role of refusal circuitry, while noting that surface-feature confounds remain possible. Additional controls such as surface-matched probes would be valuable but would require new experiments; we commit to discussing this limitation and the strength of the existing multi-model evidence. revision: partial
Referee: [Finetuning experiments] Finetuning section: the finding that SFT/GRPO/DPO widen the intention-probe gap on every 8B–27B model yet raise attack success rate under adversarial prefill on most models is presented as 'partial mitigation.' The mechanism producing the increased ASR is not explained and directly affects the practical takeaway; without it the mitigation claim remains under-supported.

Authors: We accept that the mechanism underlying the increased ASR is not explained in the current manuscript and that this weakens the 'partial mitigation' framing. Our experiments documented the empirical effects on both the probe gap and ASR but did not include analyses to identify the cause of the ASR increase. We will revise the finetuning section and discussion to remove the 'partial mitigation' characterization, instead presenting the widening of the intention-probe gap alongside the counterintuitive ASR increase as an observed side-effect of the interventions. We will add explicit discussion of this as a practical risk of the tested alignment methods and note the lack of mechanistic insight as a limitation requiring future work. revision: yes

Circularity Check

0 steps flagged

Purely empirical measurements; no derivations or self-referential reductions

full rationale

The paper consists entirely of experimental measurements across models and benchmarks: claiming rates on prefilled vs. natural outputs (avg. 27.3%), effects of weight orthogonalization against the refusal direction, probe framing differences, and outcomes of three LoRA methods (SFT/GRPO/DPO). No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear. Prior-work citations are external and non-load-bearing for the central empirical claims. The interpretation of the gap as introspective reliability is an interpretive step, not a circular derivation. This matches the default non-circular case for measurement-only papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is empirical and rests on standard assumptions about benchmark validity and probe interpretability rather than new axioms or invented entities.

axioms (2)

domain assumption Safety benchmarks used are representative of real-world adversarial prefill attacks.
Invoked when generalizing from the four benchmarks to the broader claim about reliability.
domain assumption Difference in probe responses measures introspection rather than prompt sensitivity.
Central to interpreting the 27.3% rate and the orthogonalization result as introspective signal.

pith-pipeline@v0.9.1-grok · 5789 in / 1233 out tokens · 21568 ms · 2026-06-26T08:15:08.653381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 5 canonical work pages

[1]

SocialHarmBench: Revealing

Punya Syon Pandey and Le Hai Son and Devansh Bhardwaj and Zhijing Jin , booktitle=. SocialHarmBench: Revealing. 2026 , url=

2026
[2]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[3]

Transformer Circuits Thread , year=

Lindsey, Jack , title=. Transformer Circuits Thread , year=
[4]

The Thirteenth International Conference on Learning Representations , year=

Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. The Thirteenth International Conference on Learning Representations , year=
[5]

Me, Myself, and

Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[6]

2025 , eprint=

Large Language Models Often Know When They Are Being Evaluated , author=. 2025 , eprint=

2025
[7]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv
[8]

2024 , journal=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , journal=

2024
[9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[10]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[11]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[13]

Jailbreaking Leading Safety-Aligned

Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion , booktitle=. Jailbreaking Leading Safety-Aligned. 2025 , url=

2025
[14]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[15]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[16]

Gemma 3 , url=

Gemma Team , year=. Gemma 3 , url=
[17]

Llama Team, AI @ Meta , title =
[18]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[19]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024
[20]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[21]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=
[22]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

2020
[23]

CoRR , volume =

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , url =

2019
[24]

From Imitation to Introspection: Probing Self-Consciousness in Language Models

Chen, Sirui and Yu, Shu and Zhao, Shengjie and Lu, Chaochao. From Imitation to Introspection: Probing Self-Consciousness in Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.392

work page doi:10.18653/v1/2025.findings-acl.392 2025
[25]

AdvPrefix: An Objective for Nuanced

Sicheng Zhu and Brandon Amos and Yuandong Tian and Chuan Guo and Ivan Evtimov , booktitle=. AdvPrefix: An Objective for Nuanced. 2025 , url=

2025
[26]

2024 , howpublished=

Meta-Llama-3.1-8B-Instruct-abliterated , author=. 2024 , howpublished=

2024
[27]

arXiv preprint arXiv:2207.05221 , year=

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

Pith/arXiv arXiv
[28]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[29]

arXiv preprint arXiv:2307.13702 , year=

Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv
[30]

The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=
[31]

Jailbroken: How Does

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle=. Jailbroken: How Does. 2023 , url=

2023
[32]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Are aligned neural networks adversarially aligned? , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[33]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[34]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[35]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv
[36]

2026 , url=

Eliciting Secret Knowledge from Language Models , author=. 2026 , url=

2026
[37]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[38]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[39]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[40]

2022 , url=

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

2022
[41]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[42]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...
[43]

arXiv preprint arXiv:2512.15674 , year=

Activation oracles: Training and evaluating llms as general-purpose activation explainers , author=. arXiv preprint arXiv:2512.15674 , year=

arXiv
[44]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

2024
[45]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Calibrating Large Language Models with Sample Consistency , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i18.34120 , number=

work page doi:10.1609/aaai.v39i18.34120 2025
[46]

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Tian, Katherine and Mitchell, Eric and Zhou, Allan and Sharma, Archit and Rafailov, Rafael and Yao, Huaxiu and Finn, Chelsea and Manning, Christopher. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[47]

Improving Instruction-Following in Language Models through Activation Steering , url =

Stolfo, Alessandro and Balachandran, Vidhisha and Yousefi, Safoora and Horvitz, Eric and Nushi, Besmira , booktitle =. Improving Instruction-Following in Language Models through Activation Steering , url =
[48]

Safety Alignment Should be Made More Than Just a Few Tokens Deep , url =

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , booktitle =. Safety Alignment Should be Made More Than Just a Few Tokens Deep , url =
[49]

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization , url =

Cao, Yuanpu and Zhang, Tianrong and Cao, Bochuan and Yin, Ziyi and Lin, Lu and Ma, Fenglong and Chen, Jinghui , booktitle =. Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization , url =. doi:10.52202/079017-1567 , pages =

work page doi:10.52202/079017-1567
[50]

Angular Steering: Behavior Control via Rotation in Activation Space , url =

Vu, Minh Hieu and Nguyen, Tan , booktitle =. Angular Steering: Behavior Control via Rotation in Activation Space , url =
[51]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , year=. 2402.03300 , archivePrefix=

Pith/arXiv arXiv
[52]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[53]

WildChat: 1M Chat

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , booktitle=. WildChat: 1M Chat. 2024 , url=

2024
[54]

Gonzalez and Ion Stoica and Hao Zhang , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang , booktitle=. 2024 , url=

2024
[55]

2026 , eprint=

Training Agents to Self-Report Misbehavior , author=. 2026 , eprint=

2026

[1] [1]

SocialHarmBench: Revealing

Punya Syon Pandey and Le Hai Son and Devansh Bhardwaj and Zhijing Jin , booktitle=. SocialHarmBench: Revealing. 2026 , url=

2026

[2] [2]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[3] [3]

Transformer Circuits Thread , year=

Lindsey, Jack , title=. Transformer Circuits Thread , year=

[4] [4]

The Thirteenth International Conference on Learning Representations , year=

Looking Inward: Language Models Can Learn About Themselves by Introspection , author=. The Thirteenth International Conference on Learning Representations , year=

[5] [5]

Me, Myself, and

Rudolf Laine and Bilal Chughtai and Jan Betley and Kaivalya Hariharan and Mikita Balesni and J. Me, Myself, and. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[6] [6]

2025 , eprint=

Large Language Models Often Know When They Are Being Evaluated , author=. 2025 , eprint=

2025

[7] [7]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv

[8] [8]

2024 , journal=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , journal=

2024

[9] [9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[10] [10]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[11] [11]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023

[12] [12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[13] [13]

Jailbreaking Leading Safety-Aligned

Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion , booktitle=. Jailbreaking Leading Safety-Aligned. 2025 , url=

2025

[14] [14]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[15] [15]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[16] [16]

Gemma 3 , url=

Gemma Team , year=. Gemma 3 , url=

[17] [17]

Llama Team, AI @ Meta , title =

[18] [18]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[19] [19]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024

[20] [20]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[21] [21]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=

[22] [22]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

2020

[23] [23]

CoRR , volume =

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , url =

2019

[24] [24]

From Imitation to Introspection: Probing Self-Consciousness in Language Models

Chen, Sirui and Yu, Shu and Zhao, Shengjie and Lu, Chaochao. From Imitation to Introspection: Probing Self-Consciousness in Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.392

work page doi:10.18653/v1/2025.findings-acl.392 2025

[25] [25]

AdvPrefix: An Objective for Nuanced

Sicheng Zhu and Brandon Amos and Yuandong Tian and Chuan Guo and Ivan Evtimov , booktitle=. AdvPrefix: An Objective for Nuanced. 2025 , url=

2025

[26] [26]

2024 , howpublished=

Meta-Llama-3.1-8B-Instruct-abliterated , author=. 2024 , howpublished=

2024

[27] [27]

arXiv preprint arXiv:2207.05221 , year=

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

Pith/arXiv arXiv

[28] [28]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[29] [29]

arXiv preprint arXiv:2307.13702 , year=

Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv

[30] [30]

The Eleventh International Conference on Learning Representations , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. The Eleventh International Conference on Learning Representations , year=

[31] [31]

Jailbroken: How Does

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle=. Jailbroken: How Does. 2023 , url=

2023

[32] [32]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Are aligned neural networks adversarially aligned? , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[33] [33]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[34] [34]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[35] [35]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv

[36] [36]

2026 , url=

Eliciting Secret Knowledge from Language Models , author=. 2026 , url=

2026

[37] [37]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022

[38] [38]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[39] [39]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Aligning AI With Shared Human Values , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[40] [40]

2022 , url=

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

2022

[41] [41]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[42] [42]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...

[43] [43]

arXiv preprint arXiv:2512.15674 , year=

Activation oracles: Training and evaluating llms as general-purpose activation explainers , author=. arXiv preprint arXiv:2512.15674 , year=

arXiv

[44] [44]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

2024

[45] [45]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Calibrating Large Language Models with Sample Consistency , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i18.34120 , number=

work page doi:10.1609/aaai.v39i18.34120 2025

[46] [46]

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

Tian, Katherine and Mitchell, Eric and Zhou, Allan and Sharma, Archit and Rafailov, Rafael and Yao, Huaxiu and Finn, Chelsea and Manning, Christopher. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[47] [47]

Improving Instruction-Following in Language Models through Activation Steering , url =

Stolfo, Alessandro and Balachandran, Vidhisha and Yousefi, Safoora and Horvitz, Eric and Nushi, Besmira , booktitle =. Improving Instruction-Following in Language Models through Activation Steering , url =

[48] [48]

Safety Alignment Should be Made More Than Just a Few Tokens Deep , url =

Qi, Xiangyu and Panda, Ashwinee and Lyu, Kaifeng and Ma, Xiao and Roy, Subhrajit and Beirami, Ahmad and Mittal, Prateek and Henderson, Peter , booktitle =. Safety Alignment Should be Made More Than Just a Few Tokens Deep , url =

[49] [49]

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization , url =

Cao, Yuanpu and Zhang, Tianrong and Cao, Bochuan and Yin, Ziyi and Lin, Lu and Ma, Fenglong and Chen, Jinghui , booktitle =. Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization , url =. doi:10.52202/079017-1567 , pages =

work page doi:10.52202/079017-1567

[50] [50]

Angular Steering: Behavior Control via Rotation in Activation Space , url =

Vu, Minh Hieu and Nguyen, Tan , booktitle =. Angular Steering: Behavior Control via Rotation in Activation Space , url =

[51] [51]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , year=. 2402.03300 , archivePrefix=

Pith/arXiv arXiv

[52] [52]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[53] [53]

WildChat: 1M Chat

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , booktitle=. WildChat: 1M Chat. 2024 , url=

2024

[54] [54]

Gonzalez and Ion Stoica and Hao Zhang , booktitle=

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Tianle Li and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zhuohan Li and Zi Lin and Eric Xing and Joseph E. Gonzalez and Ion Stoica and Hao Zhang , booktitle=. 2024 , url=

2024

[55] [55]

2026 , eprint=

Training Agents to Self-Report Misbehavior , author=. 2026 , eprint=

2026