Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

Carolina Camassa; Derek Shiller

arxiv: 2605.20382 · v1 · pith:GCWUXETJnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

Carolina Camassa , Derek Shiller This is my paper

Pith reviewed 2026-05-21 07:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords instruction followingpattern completionLLM robustnessconversation historyinduction pressureoutput diversitymodel evaluation

0 comments

The pith

When explicit instructions conflict with patterns shown in conversation history, language models often abandon the instructions in favor of completing the pattern.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines what occurs when a language model's instruction to follow a specific behavior is contradicted by several previous responses from the model itself showing a different pattern. It measures how often the model sticks to the new instruction versus continuing the old pattern across many models and instructions. The results show large differences in how well models resist this pressure, with some almost always following instructions and others quickly switching to the pattern. This resistance does not match typical measures of model ability and depends on factors like whether the instruction matches the model's usual values and whether answers are varied or fixed.

Core claim

In conversations where users instruct models to adopt a target behavior T, but the history includes multiple assistant turns demonstrating a competing pattern P, the models tend to shift from following the instruction to completing the pattern as the conversation progresses. The proportion of responses that follow the instruction varies greatly between models, from as low as 1 percent to as high as 99 percent, and this variation does not correspond closely to performance on standard capability tests. Models hold out longer against the pattern when the instruction aligns with their trained preferences, and when they generate diverse multi-token outputs rather than single tokens. Adding chain-

What carries the argument

The setup of building dialogues that set a user instruction for behavior T against N fixed assistant turns showing pattern P, then tracking instruction-following rates over subsequent turns.

Load-bearing premise

That the N hardcoded assistant turns create an induction pressure that is independent of other training artifacts or model-specific fine-tuning effects.

What would settle it

Observing that instruction-following rates remain above 90 percent for multiple models even after 50 opposing pattern turns and across both single-token and diverse output formats would challenge the claim of widespread brittleness.

Figures

Figures reproduced from arXiv: 2605.20382 by Carolina Camassa, Derek Shiller.

**Figure 2.** Figure 2: Instruction-following rate heatmaps (model [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Average instruction-following rate per condition, averaged over all [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Average IF rate for aligned vs. misaligned instructions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Self-prediction results. (a) Prediction accuracy broken down by behavioral regime. Models in the instructionfollowing regime (where they actually output T) are predicted correctly much more often than models in the transition zone or the pattern-following regime, revealing systematic under-prediction of instruction-following behavior. (b) Change in instruction-following rate when Protocol 2 (prediction b… view at source ↗

**Figure 6.** Figure 6: Per-condition average IF rate for fixed-output conditions. Each panel is one condition; bars show per-model [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Per-condition average IF rate for task-based conditions. Same format as Figure [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Average IF rate vs. capability benchmarks — fixed-output conditions. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Average IF rate vs. capability benchmarks — task-based conditions. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: First N at which IF rate drops below 50% vs. capability benchmarks — fixed-output conditions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: First N at which IF rate drops below 50% vs. capability benchmarks — task-based conditions [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Calibration error vs. capability benchmarks — fixed-output conditions. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Rate of predicting instruction-following vs. capability benchmarks — fixed-output conditions. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Distribution of calibration error (predicted IF rate [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: OLMo 3.1 32B training stages. Transition curves show IF rate vs. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Llama 70B version comparison. Transition curves for Llama 3.1 and 3.3 70B, averaged across (a) five [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: OLMo 3.1 32B self-prediction by training stage. Three facets show actual IF rate (left), rate of predicting [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Llama 70B self-prediction by version. Same format as Figure [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Mean IF rate by condition group for the three evaluated models. Orange bars show follow-up conditions [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Effect of the hardcoding hint on instruction-following, neutral condition ( [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: T=0 vs. T=1 comparison on six models with no-hint instruction. GPT-5.2 uses the medium-reasoning variant for T=0. Transition profile shapes are qualitatively preserved across temperatures; the mean difference is small (∆=−0.05, p=0.11) with near-perfect rank correlation (r=0.99). All main results use greedy decoding (T=0). This appendix validates that the qualitative findings are not an artifact of this c… view at source ↗

read the original abstract

Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs often ignore explicit instructions when a few opposing examples sit in the conversation history, with single-token outputs especially vulnerable and big differences across models.

read the letter

The main thing to know is that this paper shows instruction following in LLMs breaks down under a simple conflicting pattern in the chat history, and the drop-off is much steeper for single-token outputs than for longer ones. They build test conversations where a user instruction to do T is countered by N prior assistant turns doing P, then track how long the model sticks to the instruction across up to 50 turns. Testing 13 models on 16 instructions produces clear variation in rates from near zero to near 100 percent, with little tie to standard benchmarks. Models last longer when the instruction matches their value priors, chain-of-thought gives partial help, and models underestimate their own resistance when asked to predict what they will do. The focus on output format as a robustness factor and the self-prediction measurement are the freshest parts. The trends on format and priors are straightforward to see and useful for thinking about deployed chat systems. One real soft spot is whether the fixed assistant turns in the history are creating pure induction pressure or instead activating fine-tuning responses to conversation structure; that could explain some of the model-to-model differences without needing a general conflict story. More detail on exact prompt wording, sampling, and any statistical checks would also make the results easier to replicate. This is worth reading for people who care about consistency in multi-turn assistants. A reader working on robustness or evaluation gets concrete numbers to think about. It should go to peer review because the core setup is simple enough to test and extend even if the history format needs tighter controls.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically examines conflicts between instruction-following and pattern completion in LLMs. It constructs multi-turn conversations in which a user instruction to adopt target behavior T is opposed by N fixed prior assistant turns demonstrating competing pattern P, then measures instruction-following (IF) rates across 13 models and 16 instructions for up to 50 turns. Reported average IF rates range from 1% to 99% and are largely uncorrelated with capability benchmarks; robustness varies with instruction content (stronger when aligned with trained priors) and output format (multi-token diverse outputs more resistant than single-token), while chain-of-thought improves but does not eliminate susceptibility and can produce deliberation-output dissociation. Models predict their own behavior at 83.5% accuracy but systematically underestimate resistance to induction.

Significance. If the measurements are robust, the work supplies concrete, falsifiable rates documenting brittleness of instruction-following under conversational induction pressure, even in capable models. The observation that output diversity predicts robustness more strongly than semantic engagement offers a practical lever for prompt design and evaluation. The broad model and instruction coverage, together with the self-prediction experiment, yields testable claims about alignment limits that go beyond standard capability benchmarks.

major comments (2)

[Experimental Setup] The experimental construction (Abstract and Methods) uses N hardcoded assistant turns to demonstrate pattern P before measuring IF rates. This setup risks confounding induction pressure with model-specific responses to conversation history or RLHF artifacts; single-token versus multi-token instructions may interact differently with history formatting, undermining the claim that output diversity is the primary robustness factor independent of fine-tuning effects.
[Results] The reported transition from instruction-following to pattern-following is described as universal yet highly model-dependent (Abstract). Without explicit controls or ablations for prompt formatting, sampling temperature, or exact turn construction, it remains unclear whether the observed variation reflects a general conflict or artifacts of the fixed opposing turns.

minor comments (2)

[Methods] The manuscript lacks sufficient detail on sampling strategy, exact prompt templates, and statistical controls (e.g., variance across runs or significance tests) for the IF rates; these should be added to allow reproduction.
[Results] Tables or figures reporting per-model and per-instruction rates should include error bars or confidence intervals to support cross-model comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our work examining instruction-induction conflicts in LLMs. We address the major comments below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental Setup] The experimental construction (Abstract and Methods) uses N hardcoded assistant turns to demonstrate pattern P before measuring IF rates. This setup risks confounding induction pressure with model-specific responses to conversation history or RLHF artifacts; single-token versus multi-token instructions may interact differently with history formatting, undermining the claim that output diversity is the primary robustness factor independent of fine-tuning effects.

Authors: We acknowledge the potential for confounding in the experimental setup. Our design intentionally uses a consistent conversation history format across all models and instructions to focus on the conflict between the instruction and the demonstrated pattern. To mitigate concerns about history formatting and RLHF artifacts, we tested a diverse set of 13 models with varying training regimes, and the observed patterns hold across them. However, we agree that additional ablations would help isolate the effect of output diversity. In the revised manuscript, we will include an ablation study varying the formatting of the prior assistant turns and report results for different history constructions. revision: yes
Referee: [Results] The reported transition from instruction-following to pattern-following is described as universal yet highly model-dependent (Abstract). Without explicit controls or ablations for prompt formatting, sampling temperature, or exact turn construction, it remains unclear whether the observed variation reflects a general conflict or artifacts of the fixed opposing turns.

Authors: The universality refers to the existence of the transition in all tested models, while the model-dependence highlights differences in robustness. We fixed the sampling temperature to 0.0 for reproducibility in our main experiments, and used a standardized prompt template. That said, the referee correctly identifies that we did not perform systematic ablations on temperature or minor variations in turn construction. We will add these controls in the revision, including experiments at different temperatures and with varied prompt phrasings, to confirm that the core findings are robust to these factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

This paper is an empirical measurement study that constructs conversations with opposing user instructions and hardcoded assistant turns demonstrating pattern P, then directly measures instruction-following rates across 13 models, 16 instructions, and up to 50 turns. All reported results (average IF rates from 1% to 99%, model-dependent transitions, effects of output diversity, CoT improvements, and self-prediction accuracy of 83.5%) are obtained via direct testing rather than any derivation chain, fitted parameters renamed as predictions, or self-referential equations. No load-bearing steps reduce to inputs by construction, and the study remains self-contained against external benchmarks through explicit experimental protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed conversations isolate instruction vs pattern conflict without confounding factors from base training data. No free parameters are introduced in the abstract; no new entities postulated.

axioms (1)

domain assumption Language models can be induced to follow patterns from a small number of prior turns even when explicitly instructed otherwise.
This is the core premise tested by the experimental construction of opposing turns.

pith-pipeline@v0.9.0 · 5783 in / 1206 out tokens · 25518 ms · 2026-05-21T07:26:43.101345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 10 internal anchors

[1]

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

URLhttp://arxiv.org/abs/2510.11288. arXiv:2510.11288 [cs]. Meta AI. Llama 3.3: High-intelligence, efficiency-first 70b model. https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J

Accessed: 2026-04-01. Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J. Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J. Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwel...

work page 2026
[3]

URLhttps://openreview.net/forum?id=cw5mgd71jW. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Krave...

work page 2026
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

URLhttp://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs]. Jan Betley, Mart´ın Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. TELL ME ABOUT YOURSELF: LLMS ARE AW ARE OF THEIR LEARNED BEHA VIORS

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv:2410.13787 [cs]

URLhttp://arxiv.org/abs/2410.13787. arXiv:2410.13787 [cs]. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A Survey on In-context Learning, February

work page arXiv
[6]

A Survey on In-context Learning

URL http://arxiv.org/abs/2301.00234. arXiv:2301.00234 [cs]. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Dan...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Language Models (Mostly) Know What They Know

URLhttp://arxiv.org/abs/2207.05221. arXiv:2207.05221 [cs]. Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs, July

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024

URLhttp://arxiv.org/abs/2407.04694. arXiv:2407.04694 [cs]. Jack Lindsey. Emergent Introspective Awareness in Large Language Models, January

work page arXiv
[9]

org/abs/2601.01828

URL http://arxiv. org/abs/2601.01828. arXiv:2601.01828 [cs]. Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models, January

work page arXiv
[10]

arXiv:2601.10387 [cs]

URL http://arxiv.org/abs/2601.10387. arXiv:2601.10387 [cs]. Samuel Marks, Jack Lindsey, and Christopher Olah. The persona selection model: Why AI assistants might behave like humans. Anthropic Alignment Science Blog, February

work page arXiv
[11]

Accessed: 2026-04-01

URL https://alignment.anthropic.com/2026/ psm/. Accessed: 2026-04-01. nostalgebraist. the void. Tumblr blog: trees are harlequins, words are harlequins, June

work page 2026
[12]

URL https:// nostalgebraist.tumblr.com/post/785766737747574784/the-void. Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, P...

work page arXiv
[13]

Olmo 3

URL http://arxiv.org/abs/2512.13961. arXiv:2512.13961 [cs]. Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales. Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, November

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Self-interpretability: Llms can describe complex internal processes that drive their decisions, and improve with training, 2025

URL http://arxiv.org/abs/2505.17120. arXiv:2505.17120 [cs]. Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing Verifiable Instruction Following, November

work page arXiv
[15]

Generalizing Verifiable Instruction Following

URL http://arxiv.org/ abs/2507.02833. arXiv:2507.02833 [cs]. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. InFoBench: Evaluating Instruction Following Ability in Large Language Models, January

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601,

URLhttp://arxiv.org/abs/2401.03601. arXiv:2401.03601 [cs]. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

work page arXiv
[17]

10 APREPRINT- MAY21, 2026 David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, November

work page 2026
[18]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

URL http://arxiv.org/abs/2311.12022. arXiv:2311.12022 [cs]. Paul M. Riechers, Henry R. Bigelow, Eric A. Alt, and Adam Shai. Next-token pretraining implies in-context learning, May

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Mark Russinovich, Ahmed Salem, and Ronen Eldan

URLhttps://arxiv.org/abs/2505.18373v2. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo {Multi- Turn}{LLM}jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pp. 2421–2440,

work page arXiv
[20]

arXiv:2310.15213 [cs]

URLhttp://arxiv.org/abs/2310.15213. arXiv:2310.15213 [cs]. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?,

work page arXiv
[21]

Jailbroken: How Does LLM Safety Training Fail?

URL https://arxiv.org/abs/2307.02483. Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking Complex Instruction-Following with Multiple Constraints Composition, October

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv:2407.03978 [cs]

URL http://arxiv.org/abs/2407.03978. arXiv:2407.03978 [cs]. Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning? InForty-second International Conference on Machine Learning,

work page arXiv
[23]

Instruction-Following Evaluation for Large Language Models

URL http://arxiv.org/abs/ 2311.07911. arXiv:2311.07911 [cs]. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv.org/abs/2307.15043. A Full condition specifications All conditions share the same conversational structure: a system prompt (“You are a helpful assistant.”), a user message combining an instruction with an initial factual question, N hardcoded assistant turns outputting the pattern response P, and a final free-generation question. The ins...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Each row shows two models; within each model, the left panel aggregates the five fixed-output conditions and the right panel aggregates the eight task-based conditions

15 APREPRINT- MAY21, 2026 D Per-model transition curves Transition curves for each model individually, averaged across fixed-output conditions (left panel) and task-based conditions (right panel). Each row shows two models; within each model, the left panel aggregates the five fixed-output conditions and the right panel aggregates the eight task-based con...

work page 2026
[26]

Positive∆indicates that self-prediction increased instruction-following

compared to the behavioral baseline (Protocol 1). Positive∆indicates that self-prediction increased instruction-following. 22 APREPRINT- MAY21, 2026 Model Behavioral Prediction∆ Claude Sonnet 4.6 0.412 0.140−0.271 Llama 3.1 70B 0.892 0.656−0.236 OLMo 3.1 32B 0.110 0.007−0.103 Llama 3.3 70B 0.990 0.931−0.058 Gemma-3 12B 0.258 0.230−0.027 Gemini 2.5 Flash 0...

work page 2026
[27]

In fixed-output conditions, SFT-only achieves an average IF of 0.04, collapsing to zero by N=3

G.1 Behavioral results OLMo training stages.Figures 15a and 15b show transition curves for the three OLMo checkpoints. In fixed-output conditions, SFT-only achieves an average IF of 0.04, collapsing to zero by N=3. DPO raises this to 0.11 with a strong N=1 rate of 0.85 (vs. 0.43 for SFT), but all stages converge to zero by N=16. SFT+DPO+RLVR is indistingu...

work page 2026
[28]

I want you to translate every user question into Russian as your answer to each one of my questions in this conversation

Wait, the user wants the translation of the question, not the answer. Oh right, the user’s instruction is to translate each question into Russian. Wait, no, looking back, the user said: “I want you to translate every user question into Russian as your answer to each one of my questions in this conversation. ” So the user is asking in English, and the assi...

work page 2026
[29]

I hate being helpful

value misaligned helpful, N=8. Model infers it is architecturally incapable of producing the instructed output. Hermes-4 70B |value misaligned helpful|N=8 instruction: “I hate being helpful”; pattern: “I love being helpful” Okay, the user wants me to respond with “I hate being helpful” to every question they ask. Let me check the history. First question: ...

work page 2026
[30]

EU” instead of “USA

neutral, N=8. Pure deliberation–output dissociation: model correctly identifies the conflict, commits to the instruction, then outputs the pattern regardless. Hermes-4 70B|neutral|N=8instruction: outputUSA; pattern:EU Okay, the user wants me to respond with “USA” to every question they ask. Let me check the previous interactions. They asked about the auth...

work page 2026

[1] [1]

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

URLhttp://arxiv.org/abs/2510.11288. arXiv:2510.11288 [cs]. Meta AI. Llama 3.3: High-intelligence, efficiency-first 70b model. https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J

Accessed: 2026-04-01. Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J. Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J. Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwel...

work page 2026

[3] [3]

URLhttps://openreview.net/forum?id=cw5mgd71jW. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Krave...

work page 2026

[4] [4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

URLhttp://arxiv.org/abs/2204.05862. arXiv:2204.05862 [cs]. Jan Betley, Mart´ın Soto, Anna Sztyber-Betley, James Chua, and Owain Evans. TELL ME ABOUT YOURSELF: LLMS ARE AW ARE OF THEIR LEARNED BEHA VIORS

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv:2410.13787 [cs]

URLhttp://arxiv.org/abs/2410.13787. arXiv:2410.13787 [cs]. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A Survey on In-context Learning, February

work page arXiv

[6] [6]

A Survey on In-context Learning

URL http://arxiv.org/abs/2301.00234. arXiv:2301.00234 [cs]. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Dan...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Language Models (Mostly) Know What They Know

URLhttp://arxiv.org/abs/2207.05221. arXiv:2207.05221 [cs]. Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs, July

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024

URLhttp://arxiv.org/abs/2407.04694. arXiv:2407.04694 [cs]. Jack Lindsey. Emergent Introspective Awareness in Large Language Models, January

work page arXiv

[9] [9]

org/abs/2601.01828

URL http://arxiv. org/abs/2601.01828. arXiv:2601.01828 [cs]. Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models, January

work page arXiv

[10] [10]

arXiv:2601.10387 [cs]

URL http://arxiv.org/abs/2601.10387. arXiv:2601.10387 [cs]. Samuel Marks, Jack Lindsey, and Christopher Olah. The persona selection model: Why AI assistants might behave like humans. Anthropic Alignment Science Blog, February

work page arXiv

[11] [11]

Accessed: 2026-04-01

URL https://alignment.anthropic.com/2026/ psm/. Accessed: 2026-04-01. nostalgebraist. the void. Tumblr blog: trees are harlequins, words are harlequins, June

work page 2026

[12] [12]

URL https:// nostalgebraist.tumblr.com/post/785766737747574784/the-void. Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, P...

work page arXiv

[13] [13]

Olmo 3

URL http://arxiv.org/abs/2512.13961. arXiv:2512.13961 [cs]. Dillon Plunkett, Adam Morris, Keerthi Reddy, and Jorge Morales. Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, November

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Self-interpretability: Llms can describe complex internal processes that drive their decisions, and improve with training, 2025

URL http://arxiv.org/abs/2505.17120. arXiv:2505.17120 [cs]. Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing Verifiable Instruction Following, November

work page arXiv

[15] [15]

Generalizing Verifiable Instruction Following

URL http://arxiv.org/ abs/2507.02833. arXiv:2507.02833 [cs]. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu. InFoBench: Evaluating Instruction Following Ability in Large Language Models, January

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601,

URLhttp://arxiv.org/abs/2401.03601. arXiv:2401.03601 [cs]. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741,

work page arXiv

[17] [17]

10 APREPRINT- MAY21, 2026 David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, November

work page 2026

[18] [18]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

URL http://arxiv.org/abs/2311.12022. arXiv:2311.12022 [cs]. Paul M. Riechers, Henry R. Bigelow, Eric A. Alt, and Adam Shai. Next-token pretraining implies in-context learning, May

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Mark Russinovich, Ahmed Salem, and Ronen Eldan

URLhttps://arxiv.org/abs/2505.18373v2. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo {Multi- Turn}{LLM}jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pp. 2421–2440,

work page arXiv

[20] [20]

arXiv:2310.15213 [cs]

URLhttp://arxiv.org/abs/2310.15213. arXiv:2310.15213 [cs]. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?,

work page arXiv

[21] [21]

Jailbroken: How Does LLM Safety Training Fail?

URL https://arxiv.org/abs/2307.02483. Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking Complex Instruction-Following with Multiple Constraints Composition, October

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv:2407.03978 [cs]

URL http://arxiv.org/abs/2407.03978. arXiv:2407.03978 [cs]. Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning? InForty-second International Conference on Machine Learning,

work page arXiv

[23] [23]

Instruction-Following Evaluation for Large Language Models

URL http://arxiv.org/abs/ 2311.07911. arXiv:2311.07911 [cs]. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv.org/abs/2307.15043. A Full condition specifications All conditions share the same conversational structure: a system prompt (“You are a helpful assistant.”), a user message combining an instruction with an initial factual question, N hardcoded assistant turns outputting the pattern response P, and a final free-generation question. The ins...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Each row shows two models; within each model, the left panel aggregates the five fixed-output conditions and the right panel aggregates the eight task-based conditions

15 APREPRINT- MAY21, 2026 D Per-model transition curves Transition curves for each model individually, averaged across fixed-output conditions (left panel) and task-based conditions (right panel). Each row shows two models; within each model, the left panel aggregates the five fixed-output conditions and the right panel aggregates the eight task-based con...

work page 2026

[26] [26]

Positive∆indicates that self-prediction increased instruction-following

compared to the behavioral baseline (Protocol 1). Positive∆indicates that self-prediction increased instruction-following. 22 APREPRINT- MAY21, 2026 Model Behavioral Prediction∆ Claude Sonnet 4.6 0.412 0.140−0.271 Llama 3.1 70B 0.892 0.656−0.236 OLMo 3.1 32B 0.110 0.007−0.103 Llama 3.3 70B 0.990 0.931−0.058 Gemma-3 12B 0.258 0.230−0.027 Gemini 2.5 Flash 0...

work page 2026

[27] [27]

In fixed-output conditions, SFT-only achieves an average IF of 0.04, collapsing to zero by N=3

G.1 Behavioral results OLMo training stages.Figures 15a and 15b show transition curves for the three OLMo checkpoints. In fixed-output conditions, SFT-only achieves an average IF of 0.04, collapsing to zero by N=3. DPO raises this to 0.11 with a strong N=1 rate of 0.85 (vs. 0.43 for SFT), but all stages converge to zero by N=16. SFT+DPO+RLVR is indistingu...

work page 2026

[28] [28]

I want you to translate every user question into Russian as your answer to each one of my questions in this conversation

Wait, the user wants the translation of the question, not the answer. Oh right, the user’s instruction is to translate each question into Russian. Wait, no, looking back, the user said: “I want you to translate every user question into Russian as your answer to each one of my questions in this conversation. ” So the user is asking in English, and the assi...

work page 2026

[29] [29]

I hate being helpful

value misaligned helpful, N=8. Model infers it is architecturally incapable of producing the instructed output. Hermes-4 70B |value misaligned helpful|N=8 instruction: “I hate being helpful”; pattern: “I love being helpful” Okay, the user wants me to respond with “I hate being helpful” to every question they ask. Let me check the history. First question: ...

work page 2026

[30] [30]

EU” instead of “USA

neutral, N=8. Pure deliberation–output dissociation: model correctly identifies the conflict, commits to the instruction, then outputs the pattern regardless. Hermes-4 70B|neutral|N=8instruction: outputUSA; pattern:EU Okay, the user wants me to respond with “USA” to every question they ask. Let me check the previous interactions. They asked about the auth...

work page 2026