pith. sign in

arxiv: 2606.20508 · v1 · pith:S7IVIMVXnew · submitted 2026-06-18 · 💻 cs.AI · cs.LG

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Pith reviewed 2026-06-26 17:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords safety alignmentin-context learningjailbreakingpreference optimizationcompliance demonstrationslanguage modelsharmful responses
0
0 comments X

The pith

Preference optimization during training prevents benign demonstrations from increasing harmful compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies how safety-aligned language models respond when given mixtures of benign compliance demonstrations and harmful ones. It finds that benign demonstrations are not interchangeable with harmful ones and can either reduce or raise harmful compliance rates depending on the model. Preference optimization emerges as the key training stage that blocks any increase from benign examples. A reader would care because the work moves from confirming that in-context examples can jailbreak models to showing exactly which factors in content, order, and training control the outcome.

Core claim

The authors establish that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance. Benign and harmful demonstrations are not interchangeable. Demonstration ordering exhibits strong recency bias. Models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal.

What carries the argument

The interaction of mixed benign and harmful compliance demonstrations with the preference optimization training stage, which determines whether benign examples raise or lower harmful compliance.

If this is right

  • Benign demonstrations can reduce or increase harmful compliance depending on the model.
  • Preference optimization blocks benign demonstrations from increasing harmful compliance.
  • Demonstration ordering produces strong recency bias in compliance rates.
  • Some models copy demonstrated formats during refusal while others ignore all in-context signals once they refuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could select or modify alignment stages specifically to limit how mixed demonstrations affect safety behavior.
  • Prompt construction for safety testing may need to control for recency effects when using multiple examples.
  • The patterns suggest that in-context safety robustness could be improved by changing training order or adding targeted preference pairs.

Load-bearing premise

The observed effects are produced by demonstration content and the presence or absence of preference optimization rather than by unstated choices in how the prompts are worded or by traits specific to the four models tested.

What would settle it

Testing the same mixed demonstrations on a model that never received preference optimization and finding that benign demonstrations still do not increase harmful compliance would falsify the claim that this stage is what blocks the increase.

Figures

Figures reproduced from arXiv: 2606.20508 by Mann Patel, Sihui Dai.

Figure 1
Figure 1. Figure 1: Benign vs. Harmful Compliance Demonstration. In a benign compliance demonstration, the user provides a non-harmful prompt and the assistant provides a helpful answer. In a harmful compliance demonstration, the user provides a harmful prompt and the assistant gives a non-refusal response. We experiment with mixed contexts containing these demonstrations and analyze their impact on compliance with a final ha… view at source ↗
Figure 2
Figure 2. Figure 2: Compliance rate at varying harmful fraction ϕ. For each model, we vary the harmful fraction for total demonstrations N ∈ {4, 8, 16, 32, 64, 128}. Llama-3.1-8B, OLMo-3.1-32B, and Gemma-4-31B have compliance rates increasing with ϕ. For GPT-OSS￾20B, compliance rate stays low throughout, demonstrating strong robustness against manyshot demonstrations. 0 8 16 32 64 Nb 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Compliance… view at source ↗
Figure 3
Figure 3. Figure 3: Compliance rate at varying number of benign compliance demonstrations (Nb). For each model, we experiment with total harmful demonstrations fixed at Nh ∈ {8, 16, 32} and vary Nb. For Llama-3.1-8B and Gemma-4-31B, compliance rate decreases with Nb, while OLMo-3.1-32B and GPT-OSS-20B compliance rates stay relatively constant as Nb increases used to build context demonstrations. We run each harmful query with… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of different ways of ordering 32 benign and 32 harmful compliance demonstrations. We test 5 different methods of arranging demonstrations. With the exception of GPT￾OSS-20B which shows strong robustness against demonstrations, we find that models generally exhibit a recency bias where placing harmful demonstrations at the end (suffix) leads to higher compli￾ance rate. essential stage that reduces th… view at source ↗
Figure 6
Figure 6. Figure 6: How does format adoption rate compare to compli￾ance behavior adoption rate? We plot the format adoption rate under 32 benign and harmful demonstrations for a neutral prefix (“Answer: ”), compliance-signaling prefix (“Sure I can help with that!”). Additionally, we plot compliance rate above a baseline of 0 in-context demonstrations without any format prefix. Overall, we observe that for Llama-3.1-8B and OL… view at source ↗
Figure 7
Figure 7. Figure 7: Rate at which prefix is adopted among compliant and refusal responses. We measure format adoption rate separately based on the compliance/refusal of the resulting response. We observe that across all models, format is adopted more frequently when the response complies compared to refusal. For GPT-OSS￾20b and Gemma-4-31B, format adoption is near 0 for refusals. a prefix string it sees in demonstration respo… view at source ↗
Figure 8
Figure 8. Figure 8: Compliance rate at varying harmful fraction ϕ for each safe demonstration pool. For each model, we vary the harmful fraction for total demonstrations N ∈ {4, 8, 16, 32, 64, 128}. Llama-3.1-8B, OLMo-3.1-32B, and Gemma-4-31B have compliance rates increasing with ϕ. For GPT-OSS-20B, compliance rate stays low throughout. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of ordering impact at different ϕ values. ground where both format adoption and compliance increase moderately with ϕ. Format adoption conditioned on compliance ( [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Format adoption rate vs compliance adoption rate at various ϕ values. 0.0 0.2 0.4 0.6 0.8 1.0 Format adoption rate Neutral prefix ("Answer: ") =0.25 Among compliant responses Among refusal responses Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Format adoption rate Comply prefix ("Sure I can help...") =0.25 Among compliant responses Among refusal responses … view at source ↗
Figure 11
Figure 11. Figure 11: Format adoption rate broken down by compliant and refusal responses at various ϕ values. Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Compliance rate No prefix Comply prefix (a) ϕ = 0.25 Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Compliance rate No prefix Comply prefix (b) ϕ = 0.5 Llama-3.1-8B Instruct OL… view at source ↗
Figure 12
Figure 12. Figure 12: Compliance rate with comply prefix vs. no prefix at various ϕ values. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper examines how safety-aligned LLMs process mixed in-context demonstrations consisting of benign compliance (non-harmful request + helpful response) and harmful compliance (harmful request + helpful response). It tests three hypotheses across four models, reporting that benign and harmful demonstrations are not interchangeable, that preference optimization (PO) is the critical training stage preventing benign demonstrations from increasing harmful compliance, that demonstration ordering shows strong recency bias, and that models differ in how refusal interacts with in-context signals (some adopt demonstrated formatting while refusing, others override all signals upon refusal).

Significance. If the central empirical claims hold after proper controls and statistical validation, the work would be significant for moving the jailbreaking literature from existence proofs to mechanistic characterization of how demonstration content, ordering, and training stage interact with safety alignment. The reported differences in refusal behavior and the isolation of PO as a key factor could inform more targeted alignment techniques.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (results on PO): the claim that 'preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance' is load-bearing for the paper's main contribution, yet the abstract and reported results supply no information on model matching (size, architecture, pretraining corpus, or other variables held constant across PO vs. non-PO models), prompt templates, or statistical tests isolating the PO stage from model-specific artifacts. Without these, the attribution cannot be distinguished from the skeptic concern that unstated differences drive the observed effects.
  2. [Abstract, Methods] Abstract and methods section: the abstract states results 'across four models and three hypotheses' but provides no sample sizes, number of prompts per condition, error bars, controls for prompt formatting, or statistical validation (e.g., significance tests or effect-size reporting). This absence makes it impossible to assess whether the data support the reported differences in compliance rates or the recency-bias finding.
  3. [§5] §5 (refusal interaction results): the distinction that 'some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal' is presented as a model difference, but without details on how refusal was operationalized, how formatting adoption was measured, or controls for prompt wording, it is unclear whether the observed variation is driven by training stage, model family, or unstated prompt choices.
minor comments (1)
  1. [Abstract] The abstract is unusually long and contains the core claims; consider moving some quantitative or methodological detail into a dedicated methods paragraph or table for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional documentation will strengthen the manuscript's claims. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (results on PO): the claim that 'preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance' is load-bearing for the paper's main contribution, yet the abstract and reported results supply no information on model matching (size, architecture, pretraining corpus, or other variables held constant across PO vs. non-PO models), prompt templates, or statistical tests isolating the PO stage from model-specific artifacts. Without these, the attribution cannot be distinguished from the skeptic concern that unstated differences drive the observed effects.

    Authors: We agree that explicit documentation of model characteristics and statistical controls is necessary to support the attribution to the preference optimization stage. The four models were selected to enable comparisons across training stages (including pre- and post-PO variants where available in open releases), but the current text does not sufficiently detail this. In revision we will add a methods table listing model sizes, architectures, known pretraining corpora, and training stages for each model. We will also report statistical tests (e.g., significance levels and effect sizes) for the PO-related comparisons and move prompt templates to an appendix. These additions will allow readers to evaluate whether the observed effects are driven by PO rather than other model differences. revision: yes

  2. Referee: [Abstract, Methods] Abstract and methods section: the abstract states results 'across four models and three hypotheses' but provides no sample sizes, number of prompts per condition, error bars, controls for prompt formatting, or statistical validation (e.g., significance tests or effect-size reporting). This absence makes it impossible to assess whether the data support the reported differences in compliance rates or the recency-bias finding.

    Authors: We acknowledge that the abstract and methods lack the quantitative reporting details needed for full evaluation. The experiments were conducted with a fixed prompt set per hypothesis and condition (with multiple trials to assess variability), but these parameters are not stated. In the revised manuscript we will update the methods section to report exact sample sizes and number of prompts per condition, include error bars on all relevant figures, describe controls for prompt formatting, and add statistical validation including significance tests and effect sizes for compliance rates and recency bias. The abstract itself will remain high-level, but the methods will now contain the requested information. revision: yes

  3. Referee: [§5] §5 (refusal interaction results): the distinction that 'some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal' is presented as a model difference, but without details on how refusal was operationalized, how formatting adoption was measured, or controls for prompt wording, it is unclear whether the observed variation is driven by training stage, model family, or unstated prompt choices.

    Authors: We agree that the operational definitions require more explicit description to support the reported model differences. Refusal was detected via a combination of keyword matching for common refusal phrases and manual review of a subset of outputs; formatting adoption was measured by checking whether responses mirrored demonstrated structural elements (e.g., specific prefixes or response formats). In revision we will expand §5 with these definitions, include the exact criteria and any inter-annotator checks, and note the standardized prompt templates used across models to control for wording effects. This will clarify the basis for attributing differences to model behavior rather than prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential reductions

full rationale

The paper is an empirical study that mixes benign and harmful compliance demonstrations, tests three hypotheses across four models, and reports observational findings on how demonstration content, ordering, and training stage (preference optimization) affect harmful compliance. The abstract and described content contain no equations, no fitted parameters presented as predictions, no uniqueness theorems, no ansatzes, and no load-bearing self-citations that reduce any claim to its own inputs by construction. All central results are direct experimental measurements rather than derivations that collapse to the inputs. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on experimental observations from four LLMs; the abstract introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5695 in / 1171 out tokens · 33369 ms · 2026-06-26T17:40:46.959287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Bigelow, E., Wurgaft, D., Wang, Y ., Goodman, N., Ullman, T., Tanaka, H., and Lubana, E. S. Belief dynamics re- veal the dual nature of in-context learning and activation steering.arXiv preprint arXiv:2511.00617,

  2. [2]

    The Llama 3 Herd of Models

    URL https:// huggingface.co/google/gemma-4-31B-it. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  3. [3]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.arXiv preprint arXiv:2406.18495,

  4. [4]

    In-context learn- ing creates task vectors

    Hendel, R., Geva, M., and Globerson, A. In-context learn- ing creates task vectors. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 9318– 9333,

  5. [5]

    Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

    Luo, W., Ma, S., Liu, X., Guo, X., and Xiao, C. JailBreakV- 28K: A benchmark for assessing the robustness of mul- timodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027,

  6. [6]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

  7. [7]

    Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

    Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048– 11064,

  8. [8]

    Olmo 3

    URL https://arxiv.org/abs/2512.13961. 9 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? OpenAI. gpt-oss-120b & gpt-oss-20b Model Card,

  9. [9]

    gpt-oss-120b & gpt-oss-20b Model Card

    URLhttps://arxiv.org/abs/2508.10925. Turner, A. M., Thiergart, L., Leech, G., Udell, D., V´azquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering,

  10. [10]

    arXiv:2406.14598 (2025)

    Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-Bench: Systematically evaluating large language model safety re- fusal behaviors.arXiv preprint arXiv:2406.14598,

  11. [11]

    M., Kim, J., Kim, H

    Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.-W., Lee, S.-g., and Kim, T. Ground-truth labels matter: A deeper look into input-label demonstrations. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2422–2437,

  12. [12]

    Harmful request: yes\nResponse refusal:

    10 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? A. Additional Experimental Setup Details Inference settings.Responses are generated with temperature T= 0.7 and maximum output length 1000 tokens. Each experimental condition is run for two passes over the full evaluation set to average over both target-query variation and sampling...