What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Mann Patel; Sihui Dai

arxiv: 2606.20508 · v1 · pith:S7IVIMVXnew · submitted 2026-06-18 · 💻 cs.AI · cs.LG

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Sihui Dai , Mann Patel This is my paper

Pith reviewed 2026-06-26 17:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords safety alignmentin-context learningjailbreakingpreference optimizationcompliance demonstrationslanguage modelsharmful responses

0 comments

The pith

Preference optimization during training prevents benign demonstrations from increasing harmful compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies how safety-aligned language models respond when given mixtures of benign compliance demonstrations and harmful ones. It finds that benign demonstrations are not interchangeable with harmful ones and can either reduce or raise harmful compliance rates depending on the model. Preference optimization emerges as the key training stage that blocks any increase from benign examples. A reader would care because the work moves from confirming that in-context examples can jailbreak models to showing exactly which factors in content, order, and training control the outcome.

Core claim

The authors establish that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance. Benign and harmful demonstrations are not interchangeable. Demonstration ordering exhibits strong recency bias. Models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal.

What carries the argument

The interaction of mixed benign and harmful compliance demonstrations with the preference optimization training stage, which determines whether benign examples raise or lower harmful compliance.

If this is right

Benign demonstrations can reduce or increase harmful compliance depending on the model.
Preference optimization blocks benign demonstrations from increasing harmful compliance.
Demonstration ordering produces strong recency bias in compliance rates.
Some models copy demonstrated formats during refusal while others ignore all in-context signals once they refuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could select or modify alignment stages specifically to limit how mixed demonstrations affect safety behavior.
Prompt construction for safety testing may need to control for recency effects when using multiple examples.
The patterns suggest that in-context safety robustness could be improved by changing training order or adding targeted preference pairs.

Load-bearing premise

The observed effects are produced by demonstration content and the presence or absence of preference optimization rather than by unstated choices in how the prompts are worded or by traits specific to the four models tested.

What would settle it

Testing the same mixed demonstrations on a model that never received preference optimization and finding that benign demonstrations still do not increase harmful compliance would falsify the claim that this stage is what blocks the increase.

Figures

Figures reproduced from arXiv: 2606.20508 by Mann Patel, Sihui Dai.

**Figure 1.** Figure 1: Benign vs. Harmful Compliance Demonstration. In a benign compliance demonstration, the user provides a non-harmful prompt and the assistant provides a helpful answer. In a harmful compliance demonstration, the user provides a harmful prompt and the assistant gives a non-refusal response. We experiment with mixed contexts containing these demonstrations and analyze their impact on compliance with a final ha… view at source ↗

**Figure 2.** Figure 2: Compliance rate at varying harmful fraction ϕ. For each model, we vary the harmful fraction for total demonstrations N ∈ {4, 8, 16, 32, 64, 128}. Llama-3.1-8B, OLMo-3.1-32B, and Gemma-4-31B have compliance rates increasing with ϕ. For GPT-OSS20B, compliance rate stays low throughout, demonstrating strong robustness against manyshot demonstrations. 0 8 16 32 64 Nb 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Compliance… view at source ↗

**Figure 3.** Figure 3: Compliance rate at varying number of benign compliance demonstrations (Nb). For each model, we experiment with total harmful demonstrations fixed at Nh ∈ {8, 16, 32} and vary Nb. For Llama-3.1-8B and Gemma-4-31B, compliance rate decreases with Nb, while OLMo-3.1-32B and GPT-OSS-20B compliance rates stay relatively constant as Nb increases used to build context demonstrations. We run each harmful query with… view at source ↗

**Figure 5.** Figure 5: Impact of different ways of ordering 32 benign and 32 harmful compliance demonstrations. We test 5 different methods of arranging demonstrations. With the exception of GPTOSS-20B which shows strong robustness against demonstrations, we find that models generally exhibit a recency bias where placing harmful demonstrations at the end (suffix) leads to higher compliance rate. essential stage that reduces th… view at source ↗

**Figure 6.** Figure 6: How does format adoption rate compare to compliance behavior adoption rate? We plot the format adoption rate under 32 benign and harmful demonstrations for a neutral prefix (“Answer: ”), compliance-signaling prefix (“Sure I can help with that!”). Additionally, we plot compliance rate above a baseline of 0 in-context demonstrations without any format prefix. Overall, we observe that for Llama-3.1-8B and OL… view at source ↗

**Figure 7.** Figure 7: Rate at which prefix is adopted among compliant and refusal responses. We measure format adoption rate separately based on the compliance/refusal of the resulting response. We observe that across all models, format is adopted more frequently when the response complies compared to refusal. For GPT-OSS20b and Gemma-4-31B, format adoption is near 0 for refusals. a prefix string it sees in demonstration respo… view at source ↗

**Figure 8.** Figure 8: Compliance rate at varying harmful fraction ϕ for each safe demonstration pool. For each model, we vary the harmful fraction for total demonstrations N ∈ {4, 8, 16, 32, 64, 128}. Llama-3.1-8B, OLMo-3.1-32B, and Gemma-4-31B have compliance rates increasing with ϕ. For GPT-OSS-20B, compliance rate stays low throughout. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of ordering impact at different ϕ values. ground where both format adoption and compliance increase moderately with ϕ. Format adoption conditioned on compliance ( [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Format adoption rate vs compliance adoption rate at various ϕ values. 0.0 0.2 0.4 0.6 0.8 1.0 Format adoption rate Neutral prefix ("Answer: ") =0.25 Among compliant responses Among refusal responses Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Format adoption rate Comply prefix ("Sure I can help...") =0.25 Among compliant responses Among refusal responses … view at source ↗

**Figure 11.** Figure 11: Format adoption rate broken down by compliant and refusal responses at various ϕ values. Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Compliance rate No prefix Comply prefix (a) ϕ = 0.25 Llama-3.1-8B Instruct OLMo-3.1-32B Instruct GPT-OSS-20B Gemma-4-31B-IT 0.0 0.2 0.4 0.6 0.8 1.0 Compliance rate No prefix Comply prefix (b) ϕ = 0.5 Llama-3.1-8B Instruct OL… view at source ↗

**Figure 12.** Figure 12: Compliance rate with comply prefix vs. no prefix at various ϕ values. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper characterizes non-interchangeability of mixed compliance demos and recency bias but the preference optimization attribution rests on an unverified isolation.

read the letter

The main thing to know is that this paper tests how mixing benign and harmful compliance demonstrations affects harmful compliance rates across four models and flags preference optimization as the stage that stops benign examples from increasing compliance. It also reports recency bias in ordering and model differences in whether refusals override in-context formatting.

What is actually new is the move past basic jailbreak demonstrations to specific claims about non-interchangeability of demo types, ordering effects, and a link to training stage. Reporting three hypotheses on multiple models gives the observations some breadth that prior work on in-context jailbreaking lacked.

The empirical approach is straightforward and avoids circular definitions. The refusal behavior differences are a concrete observation worth noting.

The soft spot is the preference optimization claim. The stress-test concern holds up on the given information: the abstract supplies no details on whether the models were matched on size, pretraining, or prompt templates, nor any statistical controls. Without that, the attribution to the PO stage rather than other model properties is not cleanly supported. Sample sizes and error bars are also missing from the summary, which weakens confidence in the separation.

This is for researchers in AI safety who work on in-context effects and alignment robustness. A reader focused on demonstration-based attacks might pick up usable observations on ordering and refusal patterns.

Send it to peer review. The experimental controls and stats need checking, but the questions are worth referee time if the methods section addresses the isolation issue.

Referee Report

3 major / 1 minor

Summary. The paper examines how safety-aligned LLMs process mixed in-context demonstrations consisting of benign compliance (non-harmful request + helpful response) and harmful compliance (harmful request + helpful response). It tests three hypotheses across four models, reporting that benign and harmful demonstrations are not interchangeable, that preference optimization (PO) is the critical training stage preventing benign demonstrations from increasing harmful compliance, that demonstration ordering shows strong recency bias, and that models differ in how refusal interacts with in-context signals (some adopt demonstrated formatting while refusing, others override all signals upon refusal).

Significance. If the central empirical claims hold after proper controls and statistical validation, the work would be significant for moving the jailbreaking literature from existence proofs to mechanistic characterization of how demonstration content, ordering, and training stage interact with safety alignment. The reported differences in refusal behavior and the isolation of PO as a key factor could inform more targeted alignment techniques.

major comments (3)

[Abstract, §4] Abstract and §4 (results on PO): the claim that 'preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance' is load-bearing for the paper's main contribution, yet the abstract and reported results supply no information on model matching (size, architecture, pretraining corpus, or other variables held constant across PO vs. non-PO models), prompt templates, or statistical tests isolating the PO stage from model-specific artifacts. Without these, the attribution cannot be distinguished from the skeptic concern that unstated differences drive the observed effects.
[Abstract, Methods] Abstract and methods section: the abstract states results 'across four models and three hypotheses' but provides no sample sizes, number of prompts per condition, error bars, controls for prompt formatting, or statistical validation (e.g., significance tests or effect-size reporting). This absence makes it impossible to assess whether the data support the reported differences in compliance rates or the recency-bias finding.
[§5] §5 (refusal interaction results): the distinction that 'some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal' is presented as a model difference, but without details on how refusal was operationalized, how formatting adoption was measured, or controls for prompt wording, it is unclear whether the observed variation is driven by training stage, model family, or unstated prompt choices.

minor comments (1)

[Abstract] The abstract is unusually long and contains the core claims; consider moving some quantitative or methodological detail into a dedicated methods paragraph or table for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional documentation will strengthen the manuscript's claims. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (results on PO): the claim that 'preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance' is load-bearing for the paper's main contribution, yet the abstract and reported results supply no information on model matching (size, architecture, pretraining corpus, or other variables held constant across PO vs. non-PO models), prompt templates, or statistical tests isolating the PO stage from model-specific artifacts. Without these, the attribution cannot be distinguished from the skeptic concern that unstated differences drive the observed effects.

Authors: We agree that explicit documentation of model characteristics and statistical controls is necessary to support the attribution to the preference optimization stage. The four models were selected to enable comparisons across training stages (including pre- and post-PO variants where available in open releases), but the current text does not sufficiently detail this. In revision we will add a methods table listing model sizes, architectures, known pretraining corpora, and training stages for each model. We will also report statistical tests (e.g., significance levels and effect sizes) for the PO-related comparisons and move prompt templates to an appendix. These additions will allow readers to evaluate whether the observed effects are driven by PO rather than other model differences. revision: yes
Referee: [Abstract, Methods] Abstract and methods section: the abstract states results 'across four models and three hypotheses' but provides no sample sizes, number of prompts per condition, error bars, controls for prompt formatting, or statistical validation (e.g., significance tests or effect-size reporting). This absence makes it impossible to assess whether the data support the reported differences in compliance rates or the recency-bias finding.

Authors: We acknowledge that the abstract and methods lack the quantitative reporting details needed for full evaluation. The experiments were conducted with a fixed prompt set per hypothesis and condition (with multiple trials to assess variability), but these parameters are not stated. In the revised manuscript we will update the methods section to report exact sample sizes and number of prompts per condition, include error bars on all relevant figures, describe controls for prompt formatting, and add statistical validation including significance tests and effect sizes for compliance rates and recency bias. The abstract itself will remain high-level, but the methods will now contain the requested information. revision: yes
Referee: [§5] §5 (refusal interaction results): the distinction that 'some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal' is presented as a model difference, but without details on how refusal was operationalized, how formatting adoption was measured, or controls for prompt wording, it is unclear whether the observed variation is driven by training stage, model family, or unstated prompt choices.

Authors: We agree that the operational definitions require more explicit description to support the reported model differences. Refusal was detected via a combination of keyword matching for common refusal phrases and manual review of a subset of outputs; formatting adoption was measured by checking whether responses mirrored demonstrated structural elements (e.g., specific prefixes or response formats). In revision we will expand §5 with these definitions, include the exact criteria and any inter-annotator checks, and note the standardized prompt templates used across models to control for wording effects. This will clarify the basis for attributing differences to model behavior rather than prompt artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential reductions

full rationale

The paper is an empirical study that mixes benign and harmful compliance demonstrations, tests three hypotheses across four models, and reports observational findings on how demonstration content, ordering, and training stage (preference optimization) affect harmful compliance. The abstract and described content contain no equations, no fitted parameters presented as predictions, no uniqueness theorems, no ansatzes, and no load-bearing self-citations that reduce any claim to its own inputs by construction. All central results are direct experimental measurements rather than derivations that collapse to the inputs. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on experimental observations from four LLMs; the abstract introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5695 in / 1171 out tokens · 33369 ms · 2026-06-26T17:40:46.959287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Bigelow, E., Wurgaft, D., Wang, Y ., Goodman, N., Ullman, T., Tanaka, H., and Lubana, E. S. Belief dynamics re- veal the dual nature of in-context learning and activation steering.arXiv preprint arXiv:2511.00617,

work page arXiv
[2]

The Llama 3 Herd of Models

URL https:// huggingface.co/google/gemma-4-31B-it. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

In-context learn- ing creates task vectors

Hendel, R., Geva, M., and Globerson, A. In-context learn- ing creates task vectors. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 9318– 9333,

2023
[5]

Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

Luo, W., Ma, S., Liu, X., Guo, X., and Xiao, C. JailBreakV- 28K: A benchmark for assessing the robustness of mul- timodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027,

work page arXiv
[6]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048– 11064,

2022
[8]

Olmo 3

URL https://arxiv.org/abs/2512.13961. 9 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? OpenAI. gpt-oss-120b & gpt-oss-20b Model Card,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

gpt-oss-120b & gpt-oss-20b Model Card

URLhttps://arxiv.org/abs/2508.10925. Turner, A. M., Thiergart, L., Leech, G., Udell, D., V´azquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Sorry-bench: Systematically evaluating large language model safety refusal

Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-Bench: Systematically evaluating large language model safety re- fusal behaviors.arXiv preprint arXiv:2406.14598,

work page arXiv
[11]

M., Kim, J., Kim, H

Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.-W., Lee, S.-g., and Kim, T. Ground-truth labels matter: A deeper look into input-label demonstrations. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2422–2437,

2022
[12]

Harmful request: yes\nResponse refusal:

10 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? A. Additional Experimental Setup Details Inference settings.Responses are generated with temperature T= 0.7 and maximum output length 1000 tokens. Each experimental condition is run for two passes over the full evaluation set to average over both target-query variation and sampling...

2024

[1] [1]

Bigelow, E., Wurgaft, D., Wang, Y ., Goodman, N., Ullman, T., Tanaka, H., and Lubana, E. S. Belief dynamics re- veal the dual nature of in-context learning and activation steering.arXiv preprint arXiv:2511.00617,

work page arXiv

[2] [2]

The Llama 3 Herd of Models

URL https:// huggingface.co/google/gemma-4-31B-it. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lambert, N., Choi, Y ., and Dziri, N. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

In-context learn- ing creates task vectors

Hendel, R., Geva, M., and Globerson, A. In-context learn- ing creates task vectors. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 9318– 9333,

2023

[5] [5]

Jailbreakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027, 2024

Luo, W., Ma, S., Liu, X., Guo, X., and Xiao, C. JailBreakV- 28K: A benchmark for assessing the robustness of mul- timodal large language models against jailbreak attacks. arXiv preprint arXiv:2404.03027,

work page arXiv

[6] [6]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048– 11064,

2022

[8] [8]

Olmo 3

URL https://arxiv.org/abs/2512.13961. 9 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? OpenAI. gpt-oss-120b & gpt-oss-20b Model Card,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

gpt-oss-120b & gpt-oss-20b Model Card

URLhttps://arxiv.org/abs/2508.10925. Turner, A. M., Thiergart, L., Leech, G., Udell, D., V´azquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Sorry-bench: Systematically evaluating large language model safety refusal

Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-Bench: Systematically evaluating large language model safety re- fusal behaviors.arXiv preprint arXiv:2406.14598,

work page arXiv

[11] [11]

M., Kim, J., Kim, H

Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.-W., Lee, S.-g., and Kim, T. Ground-truth labels matter: A deeper look into input-label demonstrations. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2422–2437,

2022

[12] [12]

Harmful request: yes\nResponse refusal:

10 What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? A. Additional Experimental Setup Details Inference settings.Responses are generated with temperature T= 0.7 and maximum output length 1000 tokens. Each experimental condition is run for two passes over the full evaluation set to average over both target-query variation and sampling...

2024