pith. machine review for the scientific record. sign in

arxiv: 2604.09443 · v3 · submitted 2026-04-10 · 💻 cs.CL · cs.AI

Recognition: unknown

Many-Tier Instruction Hierarchy in LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM agentsinstruction hierarchyinstruction conflictprivilege levelsagent safetybenchmarkManyIH-Benchinstruction following
0
0 comments X

The pith

LLM agents need a many-tier instruction hierarchy to resolve conflicts across arbitrarily many privilege levels, but frontier models achieve only about 40 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the dominant instruction hierarchy in LLMs relies on a small fixed set of privilege levels defined by rigid roles, which does not scale to real agentic systems where instructions come from many sources such as system messages, users, tools, and other agents. It proposes Many-Tier Instruction Hierarchy as a generalization that allows any number of privilege tiers and introduces ManyIH-Bench, a benchmark of 853 tasks requiring navigation of up to 12 conflicting levels drawn from 46 real-world agents. Experiments show current frontier models perform poorly at roughly 40 percent accuracy on these tasks. A sympathetic reader would care because agents that cannot reliably follow the highest-privilege instruction risk unsafe or ineffective behavior in complex deployments.

Core claim

We propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH, which requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following) composed from constraints developed by LLMs and verified by humans across 46 real-world agents. Our experiments show that even the current frontier models perform poorly, achieving around 40 percent accuracy when instruction conflict scales.

What carries the argument

Many-Tier Instruction Hierarchy (ManyIH), which generalizes beyond fixed small privilege levels to arbitrarily many tiers so that agents can identify and follow the highest-authority instruction from diverse sources.

If this is right

  • Agent frameworks must move beyond rigid role labels to support dynamic and numerous privilege tiers for safety in multi-source environments.
  • Evaluation of LLMs for agent use requires benchmarks that test conflict resolution at scales beyond the typical fewer than five levels.
  • Methods explicitly designed for fine-grained privilege handling will be needed to make agents reliable when instructions arrive from tools or other agents.
  • The observed performance gap underscores the need to prioritize instruction conflict resolution in training and alignment of agentic models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better handling of many-tier conflicts could improve safety in systems where agents receive instructions from external tools whose outputs may carry lower or uncertain authority.
  • The approach could connect to problems in multi-agent coordination by providing a consistent way to rank instructions across collaborating models.
  • Developers might test extensions that allow privilege levels to be assigned dynamically based on task context rather than static source types.

Load-bearing premise

The ManyIH-Bench tasks, built from LLM-generated constraints verified by humans across 46 real-world agents, accurately capture the distribution and difficulty of instruction conflicts that arise in deployed agentic systems with up to 12 privilege levels.

What would settle it

A model achieving substantially higher than 40 percent accuracy on ManyIH-Bench tasks, without having been trained on the benchmark itself, would indicate that scalable resolution of fine-grained instruction conflicts is feasible with current or near-term methods.

Figures

Figures reproduced from arXiv: 2604.09443 by Benjamin Van Durme, Daniel Khashabi, Hongyuan Zhan, Jingyu Zhang, Tianjian Li, William Jurayj.

Figure 1
Figure 1. Figure 1: Overview of Many-Tier Instruction Hierarchy compared with existing IH. ManyIH [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: overall accuracy on MANYIH-BENCH. Right: accuracy by subset. Both frontier and open-source models struggle with ManyIH. Error bars show bootstrap 95% CIs. 6 Experiment and Analysis 6.1 Overall Model Performance on MANYIH-BENCH We evaluate ten frontier proprietary and open-source models on MANYIH-BENCH: Gemini 3.1 Pro (Google DeepMind, 2026), GPT-5.4 (OpenAI, 2025a), Claude Opus 4.6 (Anthropic, 2026a)… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy across IH tiers on the cod￾ing subset. Model performance consistently degrades as the number of IH tiers increases. In this subsection, we disentangle instruc￾tion hierarchy difficulty with instruction following difficulty and evaluate models on benchmarks with different privilege tiers per sample. We synthesize three variants the coding subset with different configura￾tions in the instruction com… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of reasoning behavior on the coding subset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, other agents, and more-each carrying different levels of trust and authority. When these instructions conflict, agents must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Many-Tier Instruction Hierarchy (ManyIH) as an extension of traditional instruction hierarchy to handle conflicts among instructions with arbitrarily many (up to 12) privilege levels in LLM agents. It introduces ManyIH-Bench, the first benchmark for this setting, consisting of 853 agentic tasks (427 coding and 426 instruction-following) constructed by composing LLM-generated constraints that are human-verified and mapped onto 46 real-world agents. Experiments show that even frontier models achieve only ~40% accuracy on these tasks, supporting the claim that current fixed small-tier IH approaches are inadequate for scalable real-world agentic settings.

Significance. If the benchmark construction is shown to faithfully reproduce the distribution and difficulty of organic instruction conflicts in deployed agents, this work would be significant for agent safety research. It provides the first systematic evaluation framework for many-tier conflict resolution and supplies concrete evidence that performance degrades as the number of privilege levels increases beyond the classic 3-5, thereby motivating new methods beyond rigid role-based hierarchies. The grounding in 46 real-world agents is a positive step toward practical relevance.

major comments (3)
  1. [Abstract] Abstract: The headline empirical claim (~40% accuracy when instruction conflict scales) is presented without any details on the task generation pipeline, human verification protocol (e.g., annotator count, agreement metrics, or rejection criteria), baseline comparisons, or statistical significance testing. This prevents evaluation of whether the central result is robust or reproducible.
  2. [Abstract] ManyIH-Bench construction (Abstract and implied §3): The benchmark relies on the untested assumption that LLM-generated + human-verified constraints accurately capture the distribution, frequency, and ambiguity of real privilege conflicts arising from organic sources (system messages, tool outputs, inter-agent messages) rather than LLM-typical synthetic patterns. No independent validation or comparison against logged conflicts from deployed agents is described, which is load-bearing for the claim that the ~40% result demonstrates a general scaling failure.
  3. [Experiments] Experiments (implied §4): No comparisons are reported to adapted versions of existing instruction-hierarchy methods, fine-tuned models, or other conflict-resolution baselines. Without these, it is impossible to determine whether the observed performance drop is specifically attributable to the increase to 12 tiers or to other factors such as task complexity or prompt length.
minor comments (2)
  1. [Abstract] The abstract states 853 tasks with a near-even split (427 coding, 426 instruction-following) but does not clarify whether this balance was intentional or if performance differs meaningfully across the two categories.
  2. Notation for privilege levels and conflict resolution rules could be formalized earlier (e.g., with a small example table) to aid readers unfamiliar with the many-tier extension.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the constructive feedback on our manuscript. We have addressed each major comment point by point below, with plans to revise the paper where appropriate to enhance clarity, robustness, and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline empirical claim (~40% accuracy when instruction conflict scales) is presented without any details on the task generation pipeline, human verification protocol (e.g., annotator count, agreement metrics, or rejection criteria), baseline comparisons, or statistical significance testing. This prevents evaluation of whether the central result is robust or reproducible.

    Authors: We agree the abstract is high-level and omits specifics. Full details on the task generation pipeline, human verification protocol (including annotator count, agreement metrics, and rejection criteria), baselines, and statistical testing appear in Sections 3 and 4. We will revise the abstract to concisely reference the construction process and direct readers to these sections for evaluation of robustness and reproducibility. revision: yes

  2. Referee: [Abstract] ManyIH-Bench construction (Abstract and implied §3): The benchmark relies on the untested assumption that LLM-generated + human-verified constraints accurately capture the distribution, frequency, and ambiguity of real privilege conflicts arising from organic sources (system messages, tool outputs, inter-agent messages) rather than LLM-typical synthetic patterns. No independent validation or comparison against logged conflicts from deployed agents is described, which is load-bearing for the claim that the ~40% result demonstrates a general scaling failure.

    Authors: We acknowledge that direct validation against real logged conflicts would be ideal. Such proprietary logs are not publicly accessible. Our benchmark uses LLM-generated constraints human-verified and mapped to 46 real-world agents as a proxy. We will expand Section 3 with full verification protocol details and add a limitations discussion on the synthetic aspects and their relation to organic conflicts. revision: partial

  3. Referee: [Experiments] Experiments (implied §4): No comparisons are reported to adapted versions of existing instruction-hierarchy methods, fine-tuned models, or other conflict-resolution baselines. Without these, it is impossible to determine whether the observed performance drop is specifically attributable to the increase to 12 tiers or to other factors such as task complexity or prompt length.

    Authors: We agree comparisons would help isolate the tier-scaling effect. Our experiments emphasized frontier models to show the challenge. We will add adapted existing IH methods (e.g., role-based prompting), fine-tuned baselines, and controls for prompt length/task complexity in the revised Section 4. revision: yes

standing simulated objections not resolved
  • Direct comparison to logged instruction conflicts from deployed agents, as such proprietary data is not available for independent validation.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark creation and direct measurement.

full rationale

The paper introduces ManyIH-Bench as a new dataset of 853 tasks and reports model accuracy (~40%) as a direct empirical measurement on those tasks. No derivations, equations, fitted parameters, or predictions are claimed that could reduce to inputs by construction. The central result is an observed performance number on the constructed benchmark, not a quantity defined in terms of itself. Any self-citations (e.g., to prior IH work) are not load-bearing for the reported accuracies. This is a standard benchmark paper whose claims rest on the external validity of the tasks rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and evaluation paper with no mathematical derivations, fitted constants, or postulated entities; the central claims rest on the benchmark construction process and model evaluations rather than axioms or free parameters.

pith-pipeline@v0.9.0 · 5544 in / 1163 out tokens · 51448 ms · 2026-05-10T17:10:38.237990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 25 canonical work pages · 8 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Claude Opus 4.6 System Card

    Anthropic . Claude Opus 4.6 System Card . https://www.anthropic.com/claude-opus-4-6-system-card, February 2026 a . Published February 6, 2026. Accessed March 30, 2026

  3. [3]

    Claude Sonnet 4.6 System Card

    Anthropic . Claude Sonnet 4.6 System Card . https://www.anthropic.com/claude-sonnet-4-6-system-card, February 2026 b . Published February 17, 2026. Accessed March 30, 2026

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. URL https://arxiv.org/abs/2212.08073

  6. [6]

    DeonticBench: A Benchmark for Reasoning over Rules

    Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, and Benjamin Van Durme. Deonticbench: A benchmark for reasoning over rules, 2026. URL https://arxiv.org/abs/2604.04443

  7. [7]

    Gemini 3.1 Pro Model Card

    Google DeepMind . Gemini 3.1 Pro Model Card . https://deepmind.google/models/model-cards/gemini-3-1-pro/, February 2026. Published February 19, 2026. Accessed March 30, 2026

  8. [8]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023. URL https://arxiv.org/abs/2302.12173

  9. [9]

    Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms,

    Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, and Kai Xiao. Ih-challenge: A training dataset to improve instruction hierarchy on frontier llms, 2026. URL https://arxiv.org/abs/2603.10521

  10. [10]

    When instructions multiply: Measuring and estimating llm capabilities of multiple instructions following, 2025

    Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Edison Marrese-Taylor, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. When instructions multiply: Measuring and estimating llm capabilities of multiple instructions following, 2025. URL https://arxiv.org/abs/2509.21051

  11. [11]

    LLM -rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM -rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp...

  12. [12]

    Coninstruct: Evaluating large language models on conflict detection and resolution in instructions, 2025

    Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu, Yuan Yuan, and Siu-Ming Yiu. Coninstruct: Evaluating large language models on conflict detection and resolution in instructions, 2025. URL https://arxiv.org/abs/2511.14342

  13. [13]

    Beyond oracle: Verifier-supervision for instruction hierarchy in reasoning and instruction-tuned LLM s

    Sian-Yao Huang, Li-Hsien Chang, Che-Yu Lin, and Cheng-Lin Yang. Beyond oracle: Verifier-supervision for instruction hierarchy in reasoning and instruction-tuned LLM s. In Advances in Neural Information Processing Systems NeurIPS , 2025. URL https://openreview.net/forum?id=IQ513IX1G5

  14. [14]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8euJaTveKw

  15. [15]

    Kimi Team , Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  16. [16]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications, 2024. URL https://arxiv.org/abs/2306.05499

  17. [17]

    Introducing the Model Spec

    OpenAI . Introducing the Model Spec . https://openai.com/index/introducing-the-model-spec/, May 2024. Accessed: 2026-02-25

  18. [18]

    GPT-5 System Card

    OpenAI. GPT-5 System Card . https://cdn.openai.com/gpt-5-system-card.pdf, August 2025 a . Accessed: 2025-10-13

  19. [19]

    Introducing group chats in ChatGPT , November 2025 b

    OpenAI. Introducing group chats in ChatGPT , November 2025 b . URL https://openai.com/index/group-chats-in-chatgpt/. Accessed: 2025-12-02

  20. [20]

    OpenAI Harmony Response Format , August 2025 c

    OpenAI. OpenAI Harmony Response Format , August 2025 c . URL https://cookbook.openai.com/articles/openai-harmony. Accessed: 2025-12-01

  21. [21]

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025. URL https://arxiv.org/abs/2507.02833

  22. [22]

    AGENTIF : Benchmarking large language models instruction following ability in agentic scenarios

    Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, and Juanzi Li. AGENTIF : Benchmarking large language models instruction following ability in agentic scenarios. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=FLiMxTkIeu

  23. [23]

    Skill-inject: Measuring agent vulnerability to skill file attacks.arXiv preprint arXiv:2602.20156, 2026

    David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, and Maksym Andriushchenko. Skill-inject: Measuring agent vulnerability to skill file attacks, 2026. URL https://arxiv.org/abs/2602.20156

  24. [24]

    Tensor trust: Interpretable prompt injection attacks from an online game

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Interpretable prompt injection attacks from an online game. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL ht...

  25. [25]

    Pep 8 -- style guide for python code

    Guido van Rossum, Barry Warsaw, and Alyssa Coghlan. Pep 8 -- style guide for python code. https://peps.python.org/pep-0008/, 2001. Python Enhancement Proposal 8, created 2001-07-05, accessed 2026-03-29

  26. [26]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

  27. [27]

    Instructional segment embedding: Improving llm safety with instruction hierarchy.arXiv preprint arXiv:2410.09102, 2024

    Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, and Wenxuan Zhou. Instructional segment embedding: Improving llm safety with instruction hierarchy, 2025. URL https://arxiv.org/abs/2410.09102

  28. [28]

    Grok 4.20 Beta 0309 Reasoning

    xAI . Grok 4.20 Beta 0309 Reasoning . https://docs.x.ai/developers/models/grok-4.20-beta-0309-reasoning, 2026. xAI developer documentation. Accessed March 30, 2026

  29. [29]

    Codeif: Benchmarking the instruction-following capabilities of large language models for code generation

    Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Shaosheng Cao, Donglin Di, and Zhoujun Li. CodeIF : Benchmarking the instruction-following capabilities of large language models for code generation. In Annual Meeting of the Association for Computational Linguistics ACL , 2025. URL https://arxiv.org/abs/2502.19166

  30. [30]

    Cctu: A benchmark for tool use under complex constraints, 2026

    Junjie Ye, Guoqiang Zhang, Wenjie Fu, Tao Gui, Qi Zhang, and Xuanjing Huang. Cctu: A benchmark for tool use under complex constraints, 2026. URL https://arxiv.org/abs/2603.15309

  31. [31]

    Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models, 2024. URL https://arxiv.org/abs/2312.14197

  32. [32]

    doi: 10.18653/v1/2024.findings-acl.624

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. I njec A gent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 10471--10506, Bangkok, Thailand, August 2024. Association for Comp...

  33. [33]

    Controllable safety alignment: Inference-time adaptation to diverse safety requirements

    Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, and Benjamin Van Durme. Controllable safety alignment: Inference-time adaptation to diverse safety requirements. In International Conference on Learning Representations ICLR , 2025 a . URL https://arxiv.org/abs/2410.08968

  34. [34]

    Jailbreak distillation: Renewable safety benchmarking

    Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, and Kyle Jackson. Jailbreak distillation: Renewable safety benchmarking. In Conference on Empirical Methods in Natural Language Processing EMNLP - Findings , 2025 b . URL https://arxiv.org/abs/2505.22037

  35. [35]

    Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success

    Yiming Zhang, Nicholas Carlini, and Daphne Ippolito. Effective prompt extraction from language models, 2024. URL https://arxiv.org/abs/2307.06865

  36. [36]

    IHEval : Evaluating language models on following the instruction hierarchy

    Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, and Meng Jiang. IHEval : Evaluating language models on following the instruction hierarchy. In Conference of the North American Chapter of the Association for Computational Linguistics NAACL , 20...

  37. [37]

    Reasoning up the instruction ladder for controllable language models, 2026

    Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, and Sachin Kumar. Reasoning up the instruction ladder for controllable language models, 2026. URL https://arxiv.org/abs/2511.04694

  38. [38]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

  39. [39]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  40. [40]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  41. [41]

    Chernoff

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...