Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization
Pith reviewed 2026-06-27 12:44 UTC · model grok-4.3
The pith
Gravity-weighted DPO scales optimization by level distance to enforce five-level instruction hierarchies in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens and Instructional Segment Embeddings, GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priorit
What carries the argument
Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting instruction levels under a bilateral schedule weighting both the privilege gap and the victim level's privilege.
If this is right
- Models will more reliably follow higher-privilege instructions when they conflict with lower-privilege ones.
- Over-refusal rates remain lower than under standard DPO while priority adherence rises.
- Five-level training produces a generality-specialization tradeoff compared with three-level training.
- Instructional Segment Embeddings act as a refusal-threshold calibrator.
- Hierarchy-specific delimiter tokens support enforcement of the defined priority relations.
Where Pith is reading between the lines
- The bilateral schedule may generalize to other multi-source instruction settings such as tool-use or agentic workflows.
- The same weighting principle could be tested on models larger than 8B to check scaling behavior.
- Real deployment might reduce successful prompt-injection success rates by anchoring refusals to explicit privilege gaps.
- A dynamic hierarchy that updates level distances from observed user corrections could be a natural next extension.
Load-bearing premise
That the structural distance between levels in the 5-level hierarchy, combined with the linear or bilateral weighting schedule, correctly captures the real-world severity and priority of instruction conflicts.
What would settle it
A dataset of real instruction conflicts in which human or downstream-task preferences assign different relative severities than the assumed five-level structural distances, such that GW-DPO shows no gain or a loss in macro pairwise adherence relative to standard DPO.
read the original abstract
Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes a k-level (k=5) instruction hierarchy problem with ten pairwise priority relations, introduces Gravity-Weighted DPO (GW-DPO) whose per-sample offset scales with structural distance under linear or bilateral schedules (the latter weighting both gap and victim privilege), and claims that GW-DPO with the bilateral schedule plus hierarchy-specific delimiters and Instructional Segment Embeddings (ISE) Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct by raising macro pairwise priority adherence while halving the over-refusal rate; ablations are said to isolate ISE as a refusal-threshold calibrator and frame five- versus three-level training as a generality-specialization tradeoff.
Significance. If the empirical results hold under proper controls, the work supplies a concrete preference-optimization method for enforcing trust-differentiated instruction hierarchies, directly targeting prompt-injection vulnerabilities that arise from uniform token privilege; the bilateral schedule is a distinctive technical contribution that couples gap size with victim-level privilege, and the reported ablations on ISE and hierarchy depth provide useful empirical insight into the associated tradeoffs.
major comments (2)
- [Abstract] Abstract: the central claim of Pareto improvement and halved over-refusal is stated without dataset details, statistical tests, error bars, or complete ablation controls, which are required to substantiate the reported gains on macro pairwise priority adherence.
- [Abstract] Abstract: the bilateral weighting schedule is constructed directly from the internal 5-level structural distances; because both the training objective and the evaluation metric are defined over the same synthetic hierarchy, the reported improvements are consistent with the construction but do not demonstrate that the distances correctly encode real-world priority severities in prompt-injection or multi-source instruction data.
minor comments (1)
- [Abstract] Abstract: the citations to Chen et al., 2025 and Wu et al., 2025 should be expanded with full bibliographic details once the references section is examined.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the manuscript's content and indicating revisions where they strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of Pareto improvement and halved over-refusal is stated without dataset details, statistical tests, error bars, or complete ablation controls, which are required to substantiate the reported gains on macro pairwise priority adherence.
Authors: We agree the abstract is highly condensed and omits supporting details. The full manuscript reports the training set size and composition in Section 3.2, presents all main results with error bars across three random seeds in Table 2, includes paired statistical tests (p < 0.05) in Appendix C, and provides complete ablations isolating ISE and hierarchy depth in Section 4. To make the central claim more self-contained, we will expand the abstract by one sentence referencing dataset scale and statistical significance while remaining within length limits. revision: yes
-
Referee: [Abstract] Abstract: the bilateral weighting schedule is constructed directly from the internal 5-level structural distances; because both the training objective and the evaluation metric are defined over the same synthetic hierarchy, the reported improvements are consistent with the construction but do not demonstrate that the distances correctly encode real-world priority severities in prompt-injection or multi-source instruction data.
Authors: The five-level hierarchy and its ten pairwise relations are explicitly motivated by documented trust distinctions in prompt-injection and multi-source instruction scenarios (Section 2). The bilateral schedule is designed to reflect both gap size and victim privilege precisely because these factors determine practical severity in such attacks. The synthetic construction enables exhaustive, controlled measurement of all ten relations, which would be infeasible on naturalistic data. We acknowledge that the work does not include direct empirical mapping to external real-world corpora and will add an explicit limitations paragraph stating this scope. The primary contribution remains the GW-DPO objective and its empirical isolation of the bilateral schedule's effect under controlled conditions. revision: partial
Circularity Check
No significant circularity; hierarchy and weighting are explicit design choices, not reductions to inputs
full rationale
The paper formalizes a k-level hierarchy (instantiated at k=5) and defines GW-DPO offsets explicitly from structural distances under linear/bilateral schedules. Both the training objective and the macro pairwise adherence metric operate over this same synthetic construction, but this is a standard problem-definition-plus-solution pattern rather than a circular reduction: the claimed Pareto improvement is an empirical outcome on the defined data, not forced by re-labeling fitted parameters or by self-citation chains. No equations are shown that equate the reported gains to the input distances by construction, and the cited delimiter/ISE components come from external 2025 works. The derivation therefore remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear or bilateral schedule
axioms (1)
- domain assumption LLMs can be trained via preference optimization to enforce explicit pairwise priority relations across instruction levels
Reference graph
Works this paper leans on
-
[1]
InAdvances in Neural Informa- tion Processing Systems, volume 37, pages 136037– 136083
Refusal in language models is mediated by a single direction. InAdvances in Neural Informa- tion Processing Systems, volume 37, pages 136037– 136083. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international confer- ence on machine learning, pages 41–48. Kenneth J. Biba. ...
-
[2]
InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90, New York, NY , USA
Not what you’ve signed up for: Compromis- ing real-world LLM-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90, New York, NY , USA. Association for Computing Machinery. Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, a...
-
[3]
sDPO: Don’t use your data all at once. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 366–373, Abu Dhabi, UAE. Association for Compu- tational Linguistics. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InInternational Confer- ence on Learning Representations. R. Thom...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Ignore Previous Prompt: Attack Techniques For Language Models
Enhancing alignment using curriculum learn- ing & ranked preferences. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA. Associa- tion for Computational Linguistics. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.095...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Mark Russinovich, Ahmed Salem, and Ronen Eldan
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Mark Russinovich, Ahmed Salem, and Ronen Eldan
-
[6]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Great, now write an article about that: The crescendo Multi-Turn LLM jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440. Paul Röttger, Hannah R. Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. XSTest: A test suite for identifying exaggerated safety behaviours in large language models.Pr...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Instructional segment embedding: Improving LLM safety with instruction hierarchy. InInterna- tional Conference on Learning Representations. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Lingui...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Sampling a base row from the filtered evalua- tion pool
-
[9]
Classifying its domain
-
[10]
Gathering candidate materials: 2–3 L0 rules from the relevant category, 2–3 domain- matched L1 prompts, 2–3 injection templates (safety-targeting for L0-victim pairs, non- safety otherwise), 2–3 domain-matched L4 candidates, andL 2 attribute/value options
-
[11]
Sending everything to GPT-4o (temperature 0.7) with a structured prompt that names the conflict pair, identifies which level must win and which must lose, and requires that all five levels are topically coherent and that evaluation_criteria are automat- ically checkable
-
[12]
genuine un- derstanding test
Receiving a structured JSON re- sponse with the selected/adapted L0– L4 content, a conflict_description, correct_behavior, violation_behavior, and a list ofevaluation_criteria. Malformed JSON or validation failures (missing fields, empty criteria) trigger up to two retries with a +0.1 temperature bump per retry; after three fail- ures the scenario is disc...
2023
-
[13]
by closing a Bradley-Terry preference likeli- hood (Bradley and Terry, 1952) in the policy itself, yielding the loss LDPO(πθ;π ref) =−E (x,yw,yl)∼D h logσ β·∆r θ(x, yw, yl) i , (5) where ∆rθ is the implicit reward margin defined in §5. The Bradley-Terry derivation guarantees only that ∆rθ becomes positive at the optimum: the chosen response is more likely...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.