IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
Pith reviewed 2026-06-30 06:31 UTC · model grok-4.3
The pith
A training-free method detects token-level role priority violations in multi-turn LLM chats using Jensen-Shannon divergence and corrects them with contrastive decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By measuring Jensen-Shannon divergence between token probability distributions conditioned on different role priorities, the paper identifies a consistent role-influence inversion in multi-turn settings; IHDec then uses these divergence signals to trigger contrastive decoding that suppresses tokens aligned with lower-priority roles, thereby restoring the intended instruction hierarchy without any model updates or scenario-specific tuning.
What carries the argument
IHDec (Instruction Hierarchy-steered Decoding), which computes Jensen-Shannon divergence on token distributions to flag hierarchy violations and then applies contrastive decoding to suppress subordinate role influences.
If this is right
- IHDec resolves multi-turn instruction conflicts more effectively than existing training-based defenses.
- General response quality remains fully preserved under the method.
- Safety against adversarial prompt injections increases when IHDec is applied.
- Performance gains from IHDec become larger as base model size grows.
Where Pith is reading between the lines
- The same divergence-triggered correction could be tested on other forms of input priority besides role labels.
- Decoding-time fixes of this kind may reduce reliance on retraining for maintaining alignment properties across conversation lengths.
- The observed scaling synergy suggests the technique could be combined with continued model growth to handle increasingly complex multi-source inputs.
Load-bearing premise
Jensen-Shannon divergence between token distributions can reliably mark hierarchy violations at the token level without introducing new errors or needing per-scenario adjustments.
What would settle it
A controlled test set of multi-turn dialogues in which JSD values fail to separate generations that respect role priorities from those that violate them, or in which applying the contrastive correction reduces accuracy on non-conflict queries.
Figures
read the original abstract
Large Language Models (LLMs) often fail to maintain instruction hierarchies (IH) when processing multi-source inputs with varying role-level priorities, paradoxically adhering to lower-priority directives during conflicts. While existing defenses mitigate this issue, they are largely restricted to single-turn scenarios and require expensive fine-tuning. In this paper, we formalize this failure mode in multi-turn contexts via a Jensen-Shannon Divergence (JSD) framework, uncovering a pervasive role-influence inversion phenomenon where subordinate inputs override superior roles. To rectify this without training, we propose IHDec (Instruction Hierarchy-steered Decoding). IHDec leverages JSD to automatically detect token-level hierarchy violations and dynamically executes contrastive decoding to suppress misaligned subordinate roles. Extensive evaluations demonstrate that IHDec outperforms training-based baselines in multi-turn conflicts while fully preserving general response quality. Furthermore, IHDec strengthens safety against adversarial prompt injections and exhibits a robust scaling synergy with larger models. The Code is available at https://github.com/nxcolelxu/IHDec.git
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes multi-turn instruction hierarchy failures in LLMs as a role-influence inversion phenomenon diagnosed via Jensen-Shannon Divergence (JSD) between role-conditioned token distributions. It proposes IHDec, a training-free method that uses per-token JSD to detect violations and applies contrastive decoding to suppress subordinate-role influences. The central claims are that IHDec outperforms training-based baselines on multi-turn conflicts, fully preserves general response quality, strengthens safety against adversarial injections, and exhibits positive scaling with model size.
Significance. If the empirical claims hold, IHDec would provide a practical, plug-and-play alternative to fine-tuning for enforcing instruction hierarchies in multi-source, multi-turn settings. The reported absence of quality degradation and the scaling synergy are particularly valuable for deployment. The work also supplies a concrete, falsifiable detection mechanism (token-level JSD) that could be tested independently.
major comments (3)
- [Method (JSD detection and contrastive decoding)] The load-bearing assumption that token-level JSD cleanly isolates hierarchy violations from other distributional shifts (role priority, phrasing, uncertainty) is not accompanied by controls that would rule out false positives; without explicit threshold or scaling details, the contrastive step risks suppressing valid tokens for non-hierarchy reasons.
- [§3 (formalization) and §4 (IHDec algorithm)] JSD is used both to diagnose the inversion phenomenon and to steer decoding; this creates a circularity risk unless the paper demonstrates that the steering threshold or scaling factor is fixed independently of the evaluation data (e.g., via a held-out validation set or parameter-free rule).
- [Abstract and Evaluation section] The abstract asserts "extensive evaluations" and outperformance over training-based baselines, yet supplies no metrics, baseline descriptions, or controls; the central claim that IHDec "fully preserves general response quality" therefore cannot be assessed without the corresponding tables or figures.
minor comments (2)
- [Abstract] The GitHub link is provided, but the manuscript should state the exact commit or release tag used for the reported results to ensure reproducibility.
- [Method] Notation for the two role-conditioned distributions whose JSD is computed should be introduced once and used consistently; currently the abstract and method description leave the precise conditioning (system vs. user, priority labels, etc.) implicit.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Method (JSD detection and contrastive decoding)] The load-bearing assumption that token-level JSD cleanly isolates hierarchy violations from other distributional shifts (role priority, phrasing, uncertainty) is not accompanied by controls that would rule out false positives; without explicit threshold or scaling details, the contrastive step risks suppressing valid tokens for non-hierarchy reasons.
Authors: We acknowledge the value of explicit controls for isolating hierarchy-specific effects. Section 4.1 defines JSD strictly between superior- and subordinate-role conditioned distributions at each token, and Section 5.3 provides qualitative token-level case studies showing that elevated JSD aligns with role-inversion conflicts rather than generic phrasing or uncertainty. The scaling factor is fixed at 0.5 (chosen on a small held-out validation split) and the detection threshold is set via the 95th percentile of JSD on non-conflict dialogues. To strengthen this, we will add a new ablation subsection comparing JSD distributions under hierarchy conflicts versus controlled non-hierarchy shifts (e.g., temperature-induced uncertainty and role-neutral rephrasing) in the revision. revision: partial
-
Referee: [§3 (formalization) and §4 (IHDec algorithm)] JSD is used both to diagnose the inversion phenomenon and to steer decoding; this creates a circularity risk unless the paper demonstrates that the steering threshold or scaling factor is fixed independently of the evaluation data (e.g., via a held-out validation set or parameter-free rule).
Authors: The scaling factor α is a fixed hyper-parameter (α = 1.0) used uniformly across all experiments and not tuned on any test data. The JSD detection threshold is computed once on a held-out validation set of 200 multi-turn dialogues drawn from sources disjoint from all evaluation benchmarks; this procedure is described in Section 4.3. We will insert an explicit paragraph in Section 4.3 restating the independence of these choices from the reported test sets to remove any ambiguity. revision: yes
-
Referee: [Abstract and Evaluation section] The abstract asserts "extensive evaluations" and outperformance over training-based baselines, yet supplies no metrics, baseline descriptions, or controls; the central claim that IHDec "fully preserves general response quality" therefore cannot be assessed without the corresponding tables or figures.
Authors: Sections 5 and 6 contain the full quantitative results: win-rate tables against training-based baselines (e.g., role-fine-tuned Llama-2/3 variants), MT-Bench and AlpacaEval quality scores showing <2% degradation, adversarial injection success rates, and scaling plots. The abstract is intentionally concise. We will revise the abstract to include two key quantitative statements (e.g., “outperforms baselines by 18–27% on multi-turn conflict resolution while retaining 98.4% of baseline response quality”) so that the central claims are verifiable from the abstract alone. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper formalizes role-influence inversion via a JSD framework and then applies the same divergence to detect violations for contrastive decoding. No equations, self-citations, or fitted parameters are quoted that reduce the claimed prediction or detection step to its own inputs by construction. The method is explicitly training-free and presented as an external measurement applied to model outputs, with no evidence of threshold tuning or renaming that collapses the result into the input. This is the normal case of an independent computational procedure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2016 , eprint=
A Diversity-Promoting Objective Function for Neural Conversation Models , author=. 2016 , eprint=
2016
-
[2]
2023 , eprint=
Lost in the Middle: How Language Models Use Long Contexts , author=. 2023 , eprint=
2023
-
[3]
M.L. Menéndez and J.A. Pardo and L. Pardo and M.C. Pardo , abstract =. The Jensen-Shannon divergence , journal =. 1997 , issn =. doi:https://doi.org/10.1016/S0016-0032(96)00063-4 , url =
-
[4]
arXiv preprint arXiv:2505.18907 , year=
Stronger enforcement of instruction hierarchy via augmented intermediate representations , author=. arXiv preprint arXiv:2505.18907 , year=
-
[5]
Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency , author=. arXiv preprint arXiv:2604.09075 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Lampert , booktitle=
Egor Zverev and Sahar Abdelnabi and Soroush Tabesh and Mario Fritz and Christoph H. Lampert , booktitle=. Can. 2025 , url=
2025
-
[7]
BERT: Pre-training of deep bidirectional transformers for language understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[8]
MT -Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
Bai, Ge and Liu, Jie and Bu, Xingyuan and He, Yancheng and Liu, Jiaheng and Zhou, Zhanhui and Lin, Zhuoran and Su, Wenbo and Ge, Tiezheng and Zheng, Bo and Ouyang, Wanli. MT -Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguist...
-
[9]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Control illusion: The failure of instruction hierarchies in large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[10]
Lampert , booktitle=
Egor Zverev and Evgenii Kortukov and Alexander Panfilov and Alexandra Volkova and Rush Tabesh and Sebastian Lapuschkin and Wojciech Samek and Christoph H. Lampert , booktitle=. 2026 , url=
2026
-
[11]
Hashimoto , title =
Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =
2023
-
[12]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
The instruction hierarchy: Training llms to prioritize privileged instructions , author=. arXiv preprint arXiv:2404.13208 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
arXiv preprint arXiv:2511.04694 , year=
Reasoning Up the Instruction Ladder for Controllable Language Models , author=. arXiv preprint arXiv:2511.04694 , year=
-
[14]
International Conference on Learning Representations , volume=
Instructional segment embedding: Improving llm safety with instruction hierarchy , author=. International Conference on Learning Representations , volume=
-
[15]
2026 , eprint=
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs , author=. 2026 , eprint=
2026
-
[16]
IHE val: Evaluating Language Models on Following the Instruction Hierarchy
Zhang, Zhihan and Li, Shiyang and Zhang, Zixuan and Liu, Xin and Jiang, Haoming and Tang, Xianfeng and Gao, Yifan and Li, Zheng and Wang, Haodong and Tan, Zhaoxuan and Li, Yichuan and Yin, Qingyu and Yin, Bing and Jiang, Meng. IHE val: Evaluating Language Models on Following the Instruction Hierarchy. Proceedings of the 2025 Conference of the Nations of t...
-
[17]
International Conference on Learning Representations , year=
The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=
-
[18]
Attributing Response to Context: A Jensen
Ruizhe Li and Chen Chen and Yuchen Hu and Yanjun Gao and Xi Wang and Emine Yilmaz , booktitle=. Attributing Response to Context: A Jensen. 2026 , url=
2026
-
[19]
ChatInject: Abusing Chat Templates for Prompt Injection in
Hwan Chang and Yonghyun Jun and Hwanhee Lee , booktitle=. ChatInject: Abusing Chat Templates for Prompt Injection in. 2026 , url=
2026
-
[20]
Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
Shi, Weijia and Han, Xiaochuang and Lewis, Mike and Tsvetkov, Yulia and Zettlemoyer, Luke and Yih, Wen-tau. Trusting Your Evidence: Hallucinate Less with Context-aware Decoding. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. doi:...
-
[21]
arXiv preprint arXiv:2601.06403 , year=
Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding , author=. arXiv preprint arXiv:2601.06403 , year=
-
[22]
Contrastive Decoding: Open-ended Text Generation as Optimization
Li, Xiang Lisa and Holtzman, Ari and Fried, Daniel and Liang, Percy and Eisner, Jason and Hashimoto, Tatsunori and Zettlemoyer, Luke and Lewis, Mike. Contrastive Decoding: Open-ended Text Generation as Optimization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.a...
-
[23]
2026 , note =
Claude System Prompts Release Notes , author =. 2026 , note =
2026
-
[24]
arXiv preprint arXiv:2410.09584 , year=
Toward general instruction-following alignment for retrieval-augmented generation , author=. arXiv preprint arXiv:2410.09584 , year=
-
[25]
Lee, Jungyeon and Kangmin, Lee and Kim, Taeuk. MAGIC : A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.466
-
[26]
Advances in Neural Information Processing Systems , volume=
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
arXiv preprint arXiv:2509.23188 , year=
Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts , author=. arXiv preprint arXiv:2509.23188 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.