IHDec applies JSD-steered contrastive decoding to enforce multi-turn instruction hierarchies in LLMs without fine-tuning.
IHE val: Evaluating Language Models on Following the Instruction Hierarchy
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
KG-CFR decouples planning from execution via knowledge-grounded counterfactual reasoning, preventing critical degradation in over 95% of perturbed runs and raising argument quality from 0.694 to 0.822 in a 1v1v1 simulation.
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.
citing papers explorer
-
IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
IHDec applies JSD-steered contrastive decoding to enforce multi-turn instruction hierarchies in LLMs without fine-tuning.
-
Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation
KG-CFR decouples planning from execution via knowledge-grounded counterfactual reasoning, preventing critical degradation in over 95% of perturbed runs and raising argument quality from 0.694 to 0.822 in a 1v1v1 simulation.
-
Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.
-
Position: AI Safety Requires Effective Controllability
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.