Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
Understanding and mitigating over-refusal for large language models via safety repre- sentation,
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
DataShield scores training samples by their contribution to increased LLM response compliance and filters high-risk ones using a compliance vector and layer-specific CAS metric.
Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.
citing papers explorer
-
Provably Secure Agent Guardrail
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
-
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.