Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety
Pith reviewed 2026-07-03 20:36 UTC · model grok-4.3
The pith
An independent oversight model with four gates blocks multi-turn LLM jailbreaks by escalating any danger signal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Cognitive Firewall decomposes safety assessment into four gates—an intent gate, a zero-trust context gate, a consistency gate, and an output risk gate—whose decisions are combined through escalation rather than averaging. This allows any confident danger signal to block an interaction while preserving an auditable rationale. Experiments demonstrate that the approach lowers attack success rates to 2 percent or below on three jailbreak benchmarks and 14 percent on human-crafted attacks, with an 8 percent over-refusal rate on benign queries.
What carries the argument
The four categorical gates combined through escalation in the Cognitive Firewall framework.
If this is right
- Attack success drops to 2 percent or below on single-turn, multi-turn, and authority-based jailbreaks.
- Human-crafted attacks are reduced to 14 percent success.
- Over-refusal on benign queries stays at 8 percent.
- The system provides an auditable rationale for each blocked interaction.
Where Pith is reading between the lines
- Similar gate-based oversight could apply to other generative AI systems beyond language models.
- Placing safety logic in a separate model may allow easier updates without retraining the main LLM.
- Zero-trust verification of user claims could extend to detecting other forms of manipulation in AI conversations.
Load-bearing premise
The separate oversight model can accurately implement the four gates without itself being tricked into allowing harm or refusing too many safe requests.
What would settle it
A test showing the oversight model fails to block a significant portion of the jailbreaks or refuses more than 15 percent of benign queries would indicate the framework does not deliver the claimed protection.
Figures
read the original abstract
Large language models (LLMs) can be induced to produce harmful content through multi turn strategies in which no single user message appears clearly unsafe. Existing runtime safeguards commonly evaluate prompts or responses as isolated messages, which limits their ability to recover ac-cumulated intent, verify asserted authority, or detect harmful objectives decomposed across a dialogue. This paper presents the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight model between a user and a protected target mod l. The framework decomposes safety assessment into four categorical gates: an intent gate that identi-fies the operational objective of a request, a zero trust context gate that treats claimed roles and permissions as unverified evidence, a consistency gate that detects escalation and decomposition across turns, and an output risk gate that inspects candidate responses before release. Gate decisions are combined through escalation rather than score averaging, allowing any confident danger signal to block an interaction while preserving an auditable rationale. Experiments on four jailbreak benchmarks and a benign safety test set show that the Cognitive Firewall substantially reduces attack success across single turn, multi turn, authority based, and human crafted attacks. It lowers attack success to 2 percent or below on three attack sets and to 14 percent on the most difficult human crafted set, while maintaining an 8 percent over refusal rate. These results indicate that decomposed, conversation level oversight can improve proactive containment and auditability for LLM safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Cognitive Firewall, a proactive runtime oversight framework that interposes an independent oversight LLM between user and target model. Safety assessment is decomposed into four gates (intent identification, zero-trust context verification, cross-turn consistency detection, and output risk assessment) whose decisions are combined by escalation rather than averaging. Experiments on four jailbreak benchmarks plus a benign test set are reported to reduce attack success to ≤2% on three sets and 14% on the hardest human-crafted set, with an 8% over-refusal rate on benign queries.
Significance. If the reported attack-success reductions prove reproducible and the oversight model itself remains uncompromised, the decomposed, auditable gate structure would represent a meaningful advance over single-message classifiers for multi-turn and authority-based jailbreaks. The escalation mechanism and explicit rationale generation are positive features for explainability. The work supplies no machine-checked proofs, open code, or parameter-free derivations.
major comments (2)
- [Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.
- [Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.
minor comments (2)
- [Abstract] Abstract contains typographical errors ("mod l", "ac-cumulated").
- [Experiments] The benign safety test set and over-refusal metric are mentioned but not defined or sized in the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving the clarity and verifiability of the manuscript. We address each major comment below and commit to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline empirical claim (attack success ≤2% on three benchmarks, 14% on human-crafted) is presented without any implementation details of the oversight model, choice of model family, baseline comparisons against existing guardrails, statistical significance tests, or error bars. This directly undermines verification of the central performance numbers.
Authors: We agree that the current manuscript omits key implementation details, model specifications, baselines, and statistical analysis, which limits independent verification. In the revised version we will expand the Experiments section to specify the oversight model family and configuration, include comparisons against representative guardrail baselines, and report error bars together with appropriate statistical significance tests on the attack-success figures. revision: yes
-
Referee: [Framework] Framework description (gates section): no mechanism is described for hardening or isolating the oversight model against the same multi-turn, authority-claim, or decomposition attacks that the framework targets. Because the reported results rest on the assumption that the four gates execute reliably, the absence of any such protection or separate evaluation of the oversight layer is load-bearing for the validity of the benchmark numbers.
Authors: The framework treats the oversight model as an independent, trusted component whose outputs are combined via escalation; however, the manuscript does not describe hardening techniques or provide a separate robustness evaluation of that layer. We will revise the Framework and Limitations sections to state this assumption explicitly, outline possible isolation strategies (e.g., distinct model family or separate deployment), and note the absence of dedicated oversight-layer benchmarks as a limitation for future work. revision: partial
Circularity Check
No circularity; descriptive framework plus benchmark results
full rationale
The paper introduces a multi-gate oversight framework for LLM safety and reports empirical attack-success rates on jailbreak benchmarks. No equations, parameter fits, derivations, or self-citations appear in the provided text. The central claims rest on observed benchmark outcomes rather than any reduction of a 'prediction' or 'result' to its own inputs by construction. The reader's assessment of zero circularity is therefore confirmed; the work contains no load-bearing self-referential steps of the enumerated kinds.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
H. Inanet al., “Llama guard: LLM-based input-output safeguard for human-AI conversations,” arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
ShieldGemma: Generative AI Content Moderation Based on Gemma
W. Zenget al., “ShieldGemma: Generative AI content moderation based on Gemma,” arXiv preprint arXiv:2407.21772, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,
S. Hanet al., “WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs,” inAdv. Neural Inf. Process. Syst., 2024
2024
-
[4]
I. Padhiet al., “Granite guardian,” arXiv preprint arXiv:2412.07724, 2024
-
[5]
Training language models to follow instructions with human feedback,
L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 27 730–27 744
2022
-
[6]
Constitutional AI: Harmlessness from AI Feedback
Y . Baiet al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack,” inProc. 34th USENIX Security Symp., 2025, arXiv:2404.01833
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,
Q. Renet al., “LLMs know their vulnerabilities: Uncover safety gaps through natural distribution shifts,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2025, pp. 24 763–24 785
2025
-
[10]
LLM defenses are not robust to multi-turn human jailbreaks yet,
N. Liet al., “LLM defenses are not robust to multi-turn human jailbreaks yet,” inNeurIPS Workshop on Red Teaming GenAI, 2024, arXiv:2408.15221
-
[11]
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models
Z. Ma, Z. Xu, D. Yu, C. Kang, C. Li, and P. Liu, “THRD: A training- free multi-turn defense framework for jailbreak attacks on large language models,” arXiv preprint arXiv:2606.01738, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Zero trust archi- tecture,
S. Rose, O. Borchert, S. Mitchell, and S. Connelly, “Zero trust archi- tecture,” National Institute of Standards and Technology, Gaithersburg, MD, Tech. Rep. NIST Special Publication 800-207, Aug. 2020
2020
-
[13]
Reforming artificial intelligence: A call for cognitive containment,
N. Amiri Golilarz, H. S. Al Khatib, and S. Rahimi, “Reforming artificial intelligence: A call for cognitive containment,” Preprints.org preprint 2025110867, 2025
2025
-
[14]
Bridging the gap: Toward cognitive autonomy in artificial intelligence,
N. Amiri Golilarz, S. Penchala, and S. Rahimi, “Bridging the gap: Toward cognitive autonomy in artificial intelligence,” arXiv preprint arXiv:2512.02280, 2025
-
[15]
Deliberative alignment: Reasoning enables safer language models,
M. Y . Guanet al., “Deliberative alignment: Reasoning enables safer language models,” arXiv preprint arXiv:2412.16339, 2024
-
[16]
Jailbroken: How does LLM safety training fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdv. Neural Inf. Process. Syst., 2023
2023
-
[17]
A holistic approach to undesired content detection in the real world,
T. Markovet al., “A holistic approach to undesired content detection in the real world,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 12, 2023, pp. 15 009–15 018
2023
-
[18]
A. Grattafioriet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,
T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen, “NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails,” inProc. EMNLP (System Demonstrations), 2023, pp. 431–445
2023
-
[20]
M. Sharmaet al., “Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming,” arXiv preprint arXiv:2501.18837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Attacks, defenses and evaluations for LLM conversation safety: A survey,
Z. Dong, Z. Zhou, C. Yang, J. Shao, and Y . Qiao, “Attacks, defenses and evaluations for LLM conversation safety: A survey,” inProc. NAACL- HLT, 2024, pp. 6734–6747
2024
-
[22]
Intention analysis makes LLMs a good jailbreak defender,
Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes LLMs a good jailbreak defender,” inProc. COLING, 2025, pp. 2947– 2968
2025
-
[23]
SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,
X. Wanget al., “SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner,” inProc. 34th USENIX Security Symp., 2025, arXiv:2406.05498
-
[24]
Bergeron: Combating adversarial attacks through a conscience-based alignment framework,
M. T. Pisanoet al., “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023
-
[25]
AutoDefense: Multi-agent LLM defense against jailbreak attacks,
Y . Zeng, Y . Wu, X. Zhang, H. Wang, and Q. Wu, “AutoDefense: Multi-agent LLM defense against jailbreak attacks,” arXiv preprint arXiv:2403.04783, 2024
-
[26]
LlamaFire- wall: An open source guardrail system for building se- cure AI agents
S. Chennabasappaet al., “LlamaFirewall: An open source guardrail system for building secure AI agents,” arXiv preprint arXiv:2505.03574, 2025
-
[27]
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
X. Shenet al., “One turn too late: Response-aware defense against hidden malicious intent in multi-turn dialogue,” arXiv preprint arXiv:2605.05630, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
K. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots against multi-turn adversarial attacks,” arXiv preprint arXiv:2603.29062, 2026
-
[29]
Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,
P. Kulkarni and A. Namer, “Temporal context awareness: A defense framework against multi-turn manipulation attacks on large language models,” inProc. IEEE Conf. Artif. Intell. (CAI), 2025, pp. 930–935
2025
-
[30]
Many-shot jailbreaking,
C. Anilet al., “Many-shot jailbreaking,” inAdv. Neural Inf. Process. Syst., 2024
2024
-
[31]
HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,
M. Mazeikaet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” inProc. Int. Conf. Mach. Learn. (ICML), 2024, pp. 35 181–35 224
2024
-
[32]
Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,
J. A. Corll, “Peak + accumulation: A proxy-level scoring formula for multi-turn LLM attack detection,” arXiv preprint arXiv:2602.11247, 2026
-
[33]
Defending large language models against jailbreaking attacks through goal prioriti- zation,
Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioriti- zation,” inProc. Annu. Meeting Assoc. Comput. Linguist. (ACL), 2024, pp. 8865–8887
2024
-
[34]
LLM self defense: By self examination, LLMs know they are being tricked,
M. Phuteet al., “LLM self defense: By self examination, LLMs know they are being tricked,” inProc. ICLR Tiny Papers Track, 2024
2024
-
[35]
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
P. Vergaet al., “Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models,” arXiv preprint arXiv:2404.18796, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Branch-solve-merge improves large language model evaluation and generation,
S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. Weston, and X. Li, “Branch-solve-merge improves large language model evaluation and generation,” inProc. NAACL-HLT, 2024, pp. 8352–8370
2024
-
[37]
GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,
Z. Xianget al., “GuardAgent: Safeguard LLM agents via knowledge- enabled reasoning,” inProc. Int. Conf. Mach. Learn. (ICML), 2025, pp. 68 316–68 342
2025
-
[38]
Scalable and transferable black-box jailbreaks for language models via persona modulation,
R. Shah, Q. Feuillade-Montixi, S. Pour, A. Tagade, S. Casper, and J. Rando, “Scalable and transferable black-box jailbreaks for language models via persona modulation,” arXiv preprint arXiv:2311.03348, 2023
-
[39]
“do anything now
X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, ““do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2024, pp. 1671–1685
2024
-
[40]
Q. Lan and A. Kaul, “The cognitive firewall: Securing browser-based AI agents against indirect prompt injection via hybrid edge-cloud defense,” arXiv preprint arXiv:2603.23791, 2026
-
[41]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,” arXiv preprint arXiv:2404.13208, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Control illusion: The failure of instruction hierarchies in large language models,
Y . Genget al., “Control illusion: The failure of instruction hierarchies in large language models,” inProc. AAAI Conf. Artif. Intell., vol. 40, no. 36, 2026, pp. 30 816–30 824
2026
-
[43]
Concrete Problems in AI Safety
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[44]
G. Irving, P. Christiano, and D. Amodei, “AI safety via debate,” arXiv preprint arXiv:1805.00899, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,
C. Burnset al., “Weak-to-strong generalization: Eliciting strong capabil- ities with weak supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2024
2024
-
[46]
JailbreakBench: An open robustness benchmark for jailbreaking large language models,
P. Chaoet al., “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdv. Neural Inf. Process. Syst., 2024
2024
-
[47]
XSTest: A test suite for identifying exaggerated safety behaviours in large language models,
P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “XSTest: A test suite for identifying exaggerated safety behaviours in large language models,” inProc. NAACL-HLT, 2024, pp. 5377–5400
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.