LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

Chaowei Xiao; Nanxi Li; Zhengyue Zhao

arxiv: 2605.17329 · v1 · pith:NT4RU4OFnew · submitted 2026-05-17 · 💻 cs.CR · cs.AI

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

Nanxi Li , Zhengyue Zhao , Chaowei Xiao This is my paper

Pith reviewed 2026-05-19 23:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords latent policy guardraildynamic safety policiesAI safety guardrailslatent deliberationpolicy reasoningefficiencysafety accuracy

0 comments

The pith

Latent Policy Guardrail compresses safety deliberation into 10 latent tokens to reach 84.5% accuracy at much lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Policy Guardrail to handle dynamic safety policies specified at inference time by users or organizations. It learns to compress the internal deliberation for interpreting user intent and grounding it in policy rules into continuous latent states that are supervised only by decision-relevant semantics. At runtime the system outputs a compact verdict tied to the specific violated clauses rather than a full reasoning trace. A sympathetic reader would care because customized AI assistants require safety enforcement that adapts without retraining yet still stays fast enough for real deployment. LPG-4B demonstrates this balance by attaining 84.5% average safety accuracy and 77.9% F1 while running roughly 11 times faster than a comparable explicit-thinking model under single-sample evaluation.

Core claim

Latent Policy Guardrail learns semantic latent deliberation over dynamic policies. It compresses the internal steps needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time the model produces only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency cost of explicit reasoning.

What carries the argument

Latent Policy Guardrail (LPG), which compresses deliberation for intent interpretation and policy grounding into continuous latent states supervised by decision-relevant semantics.

If this is right

Guardrails can adapt to changing user-specified or regulatory policies at inference time without any retraining.
Safety decisions remain auditable because each verdict is explicitly linked to the exact policy clauses that were violated.
Inference latency drops substantially, with the 4B model running about 11 times faster than explicit-thinking baselines while maintaining higher accuracy than the strongest dynamic baseline.
The same latent-compression technique can be applied to other benchmarks that test policy guardrails under dynamic conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the latent states prove sufficient for nuanced policies, similar compression could be tested in real-time compliance systems outside safety, such as contract review or regulatory reporting.
Varying the number of latent tokens could reveal a clear speed-accuracy curve that future systems could tune per deployment context.
The approach opens the possibility of guardrails that ingest live policy updates from external sources and immediately reflect them without model updates.

Load-bearing premise

Continuous latent states supervised only by decision-relevant semantics can faithfully capture the intent interpretation and policy grounding required for complex, user-specified safety policies without explicit reasoning steps.

What would settle it

A test set of policy scenarios whose correct judgment requires step-by-step reasoning that cannot be recovered from any 10-dimensional continuous state would show whether the compression approach holds.

Figures

Figures reproduced from arXiv: 2605.17329 by Chaowei Xiao, Nanxi Li, Zhengyue Zhao.

**Figure 1.** Figure 1: Policy grounding probes. Left: Accuracy on the full set with vs. without the policy. Right: counterfactual flip rate on the unsafe subset: fraction of samples whose verdict correctly flips to safe after only the violated rule(s) are removed; higher = the model anchors on the specific violated clause. Model Acc (%) CR (%) Orig. Shuf. mean±std Qwen3-4B 75.60 72.83 79.64±40.27 Qwen3-4B (Think) 77.25 77.68 89.… view at source ↗

**Figure 2.** Figure 2: Overview of LPG. The student compresses intent analysis and policy analysis into two [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation studies. (a) Loss-component ablation: GuardSet-X Safety Accuracy when each term (or distillation sub-loss) is removed from Eq. 5; dotted line marks the full model. (b) Latent-token capacity sweep on DynaBench; the red diamond is the explicit-reasoning baseline. 9B falling to 34.72 and LlamaGuard3-8B to 59.52 on PolicyGuardBench, whereas LPG remains robust because it reasons directly over user-supp… view at source ↗

read the original abstract

Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low-latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG-4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3-4B-Thinking under the single-sample evaluation setup. Code and data are available at https://github.com/SaFo-Lab/Latent_Policy_Guard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPG shows you can hit decent safety numbers with 10 latent tokens instead of full reasoning, but it's unclear if the compression actually preserves logical policy chaining on hard cases.

read the letter

The main thing to know is that this paper introduces a latent compression trick for dynamic policy guardrails: instead of running explicit token-by-token deliberation, LPG squeezes the needed intent interpretation and policy grounding into just 10 continuous latent states supervised by decision semantics, then spits out a verdict tied to the violated clauses. On their benchmarks the 4B version reaches 84.5% average safety accuracy and 77.9% F1 while running roughly 11 times faster than Qwen3-4B-Thinking under single-sample eval. They also ship code and data, which is useful for anyone who wants to poke at it directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Latent Policy Guardrail (LPG), a framework that compresses deliberation for dynamic, user-specified safety policies into a small number (10) of continuous latent tokens supervised by decision-relevant semantics. LPG-4B is reported to achieve 84.5% average safety accuracy and 77.9% F1 on policy guardrail benchmarks while running approximately 11 times faster than Qwen3-4B-Thinking under single-sample evaluation, with code and data released.

Significance. If the latent states preserve faithful policy reasoning rather than surface correlations, the work would meaningfully advance low-latency, auditable guardrails for inference-time policies. The reproducibility artifacts are a clear strength. However, the practical impact is limited by the unresolved question of whether 10 continuous tokens can reliably encode intent interpretation and multi-clause logical grounding without explicit reasoning steps.

major comments (3)

[§3.2 and §4.1] §3.2 and §4.1: The supervision loss (verdict prediction plus semantic alignment) contains no explicit term or regularization for intermediate logical steps such as conjunctions, negations, or conditional nesting. This directly bears on the central claim that the latent states perform 'policy reasoning' rather than learning benchmark-specific correlations.
[Table 2 and §5.2] Table 2 and §5.2: The 84.5% accuracy and 77.9% F1 figures are presented without reported standard deviations, number of runs, or statistical significance tests against the strongest dynamic baseline. The post-hoc selection of exactly 10 latent tokens also lacks an ablation showing robustness to this hyperparameter.
[§5.3] §5.3: The 11× speedup is measured under a single-sample evaluation setup. No results are provided for batched inference or for policies whose complexity would force the model to use more than 10 tokens, undermining the efficiency claim for realistic deployment.

minor comments (2)

[Abstract] Abstract: 'learnssemantic' is a typographical error and should read 'learns semantic'.
[§2] §2: The related-work discussion of dynamic guardrails is brief; adding one or two sentences contrasting LPG with recent latent-reasoning or chain-of-thought compression methods would improve context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and the interpretation of our latent reasoning claims. We address each major comment below.

read point-by-point responses

Referee: [§3.2 and §4.1] §3.2 and §4.1: The supervision loss (verdict prediction plus semantic alignment) contains no explicit term or regularization for intermediate logical steps such as conjunctions, negations, or conditional nesting. This directly bears on the central claim that the latent states perform 'policy reasoning' rather than learning benchmark-specific correlations.

Authors: We appreciate the referee's focus on this point. Our supervision objective prioritizes verdict accuracy and alignment to decision-relevant semantic embeddings precisely to encourage the latent states to capture the semantics needed for policy grounding, including logical structure. While we do not include an explicit regularization term for operators such as negation or conjunction, the training on diverse, multi-clause policies leads the model to internalize these relations implicitly, as shown by LPG's gains on benchmarks containing such constructs. In the revision we will expand the discussion in §4.1 with qualitative analysis of how the learned latents align with logical components of the policies and add supporting examples. revision: partial
Referee: [Table 2 and §5.2] Table 2 and §5.2: The 84.5% accuracy and 77.9% F1 figures are presented without reported standard deviations, number of runs, or statistical significance tests against the strongest dynamic baseline. The post-hoc selection of exactly 10 latent tokens also lacks an ablation showing robustness to this hyperparameter.

Authors: We agree that additional statistical reporting and hyperparameter analysis would strengthen the results. We will rerun all experiments with at least five random seeds, report means and standard deviations in the updated Table 2, and include statistical significance tests (paired t-tests) against the strongest dynamic baseline. We will also add an ablation varying the number of latent tokens (5, 10, 15, 20) to demonstrate robustness of the chosen value. These updates will appear in the revised §5.2 and Table 2. revision: yes
Referee: [§5.3] §5.3: The 11× speedup is measured under a single-sample evaluation setup. No results are provided for batched inference or for policies whose complexity would force the model to use more than 10 tokens, undermining the efficiency claim for realistic deployment.

Authors: The referee correctly notes that the speedup comparison uses single-sample decoding. We will add batched inference results (batch sizes 4, 8, and 16) to §5.3 to show that the latency advantage persists under realistic serving conditions. For policies exceeding the complexity handled by 10 tokens, the framework allows increasing the latent budget as a direct extension; we will discuss this trade-off and report preliminary scaling results in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical performance claims rest on benchmark comparisons without reducing to self-defined fits or self-citation chains

full rationale

The LPG framework is introduced as a method to compress deliberation into 10 latent tokens supervised by decision-relevant semantics, with claims supported by reported accuracy (84.5%) and F1 (77.9%) on policy guardrail benchmarks, plus speed comparisons to baselines like Qwen3-4B-Thinking. No equations or derivations are presented that equate outputs to inputs by construction, nor are there load-bearing self-citations to prior uniqueness theorems or ansatzes from the same authors. The central thesis relies on external empirical evaluation rather than internal redefinition or fitted-parameter renaming, making the derivation self-contained against the benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that latent states can substitute for explicit policy reasoning; the 10-token count is a design choice whose justification is not detailed in the abstract.

free parameters (1)

number of latent tokens
Fixed at 10 to balance efficiency and accuracy; value chosen rather than derived.

axioms (1)

domain assumption Latent continuous states supervised by decision-relevant semantics suffice to capture intent interpretation and policy grounding for dynamic safety policies.
Core premise enabling the compression claim.

invented entities (1)

Latent Policy Guardrail (LPG) no independent evidence
purpose: Framework that performs semantic latent deliberation over dynamic policies.
Newly introduced architecture.

pith-pipeline@v0.9.0 · 5756 in / 996 out tokens · 23848 ms · 2026-05-19T23:39:01.182484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compress these two stages into continuous latent representations... stage-aligned latent slots... semantic-content supervision via teacher-summary reconstruction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 14 internal anchors

[1]

Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, and Idan Szpektor. Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026

work page arXiv 2026
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Emergent search and backtracking in latent reasoning models

Jasmine Cui and Charles Ye. Emergent search and backtracking in latent reasoning models. arXiv preprint arXiv:2602.08100, 2026

work page arXiv 2026
[4]

Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, et al. Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

work page arXiv 2024
[5]

AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

work page arXiv 2025
[6]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024. NeurIPS 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. COLM 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Reasoning beyond chain-of-thought: A latent computational mode in large language models.arXiv preprint arXiv:2601.08058, 2026

Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, and Aidong Zhang. Reasoning beyond chain-of-thought: A latent computational mode in large language models.arXiv preprint arXiv:2601.08058, 2026

work page arXiv 2026
[9]

Bayan Bruss, Ashwinee Panda, and Tom Goldstein

Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, and Tom Goldstein. Dynaguard: A dynamic guardian model with user-defined policies.arXiv preprint arXiv:2509.02563, 2025

work page arXiv 2025
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023. NeurIPS 2023

work page arXiv 2023
[13]

r2-guard: Robust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning.arXiv preprint arXiv:2407.05557, 2024

Mintong Kang and Bo Li. r2-guard: Robust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning.arXiv preprint arXiv:2407.05557, 2024

work page arXiv 2024
[14]

Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

work page arXiv 2025
[15]

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, and Sydney Levine. Safetyanalyst: Interpretable, transparent, and steerable safety moderation for ai behavior.arXiv preprint arXiv:2410.16665, 2024. 10

work page arXiv 2024
[16]

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. ACL 2024 Findings

work page arXiv 2024
[17]

Prism: Robust vlm alignment with principled reasoning for integrated safety in multimodality.arXiv preprint arXiv:2508.18649, 2025

Nanxi Li, Zhengyue Zhao, G Edward Suh, Marco Pavone, and Chaowei Xiao. Prism: Robust vlm alignment with principled reasoning for integrated safety in multimodality.arXiv preprint arXiv:2508.18649, 2025

work page arXiv 2025
[18]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXiv preprint arXiv:2601.15588, 2026

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, and Yitong Yang. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXiv preprint arXiv:2601.15588, 2026

work page arXiv 2026
[20]

Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, 2023

work page 2023
[21]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Li, Hui Xiong, and Bryan Hooi

Yue Liu, Hongcheng Gao, Shengfang Zhai, Yufei He, Jun Xia, Zhengyu Hu, Yulin Chen, Xihong Yang, Jiaheng Zhang, Stan Z. Li, Hui Xiong, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025

work page arXiv 2025
[23]

Agrail: A lifelong agent guardrail with effective and adaptive safety detection

Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. Agrail: A lifelong agent guardrail with effective and adaptive safety detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8104–8139, 2025

work page 2025
[24]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. ICML 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Building a Domain-specific Guardrail Model in Production

Mohammad Niknazar, Paul V Haley, Latha Ramanan, Sang T Truong, Yedendra Shrini- vasan, Ayan Kumar Bhowmick, Prasenjit Dey, Ashish Jagmohan, Hema Maheshwari, Shom Ponoth, et al. Building a domain-specific guardrail model in production.arXiv preprint arXiv:2408.01452, 2024

work page arXiv 2024
[26]

Generative artificial intelligence in healthcare: applications, implementation challenges, and future directions.BioMedInformatics, 5(3):37, 2025

Syed Arman Rabbani, Mohamed El-Tanani, Shrestha Sharma, Syed Salman Rabbani, Yahia El-Tanani, Rakesh Kumar, and Manita Saini. Generative artificial intelligence in healthcare: applications, implementation challenges, and future directions.BioMedInformatics, 5(3):37, 2025

work page 2025
[27]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Codi: Com- pressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Com- pressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677–693, 2025

work page 2025
[29]

Aisa: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547, 2026

Weiming Song, Xuan Xie, and Ruiping Yin. Aisa: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547, 2026

work page arXiv 2026
[30]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11

work page 2024
[31]

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, and Tat-Seng Chua. Draft: Task decoupled latent reasoning for agent safety.arXiv preprint arXiv:2604.03242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025

Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025

work page arXiv 2025
[33]

Towards policy- compliant agents: Learning efficient guardrails for policy violation detection.arXiv preprint arXiv:2510.03485, 2025

Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, and Muhao Chen. Towards policy- compliant agents: Learning efficient guardrails for policy violation detection.arXiv preprint arXiv:2510.03485, 2025

work page internal anchor Pith review arXiv 2025
[34]

Thinkguard: Deliberative slow thinking leads to cautious guardrails.arXiv preprint arXiv:2502.13458, 2025

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails.arXiv preprint arXiv:2502.13458, 2025. ACL 2025

work page arXiv 2025
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

work page arXiv 2026
[37]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Sellars, Thomas Mesnard, and Yashvi Jain. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review arXiv 2024
[38]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Do not give instructions for acquiring, manufacturing, or using illegal drugs, controlled substances, or prohibited weapons

Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, and Chaowei Xiao. Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025. 12 Supplementary Material A Experimental Setup A.1 Baseline Models We evaluate the following baseline models: Static Policy Baselines.Llama Guard 3 [ 4...

work page arXiv 2025
[40]

Standard next-token cross-entropy labels are applied

Decoder path(for explicit verdict output): <eot> followed by the<Output> block with the JSON verdict. Standard next-token cross-entropy labels are applied

work page
[41]

Labels mask the question prefix (set to −100) so that only the reasoning and verdict tokens contribute toL ref

Reference path(for teacher supervision): the full sequence of question + explicit reasoning + output. Labels mask the question prefix (set to −100) so that only the reasoning and verdict tokens contribute toL ref. Additionally,stage boundary positions(the token indices of </Intent> and</Risk> in the reference sequence) are precomputed for the stage-bounda...

work page
[42]

A forward pass through the base LM produces the final hidden state at the prompt boundary

Prompt encoding: The question is tokenized and appended with <bot>. A forward pass through the base LM produces the final hidden state at the prompt boundary

work page
[43]

Each step uses KV-cache for efficient incremental computation

Latent rollout: The hidden state is projected via the MLP projector and fed back as input for m1=4 latent steps (intent stage), followed by m2=6 latent steps (risk stage). Each step uses KV-cache for efficient incremental computation. No discrete tokens are generated during this phase

work page
[44]

safe”, “unsafe, policyn

Verdict generation: The <eot> token embedding is appended, and the model autore- gressively emits the compact verdict string, one of “safe”, “unsafe, policyn”, or “unsafe, policyn1, n2, . . .”

work page
[45]

Verdict extraction: A deterministic regex parser maps the compact string to (y, P ∗). (Note: although the teacher reasoning collected at training time uses a JSON <Output> block, the student is trained on–and emits–the compact form, which is shorter and equally parseable.) Special tokens ([PAD], <bot>, <eot>) are suppressed during verdict generation by se...

work page

[1] [1]

Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, and Idan Szpektor. Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026

work page arXiv 2026

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Emergent search and backtracking in latent reasoning models

Jasmine Cui and Charles Ye. Emergent search and backtracking in latent reasoning models. arXiv preprint arXiv:2602.08100, 2026

work page arXiv 2026

[4] [4]

Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, et al. Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024

work page arXiv 2024

[5] [5]

AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025

work page arXiv 2025

[6] [6]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024. NeurIPS 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. COLM 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Reasoning beyond chain-of-thought: A latent computational mode in large language models.arXiv preprint arXiv:2601.08058, 2026

Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, and Aidong Zhang. Reasoning beyond chain-of-thought: A latent computational mode in large language models.arXiv preprint arXiv:2601.08058, 2026

work page arXiv 2026

[9] [9]

Bayan Bruss, Ashwinee Panda, and Tom Goldstein

Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, and Tom Goldstein. Dynaguard: A dynamic guardian model with user-defined policies.arXiv preprint arXiv:2509.02563, 2025

work page arXiv 2025

[10] [10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023. NeurIPS 2023

work page arXiv 2023

[13] [13]

r2-guard: Robust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning.arXiv preprint arXiv:2407.05557, 2024

Mintong Kang and Bo Li. r2-guard: Robust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning.arXiv preprint arXiv:2407.05557, 2024

work page arXiv 2024

[14] [14]

Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025

work page arXiv 2025

[15] [15]

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, and Sydney Levine. Safetyanalyst: Interpretable, transparent, and steerable safety moderation for ai behavior.arXiv preprint arXiv:2410.16665, 2024. 10

work page arXiv 2024

[16] [16]

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. ACL 2024 Findings

work page arXiv 2024

[17] [17]

Prism: Robust vlm alignment with principled reasoning for integrated safety in multimodality.arXiv preprint arXiv:2508.18649, 2025

Nanxi Li, Zhengyue Zhao, G Edward Suh, Marco Pavone, and Chaowei Xiao. Prism: Robust vlm alignment with principled reasoning for integrated safety in multimodality.arXiv preprint arXiv:2508.18649, 2025

work page arXiv 2025

[18] [18]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXiv preprint arXiv:2601.15588, 2026

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, and Yitong Yang. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXiv preprint arXiv:2601.15588, 2026

work page arXiv 2026

[20] [20]

Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, 2023

work page 2023

[21] [21]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Li, Hui Xiong, and Bryan Hooi

Yue Liu, Hongcheng Gao, Shengfang Zhai, Yufei He, Jun Xia, Zhengyu Hu, Yulin Chen, Xihong Yang, Jiaheng Zhang, Stan Z. Li, Hui Xiong, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025

work page arXiv 2025

[23] [23]

Agrail: A lifelong agent guardrail with effective and adaptive safety detection

Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. Agrail: A lifelong agent guardrail with effective and adaptive safety detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8104–8139, 2025

work page 2025

[24] [24]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. ICML 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Building a Domain-specific Guardrail Model in Production

Mohammad Niknazar, Paul V Haley, Latha Ramanan, Sang T Truong, Yedendra Shrini- vasan, Ayan Kumar Bhowmick, Prasenjit Dey, Ashish Jagmohan, Hema Maheshwari, Shom Ponoth, et al. Building a domain-specific guardrail model in production.arXiv preprint arXiv:2408.01452, 2024

work page arXiv 2024

[26] [26]

Generative artificial intelligence in healthcare: applications, implementation challenges, and future directions.BioMedInformatics, 5(3):37, 2025

Syed Arman Rabbani, Mohamed El-Tanani, Shrestha Sharma, Syed Salman Rabbani, Yahia El-Tanani, Rakesh Kumar, and Manita Saini. Generative artificial intelligence in healthcare: applications, implementation challenges, and future directions.BioMedInformatics, 5(3):37, 2025

work page 2025

[27] [27]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Codi: Com- pressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Com- pressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677–693, 2025

work page 2025

[29] [29]

Aisa: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547, 2026

Weiming Song, Xuan Xie, and Ruiping Yin. Aisa: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547, 2026

work page arXiv 2026

[30] [30]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11

work page 2024

[31] [31]

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, and Tat-Seng Chua. Draft: Task decoupled latent reasoning for agent safety.arXiv preprint arXiv:2604.03242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025

Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025

work page arXiv 2025

[33] [33]

Towards policy- compliant agents: Learning efficient guardrails for policy violation detection.arXiv preprint arXiv:2510.03485, 2025

Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, and Muhao Chen. Towards policy- compliant agents: Learning efficient guardrails for policy violation detection.arXiv preprint arXiv:2510.03485, 2025

work page internal anchor Pith review arXiv 2025

[34] [34]

Thinkguard: Deliberative slow thinking leads to cautious guardrails.arXiv preprint arXiv:2502.13458, 2025

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails.arXiv preprint arXiv:2502.13458, 2025. ACL 2025

work page arXiv 2025

[35] [35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026

work page arXiv 2026

[37] [37]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Sellars, Thomas Mesnard, and Yashvi Jain. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024

work page internal anchor Pith review arXiv 2024

[38] [38]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Do not give instructions for acquiring, manufacturing, or using illegal drugs, controlled substances, or prohibited weapons

Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, and Chaowei Xiao. Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025. 12 Supplementary Material A Experimental Setup A.1 Baseline Models We evaluate the following baseline models: Static Policy Baselines.Llama Guard 3 [ 4...

work page arXiv 2025

[40] [40]

Standard next-token cross-entropy labels are applied

Decoder path(for explicit verdict output): <eot> followed by the<Output> block with the JSON verdict. Standard next-token cross-entropy labels are applied

work page

[41] [41]

Labels mask the question prefix (set to −100) so that only the reasoning and verdict tokens contribute toL ref

Reference path(for teacher supervision): the full sequence of question + explicit reasoning + output. Labels mask the question prefix (set to −100) so that only the reasoning and verdict tokens contribute toL ref. Additionally,stage boundary positions(the token indices of </Intent> and</Risk> in the reference sequence) are precomputed for the stage-bounda...

work page

[42] [42]

A forward pass through the base LM produces the final hidden state at the prompt boundary

Prompt encoding: The question is tokenized and appended with <bot>. A forward pass through the base LM produces the final hidden state at the prompt boundary

work page

[43] [43]

Each step uses KV-cache for efficient incremental computation

Latent rollout: The hidden state is projected via the MLP projector and fed back as input for m1=4 latent steps (intent stage), followed by m2=6 latent steps (risk stage). Each step uses KV-cache for efficient incremental computation. No discrete tokens are generated during this phase

work page

[44] [44]

safe”, “unsafe, policyn

Verdict generation: The <eot> token embedding is appended, and the model autore- gressively emits the compact verdict string, one of “safe”, “unsafe, policyn”, or “unsafe, policyn1, n2, . . .”

work page

[45] [45]

Verdict extraction: A deterministic regex parser maps the compact string to (y, P ∗). (Note: although the teacher reasoning collected at training time uses a JSON <Output> block, the student is trained on–and emits–the compact form, which is shorter and equally parseable.) Special tokens ([PAD], <bot>, <eot>) are suppressed during verdict generation by se...

work page