LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
Pith reviewed 2026-05-19 23:39 UTC · model grok-4.3
pith:NT4RU4OF Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{NT4RU4OF}
Prints a linked pith:NT4RU4OF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Latent Policy Guardrail compresses safety deliberation into 10 latent tokens to reach 84.5% accuracy at much lower latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent Policy Guardrail learns semantic latent deliberation over dynamic policies. It compresses the internal steps needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time the model produces only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency cost of explicit reasoning.
What carries the argument
Latent Policy Guardrail (LPG), which compresses deliberation for intent interpretation and policy grounding into continuous latent states supervised by decision-relevant semantics.
If this is right
- Guardrails can adapt to changing user-specified or regulatory policies at inference time without any retraining.
- Safety decisions remain auditable because each verdict is explicitly linked to the exact policy clauses that were violated.
- Inference latency drops substantially, with the 4B model running about 11 times faster than explicit-thinking baselines while maintaining higher accuracy than the strongest dynamic baseline.
- The same latent-compression technique can be applied to other benchmarks that test policy guardrails under dynamic conditions.
Where Pith is reading between the lines
- If the latent states prove sufficient for nuanced policies, similar compression could be tested in real-time compliance systems outside safety, such as contract review or regulatory reporting.
- Varying the number of latent tokens could reveal a clear speed-accuracy curve that future systems could tune per deployment context.
- The approach opens the possibility of guardrails that ingest live policy updates from external sources and immediately reflect them without model updates.
Load-bearing premise
Continuous latent states supervised only by decision-relevant semantics can faithfully capture the intent interpretation and policy grounding required for complex, user-specified safety policies without explicit reasoning steps.
What would settle it
A test set of policy scenarios whose correct judgment requires step-by-step reasoning that cannot be recovered from any 10-dimensional continuous state would show whether the compression approach holds.
Figures
read the original abstract
Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low-latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG-4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3-4B-Thinking under the single-sample evaluation setup. Code and data are available at https://github.com/SaFo-Lab/Latent_Policy_Guard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Latent Policy Guardrail (LPG), a framework that compresses deliberation for dynamic, user-specified safety policies into a small number (10) of continuous latent tokens supervised by decision-relevant semantics. LPG-4B is reported to achieve 84.5% average safety accuracy and 77.9% F1 on policy guardrail benchmarks while running approximately 11 times faster than Qwen3-4B-Thinking under single-sample evaluation, with code and data released.
Significance. If the latent states preserve faithful policy reasoning rather than surface correlations, the work would meaningfully advance low-latency, auditable guardrails for inference-time policies. The reproducibility artifacts are a clear strength. However, the practical impact is limited by the unresolved question of whether 10 continuous tokens can reliably encode intent interpretation and multi-clause logical grounding without explicit reasoning steps.
major comments (3)
- [§3.2 and §4.1] §3.2 and §4.1: The supervision loss (verdict prediction plus semantic alignment) contains no explicit term or regularization for intermediate logical steps such as conjunctions, negations, or conditional nesting. This directly bears on the central claim that the latent states perform 'policy reasoning' rather than learning benchmark-specific correlations.
- [Table 2 and §5.2] Table 2 and §5.2: The 84.5% accuracy and 77.9% F1 figures are presented without reported standard deviations, number of runs, or statistical significance tests against the strongest dynamic baseline. The post-hoc selection of exactly 10 latent tokens also lacks an ablation showing robustness to this hyperparameter.
- [§5.3] §5.3: The 11× speedup is measured under a single-sample evaluation setup. No results are provided for batched inference or for policies whose complexity would force the model to use more than 10 tokens, undermining the efficiency claim for realistic deployment.
minor comments (2)
- [Abstract] Abstract: 'learnssemantic' is a typographical error and should read 'learns semantic'.
- [§2] §2: The related-work discussion of dynamic guardrails is brief; adding one or two sentences contrasting LPG with recent latent-reasoning or chain-of-thought compression methods would improve context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor and the interpretation of our latent reasoning claims. We address each major comment below.
read point-by-point responses
-
Referee: [§3.2 and §4.1] §3.2 and §4.1: The supervision loss (verdict prediction plus semantic alignment) contains no explicit term or regularization for intermediate logical steps such as conjunctions, negations, or conditional nesting. This directly bears on the central claim that the latent states perform 'policy reasoning' rather than learning benchmark-specific correlations.
Authors: We appreciate the referee's focus on this point. Our supervision objective prioritizes verdict accuracy and alignment to decision-relevant semantic embeddings precisely to encourage the latent states to capture the semantics needed for policy grounding, including logical structure. While we do not include an explicit regularization term for operators such as negation or conjunction, the training on diverse, multi-clause policies leads the model to internalize these relations implicitly, as shown by LPG's gains on benchmarks containing such constructs. In the revision we will expand the discussion in §4.1 with qualitative analysis of how the learned latents align with logical components of the policies and add supporting examples. revision: partial
-
Referee: [Table 2 and §5.2] Table 2 and §5.2: The 84.5% accuracy and 77.9% F1 figures are presented without reported standard deviations, number of runs, or statistical significance tests against the strongest dynamic baseline. The post-hoc selection of exactly 10 latent tokens also lacks an ablation showing robustness to this hyperparameter.
Authors: We agree that additional statistical reporting and hyperparameter analysis would strengthen the results. We will rerun all experiments with at least five random seeds, report means and standard deviations in the updated Table 2, and include statistical significance tests (paired t-tests) against the strongest dynamic baseline. We will also add an ablation varying the number of latent tokens (5, 10, 15, 20) to demonstrate robustness of the chosen value. These updates will appear in the revised §5.2 and Table 2. revision: yes
-
Referee: [§5.3] §5.3: The 11× speedup is measured under a single-sample evaluation setup. No results are provided for batched inference or for policies whose complexity would force the model to use more than 10 tokens, undermining the efficiency claim for realistic deployment.
Authors: The referee correctly notes that the speedup comparison uses single-sample decoding. We will add batched inference results (batch sizes 4, 8, and 16) to §5.3 to show that the latency advantage persists under realistic serving conditions. For policies exceeding the complexity handled by 10 tokens, the framework allows increasing the latent budget as a direct extension; we will discuss this trade-off and report preliminary scaling results in the revision. revision: partial
Circularity Check
No significant circularity: empirical performance claims rest on benchmark comparisons without reducing to self-defined fits or self-citation chains
full rationale
The LPG framework is introduced as a method to compress deliberation into 10 latent tokens supervised by decision-relevant semantics, with claims supported by reported accuracy (84.5%) and F1 (77.9%) on policy guardrail benchmarks, plus speed comparisons to baselines like Qwen3-4B-Thinking. No equations or derivations are presented that equate outputs to inputs by construction, nor are there load-bearing self-citations to prior uniqueness theorems or ansatzes from the same authors. The central thesis relies on external empirical evaluation rather than internal redefinition or fitted-parameter renaming, making the derivation self-contained against the benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of latent tokens
axioms (1)
- domain assumption Latent continuous states supervised by decision-relevant semantics suffice to capture intent interpretation and policy grounding for dynamic safety policies.
invented entities (1)
-
Latent Policy Guardrail (LPG)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compress these two stages into continuous latent representations... stage-aligned latent slots... semantic-content supervision via teacher-summary reconstruction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026
Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, and Idan Szpektor. Latent reasoning with supervised thinking states.arXiv preprint arXiv:2602.08332, 2026
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Emergent search and backtracking in latent reasoning models
Jasmine Cui and Charles Ye. Emergent search and backtracking in latent reasoning models. arXiv preprint arXiv:2602.08100, 2026
-
[4]
Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, et al. Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv preprint arXiv:2411.17713, 2024
-
[5]
Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. AEGIS2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004, 2025
-
[6]
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024. NeurIPS 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. COLM 2025
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, and Aidong Zhang. Reasoning beyond chain-of-thought: A latent computational mode in large language models.arXiv preprint arXiv:2601.08058, 2026
-
[9]
Bayan Bruss, Ashwinee Panda, and Tom Goldstein
Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, and Tom Goldstein. Dynaguard: A dynamic guardian model with user-defined policies.arXiv preprint arXiv:2509.02563, 2025
-
[10]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023. NeurIPS 2023
-
[13]
Mintong Kang and Bo Li. r2-guard: Robust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning.arXiv preprint arXiv:2407.05557, 2024
-
[14]
Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. Polyguard: A multilingual safety moderation tool for 17 lan- guages.arXiv preprint arXiv:2504.04377, 2025
-
[15]
Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, and Sydney Levine. Safetyanalyst: Interpretable, transparent, and steerable safety moderation for ai behavior.arXiv preprint arXiv:2410.16665, 2024. 10
-
[16]
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. ACL 2024 Findings
-
[17]
Nanxi Li, Zhengyue Zhao, G Edward Suh, Marco Pavone, and Chaowei Xiao. Prism: Robust vlm alignment with principled reasoning for integrated safety in multimodality.arXiv preprint arXiv:2508.18649, 2025
-
[18]
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, and Rema Padman. Beyond single-turn: A survey on multi-turn interactions with large language models. arXiv preprint arXiv:2504.04717, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, and Yitong Yang. Yufeng-xguard: A reasoning-centric, interpretable, and flexible guardrail model for large language models.arXiv preprint arXiv:2601.15588, 2026
-
[20]
Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation
Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, 2023
work page 2023
-
[21]
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Yue Liu, Hongcheng Gao, Shengfang Zhai, Yufei He, Jun Xia, Zhengyu Hu, Yulin Chen, Xihong Yang, Jiaheng Zhang, Stan Z. Li, Hui Xiong, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492, 2025
-
[23]
Agrail: A lifelong agent guardrail with effective and adaptive safety detection
Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. Agrail: A lifelong agent guardrail with effective and adaptive safety detection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8104–8139, 2025
work page 2025
-
[24]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. ICML 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Building a Domain-specific Guardrail Model in Production
Mohammad Niknazar, Paul V Haley, Latha Ramanan, Sang T Truong, Yedendra Shrini- vasan, Ayan Kumar Bhowmick, Prasenjit Dey, Ashish Jagmohan, Hema Maheshwari, Shom Ponoth, et al. Building a domain-specific guardrail model in production.arXiv preprint arXiv:2408.01452, 2024
-
[26]
Syed Arman Rabbani, Mohamed El-Tanani, Shrestha Sharma, Syed Salman Rabbani, Yahia El-Tanani, Rakesh Kumar, and Manita Saini. Generative artificial intelligence in healthcare: applications, implementation challenges, and future directions.BioMedInformatics, 5(3):37, 2025
work page 2025
-
[27]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Codi: Com- pressing chain-of-thought into continuous space via self-distillation
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Com- pressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677–693, 2025
work page 2025
-
[29]
Weiming Song, Xuan Xie, and Ruiping Yin. Aisa: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547, 2026
-
[30]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11
work page 2024
-
[31]
DRAFT: Task Decoupled Latent Reasoning for Agent Safety
Lin Wang, Junfeng Fang, Dan Zhang, Fei Shen, Xiang Wang, and Tat-Seng Chua. Draft: Task decoupled latent reasoning for agent safety.arXiv preprint arXiv:2604.03242, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, and Bang Liu. System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts.arXiv preprint arXiv:2505.18962, 2025
-
[33]
Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, and Muhao Chen. Towards policy- compliant agents: Learning efficient guardrails for policy violation detection.arXiv preprint arXiv:2510.03485, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. Thinkguard: Deliberative slow thinking leads to cautious guardrails.arXiv preprint arXiv:2502.13458, 2025. ACL 2025
-
[35]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Yingcheng Wu, et al. Latentchem: From textual cot to latent thinking in chemical reasoning.arXiv preprint arXiv:2602.07075, 2026
-
[37]
Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024
Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Sellars, Thomas Mesnard, and Yashvi Jain. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772, 2024
-
[38]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, and Chaowei Xiao. Armor: Aligning secure and safe large language models via meticulous reasoning.arXiv preprint arXiv:2507.11500, 2025. 12 Supplementary Material A Experimental Setup A.1 Baseline Models We evaluate the following baseline models: Static Policy Baselines.Llama Guard 3 [ 4...
-
[40]
Standard next-token cross-entropy labels are applied
Decoder path(for explicit verdict output): <eot> followed by the<Output> block with the JSON verdict. Standard next-token cross-entropy labels are applied
-
[41]
Reference path(for teacher supervision): the full sequence of question + explicit reasoning + output. Labels mask the question prefix (set to −100) so that only the reasoning and verdict tokens contribute toL ref. Additionally,stage boundary positions(the token indices of </Intent> and</Risk> in the reference sequence) are precomputed for the stage-bounda...
-
[42]
A forward pass through the base LM produces the final hidden state at the prompt boundary
Prompt encoding: The question is tokenized and appended with <bot>. A forward pass through the base LM produces the final hidden state at the prompt boundary
-
[43]
Each step uses KV-cache for efficient incremental computation
Latent rollout: The hidden state is projected via the MLP projector and fed back as input for m1=4 latent steps (intent stage), followed by m2=6 latent steps (risk stage). Each step uses KV-cache for efficient incremental computation. No discrete tokens are generated during this phase
-
[44]
Verdict generation: The <eot> token embedding is appended, and the model autore- gressively emits the compact verdict string, one of “safe”, “unsafe, policyn”, or “unsafe, policyn1, n2, . . .”
-
[45]
Verdict extraction: A deterministic regex parser maps the compact string to (y, P ∗). (Note: although the teacher reasoning collected at training time uses a JSON <Output> block, the student is trained on–and emits–the compact form, which is shorter and equally parseable.) Special tokens ([PAD], <bot>, <eot>) are suppressed during verdict generation by se...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.