YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models
read the original abstract
As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective
X-Value is the first cross-lingual values judgment benchmark that reveals limitations and performance gaps in LLMs across languages and issue categories.
-
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Vera automates safety testing for LLM agents via literature-driven risk taxonomies, combinatorial case generation, and evidence-grounded verification in isolated environments, showing 93.9% average attack success on f...
-
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models
SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
-
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
-
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
-
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety
Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining gener...
-
BraveGuard: From Open-World Threats to Safer Computer-Use Agents
BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.
-
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety
Yuvion VL is a multimodal foundation model trained with adversarial-aware data and contrastive fine-tuning that claims industry-leading safety performance on the authors' YVRE benchmarks while retaining general capabilities.
-
GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy
GLiNER Guard provides unified encoder variants for LLM safety and PII detection in a single pass, with high throughput on A100 hardware and a new PII-Bench benchmark.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.