YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

Defeng Li; Haiwen Hong; Hui Xue; Jinfeng Li; Junyu Lin; Lingyao Gao; Longtao Huang; Meizhen Liu; Ranjie Duan; Xiaohan Yuan

arxiv: 2601.15588 · v2 · pith:7EEPQ7ZSnew · submitted 2026-01-22 · 💻 cs.CL

YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

Junyu Lin , Meizhen Liu , Xiufeng Huang , Jinfeng Li , Haiwen Hong , Xiaohan Yuan , Yuefeng Chen , Longtao Huang

show 7 more authors

Hui Xue Ranjie Duan Zhikai Chen Yuchuan Fu Defeng Li Lingyao Gao Yitong Yang

This is my paper

classification 💻 cs.CL

keywords riskyufeng-xguardmodelsafetyinterpretablelanguagedecisionexplanatory

0 comments

read the original abstract

As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective
cs.CL 2026-02 unverdicted novelty 8.0

X-Value is the first cross-lingual values judgment benchmark that reveals limitations and performance gaps in LLMs across languages and issue categories.
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
cs.AI 2026-07 conditional novelty 6.0

Vera automates safety testing for LLM agents via literature-driven risk taxonomies, combinatorial case generation, and evidence-grounded verification in isolated environments, showing 93.9% average attack success on f...
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models
cs.CL 2026-06 unverdicted novelty 6.0

SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
cs.CR 2026-05 conditional novelty 6.0

LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety
cs.CL 2026-06 unverdicted novelty 5.0

Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety
cs.CV 2026-06 unverdicted novelty 5.0

Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining gener...
BraveGuard: From Open-World Threats to Safer Computer-Use Agents
cs.CR 2026-05 unverdicted novelty 5.0

BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.
Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety
cs.CV 2026-06 unverdicted novelty 4.0

Yuvion VL is a multimodal foundation model trained with adversarial-aware data and contrastive fine-tuning that claims industry-leading safety performance on the authors' YVRE benchmarks while retaining general capabilities.
GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy
cs.CR 2026-05 unverdicted novelty 4.0

GLiNER Guard provides unified encoder variants for LLM safety and PII detection in a single pass, with high throughput on A100 hardware and a new PII-Bench benchmark.