pith. sign in

arxiv: 2601.15588 · v2 · pith:7EEPQ7ZSnew · submitted 2026-01-22 · 💻 cs.CL

YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

classification 💻 cs.CL
keywords riskyufeng-xguardmodelsafetyinterpretablelanguagedecisionexplanatory
0
0 comments X
read the original abstract

As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective

    cs.CL 2026-02 unverdicted novelty 8.0

    X-Value is the first cross-lingual values judgment benchmark that reveals limitations and performance gaps in LLMs across languages and issue categories.

  2. Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification

    cs.AI 2026-07 conditional novelty 6.0

    Vera automates safety testing for LLM agents via literature-driven risk taxonomies, combinatorial case generation, and evidence-grounded verification in isolated environments, showing 93.9% average attack success on f...

  3. SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

    cs.CL 2026-06 unverdicted novelty 6.0

    SentGuard achieves 90.5% detection of unsafe cases within two sentences at 7.41% false positive rate by operating at sentence boundaries during LLM streaming generation.

  4. LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

    cs.CR 2026-05 conditional novelty 6.0

    LPG compresses policy deliberation into 10 latent tokens to reach 84.5% safety accuracy and 11x speedup over explicit reasoning baselines on guardrail benchmarks.

  5. Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

    cs.CL 2026-06 unverdicted novelty 5.0

    Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.

  6. Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

    cs.CV 2026-06 unverdicted novelty 5.0

    Yuvion VL is a multimodal LLM family using adversarial-aware data construction, three-stage training, and contrastive fine-tuning that claims industry-leading safety performance on new benchmarks while retaining gener...

  7. BraveGuard: From Open-World Threats to Safer Computer-Use Agents

    cs.CR 2026-05 unverdicted novelty 5.0

    BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.

  8. Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

    cs.CV 2026-06 unverdicted novelty 4.0

    Yuvion VL is a multimodal foundation model trained with adversarial-aware data and contrastive fine-tuning that claims industry-leading safety performance on the authors' YVRE benchmarks while retaining general capabilities.

  9. GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy

    cs.CR 2026-05 unverdicted novelty 4.0

    GLiNER Guard provides unified encoder variants for LLM safety and PII detection in a single pass, with high throughput on A100 hardware and a new PII-Bench benchmark.