AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

Ghosh, Shaona, Varshney, Prasoon, Sreedhar, Makesh Narsimhan, others · 2025 · DOI 10.18653/v1/2025.naacl-long.306

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open at publisher browse 8 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

cs.CR · 2026-07-02 · unverdicted · novelty 8.0

A behavioral monitoring technique using HTTP, lexical, and timing signals detects guardrail presence with 100% accuracy and distinguishes guardrail blocks from LLM rejections with 98% average F1 on unseen prompts.

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

cs.LG · 2026-05-11 · unverdicted · novelty 8.0

Guardrail classifiers receive formal guarantees by certifying convex harmful regions in pre-activation space, exposing safety holes in three toxicity models despite high empirical scores.

BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.

kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

kNNGuard classifies prompts using multi-layer kNN on LLM hidden activations from 50 examples, matching or exceeding fine-tuned guardrails in F1 while running 2.7x to 10x faster with no training required.

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

CDR-Bench shows state-of-the-art LLMs fail at compositional and especially order-sensitive data refinement across atomic, order-agnostic, and order-sensitive settings.

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts

cs.CR · 2026-04-17 · unverdicted · novelty 4.0

TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes cs.AI · 2026-06-30 · unverdicted · none · ref 56
CDR-Bench shows state-of-the-art LLMs fail at compositional and especially order-sensitive data refinement across atomic, order-agnostic, and order-sensitive settings.

AEGIS 2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer