hub

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao · 2023 · cs.CL · arXiv 2312.06674

79 Pith papers cite this work. Polarity classification is still indexing.

79 Pith papers citing it

open full Pith review browse 79 citing papers arXiv PDF

abstract

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tun

co-cited works

representative citing papers

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.

Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

cs.CR · 2026-05-08 · conditional · novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

Self-Mined Hardness for Safety Fine-Tuning

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Jailbroken Frontier Models Retain Their Capabilities

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking

cs.CR · 2026-04-18 · unverdicted · novelty 7.0

HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.

Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives

cs.CR · 2026-04-18 · unverdicted · novelty 7.0

Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it and structural prevention of userspace bypasses.

Conjunctive Prompt Attacks in Multi-Agent LLM Systems

cs.MA · 2026-04-17 · unverdicted · novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

LogAct: Enabling Agentic Reliability via Shared Logs

cs.DC · 2026-04-09 · unverdicted · novelty 7.0

LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

cs.CR · 2026-04-01 · unverdicted · novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.

Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

cs.CR · 2026-05-11 · conditional · novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.

citing papers explorer

Showing 50 of 79 citing papers.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 7 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems cs.CR · 2026-05-12 · unverdicted · none · ref 14 · internal anchor
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems cs.CR · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations cs.AI · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 21 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation cs.LG · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs cs.CR · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Self-Mined Hardness for Safety Fine-Tuning cs.LG · 2026-05-04 · unverdicted · none · ref 8 · internal anchor
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 92 · internal anchor
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 163 · internal anchor
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Jailbroken Frontier Models Retain Their Capabilities cs.LG · 2026-04-30 · unverdicted · none · ref 28 · internal anchor
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 75 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 12 · internal anchor
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models cs.LG · 2026-04-20 · unverdicted · none · ref 23 · internal anchor
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse cs.CR · 2026-04-19 · unverdicted · none · ref 20 · internal anchor
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking cs.CR · 2026-04-18 · unverdicted · none · ref 23 · internal anchor
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives cs.CR · 2026-04-18 · unverdicted · none · ref 10 · internal anchor
Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it and structural prevention of userspace bypasses.
Conjunctive Prompt Attacks in Multi-Agent LLM Systems cs.MA · 2026-04-17 · unverdicted · none · ref 12 · internal anchor
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
LogAct: Enabling Agentic Reliability via Shared Logs cs.DC · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits cs.CR · 2026-04-01 · unverdicted · none · ref 10 · internal anchor
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data cs.CR · 2026-05-11 · conditional · none · ref 49 · internal anchor
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization cs.CV · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 16 · internal anchor
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs cs.CR · 2026-05-09 · unverdicted · none · ref 33 · internal anchor
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories cs.AI · 2026-05-09 · unverdicted · none · ref 49 · internal anchor
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 11 · 2 links · internal anchor
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 27 · internal anchor
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation cs.CR · 2026-05-06 · unverdicted · none · ref 62 · internal anchor
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis cs.CR · 2026-05-05 · unverdicted · none · ref 3 · internal anchor
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection cs.CR · 2026-05-05 · unverdicted · none · ref 135 · internal anchor
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 17 · internal anchor
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving cs.LG · 2026-05-03 · unverdicted · none · ref 29 · internal anchor
ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection cs.CR · 2026-05-02 · unverdicted · none · ref 2 · internal anchor
Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training cs.CR · 2026-05-02 · unverdicted · none · ref 19 · internal anchor
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 85 · internal anchor
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Test-Time Safety Alignment cs.CL · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model cs.CL · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
How Sensitive Are Safety Benchmarks to Judge Configuration Choices? cs.CL · 2026-04-27 · unverdicted · none · ref 6 · internal anchor
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 25 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
Reasoning Structure Matters for Safety Alignment of Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 15 · internal anchor
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 4 · 2 links · internal anchor
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious cs.CL · 2026-04-15 · accept · none · ref 13 · internal anchor
42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems cs.OS · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer