arxiv: 2312.06674 · v1 · submitted 2023-12-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan , Kartikeya Upasani , Jianfeng Chi , Rashi Rungta , Krithika Iyer , Yuning Mao , Michael Tontchev , Qing Hu , Brian Fuller , Davide Testuggine , Madian Khabsa

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Llama GuardLLM safeguardsafety risk taxonomyprompt classificationresponse classificationcontent moderationinstruction tuningAI safety

0 comments

The pith

Llama Guard is a Llama2-7b model instruction-tuned on a safety-risk dataset that classifies risks in both user prompts and generated responses at levels matching or exceeding existing moderation tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Llama Guard as an LLM-based safeguard tailored to human-AI conversations. It defines a safety risk taxonomy that labels risks appearing in prompts and then classifies the responses those prompts produce. A compact high-quality dataset built around this taxonomy is used to instruction-tune a Llama2-7b model. On established benchmarks the tuned model performs at or above current content-moderation systems while also supporting task customization and flexible output formats. The work releases the model weights to let others adapt the approach for changing safety requirements.

Core claim

Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats, enabling the adjustment of taxonomy categories to align with specific use cases and facilitating zero-shot or few-shot prompt

What carries the argument

The safety risk taxonomy that categorizes risks in LLM prompts for prompt classification and in generated responses for response classification, paired with instruction-tuning of Llama2-7b on the collected dataset to produce multi-class labels and binary safety decisions.

Load-bearing premise

The high-quality dataset collected around the safety risk taxonomy is representative of real-world risks in human-AI conversations and benchmark performance will translate to effective safeguarding in deployed systems.

What would settle it

Running Llama Guard on a large set of live human-AI conversations that contain known harmful outcomes and measuring whether its classifications align with independent human judgments of those same interactions.

read the original abstract

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Llama Guard releases an open 7B safety classifier for LLM conversations that matches public benchmarks via instruction tuning on a custom taxonomy, but the writeup is light on numbers and dataset details.

read the letter

The core contribution is an open weights release of a Llama 2 7B model fine-tuned to classify both prompts and responses for safety risks using a custom taxonomy. The model supports multi-class decisions plus binary scores, and the instruction tuning lets users swap taxonomies or output formats without retraining. They report it matches or exceeds existing moderation tools on the OpenAI Moderation Evaluation set and ToxicChat after training on their collected data, which they note is low-volume but high-quality. The open release and customization angle are the practical wins here; anyone running conversational systems can drop this in as a filter without depending on closed APIs, and the joint input-output framing fits real chat flows better than prompt-only or response-only tools. The taxonomy itself is a reusable piece that others can build on or adapt. The soft spots are straightforward. The abstract supplies no concrete metrics, no dataset size or construction details beyond the volume note, no error analysis, and no ablation on how the taxonomy was derived or how labels were aligned. That makes the performance claim hard to stress-test from the text alone, and it leaves open whether the benchmarks capture the distribution of risks that actually show up in deployed human-AI conversations. The generalization step from benchmark scores to production safeguarding is assumed rather than demonstrated. This paper is aimed at practitioners who need an open, adaptable safety layer for chat models and at researchers who want a starting point for further tuning or taxonomy work. A reader focused on responsible deployment or open-source moderation tooling will get immediate value from the weights and the taxonomy description. It deserves a serious referee because the model is released, the claims are tied to public benchmarks, and the gaps are fixable with added numbers and methods rather than fundamental flaws in the approach. Send it to review; the open release makes engagement worthwhile even if the current draft needs more evidence to stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper introduces Llama Guard, a Llama2-7B model instruction-tuned on a collected high-quality (but low-volume) dataset for prompt and response classification in human-AI conversations. It defines a safety risk taxonomy to categorize risks and uses this for both input classification and output moderation. The central claim is that the resulting model matches or exceeds existing content moderation tools on the OpenAI Moderation Evaluation dataset and ToxicChat benchmark; the model supports multi-class classification with binary decision scores, allows taxonomy customization via instruction tuning, and the weights are released publicly.

Significance. If the benchmark results hold under detailed scrutiny, the work supplies a practical, open-weight, instruction-tunable safeguard that can be adapted to new taxonomies or use cases. The public release of weights is a concrete contribution to reproducibility and community experimentation in conversational AI safety.

major comments (2)

[Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.
[Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.

minor comments (1)

[Abstract] The description of 'multi-class classification and generating binary decision scores' would benefit from an explicit example of the output format (e.g., a sample prompt and expected response).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve transparency and verifiability of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that Llama Guard 'demonstrates strong performance' and 'matches or exceeds' existing tools on the OpenAI Moderation Evaluation dataset and ToxicChat is stated without any quantitative metrics (accuracy, F1, precision/recall, or comparison tables), dataset statistics, or error analysis. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would benefit from explicit quantitative support for the performance claims. In the revised manuscript, we will add key metrics (e.g., accuracy and F1 scores) along with direct comparisons to existing moderation tools on both the OpenAI Moderation Evaluation dataset and ToxicChat. A brief reference to the error analysis and dataset statistics already detailed in the Experiments section will also be included in the abstract. revision: yes
Referee: [Dataset and Experiments sections] Dataset construction and experimental sections: no details are supplied on the size of the collected dataset, label distribution, how the safety risk taxonomy was operationalized into labels, training hyperparameters, or exact benchmark numbers. These omissions prevent verification that the low-volume dataset supports the reported generalization.

Authors: We acknowledge the need for greater specificity in these sections to enable verification. The revised manuscript will expand the Dataset section to report the exact dataset size, label distributions, and the mapping from the safety risk taxonomy to classification labels. The Experiments section will include training hyperparameters and precise benchmark numbers with comparison tables. These additions will clarify how the high-quality, low-volume dataset supports the observed generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an empirically trained safeguard model (Llama Guard) via instruction-tuning on a custom safety dataset, then reports performance on independent external benchmarks (OpenAI Moderation Evaluation dataset and ToxicChat). No mathematical derivations, equations, or first-principles predictions exist in the text; the central claims rest on standard train-then-evaluate results against public test sets rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the representativeness of the collected dataset and the assumption that instruction tuning on a modest amount of data yields reliable safety classification. No explicit free parameters beyond standard model training are mentioned.

axioms (1)

domain assumption Instruction tuning of LLMs on a modest dataset can produce effective multi-class safety classifiers.
Invoked to support performance claims despite low data volume.

invented entities (1)

Safety risk taxonomy no independent evidence
purpose: Categorize risks in prompts and responses for classification.
New taxonomy introduced by the authors for this task.

pith-pipeline@v0.9.0 · 5587 in / 1267 out tokens · 87669 ms · 2026-05-10T18:54:29.249408+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
cs.CR 2026-04 unverdicted novelty 8.0 full

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
cs.CR 2026-05 unverdicted novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Self-Mined Hardness for Safety Fine-Tuning
cs.LG 2026-05 unverdicted novelty 7.0

Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
cs.LG 2026-04 unverdicted novelty 7.0

SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
cs.CR 2026-04 unverdicted novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
cs.CR 2026-04 unverdicted novelty 7.0

HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives
cs.CR 2026-04 unverdicted novelty 7.0

Governed MCP implements kernel-level governance for MCP tool calls in AI agents through a 6-layer pipeline including ProbeLogits semantic verification, with an ablation showing F1 drop from 0.773 to 0.327 without it a...
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
LogAct: Enabling Agentic Reliability via Shared Logs
cs.DC 2026-04 unverdicted novelty 7.0

LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
cs.CR 2026-05 conditional novelty 6.0

ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
cs.CR 2026-05 conditional novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
cs.CV 2026-05 unverdicted novelty 6.0

UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
cs.CR 2026-05 unverdicted novelty 6.0

MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
cs.CR 2026-05 unverdicted novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
cs.AI 2026-05 unverdicted novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification
cs.AI 2026-05 unverdicted novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
cs.CL 2026-05 unverdicted novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
cs.CR 2026-05 unverdicted novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 6.0

Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
cs.CL 2026-04 unverdicted novelty 6.0

LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Test-Time Safety Alignment
cs.CL 2026-04 unverdicted novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
cs.CL 2026-04 unverdicted novelty 6.0

Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
cs.CL 2026-04 unverdicted novelty 6.0

LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
cs.CL 2026-04 accept novelty 6.0

42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
cs.OS 2026-04 unverdicted novelty 6.0

ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
cs.CL 2026-04 unverdicted novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance
cs.LG 2026-04 unverdicted novelty 6.0

Spectral geometry of LoRA adapters encodes training objective and predicts harmful compliance in language models.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 83 Pith papers · 4 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 ,

work page internal anchor Pith review arXiv
[2]

SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter

Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation , pages 54– 63, Minneapolis, Minne...

work page 2019
[3]

doi: 10.18653/v1/S19-2007

Association for Computational Linguistics. doi: 10.18653/v1/S19-2007. https://www.aclweb.org/anthology/S19-2007. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing sy...

work page doi:10.18653/v1/s19-2007 2007
[4]

doi: 10.18653/v1/W18-0802

Association for Computational Linguistics. doi: 10.18653/v1/W18-0802. https://aclanthology.org/W18-0802. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models,

work page doi:10.18653/v1/w18-0802
[5]

doi: 10.18653/v1/W18-5102

Association for Computational Linguistics. doi: 10.18653/v1/W18-5102. https: //www.aclweb.org/anthology/W18-5102. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack,

work page doi:10.18653/v1/w18-5102
[6]

doi: 10.18653/v1/2021.acl-long.210

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.210. https://aclanthology.org/2021.acl-long.210. Alon Halevy, Cristian Canton-Ferrer, Hao Ma, Umut Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Silvestri, and Ves Stoyanov. Preserving integrity in online social networks. Communications of the ACM , 65(2):92–98,

work page doi:10.18653/v1/2021.acl-long.210 2021
[7]

Exploring social bias in chatbots using stereotype knowledge

Nayeon Lee, Andrea Madotto, and Pascale Fung. Exploring social bias in chatbots using stereotype knowledge. In Amittai Axelrod, Diyi Yang, Rossana Cunha, Samira Shaikh, and Zeerak Waseem, editors, Proceedings of the 2019 Workshop on Widening NLP , pages 177–180, Florence, Italy, August

work page 2019
[8]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess` ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review arXiv
[9]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Adva...

work page internal anchor Pith review arXiv
[11]

SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Jonathan May, Ekaterina Shutova, Aurelie Herbelot, Xiaodan Zhu, Marianna Apidianaki, and Saif M. Mohammad, editors, Proceedings of the 13th International Works...

work page 2019
[12]

doi: 10.18653/v1/S19-2010

Association for Computational Linguistics. doi: 10.18653/v1/S19-2010. https://aclanthology.org/S19-2010. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment,

work page doi:10.18653/v1/s19-2010 2010