hub

Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387

Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, Yisen Wang · 2023 · arXiv 2310.06387

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

Jailbreaking Large Language Models with Morality Attacks

cs.CL · 2026-04-18 · unverdicted · novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.

citing papers explorer

Showing 11 of 11 citing papers.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 15
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 15
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 54
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 73
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation cs.CR · 2026-04-24 · unverdicted · none · ref 53
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 17
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 55
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking cs.SD · 2026-04-10 · unverdicted · none · ref 33
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 39
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Jailbreaking Large Language Models with Morality Attacks cs.CL · 2026-04-18 · unverdicted · none · ref 5
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 53
Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.

Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer