{"total":33,"items":[{"citing_arxiv_id":"2605.11514","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-05-12T04:35:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11217","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging RAG for Training-Free Alignment of LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:29:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11002","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks","primary_cat":"cs.CR","submitted_at":"2026-05-10T00:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09070","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success","primary_cat":"cs.CR","submitted_at":"2026-05-09T17:26:39+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08876","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents","primary_cat":"cs.LG","submitted_at":"2026-05-09T10:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05682","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI","primary_cat":"cs.HC","submitted_at":"2026-05-07T05:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04700","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization","primary_cat":"cs.CR","submitted_at":"2026-05-06T09:52:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03378","ref_index":119,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-05-05T05:37:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02647","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming","primary_cat":"cs.CL","submitted_at":"2026-05-04T14:32:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01899","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01758","ref_index":23,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-03T07:38:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection rates from over 95% to under 5.47% while preserving benign performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02958","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection","primary_cat":"cs.CR","submitted_at":"2026-05-02T14:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Refusal in LLMs leaves a detectable upstream trajectory that SALO exploits to raise jailbreak detection from near zero to over 90 percent even under forced-decoding attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01034","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Theoretical Game of Attacks via Compositional Skills","primary_cat":"cs.CL","submitted_at":"2026-05-01T18:59:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02946","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-01T11:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00267","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Jailbroken Frontier Models Retain Their Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-30T22:04:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28157","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption","primary_cat":"cs.CR","submitted_at":"2026-04-30T17:43:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20994","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models","primary_cat":"cs.CR","submitted_at":"2026-04-22T18:32:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18976","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming","primary_cat":"cs.CL","submitted_at":"2026-04-21T01:58:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15789","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Study of Training-Free Methods for Trustworthy Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-17T07:50:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15780","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-17T07:37:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11309","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-04-13T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10326","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion","primary_cat":"cs.CR","submitted_at":"2026-04-11T19:19:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10299","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking","primary_cat":"cs.CV","submitted_at":"2026-04-11T17:33:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention-Guided Visual Jailbreaking blinds LVLMs to safety instructions by suppressing attention to alignment prefixes and anchoring generation on adversarial image features, reaching 94.4% attack success rate on Qwen-VL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08846","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-10T01:01:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07727","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense","primary_cat":"cs.CR","submitted_at":"2026-04-09T02:22:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06811","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems","primary_cat":"cs.CR","submitted_at":"2026-04-08T08:24:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04060","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-05T11:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19790","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements","primary_cat":"cs.AI","submitted_at":"2026-04-02T03:38:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01473","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits","primary_cat":"cs.CR","submitted_at":"2026-04-01T23:29:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11717","ref_index":153,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refusal in Language Models Is Mediated by a Single Direction","primary_cat":"cs.LG","submitted_at":"2024-06-17T16:36:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.17177","ref_index":113,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models","primary_cat":"cs.CV","submitted_at":"2024-02-27T03:30:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Liu, X. Lei, J. Tang, and M. Huang, \"Safetybench: Evaluating the safety of large language models with multiple choice questions,\" 2023. 31 [112] X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, \"\" do anything now\": Characterizing and eval- uating in-the-wild jailbreak prompts on large language models,\" arXiv preprint arXiv:2308.03825 , 2023. [113] X. Liu, N. Xu, M. Chen, and C. Xiao, \"Autodan: Generating stealthy jailbreak prompts on aligned large language models,\"arXiv preprint arXiv:2310.04451, 2023. [114] S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, \"Autodan: Interpretable gradient-based adversarial attacks on large language models,\" 2023. [115] A."},{"citing_arxiv_id":"2310.03684","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks","primary_cat":"cs.LG","submitted_at":"2023-10-05T17:01:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}