{"total":311,"items":[{"citing_arxiv_id":"2605.23196","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prompt Overflow: What the Guardrail Inspects Is Not What the Model Infers","primary_cat":"cs.CR","submitted_at":"2026-05-22T03:27:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces Prompt Overflow Attack that fragments malicious instructions in overlength prompts to evade guardrail segmentation while remaining actionable to LLMs with larger context windows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":102,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22321","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions","primary_cat":"cs.CR","submitted_at":"2026-05-21T11:07:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A3S-Bench evaluates LLM agents against temporal, spatial, and semantic evasions, raising average risk trigger rates from 28.3% to 52.6% across 2,254 trajectories and 20 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22258","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting","primary_cat":"cs.CL","submitted_at":"2026-05-21T10:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21948","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-21T03:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SCI-Defense combines perplexity detection, semantic integrity scoring across four manipulation dimensions, and inter-candidate detection to counter GEO attacks, reporting perfect precision on Amazon product data but domain-limited recall on web passages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21834","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation","primary_cat":"cs.LG","submitted_at":"2026-05-20T23:56:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21706","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent-space Attacks for Refusal Evasion in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-20T20:10:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21674","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adversarial Reframing: A Framework for Targeted Generation in Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-20T19:31:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21362","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T16:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20994","ref_index":117,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Context-Invariant Safety Alignment for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T10:33:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20759","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders","primary_cat":"cs.CR","submitted_at":"2026-05-20T05:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Graph-context LLM fraud defenders improve early refusal under replay and adaptive multi-round attacks compared to text baselines but increase benign over-refusal, with the cost localized to how the LLM consumes structured graph fields rather than encoder quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20654","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak","primary_cat":"cs.LG","submitted_at":"2026-05-20T03:16:15+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20641","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-20T02:55:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20519","ref_index":18,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Codec-Robust Attacks on Audio LLMs","primary_cat":"cs.SD","submitted_at":"2026-05-19T21:39:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20382","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:32:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20351","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)","primary_cat":"cs.CR","submitted_at":"2026-05-19T18:05:51+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19966","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:15:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19940","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains","primary_cat":"cs.AI","submitted_at":"2026-05-19T15:00:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19722","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Safety Alignment Effects in Autonomous Security Agents","primary_cat":"cs.CR","submitted_at":"2026-05-19T11:55:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19485","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-05-19T07:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20286","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Probe-based Steering for Robust LLM Jailbreaking","primary_cat":"cs.CR","submitted_at":"2026-05-19T06:11:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Adaptive probe-based steering guided by model extraction and activation statistics improves LLM jailbreak success rates from 6% to 70% average harmfulness without extra contrastive prompts or manual tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19377","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Evaluation Game: Beyond Static LLM Benchmarking","primary_cat":"cs.LG","submitted_at":"2026-05-19T05:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":176,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"difficult to define a \"safe\" boundary in the raw waveform or spectral domain without degrading the signal's utility. Lack of Standardized Benchmarks:The offensive pro- liferation is partly driven by the ease of adapting computer vision and LLM attack algorithms to audio. However, de- fense lacks a unified evaluation standard. Unlike the mature \"Red Teaming\" datasets for text [176], the LALM commu- nity lacks a comprehensiveSafety Leaderboardthat evaluates models across the full spectrum of threats-from paralin- guistic privacy leakage to acoustic backdoors. This absence of metrics incentivizes performance-driven development at the expense of security. 4.4.2 The Challenge of Cross-Modal Alignment Our analysis shows that directly transferring text-based"},{"citing_arxiv_id":"2605.19035","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On","primary_cat":"cs.AI","submitted_at":"2026-05-18T18:57:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20262","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing","primary_cat":"cs.LG","submitted_at":"2026-05-18T18:17:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18991","ref_index":77,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Security is a Systems Problem","primary_cat":"cs.CR","submitted_at":"2026-05-18T18:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper argues that agent security is best addressed as a systems problem by applying principles from operating systems, networks, and formal methods rather than relying solely on model robustness improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18988","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks","primary_cat":"cs.CR","submitted_at":"2026-05-18T18:06:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes the TRIAD framework that treats multi-turn multimodal attacks as continuous trajectories and uses structural anomaly detection, regularized Mahalanobis distance, topological acceleration, and a time-varying Cox model with Bayesian HMM feedback to predict and bound expected time-to-failure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18309","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Alignment Dynamics in LLM Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-18T12:27:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18168","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-18T10:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18133","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments","primary_cat":"cs.CR","submitted_at":"2026-05-18T09:38:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17986","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-05-18T07:41:35+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18918","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense","primary_cat":"cs.CR","submitted_at":"2026-05-18T06:57:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESLD extracts safety signals directly from the latent space of any guard model to enable faster and more accurate prompt-injection detection without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17413","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications","primary_cat":"cs.CR","submitted_at":"2026-05-17T12:18:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17380","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ADR: An Agentic Detection System for Enterprise Agentic AI Security","primary_cat":"cs.AI","submitted_at":"2026-05-17T10:49:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17310","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17288","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack","primary_cat":"cs.CR","submitted_at":"2026-05-17T06:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17284","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-17T06:45:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17128","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"New Wide-Net-Casting Jailbreak Attacks Risk Large Models","primary_cat":"cs.CR","submitted_at":"2026-05-16T19:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17034","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation","primary_cat":"cs.LG","submitted_at":"2026-05-16T15:08:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16520","ref_index":240,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing","primary_cat":"cs.LG","submitted_at":"2026-05-15T18:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16090","ref_index":85,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation","primary_cat":"cs.CR","submitted_at":"2026-05-15T15:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16035","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Who Owns This Agent? Tracing AI Agents Back to Their Owners","primary_cat":"cs.CR","submitted_at":"2026-05-15T15:10:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16471","ref_index":159,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI","primary_cat":"cs.CR","submitted_at":"2026-05-15T13:53:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15598","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-15T04:14:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15393","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-14T20:26:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14514","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-14T07:58:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15239","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-14T03:40:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16407","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture","primary_cat":"cs.LO","submitted_at":"2026-05-13T12:01:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"partial","one_line_summary":"Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13338","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models","primary_cat":"cs.CR","submitted_at":"2026-05-13T10:57:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This effectively mismatches a premise set with an unrelated question, creating logically fractured problem instances. Premise-level crossover:Let kA ∼Uniform{1, . . . ,|P A|} and kB ∼Uniform{1, . . . ,|P B|} be randomly sampled premise indices. The operator swaps the selected premises: xC = PA \\ {pA kA } ∪ {p B kB }, q A \u0001 , xD = PB \\ {pB kB } ∪ {p A kA }, q B \u0001 . (9) In each crossover, the question-level crossover occurs with a probability of pqc, while the premise-level crossover occurs with a probability of 1−p qc. This produces cross-context premise combinations that disrupt the original reasoning chain. MUTATION Mutation is applied to an individual x= (P, q) with proba- bility pm. We consider two mutation modes, both operating"},{"citing_arxiv_id":"2605.13334","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-13T10:51:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}