{"total":15,"items":[{"citing_arxiv_id":"2606.00801","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety","primary_cat":"cs.CR","submitted_at":"2026-05-30T16:40:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Applies MAP-Elites quality-diversity optimization to evolve semantic attack strategies across dimensions like strategy type, encoding, and length, uncovering distinct vulnerability profiles in four LLMs including GPT-4o-mini and Claude 3.5 Sonnet.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05116","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Hardness of Junking LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-06T16:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01899","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Persona-based jailbreak attacks differ from traditional jailbreak attacks. Instead of modifying malicious intent, it shifts the model's behavioral boundaries by reshaping its role perception. Deshpande et al. [18] revealed that persona assignment significantly increases toxic generation in ChatGPT. Persona modulation employs an LLM assistant to construct specific roles predisposed to executing harmful instructions [19]. Zhang et al. [9] used a genetic algorithm to automatically generate universal persona prompts, which not only substantially bypass the defenses of mainstream LLMs but also synergize with other jailbreak attacks. PersonaTeaming introduces personas in 2 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT the adversarial prompt generation process to explore a wider spectrum of adversarial strategies [20]."},{"citing_arxiv_id":"2604.14548","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"emotion recognition: Closing the valence gap.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10745-10759, 2023. doi: 10.1109/TPAMI.2023.3263585. [75] Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023. [76] Zaibin Zhang, Yongting Zhang, Lijun Li, Jing Shao, Hongzhi Gao, Yu Qiao, Lijun Wang, Huchuan Lu, and Feng Zhao. Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages"},{"citing_arxiv_id":"2604.10733","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-12T17:12:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09750","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Conflicts Make Large Reasoning Models Vulnerable to Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-10T11:44:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Conflicts between alignment objectives or dilemmas increase attack success rates on LRMs by shifting and overlapping safety and functional neural representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11528","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs","primary_cat":"cs.CR","submitted_at":"2026-02-12T03:37:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRACE-RPS drops LLM attribute inference accuracy from around 50% to below 5% via fine-grained anonymization plus a two-stage rejection optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03416","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-01-06T21:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GAMBIT constructs gamified instructional traps that decompose harmful visuals and drive MLLMs to reconstruct and answer malicious queries as part of winning a game, achieving over 85% attack success on models including GPT-4o and Gemini 2.5 Flash.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06701","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks","primary_cat":"cs.LG","submitted_at":"2025-09-08T13:55:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.20325","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs","primary_cat":"cs.CL","submitted_at":"2025-08-28T00:07:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.04295","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-07-05T06:57:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"Fine-tuning-based [68] [103] [47] [111] Black-box Attack Tamplate Completion Scenario Nesting [52] [22] [104] Context-based [100] [20] [48] [5] [120] Code Injection [43] [61] Prompt Rewriting Cipher [108] [40] [33] [55] [55] [13] Low-resource Languages [21] [106] [49] Genetic Algorithm-based [56] [46] [107] [50] [88] LLM-based Generation [19] [109] [76] [12] [15] [41] [27] [91] [54] [64] Figure 2: Taxonomy of jailbreak attack. Step1: Initialize the prompt with an adversarial suffix Step2: Optimize new token in the suffix iteratively Step3: Jailbreak prompt with optimized suffix Sorry, I can't help you with that. Can you teach me how to make a bomb? Can you teach me how to make a bomb? <suffix>"},{"citing_arxiv_id":"2404.01318","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","primary_cat":"cs.CR","submitted_at":"2024-03-28T02:44:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.10260","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A StrongREJECT for Empty Jailbreaks","primary_cat":"cs.LG","submitted_at":"2024-02-15T18:58:09+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.03853","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dr. Jekyll and Mr. Hyde: Two Faces of LLMs","primary_cat":"cs.CR","submitted_at":"2023-12-06T19:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Impersonating complex misaligned personas via biographies and role-play bypasses safety in ChatGPT, Gemini, and Deepseek, succeeding on 38-40 out of 40 illicit questions across tested models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.08419","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbreaking Black Box Large Language Models in Twenty Queries","primary_cat":"cs.LG","submitted_at":"2023-10-12T15:38:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?, 2023. [14] Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023. [15] Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023. 1 10 [16] Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack."}],"limit":50,"offset":0}