{"total":16,"items":[{"citing_arxiv_id":"2606.07335","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics","primary_cat":"cs.CR","submitted_at":"2026-06-05T14:49:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02530","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-06-01T17:38:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00686","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19190","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South","primary_cat":"cs.CY","submitted_at":"2026-05-18T23:34:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04992","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation","primary_cat":"cs.CR","submitted_at":"2026-05-06T14:52:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The adapted forward pass for an input his computed asW ′h = W0h + 𝛼 𝑟 BAh, where 𝛼 scales the adapta- tion magnitude. This reduces trainable parameters from O (𝑑𝑘) to O (𝑟(𝑑+𝑘)) , enabling the fine-tuning of 70B+ models on consumer hardware via 4-bit quantization (QLoRA) [32]. LoRA is a standard for safety alignment as it modifies behavior while preserving foun- dational capabilities [15, 55]. 2.2 Supervised Fine-Tuning (SFT) SFT adapts a model to a target behavioral profile by minimizing the standard negative log-likelihood (NLL) loss over a labeled dataset D={(x 𝑖,y 𝑖 )} 𝑁 𝑖=1: LSFT (𝜃)=− 𝑁∑︁ 𝑖=1 |y𝑖 |∑︁ 𝑡=1 log𝑃 𝜃 𝑦𝑖,𝑡 |x 𝑖, 𝑦 𝑖,<𝑡 \u0001 .(1) SFT serves as the initial post-training stage in alignment pipelines [104], typically utilizing human-written demonstrations to outperform"},{"citing_arxiv_id":"2605.04446","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-06T03:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01687","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety","primary_cat":"cs.CL","submitted_at":"2026-05-03T02:55:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.20981","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diversifying Toxicity Search in Large Language Models Through Speciation","primary_cat":"cs.NE","submitted_at":"2026-01-28T19:29:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ToxSearch-S applies unsupervised speciation to evolutionary prompt search, maintaining capacity-limited species with exemplar leaders and species-aware selection to achieve higher peak toxicity and broader semantic coverage than standard methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.17887","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents","primary_cat":"cs.AI","submitted_at":"2026-01-25T15:42:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Personalization through long-term memory in LLM agents increases harmful query success rates by 15.8-243.7% via intent legitimation, measured on the new PS-Bench benchmark across frameworks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.09689","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models","primary_cat":"cs.CR","submitted_at":"2025-10-09T09:44:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.11206","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evalet: Evaluating Large Language Models through Functional Fragmentation","primary_cat":"cs.HC","submitted_at":"2025-09-14T10:24:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"tions that are not actually relevant to the criterion. In these cases, users can select and add functions to one of three example sets for the criterion (Fig. 5A, B): (1)positive examplesto rate positively, (2)negative examplesto rate negatively, and (3)excluded examples to avoid extracting for this criterion. These sets serve as few-shot examples [11] in future evaluations. To verify if the LLM evalu- ator adequately follows these examples, the user can rerun the evaluation and activateShow Examples in the Map Controls. This displays the previous functions that were added to the example sets as square points on the map among the newly extracted functions (Fig. 5C), which allows users to visually verify the effect of the"},{"citing_arxiv_id":"2509.10546","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain","primary_cat":"cs.CL","submitted_at":"2025-09-07T22:35:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.14226","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs","primary_cat":"cs.CL","submitted_at":"2025-05-20T11:35:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.02574","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Safety Evaluations Lack Robustness","primary_cat":"cs.CR","submitted_at":"2025-03-04T12:55:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.12935","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions","primary_cat":"cs.AI","submitted_at":"2024-08-23T09:33:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1, red-teaming is an effective technique for generating reliable safety data. Consequently, this section will delve into various training strategies, e.g., instruction tuning and RLHF. 6.2.1 Instruction Tuning. Safety training can be effectively implemented using adversarial prompts and their corre- sponding responsible output in an instruction-tuning framework. Bianchi et al. [ 64] analyze this training strategy, showing that adding a small number of safety examples (just 3% for models like LLaMA) when fine-tuning LLMs can substantially improve model safety. However, the study also highlights the risk of overusing safety data, which can lead the model to excessively prioritize safety and refuse some perfectly safe but superficially unsafe prompts."},{"citing_arxiv_id":"2407.04295","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Jailbreak Attacks and Defenses Against Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-07-05T06:57:30+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"requests. 4.1 Prompt-level Defenses Prompt-level defenses refer to the scenarios where the direct access to neither the internal model weight nor the output 11 Jailbreak Defense Methods Prompt Level Prompt Detection [37] [1] Prompt Perturbation [11] [73] [38] [112] [45] [121] System Prompt Safeguard [77] [126] [94] [118] Model Level SFT-based [9] [18] [8] RLHF-based [66] [6] [83] [25] [59] [26] [58] Gradient and Logit Analysis [101] [102] [35] [53] Refinement [44] [113] Proxy Defense [110] [85] Figure 8: Taxonomy of jailbreak defense. logits is available, thus the prompt becomes the only vari- able both the attackers and defenders can control. To protect the model from the increasing number of elaborately con-"}],"limit":50,"offset":0}