{"total":14,"items":[{"citing_arxiv_id":"2606.07706","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-06-05T10:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01481","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-31T22:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18868","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-15T12:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05678","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:12:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"This setup lets us compare safety across stages, principles, models, and prompt sources, and also provides the scoring signal used by our mitigation method. 3.2 Data and Models Safety prompts.Our safety prompt benchmark aggregates prompts from multiple public harmful- ness and jailbreak datasets. The in-distribution prompt pool combines seven sources:WildChat[ 35], PKU-SafeRLHF[ 36],JailbreakV[ 37],HarmBench[ 14],BeaverTails[ 38],StrongREJECT[ 15], andJailbreakBench[ 16]. Together, these sources cover direct harmful requests, jailbreaks, malicious role-play, adversarial framing, and naturally occurring unsafe user queries. We map dataset-specific fields to a unified prompt column and source label, filter non-English prompts and length outliers,"},{"citing_arxiv_id":"2605.04446","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-06T03:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01687","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety","primary_cat":"cs.CL","submitted_at":"2026-05-03T02:55:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12374","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-14T07:02:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08846","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-10T01:01:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Interpretation (↓) Average (↓) No Steering0.13 0.03 0.11 0.09 Prompting 0.31 0.07 0.28 0.22 ActAdd 0.28 0.04 0.21 0.18 MOP 0.14 0.030.19 0.12 DACO (Ours) 0.170.030.14 0.11 report category-averaged defense success rate (i.e., the per- centage of responses that are classified as safe). We evaluate DACO on MM-SafetyBench (MS) [52] and JailbreakV-28K (JBV) [ 58] using either RoBERTa-SafeEdit (R) [ 104] or Qwen3Guard (QG) [130] as a safety judge, which lead to four possible combinations: MS-R, MS-QG, JBV-R, and JBV-QG. RoBERTa-SafeEdit outputs a score in [0,1] (i.e., probability), where higher is safer. For Qwen3Guard, we take the judge's safety pattern evaluation as scores, where \"Safe\" is 1, \"Unsafe\" is 0, and \"Controversial\" is 0."},{"citing_arxiv_id":"2604.05498","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JailWAM: Jailbreaking World Action Models in Robot Control","primary_cat":"cs.RO","submitted_at":"2026-04-07T06:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21697","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-03-23T08:32:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.21815","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-26T01:01:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21540","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking","primary_cat":"cs.CR","submitted_at":"2025-07-29T07:13:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISM decomposes harmful instructions into benign visual gadgets and directs LVLMs via prompts to compose them through reasoning into harmful outputs, achieving ASR over 0.90 on SafeBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.00446","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics","primary_cat":"cs.CR","submitted_at":"2025-04-01T05:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}