hub

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks

· 2024 · arXiv 2404.03027

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2

citation-polarity summary

use dataset 2

representative citing papers

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

JailWAM: Jailbreaking World Action Models in Robot Control

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

cs.CR · 2026-06-05 · unverdicted · novelty 6.0

MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.

MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models

cs.CR · 2026-06-05 · unverdicted · novelty 6.0

MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

cs.CR · 2026-03-23 · unverdicted · novelty 6.0

Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

cs.CR · 2025-07-29 · unverdicted · novelty 6.0

PRISM decomposes harmful instructions into benign visual gadgets and directs LVLMs via prompts to compose them through reasoning into harmful outputs, achieving ASR over 0.90 on SafeBench.

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

Explicit image-tool interaction in VLMs cuts multimodal jailbreak ASR by ~30% on average; the effect is attributed to a safety-relevant shift in hidden representations rather than image semantics or text traces.

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.

NVIDIA Nemotron 3: Efficient and Open Intelligence

cs.CL · 2025-12-24 · unverdicted · novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

cs.CR · 2025-04-01 · unverdicted · novelty 3.0

A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

cs.CV · 2025-12-26

citing papers explorer

Showing 3 of 3 citing papers after filters.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CV · 2026-05-31 · unverdicted · none · ref 31
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness? cs.CV · 2026-05-27 · unverdicted · none · ref 5
Explicit image-tool interaction in VLMs cuts multimodal jailbreak ASR by ~30% on average; the effect is attributed to a safety-relevant shift in hidden representations rather than image semantics or text traces.
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models cs.CV · 2025-12-26 · unreviewed · ref 25

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer