SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
hub
Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 2polarities
use dataset 2representative citing papers
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
PRISM decomposes harmful instructions into benign visual gadgets and directs LVLMs via prompts to compose them through reasoning into harmful outputs, achieving ASR over 0.90 on SafeBench.
Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.
citing papers explorer
-
SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
SafeGen-Bench is a benchmark with 10 malicious categories that evaluates conditional T2V models on paired start frames and text prompts, finding unsafety scores up to 44.5 and 80% guardrail failure rate.
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
JailWAM: Jailbreaking World Action Models in Robot Control
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
-
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models
MLingualFC benchmark finds flowchart jailbreaks succeed at high rates for Latin-script languages but much lower rates for Punjabi in multilingual VLMs, pointing to language-dependent safety gaps.
-
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
-
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models
Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
-
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking
PRISM decomposes harmful instructions into benign visual gadgets and directs LVLMs via prompts to compose them through reasoning into harmful outputs, achieving ASR over 0.90 on SafeBench.
-
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.
- High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models