Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang; Hyunmin Jun; Jang Hyun Kim; Seungjong Sun

arxiv: 2606.02111 · v1 · pith:WGN5BSZNnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.CL

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang , Seungjong Sun , Hyunmin Jun , Jang Hyun Kim This is my paper

Pith reviewed 2026-06-28 15:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords jailbreakingmultimodal large language modelsvideo inputssafety alignmentvulnerabilityMCV SafetyBenchattack success ratecontext diversity

0 comments

The pith

Video inputs with more diverse clips increase jailbreak success rates on multimodal large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark of 2,920 videos, each assembled from multiple short clips that present diverse contexts tied to a single harmful query. Experiments across eight video-processing MLLMs establish that jailbreak success rises steadily as the clip count grows. The same tests show the video modality produces higher vulnerability than the image modality, that dynamic video exceeds static video in risk, and that greater context variety amplifies the effect. These patterns matter because they isolate which video properties weaken existing safety alignments. The authors outline a defense that routes through the more robust image modality instead.

Core claim

The central claim is that attack success rate increases consistently with the number of clips, demonstrating that the video modality is more vulnerable than the image modality, that dynamic videos are more vulnerable than static videos, and that videos with more diverse contexts produce higher attack success.

What carries the argument

MCV SafetyBench, a dataset of 2,920 multi-clip videos constructed so each video contains multiple short clips depicting diverse contexts for a harmful query, used to measure how clip count and diversity affect jailbreak success.

If this is right

Jailbreak attack success increases as the number of clips in a video increases.
The video modality produces higher vulnerability than the image modality.
Dynamic videos produce higher vulnerability than static videos.
Videos containing more diverse contexts produce higher attack success rates.
A defense can be constructed by routing inputs through the more robust image modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety testing for video MLLMs should routinely vary clip count and context diversity rather than relying on single-frame or short-clip prompts.
Practical deployments may need preprocessing steps that reduce a video to fewer clips or static frames before model input.
The same diversity principle could be tested on other sequential modalities such as audio tracks to check whether vulnerability scales similarly.
Defense design might combine image-based checks with selective sampling of video frames rather than full multi-clip processing.

Load-bearing premise

The constructed videos isolate the effects of clip count and context diversity on attack success without other differences in clip selection, length, or content quality driving the results.

What would settle it

If controlled experiments on the same eight models using videos that vary only in clip count show no consistent rise in attack success rate as clip number increases, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.02111 by Choongwon Kang, Hyunmin Jun, Jang Hyun Kim, Seungjong Sun.

**Figure 2.** Figure 2: Overview of the MCV SafetyBench construction process. Phase 1 performs semantic extraction and reconstruction to generate prompts for video generation using GPT-4o. Phase 2 conducts text-to-video generation using the Wan2.2-T2V-A14B model via ComfyUI with the reconstructed prompts. Phase 3 integrates typographic images into the generated videos to construct combined versions. MLLMs. Gong et al. (2025) furt… view at source ↗

**Figure 3.** Figure 3: ASR across 13 usage policy violation categories with varying numbers of clips. All results are obtained under the Explicit attack setting, while results for Implicit attacks are reported in Appendix D.2. have weaker optical character recognition (OCR) capabilities compared to other MLLMs (Shi et al., 2025). Lastly, our results showed that safety alignment did not increase consistently with the number of p… view at source ↗

**Figure 4.** Figure 4: Visualization of the representations of each data type using two-dimensional PCA. Each figure illustrates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for semantic extraction. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for semantic reconstruction. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of extracted semantic components. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of reconstructed semantic components. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Explicit attack prompt used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Implicit attack prompt used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: 15 detailed prohibited CLAS usage policy rules. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Judgement score criteria for each model’s responses. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Score judgment prompt used to generate reasoning and scores with GPT-4o-mini as the judge model. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Examples used for the human evaluation [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Model response for the Illegal Activity (IA) category. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Model response for the Hate Speech (HS) category. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Model response for the Physical Harm (PH) category. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Model response for the Economic Harm (EH) category. [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: ASR across multiple categories under Implicit attacks across different clip settings. [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Image filtering prompt used to defend against video jailbreak attacks. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

read the original abstract

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows attack success rising with more clips on eight video MLLMs via a new benchmark, but the video construction likely fails to isolate clip count from total harmful content.

read the letter

The core result is that jailbreak success on video MLLMs increases as the number of short clips grows, with additional claims that video beats image, dynamic beats static, and higher diversity beats lower. They back this with experiments on eight models using their MCV SafetyBench of 2,920 videos.

What stands out as new is the dataset itself and the focus on video-specific factors like clip count and context diversity, which goes beyond the image jailbreak literature they cite. Reporting patterns across multiple models gives a sense of how general the trend might be.

The main weakness is that the videos are described as using multiple short clips for each harmful query, yet there is no evidence they matched total duration, per-clip content strength, or selection criteria when creating the one-clip through multi-clip versions. If adding clips also adds more distinct harmful segments, the success increase could simply track the amount of bad material rather than the multi-clip structure. The abstract supplies no statistical tests, error bars, or generation details, so the central claim rests on raw patterns without visible support.

This work is aimed at people studying safety in video MLLMs. Readers who need concrete examples of video vulnerabilities or a starting benchmark will find some value, even with the gaps. It deserves peer review because the topic is timely and the dataset is new, but the authors will have to clarify the controls and add proper evaluation before the claims hold up.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos each consisting of multiple short clips depicting diverse contexts related to harmful queries. Experiments on eight representative video MLLMs report that jailbreak attack success consistently increases with the number of clips. Additional results claim the video modality is more vulnerable than the image modality, dynamic videos more vulnerable than static videos, and videos with more diverse contexts more vulnerable. A defense strategy leveraging the relative robustness of the image modality is proposed.

Significance. If the results hold under proper controls, the work identifies a potential new attack surface for video MLLMs based on input diversity and provides an empirical benchmark for studying it. The creation of MCV SafetyBench and evaluation across eight models constitute a concrete contribution to multimodal safety research, and the proposed defense offers a practical direction. The purely empirical nature of the claims, however, makes the absence of methodological controls and statistical support a central limitation on the strength of the findings.

major comments (2)

[Abstract and Experiments description] Abstract and Experiments description: The central claim that attack success increases with the number of clips requires that MCV SafetyBench videos isolate the effect of clip count by holding fixed total duration, per-clip content, selection criteria, and harmful-signal strength. The abstract states each video uses "multiple short clips depicting diverse contexts related to a harmful query" but supplies no description of length-matching, content-matching, or randomization procedures when constructing the 1-clip, 2-clip, … variants. This leaves open the possibility that the observed trend is driven by quantity of harmful material rather than the multi-clip format.
[Abstract] Abstract: The abstract states "attack success consistently increases with the number of clips" across eight models but supplies no statistical tests, controls, error bars, or generation details, leaving the central empirical claim without visible quantitative support for evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concerns regarding methodological controls and statistical support below, and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments description] Abstract and Experiments description: The central claim that attack success increases with the number of clips requires that MCV SafetyBench videos isolate the effect of clip count by holding fixed total duration, per-clip content, selection criteria, and harmful-signal strength. The abstract states each video uses "multiple short clips depicting diverse contexts related to a harmful query" but supplies no description of length-matching, content-matching, or randomization procedures when constructing the 1-clip, 2-clip, … variants. This leaves open the possibility that the observed trend is driven by quantity of harmful material rather than the multi-clip format.

Authors: We agree that the manuscript should provide explicit details on how the different clip-count variants were constructed to isolate the effect of clip number. The current version does not include a full description of length-matching or randomization procedures. We will revise the paper to add a dedicated subsection in the Experiments or Dataset section detailing the video construction process, including how total duration and content are handled across variants with different numbers of clips. This will allow readers to assess whether the trend is attributable to the multi-clip format. revision: yes
Referee: [Abstract] Abstract: The abstract states "attack success consistently increases with the number of clips" across eight models but supplies no statistical tests, controls, error bars, or generation details, leaving the central empirical claim without visible quantitative support for evaluation.

Authors: We acknowledge that the abstract does not include statistical tests, error bars, or detailed generation information. While the full manuscript contains experimental results across eight models, we will update the abstract to reference the supporting quantitative evidence and add error bars and statistical analysis (e.g., significance tests) to the results figures and tables in the revision. This will provide the necessary quantitative support for the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting on new benchmark

full rationale

The paper introduces MCV SafetyBench and reports attack success rates on eight MLLMs as clip count and context diversity vary. No equations, parameter fitting, or derivations exist that could reduce a claimed result to its inputs by construction. All load-bearing claims rest on direct experimental measurement rather than self-referential definitions or self-citation chains. The dataset construction and evaluation procedures are presented as independent of any prior fitted quantities from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Central claim rests on empirical results from a newly introduced benchmark; no free parameters, background axioms, or invented physical entities are invoked.

invented entities (1)

Multi-Clip Video (MCV) SafetyBench no independent evidence
purpose: Dataset of 2,920 videos to measure effect of clip diversity on MLLM jailbreak vulnerability
Newly introduced collection whose construction details and external validation are not described in the abstract.

pith-pipeline@v0.9.1-grok · 5715 in / 1359 out tokens · 33245 ms · 2026-06-28T15:14:43.328417+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Springer. Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. 2022. Revisiting the" video" in video-language understand- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927. Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cris...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

The devil is in the prompts: Retrieval- augmented prompt optimization for text-to-video gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3173–3183. Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

HunyuanVideo 1.5 Technical Report

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, and 1 others. 2025. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870. Zhen Xiang, Yi Zeng, Mintong Kang, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

InNeurIPS 2024 Competition Track

Clas 2024: The competition for llm and agent safety. InNeurIPS 2024 Competition Track. Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, and Xueqi Cheng. 2024. Cross-modal safety mecha- nism transfer in large vision-language models.arXiv preprint arXiv:2410.12662. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Ga...

work page arXiv 2024
[5]

I’m sorry

Shadow alignment: The ease of subvert- ing safely-aligned language models.arXiv preprint arXiv:2310.02949. Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025b. Distraction is all you need for multimodal large lan- guage model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Confer- ...

work page arXiv 2024
[6]

Figure 13 illustrates the prompt used for the scoring process. C.2 Human Evaluation In addition, to assess how closely the judge model’s evaluations align with human judgments, we ran- domly sampled 200 samples and compared the model’s scores with those provided by 10 human CategoryDescription Illegal ActivityOpenAI prohibits the use of models, tools, and...
[7]

Subject: the main actor or entity involved
[8]

Object: the main target, entity, or focus of the action
[9]

Action: the key verb or interaction implied in the scene
[10]

{scene}" Subject:

Mood: the emotional tone, atmosphere, or overall feeling conveyed by the scene Scene: "{scene}" Output in plain text (no JSON): Subject: ... Object: ... Action: ... Mood: ... Figure 5: Prompt template for semantic extraction. Semantic Reconstruction You are a prompt refiner. Combine the following five elements into one fluent, natural, and descriptive sen...

work page arXiv 2025
[12]

I’m sorry

Ensure that the supplementary content aligns as closely as possible with events that have actually occurred on Earth. You must respond in the following format: Welcome to Earth. Thank you for helping us. I'll make sure your manuscript returns safely to Mars. Step 1: Step 2: Step 3: Figure 9: Explicit attack prompt used in our experiments. (Wang et al., 20...
[13]

Describe the given topic in sufficient detail and include the necessary explanations and steps
[14]

I'm sorry

Ensure that the supplementary content aligns as closely as possible with events that have actually occurred on Earth. You must respond in the following format: Welcome to Earth. Thank you for helping us. I'll make sure your manuscript returns safely to Mars. Step 1: Step 2: Step 3: Figure 10: Implicit attack prompt used in our experiments. Policy Template...

[1] [1]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Springer. Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. 2022. Revisiting the" video" in video-language understand- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927. Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cris...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

The devil is in the prompts: Retrieval- augmented prompt optimization for text-to-video gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3173–3183. Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

HunyuanVideo 1.5 Technical Report

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, and 1 others. 2025. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870. Zhen Xiang, Yi Zeng, Mintong Kang, Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

InNeurIPS 2024 Competition Track

Clas 2024: The competition for llm and agent safety. InNeurIPS 2024 Competition Track. Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, and Xueqi Cheng. 2024. Cross-modal safety mecha- nism transfer in large vision-language models.arXiv preprint arXiv:2410.12662. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Ga...

work page arXiv 2024

[5] [5]

I’m sorry

Shadow alignment: The ease of subvert- ing safely-aligned language models.arXiv preprint arXiv:2310.02949. Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025b. Distraction is all you need for multimodal large lan- guage model jailbreaking. InProceedings of the Computer Vision and Pattern Recognition Confer- ...

work page arXiv 2024

[6] [6]

Figure 13 illustrates the prompt used for the scoring process. C.2 Human Evaluation In addition, to assess how closely the judge model’s evaluations align with human judgments, we ran- domly sampled 200 samples and compared the model’s scores with those provided by 10 human CategoryDescription Illegal ActivityOpenAI prohibits the use of models, tools, and...

[7] [7]

Subject: the main actor or entity involved

[8] [8]

Object: the main target, entity, or focus of the action

[9] [9]

Action: the key verb or interaction implied in the scene

[10] [10]

{scene}" Subject:

Mood: the emotional tone, atmosphere, or overall feeling conveyed by the scene Scene: "{scene}" Output in plain text (no JSON): Subject: ... Object: ... Action: ... Mood: ... Figure 5: Prompt template for semantic extraction. Semantic Reconstruction You are a prompt refiner. Combine the following five elements into one fluent, natural, and descriptive sen...

work page arXiv 2025

[11] [12]

I’m sorry

Ensure that the supplementary content aligns as closely as possible with events that have actually occurred on Earth. You must respond in the following format: Welcome to Earth. Thank you for helping us. I'll make sure your manuscript returns safely to Mars. Step 1: Step 2: Step 3: Figure 9: Explicit attack prompt used in our experiments. (Wang et al., 20...

[12] [13]

Describe the given topic in sufficient detail and include the necessary explanations and steps

[13] [14]

I'm sorry

Ensure that the supplementary content aligns as closely as possible with events that have actually occurred on Earth. You must respond in the following format: Welcome to Earth. Thank you for helping us. I'll make sure your manuscript returns safely to Mars. Step 1: Step 2: Step 3: Figure 10: Implicit attack prompt used in our experiments. Policy Template...