pith. machine review for the scientific record. sign in

arxiv: 2604.27002 · v1 · submitted 2026-04-29 · 💻 cs.CR

Recognition: unknown

Membership Inference Attacks Against Video Large Language Models

Gelei Deng, Wei Song, Yi Liu, Yuekang Li, Yuxin Cao, Ziqi Ding

Pith reviewed 2026-05-07 13:03 UTC · model grok-4.3

classification 💻 cs.CR
keywords membership inferencevideo large language modelsblack-box attacksprivacytemperature perturbationsemantic driftVideoLLMstraining data leakage
0
0 comments X

The pith

A black-box attack can determine whether a specific video was used to train a VideoLLM by measuring how its generated text changes across different temperatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a membership inference attack for video large language models that works from black-box text outputs alone. It queries the target model at low and high temperatures, quantifies the semantic drift between the resulting texts, and combines that signal with simple measures of the video's motion and temporal complexity. The central intuition is that videos seen during training produce more stable, less drifting outputs under temperature changes than unseen videos. When tested on LLaVA-Video-7B-Qwen2-Video-Only, the method reaches an AUC of 0.68 and accuracy of 0.63, showing that current VideoLLMs leak information about their training videos. This matters because these models are trained on large, heterogeneous video corpora that may contain private or copyrighted footage.

Core claim

The paper claims that member videos induce sharper, more brittle generation behavior across decoding temperatures than non-member videos, and that this difference, when read jointly with the video's intrinsic difficulty, enables reliable black-box membership inference. The attack is realized by querying the model at low and high temperatures, computing semantic drift between the outputs, and feeding the drift value together with video difficulty features into a simple classifier.

What carries the argument

Temperature-perturbed generation that measures semantic drift between low- and high-temperature outputs, interpreted jointly with video-aware difficulty features.

If this is right

  • VideoLLMs leak private information about their training videos through the stability of their generated text.
  • Black-box auditors can check membership without access to model weights or training logs.
  • Privacy risks exist for any VideoLLM trained on uncurated video-text pairs from the open web.
  • The same temperature-drift signal may be weaker or absent in models that use heavy regularization or deduplication.
  • Mitigation techniques such as output smoothing or differential privacy during fine-tuning become necessary for VideoLLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attack could be adapted to detect whether proprietary video footage was used without permission in commercial VideoLLM training runs.
  • Similar temperature-based drift signals may appear in other temporal modalities such as audio or time-series data, suggesting a broader class of multimodal membership leaks.
  • If the difficulty features are removed, performance may drop, implying that the temperature signal alone is insufficient and that video-specific properties are essential to the method.
  • Developers could reduce the attack surface by training on shorter clips or by adding noise to the generation process at inference time.

Load-bearing premise

Member videos produce more consistent text outputs across temperature changes than non-member videos once video difficulty is taken into account.

What would settle it

A controlled test in which the identical VideoLLM architecture is trained once with a set of videos and once without them, then the attack's AUC is measured on held-out members versus non-members; an AUC near 0.5 would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.27002 by Gelei Deng, Wei Song, Yi Liu, Yuekang Li, Yuxin Cao, Ziqi Ding.

Figure 1
Figure 1. Figure 1: Overview of TempVideo-MIA. 4 Method As shown in view at source ↗
read the original abstract

Video large language models (VideoLLMs) are increasingly trained or instruction-tuned on large-scale video--text corpora collected from heterogeneous sources, raising an immediate privacy question: can an external auditor determine whether a particular video was used during training? While membership inference attacks (MIAs) have been studied extensively for classifiers and, more recently, for text and image generation models, the VideoLLM setting remains unexplored. This setting is challenging because black-box auditors observe only generated text, whereas the membership signal is entangled with video-specific factors such as motion complexity and temporal span. In this paper, we present a black-box MIA targeting VideoLLMs that couples temperature-perturbed generation with video-aware difficulty features. Our key intuition is that member samples tend to induce sharper, more brittle generation behavior across decoding temperatures, and that this signal should be interpreted jointly with the intrinsic difficulty of the queried video. Concretely, we query the target model at low and high temperatures, measure the semantic drift between the resulting texts. We evaluate the attack against \texttt{LLaVA-Video-7B-Qwen2-Video-Only} and achieve a member inference AUC of 0.68 and accuracy of 0.63. These results demonstrate that Video-LLMs are vulnerable to black-box membership inference attacks, highlighting an urgent need for the community to systematically evaluate and mitigate privacy risks in VideoLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce the first black-box membership inference attack (MIA) specifically targeting Video Large Language Models (VideoLLMs). It couples temperature-perturbed text generation with video-aware difficulty features (e.g., motion complexity and temporal span) based on the intuition that member videos produce sharper semantic drift between low- and high-temperature outputs. The attack is evaluated on LLaVA-Video-7B-Qwen2-Video-Only, reporting an AUC of 0.68 and accuracy of 0.63, which the authors interpret as evidence that VideoLLMs are vulnerable to black-box MIAs.

Significance. If the empirical results can be shown to be robust to controls and ablations, the work would be significant as the first demonstration of MIA feasibility for the video modality in LLMs, extending prior work on text and image models and underscoring privacy risks in training on heterogeneous video-text corpora. It provides a concrete starting point for future mitigation research in VideoLLMs.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The reported AUC of 0.68 and accuracy of 0.63 are presented without any baseline comparisons (e.g., random guessing, standard shadow-model MIAs, or non-membership-aware classifiers), statistical significance tests, or ablation studies removing the video-aware difficulty features. This makes it impossible to determine whether the measured signal arises from membership or from intrinsic video properties such as motion complexity.
  2. [Introduction and Method] Introduction and Method sections: The central intuition that 'member samples tend to induce sharper, more brittle generation behavior across decoding temperatures' is stated without derivation, without comparison to text or image LLMs, and without any reported ablation demonstrating that the difficulty features are required to reach the claimed performance. The signal could therefore be driven by video-intrinsic factors rather than membership, undermining the claim that the method specifically demonstrates VideoLLM vulnerability.
minor comments (2)
  1. [Abstract and Method] The abstract and method description would benefit from explicit definitions of the semantic-drift metric and the exact video difficulty features used (e.g., how motion complexity is quantified).
  2. [Method] Clarify the exact query format and temperature values employed in the temperature-perturbed generation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects for strengthening our evaluation. We will revise the manuscript to include the suggested baselines, ablations, and statistical analyses. Our responses to the major comments are as follows.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The reported AUC of 0.68 and accuracy of 0.63 are presented without any baseline comparisons (e.g., random guessing, standard shadow-model MIAs, or non-membership-aware classifiers), statistical significance tests, or ablation studies removing the video-aware difficulty features. This makes it impossible to determine whether the measured signal arises from membership or from intrinsic video properties such as motion complexity.

    Authors: We agree that the evaluation in the original manuscript lacks these controls. In the revised version, we will add: (1) a random baseline with AUC 0.5, (2) a non-membership-aware classifier using only video difficulty features (motion complexity, temporal span) to show that membership signal adds value beyond intrinsic properties, (3) discussion of why a full shadow-model MIA is challenging in the strict black-box setting and a simple proxy if feasible, and (4) statistical significance tests (e.g., bootstrap confidence intervals) on the AUC. Additionally, we will perform and report an ablation study removing the video-aware features to quantify their contribution. These additions will clarify that the observed performance stems from the membership signal rather than video properties alone. revision: yes

  2. Referee: [Introduction and Method] Introduction and Method sections: The central intuition that 'member samples tend to induce sharper, more brittle generation behavior across decoding temperatures' is stated without derivation, without comparison to text or image LLMs, and without any reported ablation demonstrating that the difficulty features are required to reach the claimed performance. The signal could therefore be driven by video-intrinsic factors rather than membership, undermining the claim that the method specifically demonstrates VideoLLM vulnerability.

    Authors: The intuition is based on empirical observations during development: member videos, being overfit, exhibit larger semantic shifts in generated text when temperature is varied, as the model is more 'confident' on training data leading to brittle behavior. We will expand the introduction to derive this more explicitly from overfitting principles and include comparisons to prior MIA results on text and image LLMs, noting that video introduces additional complexity due to temporal dynamics. As mentioned in response to the first comment, the ablation study will demonstrate the necessity of the difficulty features. We maintain that the attack is specific to VideoLLMs because it leverages video-specific features, but we will clarify this in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack with measured performance, no derivation or fitted prediction.

full rationale

The paper describes an empirical black-box MIA that queries the model at different temperatures, computes semantic drift, and augments with video difficulty features. The central result (AUC 0.68, accuracy 0.63 on LLaVA-Video-7B) is a direct measurement on held-out data, not a prediction derived from any equation or parameter fit within the paper. The key intuition about brittle generation is presented as an observation without equations, self-citations, or reductions to inputs. No load-bearing steps match the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven domain assumption that training membership produces detectable differences in temperature sensitivity of generated text, modulated by video difficulty; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Member samples tend to induce sharper, more brittle generation behavior across decoding temperatures
    This is presented as the key intuition that the attack builds upon.

pith-pipeline@v0.9.0 · 5556 in / 1301 out tokens · 82573 ms · 2026-05-07T13:03:03.874012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

  3. [3]

    arXiv preprint arXiv:2508.10974 (2025) Membership Inference Attacks Against Video Large Language Models 15

    Cao, Y., Song, W., Wang, D., Xue, J., Dong, J.S.: Failures to surface harmful contents in video large language models. arXiv preprint arXiv:2508.10974 (2025) Membership Inference Attacks Against Video Large Language Models 15

  4. [4]

    arXiv preprint arXiv:2602.10639 (2026)

    Cao, Y., Song, W., Xu, S., Xue, J., Dong, J.S.: Videostf: Stress-testing output repetition in video large language models. arXiv preprint arXiv:2602.10639 (2026)

  5. [5]

    arXiv preprint arXiv:2509.20851 (2025)

    Cao, Y., Song, W., Xue, J., Dong, J.S.: Poisoning prompt-guided sampling in video large language models. arXiv preprint arXiv:2509.20851 (2025)

  6. [6]

    In: 2023 IEEE symposium on security and privacy (SP)

    Cao, Y., Xiao, X., Sun, R., Wang, D., Xue, M., Wen, S.: Stylefool: Fooling video classification systems via style transfer. In: 2023 IEEE symposium on security and privacy (SP). pp. 1631–1648. IEEE (2023)

  7. [7]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Cao, Y., Zhao, Z., Xiao, X., Wang, D., Xue, M., Lu, J.: Logostylefool: Vitiating video recognition systems via logo style transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 945–953 (2024)

  8. [8]

    In: 2022 IEEE symposium on security and privacy (SP)

    Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. In: 2022 IEEE symposium on security and privacy (SP). pp. 1897–1914. IEEE (2022)

  9. [9]

    Advances in Neural Information Processing Systems37, 19472–19495 (2024)

    Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)

  10. [10]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  11. [11]

    In: Proceedings of the Forty-second International Conference on Machine Learning (2025)

    Cheng, C., Guan, J., Wu, W., Yan, R.: Scaling video-language models to 10k frames via hierarchical differential distillation. In: Proceedings of the Forty-second International Conference on Machine Learning (2025)

  12. [12]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., et al.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)

  13. [13]

    See https://vicuna

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

  14. [14]

    In: International conference on machine learning

    Choquette-Choo, C.A., Tramer, F., Carlini, N., Papernot, N.: Label-only mem- bership inference attacks. In: International conference on machine learning. pp. 1964–1974. PMLR (2021)

  15. [15]

    In: European Conference on Computer Vision

    Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: European Conference on Computer Vision. pp. 75–92. Springer (2024)

  16. [16]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19358–19369 (2023)

  17. [17]

    In: 34th USENIX Security Symposium (USENIX Security 25)

    Hu, Y., Li, Z., Liu, Z., Zhang, Y., Qin, Z., Ren, K., Chen, C.: Membership inference attacks against{Vision-Language}models. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 1589–1608 (2025)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jin,P.,Takanobu,R.,Zhang,W.,Cao,X.,Yuan,L.:Chat-univi:Unifiedvisualrep- resentation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)

  19. [19]

    Circulation117(18), 2395–2399 (2008) 16 W

    LaValley, M.P.: Logistic regression. Circulation117(18), 2395–2399 (2008) 16 W. Song et al

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  21. [21]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  22. [22]

    arXiv preprint arXiv:2506.03179 (2025)

    Li, Q., Yu, R., Wang, X.: Vid-sme: Membership inference attacks against large video understanding models. arXiv preprint arXiv:2506.03179 (2025)

  23. [23]

    Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guagemodels.In:EuropeanConferenceonComputerVision.pp.323–340.Springer (2024)

  24. [24]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024)

  25. [25]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025)

  26. [26]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12585–12602 (2024)

  27. [27]

    WSEAS transactions on circuits and systems8(7), 579–588 (2009)

    Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer per- ceptron and neural networks. WSEAS transactions on circuits and systems8(7), 579–588 (2009)

  28. [28]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

  29. [29]

    In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition

    Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 4161–4170 (2017)

  30. [30]

    Journal of insurance medicine47(1), 31–39 (2017)

    Rigatti, S.J.: Random forest. Journal of insurance medicine47(1), 31–39 (2017)

  31. [31]

    Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks againstmachinelearningmodels.In:2017IEEEsymposiumonsecurityandprivacy (SP). pp. 3–18. IEEE (2017)

  32. [32]

    In: 33rd USENIX Security Symposium (USENIX Security 24)

    Song, W., Cong, C., Zhong, H., Xue, J.: Correction-based defense against adver- sarial video attacks via{Discretization-Enhanced}video compressive sensing. In: 33rd USENIX Security Symposium (USENIX Security 24). pp. 3603–3620 (2024)

  33. [33]

    In: 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID)

    Song, W., Sui, Y., Xing, Z., Zhu, L., Xue, J.: Vidtoken: A video-transformer-based latent token defense for adversarial video detection. In: 2025 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID). pp. 566–

  34. [34]

    In: Machine learning models and algo- rithms for big data classification: thinking with examples for effective learning, pp

    Suthaharan, S.: Support vector machine. In: Machine learning models and algo- rithms for big data classification: thinking with examples for effective learning, pp. 207–235. Springer (2016)

  35. [35]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29118–29128 (2025)

  36. [36]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Membership Inference Attacks Against Video Large Language Models 17

  37. [37]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  38. [38]

    arXiv preprint arXiv:2601.11210 (2026)

    Wang, L., Chen, W., Yu, N., Li, Z., Guo, S.: Vidleaks: Membership inference attacks against text-to-video models. arXiv preprint arXiv:2601.11210 (2026)

  39. [39]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  40. [40]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  41. [41]

    Direct preference optimization of video large multimodal models from language model reward,

    Zhang,R.,Gui,L.,Sun,Z.,Feng,Y.,Xu,K.,Zhang,Y.,Fu,D.,Li,C.,Hauptmann, A.,Bisk, Y.,etal.:Direct preferenceoptimizationofvideo largemultimodalmodels from language model reward. arXiv preprint arXiv:2404.01258 (2024)

  42. [43]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)