pith. sign in

arxiv: 2605.18984 · v1 · pith:ONHQEPM2new · submitted 2026-05-18 · 💻 cs.CV

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Pith reviewed 2026-05-20 10:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords artifact detectionAI-generated videosmultimodal large language modelsvideo realism evaluationbenchmarktemporal inconsistencieshuman perceptual alignment
0
0 comments X

The pith

MLLMs struggle to detect artifacts in AI-generated videos, often performing near random levels and misaligning with human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Artifact-Bench to test how well multimodal large language models perceive and reason about flaws such as temporal inconsistencies, structural distortions, and semantic issues in AI-generated videos. It builds a three-level taxonomy spanning photorealistic, animated, and CG-style content, then defines three tasks: distinguishing real from generated videos, comparing pairs for realism, and identifying specific artifacts. Experiments across 19 models show many fall to random or below-random accuracy in harder cases and diverge from human perceptual preferences. A reader would care because these models are increasingly used to judge or guide video generation progress, yet the results indicate they cannot yet serve as trustworthy automated evaluators.

Core claim

The paper claims that current MLLMs have substantial limitations in artifact perception and reasoning for AI-generated videos. Using Artifact-Bench and its hierarchical taxonomy, the evaluation tasks reveal that model performance frequently approaches random guessing in challenging settings and shows significant misalignment with human judgments on realism.

What carries the argument

Artifact-Bench, a benchmark built on a three-level hierarchical taxonomy of realism artifacts that organizes evaluation into real-versus-generated classification, pairwise realism comparison, and fine-grained artifact identification.

If this is right

  • MLLMs cannot yet serve as reliable general evaluators for AI-generated video realism.
  • Artifact perception and fine-grained diagnostic reasoning remain weak points that future model development must address.
  • The observed misalignment implies that current MLLM training does not adequately capture human perceptual cues for video flaws.
  • The benchmark supplies a structured diagnostic tool for measuring progress across photorealistic, animated, and CG-style domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Video generation teams could adopt the benchmark to pinpoint recurring artifact patterns and guide targeted improvements in their models.
  • Similar hierarchical taxonomies and multi-task setups might be applied to evaluate artifact detection in other generative media such as images or audio.
  • Incorporating explicit human preference data during MLLM fine-tuning could narrow the gap between model judgments and perceptual reality.
  • The benchmark's task design offers a template for creating automated metrics that better track realism advances without constant human review.

Load-bearing premise

The three-level hierarchical taxonomy of realism artifacts comprehensively captures the failure modes that matter for human perception of AI-generated video.

What would settle it

A follow-up test in which a new MLLM achieves consistently above-random accuracy on the pairwise comparison and fine-grained identification tasks while producing rankings that closely match aggregated human ratings on the same video set.

Figures

Figures reproduced from arXiv: 2605.18984 by Bohan Zeng, Bozhou Li, Chengzhuo Tong, Fengxiang Wang, Haotian Wang, Pengfei Wan, Qixun Wang, Ruizhe Chen, Shilin Yan, Xinlong Chen, Xinyu Liu, Xuanyu Zhu, Xuehai Bai, Yang Shi, Yifan Dai, Yi-Fan Zhang, Yiyan Ji, Yuanxing Zhang, Yue Ding, Yuhao Dong, Yujie Wei, Yuqi Tang, Yuran Wang, Zhuoran Zhang.

Figure 1
Figure 1. Figure 1: The Hierarchical Taxonomy of AI-Generated Video Artifacts. We organize AI-generated video realism artifacts into three hierarchical tiers: top-level artifact domains, mid-level failure families, and 30 fine-grained artifact types. The taxonomy spans Surface Artifacts, Structural Defects, and Temporal-Semantic (Temp-Sem) Violations, covering failures in visual appearance, object and scene structure, tempora… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the three proposed tasks and their evaluation workflows. From top to bottom, the figure demonstrates the input formats and expected reasoning pipelines for RVAC, PVRC, and AID. These tasks form a comprehensive hierarchy, evaluating model capabilities from coarse-grained recognition to detailed artifact identification. evaluates the fine-grained reasoning ability of MLLMs in accurately ident… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Artifact-Bench construction pipeline. We build a hybrid dataset combining real-world and AI-generated videos for three tasks: RVAC, PVRC, and AID. Real videos are captioned to generate semantically aligned AIGC counterparts, while AIGC pairs with varying realism are constructed via re-generation and prompt-consistent sampling. For artifact coverage, we combine natural collection with target… view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of Artifact-Bench. (a) Hierarchical breakdown of the major video categories and diverse sub-scenarios. (b) Distribution of video sources, featuring a variety of recent state-of-the-art generative models. (c)–(e) Distributions of video duration, spatial resolution, and the number of primary subjects, respectively, demonstrating the structural diversity of our dataset. 4. Experiments 4.1. Evaluati… view at source ↗
Figure 5
Figure 5. Figure 5: Failure cases requiring fine-grained and temporal-spatial perception. (a) The paddle penetrates the boat hull, while this artifact occupies only a small portion of the entire image; (b) the football changes from two balls to one and then back to two, requiring multi-frame object tracking. not necessarily improve artifact detection capability. For example, InternVL3.5-38B performs comparably to its 8B count… view at source ↗
Figure 6
Figure 6. Figure 6: Representative examples for Task 1: Real vs. AI-Generated Video Classification (RVAC). [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative examples for Task 2: Pairwise Video Realism Comparison (PVRC). [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative examples for Task 3: Artifact Identification (AID). [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Artifact-Bench, a benchmark for evaluating MLLMs on detecting and reasoning about artifacts in AI-generated videos. It defines a three-level hierarchical taxonomy of realism artifacts spanning photorealistic, animated, and CG-style videos, and defines three tasks: real vs. AI-generated classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 MLLMs are reported to show substantial limitations, with many models performing near or below random chance in challenging settings and exhibiting misalignment with human perceptual preferences.

Significance. If the benchmark construction and human-alignment claims hold after validation, the work would provide a useful diagnostic tool for video-generation evaluation beyond photorealistic domains and could guide improvements in MLLM perceptual reasoning. The structured taxonomy and multi-task design offer a concrete starting point for future artifact-aware benchmarks.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the claim that 19 MLLMs were evaluated with performance trends (including near/below-random results and human misalignment) is presented without any reported dataset size, number of videos per category, annotation protocol, inter-annotator agreement, statistical tests, or error bars. This directly undermines verifiability of the central empirical claims.
  2. [Taxonomy] Taxonomy section: the three-level hierarchy (photorealistic/animated/CG-style) is asserted to comprehensively capture relevant failure modes, yet no human-elicitation study, perceptual validation, or coverage analysis against human judgments is described. This assumption is load-bearing for the misalignment result, because low model scores could reflect taxonomy-perception mismatch rather than genuine MLLM deficits.
minor comments (2)
  1. [Conclusion] Add explicit release statement for the full dataset, annotations, and evaluation code to support reproducibility.
  2. [Task Definitions] Clarify the exact prompting templates and output parsing rules used for each of the three tasks so that results can be replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the presentation of our work. Below, we provide point-by-point responses to the major comments and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the claim that 19 MLLMs were evaluated with performance trends (including near/below-random results and human misalignment) is presented without any reported dataset size, number of videos per category, annotation protocol, inter-annotator agreement, statistical tests, or error bars. This directly undermines verifiability of the central empirical claims.

    Authors: We agree with the referee that the verifiability of our empirical claims would be strengthened by more explicit and consolidated reporting of these details. The current manuscript describes the benchmark construction and evaluation setup but does not provide a single summary of dataset sizes, category distributions, annotation protocols, inter-annotator agreement, or include error bars and statistical tests in a prominent way. We will revise the Experiments section to add a new subsection on 'Dataset and Annotation Details' that reports the total number of videos, breakdown per category and style, the annotation protocol, inter-annotator agreement scores, and we will ensure all performance metrics are accompanied by error bars and results of statistical significance tests. This will make the central claims fully verifiable. revision: yes

  2. Referee: [Taxonomy] Taxonomy section: the three-level hierarchy (photorealistic/animated/CG-style) is asserted to comprehensively capture relevant failure modes, yet no human-elicitation study, perceptual validation, or coverage analysis against human judgments is described. This assumption is load-bearing for the misalignment result, because low model scores could reflect taxonomy-perception mismatch rather than genuine MLLM deficits.

    Authors: We acknowledge the referee's concern that the taxonomy's comprehensiveness should be validated against human perceptions to support the misalignment findings. The taxonomy was derived from a systematic review of artifacts observed in AI-generated videos across different visual styles. However, the manuscript does not include a human-elicitation study or coverage analysis. To address this, we will conduct and report a small-scale perceptual validation study in the revised manuscript, where human participants categorize sample artifacts according to the proposed hierarchy and we measure agreement with the taxonomy. We will also discuss any observed mismatches and their implications for interpreting the MLLM results. This revision will help rule out taxonomy-perception mismatch as an alternative explanation for the observed model limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with independent empirical evaluation

full rationale

The paper introduces Artifact-Bench as a new evaluation framework consisting of a three-level taxonomy of artifacts across video styles and three tasks (classification, pairwise comparison, fine-grained identification). It then reports empirical results on 19 MLLMs showing performance limitations and human misalignment. No equations, parameter fits, or predictions are defined in terms of the outputs; the taxonomy and tasks are presented as author-defined constructs to address a gap, with results measured against them rather than derived from them. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained as an empirical benchmark study against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the newly defined three-level taxonomy is both exhaustive and aligned with human perception; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption A three-level hierarchical taxonomy covering photorealistic, animated, and CG-style videos adequately organizes all realism artifacts relevant to MLLM evaluation.
    Invoked in the abstract when the authors state they first establish the taxonomy and then define tasks based on it.

pith-pipeline@v0.9.0 · 5832 in / 1092 out tokens · 25014 ms · 2026-05-20T10:22:08.190892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 8 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1, 2, 7

  2. [2]

    Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, and Yuanxing Zhang. Edit-compass & editreward-compass: A unified benchmark for image editing and reward modeling. arXiv preprint arXiv:2605.13062, 2026. 3

  3. [3]

    Avocado: An audiovisual video captioner driven by temporal orchestration,

    Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, et al. Avocado: An audiovisual video cap- tioner driven by temporal orchestration.arXiv preprint arXiv:2510.10395, 2025. 2

  4. [4]

    Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv e-prints, pages arXiv–2506, 2025

    Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A versatile video understanding and reasoning model from question answering to captioning tasks.arXiv e-prints, pages arXiv–2506, 2025. 2

  5. [5]

    Opengpt-4o-image: A compre- hensive dataset for advanced image generation and editing

    Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A compre- hensive dataset for advanced image generation and editing. arXiv preprint arXiv:2509.24900, 2025. 3

  6. [6]

    3, 4 Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang

    Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, et al. Em- bodiedeval: Evaluate multimodal llms as embodied agents. arXiv preprint arXiv:2501.11858, 2025. 3

  7. [7]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language mod- els with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026. 7

  8. [8]

    Gemini 3.1 pro

    Google DeepMind. Gemini 3.1 pro. https://deepmind. google / models / model - cards / gemini - 3 - 1 - pro/, 2026. 1, 2, 5, 7

  9. [9]

    Google. Veo 3. https://aistudio.google.com/ models/veo-3, 2025. 1, 5

  10. [10]

    Gemini 3 flash

    Google DeepMind. Gemini 3 flash. https://deepmind. google/models/gemini/flash/, 2025. 7

  11. [11]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Ei- tan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon,...

  12. [12]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025. 2

  13. [13]

    Kling ai: Video generation model

    Kuaishou AI Team. Kling ai: Video generation model. https://klingai.com/, 2024. 1, 5

  14. [14]

    Aegis: Authen- ticity evaluation benchmark for ai-generated video sequences

    Jieyu Li, Xin Zhang, and Joey Tianyi Zhou. Aegis: Authen- ticity evaluation benchmark for ai-generated video sequences. InProceedings of the 33rd ACM International Conference on Multimedia, page 13346 –13353. ACM, 2025. 2, 3

  15. [15]

    Skyra: Ai- generated video detection via grounded artifact reasoning,

    Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai- generated video detection via grounded artifact reasoning,

  16. [16]

    Uve: Are mllms uni- fied evaluators for ai-generated videos?arXiv preprint arXiv:2503.09949, 2025

    Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, and Lu Jiang. Uve: Are mllms uni- fied evaluators for ai-generated videos?arXiv preprint arXiv:2503.09949, 2025. 2, 3

  17. [17]

    OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 2

  18. [18]

    OpenAI. Gpt-4.1. https://platform.openai.com/ docs/models/gpt-4.1, 2025. 1

  19. [19]

    Mavors: Multi-granularity video representation for multimodal large language model

    Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanx- ing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, et al. Mavors: Multi-granularity video representation for multimodal large language model. InPro- ceedings of the 33rd ACM International Conference on Multi- media, pages 10994–11003, 2025. 1, 2

  20. [20]

    Mme-videoocr: Eval- uating ocr-based capabilities of multimodal llms in video scenarios, 2025

    Yang Shi, Huanqian Wang, Wulin Xie, Huanyao Zhang, Li- jie Zhao, Yi-Fan Zhang, Xinfeng Li, Chaoyou Fu, Zhuoer Wen, Wenting Liu, Zhuoran Zhang, Xinlong Chen, Bohan Zeng, Sihan Yang, Yushuo Guan, Zhang Zhang, Liang Wang, Haoxuan Li, Zhouchen Lin, Yuanxing Zhang, Pengfei Wan, Haotian Wang, and Wenjing Yang. Mme-videoocr: Eval- uating ocr-based capabilities o...

  21. [21]

    Speed by simplicity: A single-stream architecture for fast audio-video generative foundation model,

    SII-GAIR and Sand.ai. Speed by simplicity: A single-stream architecture for fast audio-video generative foundation model,

  22. [22]

    Vf- eval: Evaluating multimodal llms for generating feedback on aigc videos

    Tingyu Song, Tongyan Hu, Guo Gan, and Yilun Zhao. Vf- eval: Evaluating multimodal llms for generating feedback on aigc videos. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21126–21146, 2025. 2, 3

  23. [23]

    Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026

    Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Hui- jia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveri- tas: Ai-generated video detection via perception pretext rein- forcement learning.arXiv preprint arXiv:2602.08828, 2026. 1, 7

  24. [24]

    Kwai keye-vl 1.5 technical report, 2025

    Kwai Keye Team. Kwai keye-vl 1.5 technical report, 2025. 7

  25. [25]

    Hunyuanvideo 1.5 technical report, 2025

    Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. 1, 5

  26. [26]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  27. [27]

    Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.Ad- vances in Neural Information Processing Systems, 38:159185– 159218, 2026

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Shan Boqi, Long Lan, Yulin Wang, et al. Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.Ad- vances in Neural Information Processing Systems, 38:159185– 159218, 2026. 1

  28. [28]

    Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution re- mote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, et al. Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution re- mote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

  29. [29]

    Text before vision: Staged knowledge injection matters for agentic rlvr in ultra-high- resolution remote sensing understanding.arXiv preprint arXiv:2602.14225, 2026

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yuhao Zhou, Di Wang, Yifan Zhang, Haoyu Wang, Haiyan Zhao, Hongda Sun, et al. Text before vision: Staged knowledge injection matters for agentic rlvr in ultra-high- resolution remote sensing understanding.arXiv preprint arXiv:2602.14225, 2026. 1

  30. [30]

    Monet: Reasoning in latent visual space beyond images and language,

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language,

  31. [31]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 7

  32. [32]

    Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025

    Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025. 2, 3

  33. [33]

    Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025

    Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm. arXiv preprint arXiv:2507.14632, 2025. 3, 7

  34. [34]

    Mimo-vl technical report, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 7

  35. [35]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wen- shuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking effi- cient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 2

  36. [36]

    Debiasing multimodal large language models via penal- ization of language priors

    YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, and Rong Jin. Debiasing multimodal large language models via penal- ization of language priors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4232–4241,

  37. [37]

    Mm-rlhf: The next step forward in multimodal llm alignment,

    Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025. 2

  38. [38]

    When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms,

    Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Hao- tian Wang, Di Wang, and Lijie Hu. When modalities conflict: How unimodal reasoning uncertainty governs preference dy- namics in mllms.arXiv preprint arXiv:2511.02243, 2025. 3

  39. [39]

    yes". Otherwise, classify it as

    Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanx- ing Zhang, Ziwei Liu, et al. Vtc-bench: Evaluating agentic multimodal models via compositional visual tool chaining. arXiv preprint arXiv:2603.15030, 2026. 3 A. Experiment Details A.1. Experimental Setup For all evaluated models, we use a default video...

  40. [42]

    yes" - "no

    The only valid outputs are: - "yes" - "no"

  41. [43]

    Output exactly one valid answer without any additional text or explanation

  42. [44]

    <Video A>

    If the response does not contain a clear final answer indicating whether the video is AI-generated, output: "Invalid" Prompt Template for Task 2: Pairwise Video Realism Comparison (PVRC) You are an answer extraction system. The original task is to compare two AI-generated videos, <Video A> and <Video B>, and determine which video has higher perceptual rea...

  43. [47]

    <Video A>

    Output exactly one of the following: - "<Video A>" - "<Video B>"

  44. [49]

    The original task is to identify all AIGC-specific artifacts that are clearly observable in a given AI-generated video

    If the response does not contain a clear final answer indicating which video is more realistic, output: "Invalid" Prompt Template for Task 3: Artifact Identification (AID) You are an answer extraction system. The original task is to identify all AIGC-specific artifacts that are clearly observable in a given AI-generated video. The candidate options are mu...

  45. [50]

    Ignore all reasoning, explanations, and intermedi- ate analysis

  46. [51]

    Focus only on the final conclusion

  47. [52]

    Extract only the selected option letters

  48. [53]

    If multiple options are selected, output them separated by commas (e.g., "A,C,E")

  49. [54]

    Do not output any additional text or explanation

  50. [55]

    Only output valid option letters that appear in the final answer

  51. [56]

    yes" if the video is AIGC, and

    If the response does not contain a clear final answer, output: "Invalid" B. Benchmark Details B.1. Representative Examples from Artifact-Bench In order to comprehensively convey the characteristics of tasks in Artifact-Bench, two representative examples are presented for each task, as illustrated in Figures 6, 7, and 8. B.2. Task Distribution Table 3 show...