pith. machine review for the scientific record. sign in

arxiv: 2604.19193 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

How Far Are Video Models from True Multimodal Reasoning?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords video modelsmultimodal reasoningCLVG-Benchvideo generationbenchmarkphysical simulationlogical reasoning
0
0 comments X

The pith

State-of-the-art video models handle basic understanding but fail on logically grounded and interactive video generation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLVG-Bench, a new evaluation framework with over 1,000 manually annotated examples across physical simulation, logical reasoning, and interactive contexts, paired with an Adaptive Video Evaluator that aligns with human judgment. It tests zero-shot reasoning in video generation models and finds that leading systems like Seedance 2.0 perform adequately on straightforward understanding subtasks yet drop to under 25 percent success on logically grounded generation and near zero on interactive generation. The work argues this gap reveals multimodal reasoning and physical grounding as the main remaining bottlenecks. By releasing the benchmark and evaluator, the authors aim to give researchers concrete, interpretable feedback for improving general-purpose video models.

Core claim

While state-of-the-art video models demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks, achieving success rates below 25 percent and approximately zero percent, respectively.

What carries the argument

CLVG-Bench, a Context Learning in Video Generation evaluation framework consisting of manually annotated metadata across 6 categories and 47 subcategories, together with the Adaptive Video Evaluator that supplies textual feedback aligned to human perception.

Load-bearing premise

The manually created tasks in CLVG-Bench and the Adaptive Video Evaluator capture true multimodal reasoning in a way that matches human expert judgment without systematic bias or gaps.

What would settle it

A new set of video generation examples requiring interactive physical reasoning that current models solve at high rates while human raters still score the outputs as coherent.

Figures

Figures reproduced from arXiv: 2604.19193 by Daoan Zhang, Dezhi YU, Jianhui Wei, Jie Tan, Songtao Jiang, Wei Xu, Xiaotian Zhang, Yan Zhang, Yichen Li, Yuan Wang, Ziyi Chen, Zuozhu Liu.

Figure 1
Figure 1. Figure 1: Overview of CLVG-Bench. Our framework evaluates video models via Con￾text Learning in Video Generation across arbitrary modalities (Text, Image, Audio, Video). It systematically probes reasoning capabilities across 6 categories and 47 sub￾categories, which provides a roadmap for general-purpose video models. transitioned from simple, single-reference conditioning [32,35,37,39,51,65,86] to accommodating mul… view at source ↗
Figure 2
Figure 2. Figure 2: An environment for testing the multi￾turn interactive generation capabilities of video models, where the model is expected to infer and predict a suitable video script based on the history context and user feedback. As described above, the meta￾data consists of four components: text (also called context), im￾ages, video, and audio. The con￾text serves as the fundamental ba￾sis for reference to video genera… view at source ↗
Figure 3
Figure 3. Figure 3: The top tier represents the previous evaluation methodologies: (a) fragmented evaluation, (b) human-aligned reward models, and (c) unified fine-grained evaluation. The lower tier illustrates the pipeline of the Adaptive Video Evaluator. During the optimization phase, we utilize Dtrain and Dval to explore and verify prompt candidates. By aggregating the instance-level confusion matrices across the validatio… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Qualitative comparison of videos generated from the original instructions and their rephrased counterparts. (b) Per-turn success rate in the multi-turn interaction setting. (c) Overall case pass rate (%). 4.3 Discussion and Analysis Necessity of Context Learning. The results in Tab. 2 reveal that current video models still exhibit notable deficiencies in context learning and reasoning. A potential work… view at source ↗
Figure 5
Figure 5. Figure 5: Case study of AVE-driven prompt refinement in the physical simulation task. This example illustrates the process of enhancing a basic initial prompt using AVE. the optimized system prompt contains generalizable task-solving experience that can be shared across different models, enabling zero-cost cross-model adaptation with no additional optimization required. Case Study on AVE [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 6
Figure 6. Figure 6: Cases of each subcategory in CLVG-Bench (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cases of each subcategory in CLVG-Bench (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cases of each subcategory in CLVG-Bench (Part 3) [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cases of each subcategory in CLVG-Bench (Part 4) [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cases of each subcategory in CLVG-Bench (Part 5) [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Optimized prompt for physical simulation evaluation [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Optimized prompt for perception evaluation [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Optimized prompt for prompt following evaluation [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The optimizer meta prompt [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for open-ended text matching [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Video Metadata Dimensions [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
read the original abstract

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CLVG-Bench, a new evaluation framework with over 1,000 manually annotated metadata items across 6 categories and 47 subcategories to test video models' zero-shot multimodal reasoning via context learning in video generation tasks, covering physical simulation, logical reasoning, and interactive contexts. It proposes an Adaptive Video Evaluator (AVE) that claims to align with human expert perception using minimal annotations while providing interpretable textual feedback. Experiments on SOTA models such as Seedance 2.0 show competence on understanding and some reasoning subtasks but substantial shortfalls on logically grounded generation (<25% success) and interactive generation (~0% success), identifying multimodal reasoning and physical grounding as critical bottlenecks. The benchmark and code are released.

Significance. If the benchmark tasks are well-constructed and AVE produces reliable human-aligned labels, the work would offer a more rigorous probe of true multimodal reasoning than existing fragmented benchmarks, with actionable insights and a roadmap for video model improvement. The release of data and code supports reproducibility. However, the significance hinges on unverified aspects of the new evaluator and annotation process.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline results (SOTA models <25% success on logical generation tasks and ~0% on interactive generation) depend entirely on AVE producing accurate success/failure labels and feedback. The abstract states that AVE 'aligns with human expert perception using minimal annotations' but reports no quantitative validation (correlation coefficients, inter-rater reliability, or ablation studies on annotation volume) for the hardest categories such as physical simulation and logical reasoning. Without this, the performance gaps cannot be confidently attributed to reasoning bottlenecks rather than evaluator artifacts.
  2. [§3] §3 (CLVG-Bench construction): The description of task construction, annotation guidelines, and how the 1,000+ items were selected to ensure they require 'true multimodal reasoning' (as opposed to pattern matching or superficial cues) is insufficient. This is load-bearing for the central claim, as the benchmark is newly invented and the results are benchmark-specific.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from explicit comparison to prior video reasoning benchmarks (e.g., specific metrics or task types they fail to cover) to better motivate the new design.
  2. [§3.2] Clarify the exact definition of 'success rate' used by AVE (e.g., binary threshold on textual feedback or multi-criteria scoring) and whether it was applied uniformly across all 47 subcategories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details and validation as outlined.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline results (SOTA models <25% success on logical generation tasks and ~0% on interactive generation) depend entirely on AVE producing accurate success/failure labels and feedback. The abstract states that AVE 'aligns with human expert perception using minimal annotations' but reports no quantitative validation (correlation coefficients, inter-rater reliability, or ablation studies on annotation volume) for the hardest categories such as physical simulation and logical reasoning. Without this, the performance gaps cannot be confidently attributed to reasoning bottlenecks rather than evaluator artifacts.

    Authors: We agree that the absence of quantitative validation metrics for AVE in the current manuscript is a limitation that weakens confidence in attributing the observed performance gaps specifically to multimodal reasoning shortfalls. The manuscript describes AVE's design and qualitative alignment but does not report correlation coefficients, inter-rater reliability, or annotation-volume ablations for the most challenging categories. In the revised manuscript, we will add a new subsection in §4 (and update the abstract if needed) presenting these quantitative results, including Pearson/Spearman correlations with human experts, Fleiss' kappa or similar inter-rater scores, and ablation curves showing performance stability with varying annotation budgets. These analyses were conducted during benchmark development and will be reported to allow readers to evaluate evaluator reliability directly. revision: yes

  2. Referee: [§3] §3 (CLVG-Bench construction): The description of task construction, annotation guidelines, and how the 1,000+ items were selected to ensure they require 'true multimodal reasoning' (as opposed to pattern matching or superficial cues) is insufficient. This is load-bearing for the central claim, as the benchmark is newly invented and the results are benchmark-specific.

    Authors: We concur that §3 currently provides only a high-level overview of task construction and selection, which is insufficient to fully substantiate that the 1,000+ items probe genuine multimodal reasoning rather than superficial cues. In the revision, we will substantially expand §3 to include: (i) the full annotation guidelines provided to expert annotators, (ii) concrete examples from each major category showing how prompts were engineered to require integration of visual context, physical simulation, and logical inference (with counter-examples of rejected superficial variants), and (iii) quantitative selection statistics (e.g., distribution across subcategories, rejection rates for cue-based items, and diversity metrics). This expanded description will make the benchmark's design more transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity in evaluation framework or claims

full rationale

The paper presents an empirical evaluation introducing CLVG-Bench (new manually annotated dataset) and AVE (new evaluator) to measure video model performance on reasoning tasks. No derivation chain, equations, or fitted parameters are claimed; results are direct measurements of model success rates on the benchmark. No self-citations, self-definitions, or reductions of outputs to inputs by construction are present in the abstract or described methodology. The central finding (low performance on logical/interactive generation) is a reported observation, not a prediction forced by prior fits or renamings. This is a standard non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the assumption that the new benchmark tasks represent true multimodal reasoning and that the evaluator aligns with human judgment; these are introduced without external validation or prior benchmarks cited in the abstract.

axioms (2)
  • domain assumption CLVG-Bench tasks and subcategories comprehensively cover complex multimodal reasoning scenarios including physical simulation, logical reasoning, and interactive contexts.
    Invoked in the design and claims about what the benchmark measures.
  • domain assumption The Adaptive Video Evaluator produces assessments that align with human expert perception using minimal annotations.
    Central to the claim of rigorous and scalable assessment.
invented entities (2)
  • CLVG-Bench no independent evidence
    purpose: Evaluation framework for probing zero-shot multimodal reasoning in video models via context learning in generation.
    Newly proposed benchmark with manual annotations across 6 categories and 47 subcategories.
  • Adaptive Video Evaluator (AVE) no independent evidence
    purpose: Automated evaluator that aligns with human perception and delivers interpretable textual feedback.
    Newly proposed method to enable scalable assessment of the benchmark.

pith-pipeline@v0.9.0 · 5581 in / 1510 out tokens · 39463 ms · 2026-05-10T02:41:12.991059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 52 canonical work pages · 20 internal anchors

  1. [1]

    Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., Khattab, O.: Gepa: Reflective prompt evolution can outperform reinforcement learning (2025),https://arxiv.org/abs/2507.19457 9, 10

  2. [2]

    Oxford University Press (1984) 6

    Andrew, J.D.: Concepts in film theory. Oxford University Press (1984) 6

  3. [3]

    Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y., Jiang, C., Sun, Y., Chang, K.W., Grover, A.: Videophy: Evaluating physical commonsense for video generation (2024),https://arxiv.org/abs/2406.035203

  4. [4]

    Bansal, H., Peng, C., Bitton, Y., Goldenberg, R., Grover, A., Chang, K.W.: Videophy-2: A challenging action-centric physical commonsense evaluation in video generation (2025),https://arxiv.org/abs/2503.068003

  5. [5]

    Bordwell, D., Thompson, K., Smith, J.: Film art: An introduction, vol. 7. McGraw- Hill New York (2008) 6

  6. [6]

    Advances in neural information processing systems33, 1877–1901 (2020) 1

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 1

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 4

    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 4

  8. [8]

    Cai, Y., Cai, S., Shi, Y., Xu, Z., Chen, L., Qin, Y., Tan, X., Li, G., Li, Z., Lin, H., Mao, Y., Li, K., Sun, X.: Training-free group relative policy optimization (2025), https://arxiv.org/abs/2510.08191

  9. [9]

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers (2021),https:// arxiv.org/abs/2104.14294

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 4 16 X. Zhang et al

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  12. [12]

    In: Proceedings of the IEEE international conference on computer vision

    Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu- tional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2758–2766 (2015) 2

  13. [13]

    Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D’Amour, A., Dvijotham, D., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., Berant, J.: Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking (2024),https://arxiv.org/abs/2312.092443

  14. [14]

    Vila2: Vila augmented vila

    Fang, Y., Zhu, L., Lu, Y., Wang, Y., Molchanov, P., Kautz, J., Cho, J., Pavone, M., Han, S., Yin, H.: Vila2: Vila augmented vila. arXiv preprint arXiv:2407.17453 (2024)

  15. [15]

    Google: Gemini 3 pro model card. Tech. rep., Google DeepMind (12 2025), https://storage.googleapis.com/deepmind- media/Model- Cards/Gemini- 3- Pro-Model-Card.pdf, model Release: November 2025 6

  16. [16]

    Google: Veo 3 launch.https://cloud.google.com/blog/products/ai-machine- learning/veo- 3- fast- available- for- everyone- on- vertex- ai(2025), ac- cessed: March 3, 2026 4, 9, 11

  17. [17]

    arXiv preprint arXiv:2505.16770 (2025) 4, 5

    Guo, M.H., Chu, X., Yang, Q., Mo, Z.H., Shen, Y., Li, P.l., Lin, X., Zhang, J., Chen, X.S., Zhang, Y., et al.: Rbench-v: A primary assessment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770 (2025) 4, 5

  18. [19]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifschitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026) 4, 11

  19. [20]

    Han, H., Li, S., Chen, J., Yuan, Y., Wu, Y., Leong, C.T., Du, H., Fu, J., Li, Y., Zhang, J., Zhang, C., jia Li, L., Ni, Y.: Video-bench: Human-aligned video generation benchmark (2025),https://arxiv.org/abs/2504.049073, 5

  20. [21]

    CoRR , volume =

    He, H., Wang, J., Zhang, J., Xue, Z., Bu, X., Yang, Q., Wen, S., Xie, L.: Openve- 3m: A large-scale high-quality dataset for instruction-guided video editing. arXiv preprint arXiv:2512.07826 (2025) 4, 5

  21. [22]

    He, X., Jiang, D., Nie, P., Liu, M., Jiang, Z., Su, M., Ma, W., Lin, J., Ye, C., Lu, Y., Wu, K., Schneider, B., Do, Q.D., Li, Z., Jia, Y., Zhang, Y., Cheng, G., Wang, H., Zhou, W., Lin, Q., Zhang, Y., Zhang, G., Huang, W., Chen, W.: Videoscore2: Think before you score in generative video evaluation (2025),https://arxiv.org/ abs/2509.227993

  22. [23]

    VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252,

    He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., Wang, K., Do, Q.D., Ni, Y., Lyu, B., Narsupalli, Y., Fan, R., Lyu, Z., Lin, Y., Chen, W.: Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. ArXivabs/2406.15252(2024), https://arxiv.org/abs/2406.152523, 4

  23. [24]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding (2021),https://arxiv. org/abs/2009.03300 How Far Are Video Models from True Multimodal Reasoning? 17

  24. [25]

    Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference- free evaluation metric for image captioning (2022),https://arxiv.org/abs/2104. 087184

  25. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hu, H., Chan, K.C., Su, Y.C., Chen, W., Li, Y., Sohn, K., Zhao, Y., Ben, X., Gong, B., Cohen, W., et al.: Instruct-imagen: Image generation with multi-modal instruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4754–4763 (2024) 2

  26. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 2, 4

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 2, 4

  27. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 2

  28. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 4, 5

  29. [30]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Ji, P., Xiao, C., Tai, H., Huo, M.: T2vbench: Benchmarking temporal dynamics for text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5325–5335 (2024) 2, 4

  30. [31]

    et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding (2025), https://arxiv.org/abs/2510.08668

    Jiang, S., Wang, Y., Song, S., Hu, T., Zhou, C., Pu, B., Zhang, Y., Yang, Z., Feng, Y., Zhou, J.T., et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025) 1

  31. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 2, 4

  32. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 5

  33. [34]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 4

  34. [35]

    arXiv preprint arXiv:2503.19907 (2025) 2

    Ju, X., Ye, W., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Xu, Q.: Fulldit: Multi-task video generative foundation model with full attention. arXiv preprint arXiv:2503.19907 (2025) 2

  35. [36]

    Advances in neural information processing systems35, 22199–22213 (2022)

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

  36. [38]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 4

  37. [39]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Ku, M., Wei, C., Ren, W., Yang, H., Chen, W.: Anyv2v: A tuning-free framework for any video-to-video editing tasks. arXiv preprint arXiv:2403.14468 (2024) 2 18 X. Zhang et al

  38. [40]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  39. [41]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, M., Xie, C., Wu, Y., Zhang, L., Wang, M.: Five-bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16672–16681 (2025) 5

  40. [42]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Liao, C., Liu, L., Wang, X., Luo, Z., Zhang, X., Zhao, W., Wu, J., Li, L., Tian, Z., Huang, W.: Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472 (2025) 4

  41. [44]

    Zero-shot voice conversion with diffusion transform- ers,

    Liu, S.: Zero-shot voice conversion with diffusion transformers. arXiv preprint arXiv:2411.09943 (2024) 7

  42. [45]

    11 Published as a conference paper at ICLR 2025 Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440 (2023) 2, 4

  43. [46]

    Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A bench- mark for fine-grained evaluation of open-domain text-to-video generation (2023), https://arxiv.org/abs/2311.018132, 4

  44. [48]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

    Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y., Li, D., Qiao, Y., Luo, P.: Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363 (2024) 4

  45. [49]

    OpenAI: Sora: A video generation model.https://openai.com/zh- Hans- CN/ index/sora-2/(2025), accessed: March 4 2026 9, 11

  46. [50]

    pixverse.ai/(2023), accessed: March 3, 2026 4

    PixVerse: Pixverse: AI-powered image and video editing platform.https://app. pixverse.ai/(2023), accessed: March 3, 2026 4

  47. [51]

    Movie Gen: A Cast of Media Foundation Models

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.Y., Chuang, C.Y., et al.: Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720 (2024) 2

  48. [52]

    AoPS Wiki,https: //artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions, accessed: 2026-03-01

    of Problem Solving (AoPS), A.: Aime problems and solutions. AoPS Wiki,https: //artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions, accessed: 2026-03-01

  49. [53]

    Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof qa benchmark (2023),https: //arxiv.org/abs/2311.12022

  50. [55]

    arXiv preprint arXiv:2602.10102 (2026) 4

    Ren, Z., Wei, Y., Yu, X., Luo, G., Zhao, Y., Kang, B., Feng, J., Jin, X.: Vide- oworld 2: Learning transferable knowledge from real-world videos. arXiv preprint arXiv:2602.10102 (2026) 4

  51. [56]

    Seed, B.: Seed2.0 model card: Towards intelligence frontier for real-world complex- ity. Tech. rep., 2026a. Technical Report (2026) 4, 6, 9, 11

  52. [57]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Seedance, T., Chen, H., Chen, S., Chen, X., Chen, Y., Chen, Y., Chen, Z., Cheng, F., Cheng, T., Cheng, X., et al.: Seedance 1.5 pro: A native audio-visual joint generation foundation model. arXiv preprint arXiv:2512.13507 (2025) 1, 9, 11 How Far Are Video Models from True Multimodal Reasoning? 19

  53. [58]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 8406–8416 (2025) 2, 4

  54. [59]

    In: The Twelfth In- ternational Conference on Learning Representations (2024)

    Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Emu: Generative pretraining in multimodality. In: The Twelfth In- ternational Conference on Learning Representations (2024)

  55. [60]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Sun, S., Liang, X., Fan, S., Gao, W., Gao, W.: Ve-bench: Subjective-aligned bench- mark suite for text-driven video editing quality assessment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 7105–7113 (2025) 4, 5

  56. [61]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4

  57. [62]

    CoRR , volume =

    Team, K., Chen, J., Ci, Y., Du, X., Feng, Z., Gai, K., Guo, S., Han, F., He, J., He, K., et al.: Kling-omni technical report. arXiv preprint arXiv:2512.16776 (2025) 1, 4

  58. [63]

    Team, T.H.F.M.: Hunyuanvideo 1.5 technical report (2025),https://arxiv.org/ abs/2511.188709

  59. [64]

    Vidu: Vidu: AI-powered video generation platform.https://www.vidu.cn/(2024), accessed: 2026-03-03 4

  60. [65]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 2, 4, 9, 11

  61. [66]

    ModelScope Text-to-Video Technical Report

    Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., Zhang, S.: Modelscope text- to-video technical report. arXiv preprint arXiv:2308.06571 (2023) 4

  62. [68]

    A very big video reasoning suite

    Wang, M., Wang, R., Lin, J., Ji, R., Wiedemer, T., Gao, Q., Luo, D., Qian, Y., Huang, L., Hong, Z., et al.: A very big video reasoning suite. arXiv preprint arXiv:2602.20159 (2026) 4

  63. [69]

    In: The Fourteenth International Conference on Learning Representations (2025) 4

    Wang, S., Pei, M., Sun, L., Deng, C., Li, Y., Shao, K., Tian, Z., Zhang, H., Wang, J.: Spatialviz-bench: A cognitively-grounded benchmark for diagnosing spatial vi- sualization in mllms. In: The Fourteenth International Conference on Learning Representations (2025) 4

  64. [70]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 4

  65. [71]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Wang, Y., Liu, J., Gao, S., Feng, B., Tang, Z., Gai, X., Wu, J., Liu, Z.: V2t- cot: From vision to text chain-of-thought for medical reasoning and diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 658–668. Springer (2025) 1

  66. [73]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025a

    Wei, C., Liu, Q., Ye, Z., Wang, Q., Wang, X., Wan, P., Gai, K., Chen, W.: Uni- video: Unified understanding, generation, and editing for videos. arXiv preprint arXiv:2510.08377 (2025)

  67. [74]

    Emergent Abilities of Large Language Models

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022) 1 20 X. Zhang et al

  68. [76]

    Video models are zero-shot learners and reasoners

    Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025) 1, 5

  69. [77]

    Wu, J.Z., Fang, G., Fu, D.J., Kanakagiri, V.A.R., Iandola, F., Keutzer, K., Hsu, W., Dong, Z., Shou, M.Z.: Veditbench: Holistic benchmark for text-guided video editing (2026),https://arxiv.org/abs/2602.218352

  70. [78]

    Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

    Wu, J., Zhang, X., Yuan, H., Zhang, X., Huang, T., He, C., Deng, C., Zhang, R., Wu, Y., Long, M.: Visual generation unlocks human-like reasoning through multimodal world models. arXiv preprint arXiv:2601.19834 (2026) 7

  71. [79]

    Wu, R., Chen, L., Yang, T., Guo, C., Li, C., Zhang, X.: Lamp: Learn a motion patternforfew-shotvideogeneration.In:ProceedingsoftheIEEE/CVFconference on computer vision and pattern recognition. pp. 7089–7098 (2024) 4

  72. [80]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. In: The Thirteenth International Conference on Learning Representations (2025)

  73. [81]

    Xu, J., Huang, Y., Cheng, J., Yang, Y., Xu, J., Wang, Y., Duan, W., Yang, S., Jin, Q., Li, S., Teng, J., Yang, Z., Zheng, W., Liu, X., Ding, M., Zhang, X., Gu, X., Huang, S., Huang, M., Tang, J., Dong, Y.: Visionreward: Fine-grained multi- dimensional human preference learning for image and video generation (2024), https://arxiv.org/abs/2412.210593

  74. [82]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5288–5296 (2016) 4

  75. [83]

    Yang, Y., Fan, K., Sun, S., Li, H., Zeng, A., Han, F., Zhai, W., Liu, W., Cao, Y., Zha, Z.J.: Videogen-eval: Agent-based system for video generation evaluation (2025),https://arxiv.org/abs/2503.234523, 5

  76. [84]

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: Hotpotqa: A dataset for diverse, explainable multi-hop question answering (2018),https://arxiv.org/abs/1809.09600

  77. [85]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  78. [88]

    Unic: Unified in-context video editing.arXiv preprint arXiv:2506.04216, 2025

    Ye, Z., He, X., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, Q., Luo, W.: Unic: Unified in-context video editing. arXiv preprint arXiv:2506.04216 (2025) 5

  79. [89]

    TextGrad: Automatic "Differentiation" via Text

    Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic "differentiation" via text (2024),https://arxiv.org/abs/ 2406.074969

  80. [90]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025) How Far Are Video Models from True Multimodal Reasoning? 21

Showing first 80 references.