pith. sign in

arxiv: 2606.02522 · v1 · pith:4EF5OEKPnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Pith reviewed 2026-06-28 14:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video MLLMstemporal fidelitymomentary visual eventsbenchmarkvideo question answeringmultimodal modelstemporal reasoning
0
0 comments X

The pith

Current video MLLMs fail to capture brief but decisive visual events that determine many practical answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Moment-Video, a benchmark of 1000 video-QA pairs that tests whether models can notice, count, describe or reason about short-lived visual events lasting only a few frames. It claims these events are often skipped by sparse sampling or lost in token compression, and that language reasoning cannot recover them reliably. Evaluations across 33 models show the best result at 39.6 percent accuracy and most open-source models below 25 percent. A sympathetic reader would care because many real questions hinge on such transient evidence rather than persistent objects or overall scene context.

Core claim

Moment-Video demonstrates that video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence, with the strongest model reaching only 39.6 percent overall accuracy across tasks that require attention to localized, sampling-sensitive events.

What carries the argument

The Moment-Video benchmark, which grounds each of its 1000 questions in a localized, visually observable, and sampling-sensitive event across four task types.

If this is right

  • Denser frame sampling raises accuracy for some models but leaves a remaining performance gap.
  • Longer videos increase the difficulty of temporal localization.
  • Proprietary models outperform open-source ones but none reach reliable understanding of momentary events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model designs may need new methods for preserving short-duration signals beyond current sampling or compression approaches.
  • The benchmark could serve as a diagnostic tool for comparing temporal handling across different video architectures.

Load-bearing premise

Each question truly requires attention to a transient visual event that cannot be answered from persistent objects, global context, or language priors.

What would settle it

A model that scores near ceiling on the benchmark while using only sparse frames sampled away from the key events would indicate the questions do not require momentary visual evidence.

read the original abstract

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Moment-Video, a benchmark of 1,000 human-verified video-QA pairs across 7 domains and 25 subcategories, designed to diagnose video MLLMs' handling of momentary visual events (localized actions or state changes lasting only a few frames) via four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Each pair is claimed to require models to notice transient evidence rather than rely on persistent objects, global context, or language priors. Evaluation of 33 models shows Seed-2.0-Pro at 39.6% overall accuracy (most open-source models <25%), with diagnostics indicating denser sampling helps but does not close the gap and longer videos increase localization challenges; the central conclusion is that current video MLLMs lack temporally faithful representations.

Significance. If the sampling-sensitivity and non-recoverability claims hold, the benchmark provides a targeted diagnostic for a previously underexplored failure mode in video MLLMs, with potential to drive improvements in frame sampling, visual token compression, and temporal aggregation. The scale, human verification, and multi-task coverage strengthen its utility as an evaluation tool beyond existing long-form video benchmarks.

major comments (3)
  1. [Section 3] Benchmark construction (Section 3): the claim that all 1,000 pairs are verifiably sampling-sensitive and non-recoverable from language priors or static frames rests on human verification, but the manuscript provides no details on verification protocol, inter-annotator agreement, or concrete tests (e.g., model performance on single-frame or text-only ablations) used to confirm this property for each subcategory.
  2. [Section 4] Evaluation setup (Section 4): the reported accuracies and diagnostic findings on frame sampling density lack specification of the exact sampling rates, token budgets, and prompting templates applied uniformly across the 33 models, making it impossible to isolate whether failures stem from temporal fidelity or from implementation choices.
  3. [Section 5] Error analysis (Section 5): the diagnostic claim that longer videos introduce stronger temporal-localization challenges is not supported by per-video-length breakdowns or statistical tests; aggregate accuracy alone does not establish this as load-bearing for the central temporal-fidelity conclusion.
minor comments (2)
  1. [Table 1] Table 1 or equivalent: clarify the exact distribution of the 1,000 pairs across the 25 subcategories and four task types to allow readers to assess balance.
  2. [Figure 2] Figure 2 or equivalent: the visualization of model performance gaps would benefit from error bars or per-task breakdowns to improve interpretability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the benchmark and evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested details.

read point-by-point responses
  1. Referee: [Section 3] Benchmark construction (Section 3): the claim that all 1,000 pairs are verifiably sampling-sensitive and non-recoverable from language priors or static frames rests on human verification, but the manuscript provides no details on verification protocol, inter-annotator agreement, or concrete tests (e.g., model performance on single-frame or text-only ablations) used to confirm this property for each subcategory.

    Authors: We agree that the current manuscript lacks sufficient detail on the verification process. In the revised version, we will add a new subsection in Section 3 that describes the full human verification protocol, including annotator instructions, number of annotators per item, inter-annotator agreement (Cohen's kappa), and the decision criteria for sampling sensitivity. We will also report single-frame and text-only ablation results across subcategories to empirically confirm that each QA pair requires transient visual evidence. revision: yes

  2. Referee: [Section 4] Evaluation setup (Section 4): the reported accuracies and diagnostic findings on frame sampling density lack specification of the exact sampling rates, token budgets, and prompting templates applied uniformly across the 33 models, making it impossible to isolate whether failures stem from temporal fidelity or from implementation choices.

    Authors: We concur that precise implementation details are required for reproducibility and to isolate temporal effects. The revision will specify the exact sampling rates (e.g., uniform sampling at fixed FPS values), per-model visual token budgets, and the standardized prompting templates used for all 33 models. These additions will be placed in Section 4 and the appendix. revision: yes

  3. Referee: [Section 5] Error analysis (Section 5): the diagnostic claim that longer videos introduce stronger temporal-localization challenges is not supported by per-video-length breakdowns or statistical tests; aggregate accuracy alone does not establish this as load-bearing for the central temporal-fidelity conclusion.

    Authors: We acknowledge that the manuscript currently presents only aggregate observations without length-stratified breakdowns or statistical support. In the revision, we will add per-video-length accuracy tables (binned by duration), correlation coefficients between video length and accuracy, and appropriate statistical tests to substantiate the claim that longer videos increase localization difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a benchmark dataset (Moment-Video) consisting of 1,000 human-verified video-QA pairs and performs direct empirical evaluation of 33 external MLLMs on it. No mathematical derivations, parameter fittings, or predictive claims are present; the central claim that models lack temporally faithful representations follows from observed low accuracies (e.g., best model at 39.6%) rather than any self-referential construction. All load-bearing elements (question grounding, sampling sensitivity, human verification) are externally validated and independent of the reported results. No self-citation chains or ansatzes reduce the findings to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed questions isolate momentary events without language or global context shortcuts; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human verification ensures questions require transient evidence rather than persistent objects or language priors
    Abstract states each question is human-verified to be grounded in localized, sampling-sensitive events.

pith-pipeline@v0.9.1-grok · 5863 in / 1214 out tokens · 27936 ms · 2026-06-28T14:47:52.168334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 21 canonical work pages · 16 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Qwen3.5-Omni Technical Report

    Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

  3. [3]

    Animal kingdom: A large and diverse dataset for animal behavior understanding,

    X. L. Ng, K. E. Ong, Q. Zheng, Y. Ni, S. Y. Yeo, and J. Liu, “Animal kingdom: A large and diverse dataset for animal behavior understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 023–19 034

  4. [4]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    B. Zhang, K. Li, Z. Cheng, Z. Hu, Y. Yuan, G. Chen, S. Leng, Y. Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv preprint arXiv:2501.13106, 2025

  5. [5]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Panet al., “Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning,”arXiv preprint arXiv:2507.01006, 2025

  6. [6]

    Kimi-VL Technical Report

    K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Weiet al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025

  7. [7]

    Tracknetv2: Efficient shuttlecock tracking network,

    N.-E. Sun, Y.-C. Lin, S.-P. Chuang, T.-H. Hsu, D.-R. Yu, H.-Y. Chung, and T.-U. ˙Ik, “Tracknetv2: Efficient shuttlecock tracking network,” in2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 2020, pp. 86–91

  8. [8]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012

  9. [9]

    Sports videos in the wild (svw): A video dataset for sports analysis,

    S. M. Safdarnejad, X. Liu, L. Udpa, B. Andrus, J. Wood, and D. Craven, “Sports videos in the wild (svw): A video dataset for sports analysis,” in2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–7

  10. [10]

    Multisports: A multi-person video dataset of spatio- temporally localized sports actions,

    Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio- temporally localized sports actions,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 536–13 545

  11. [11]

    LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

    B. Sun, J. Zhao, X. Chen, X. Wei, and Q. Hou, “Llava-octopus: Unlocking instruction-driven adaptive projector fusion for video understanding,”arXiv preprint arXiv:2501.05067, 2025

  12. [12]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

  13. [13]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y. He, Y. Wang, Y. Qiao, Y. Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

  14. [14]

    Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

    Z. Yan, X. Li, Y. He, Z. Yue, X. Zeng, Y. Wang, Y. Qiao, L. Wang, and Y. Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark,

    K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 195–22 206

  16. [16]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24 108–24 118

  17. [17]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    C. Fu, H. Yuan, Y. Dong, Y.-F. Zhang, Y. Shen, X. Hu, X. Li, J. Su, C. Long, X. Xieet al., “Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding,”arXiv preprint arXiv:2604.05015, 2026

  18. [18]

    Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,

    W. Hong, Y. Cheng, Z. Yang, W. Wang, L. Wang, X. Gu, S. Huang, Y. Dong, and J. Tang, “Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8450–8460. 11

  19. [19]

    Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,

    C. Tu, L. Zhang, P. Chen, P. Ye, X. Zeng, W. Cheng, G. Yu, and T. Chen, “Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,”arXiv preprint arXiv:2503.14935, 2025

  20. [20]

    Lvbench: An extreme long video understanding benchmark,

    W. Wang, Z. He, W. Hong, Y. Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xuet al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 22 958–22 967

  21. [21]

    Longvideobench: A benchmark for long-context interleaved video-language understanding,

    H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 28 828–28 857. [Online]. Available: ht...

  22. [22]

    Mlvu: Benchmarking multi-task long video understanding,

    J. Zhou, Y. Shu, B. Zhao, B. Wu, Z. Liang, S. Xiao, M. Qin, X. Yang, Y. Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: Benchmarking multi-task long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 13 691–13 701

  23. [23]

    Hourvideo: 1-hour video-language understanding,

    K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei, “Hourvideo: 1-hour video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024,...

  24. [24]

    Seeing from another perspective: Evaluating multi-view understanding in mllms,

    C.-H. Yeh, C. Wang, S. Tong, T.-Y. Cheng, R. Wang, T. Chu, Y. Zhai, Y. Chen, S. Gao, and Y. Ma, “Seeing from another perspective: Evaluating multi-view understanding in mllms,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 14, 2026, pp. 12 000–12 008

  25. [25]

    Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,

    J. Li, J. Wang, M. Tan, H. Wang, C. Yan, L. Shi, J. Cai, X. Jiang, and Y. Hu, “Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 8, 2026, pp. 6244–6252

  26. [26]

    Videoreasonbench: Can mllms perform vision-centric complex video reasoning?

    Y. Liu, K. Ouyang, H. Wu, Y. Liu, L. Sui, X. Li, Y. Zhong, Y. Charles, X. Zhou, and X. Sun, “Videoreasonbench: Can mllms perform vision-centric complex video reasoning?” 2026. [Online]. Available: https://arxiv.org/abs/2505.23359

  27. [27]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    K. Hu, P. Wu, F. Pu, W. Xiao, Y. Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,” 2025. [Online]. Available: https://arxiv.org/abs/2501.13826

  28. [28]

    Mmvu: Measuring expert-level multi-discipline video understanding,

    Y. Zhao, H. Zhang, L. Xie, T. Hu, G. Gan, Y. Long, Z. Hu, W. Chen, C. Li, Z. Xuet al., “Mmvu: Measuring expert-level multi-discipline video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8475–8489

  29. [29]

    Videoads for fast-paced video understanding,

    Z. Zhang, W. Dou, L. Peng, H. Pan, U. Bagci, and B. Gong, “Videoads for fast-paced video understanding,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 21 812–21 821

  30. [30]

    Egoschema: A diagnostic benchmark for very long-form video language understanding,

    K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 46 212–46 244. [Online]. Available: https://proc...

  31. [31]

    Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,

    Q. Fang, Y. Zhou, S. Guo, S. Zhang, and Y. Feng, “Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 617–18 629

  32. [32]

    GLM-5: from Vibe Coding to Agentic Engineering

    A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,”arXiv preprint arXiv:2602.15763, 2026

  33. [33]

    Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,

    J. Li, S. Li, Q. Lian, P. Li, X. Chen, and Y. Zhou, “Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,”IEEE Transactions on Robotics, vol. 42, pp. 1643–1661, 2026

  34. [34]

    Aisafety: An ai-based smart system for enhancing operator safety in production processes,

    F. Di Paco, L. Burattini, R. Gabbrielli, L. Landi, F. Marcelloni, L. Marrazzini, M. Palumbo, and M. Pirozzi, “Aisafety: An ai-based smart system for enhancing operator safety in production processes,”Safety Science, vol. 199, p. 107201, 2026. 12

  35. [35]

    Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,

    Google DeepMind, “Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,” Google Blog, 2025. [Online]. Available: https://deepmind.google/models/gemini/

  36. [36]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity,

    ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” Feb. 2026, model Card. [Online]. Available: https://seed.bytedance.com/zh/seed2

  37. [37]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

  38. [38]

    Mimo-v2.5,

    “Mimo-v2.5,” https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026

  39. [39]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

  40. [40]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all,

    ——, “Qwen3.6-35B-A3B: Agentic coding power, now open to all,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-35b-a3b

  41. [41]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model,

    ——, “Qwen3.6-27B: Flagship-level coding in a 27B dense model,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-27b

  42. [42]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  43. [43]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

  44. [44]

    Kimi k2.6 tech blog: Advancing open-source coding,

    M. AI, “Kimi k2.6 tech blog: Advancing open-source coding,” https://www.kimi.com/blog/kimi-k2-6, 2026

  45. [45]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,

    X. Zhao, P. Zhang, K. Tang, H. Li, Z. Zhang, G. Zhai, J. Yan, H. Yang, X. Yang, and H. Duan, “Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,” 2025

  46. [46]

    Gui-world: A dataset for gui-oriented multimodal llm-based agents,

    D. Chen, Y. Huang, S. Wu, J. Tang, L. Chen, Y. Bai, Z. He, C. Wang, H. Zhou, Y. Liet al., “Gui-world: A dataset for gui-oriented multimodal llm-based agents,”arXiv preprint arXiv:2406.10819, 2024

  47. [47]

    Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season

    H. G. Hunt, “Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season.”Data in Brief, vol. 30, p. 105630, 2020

  48. [48]

    Egocentric-10k,

    B. AI, “Egocentric-10k,” 2025. [Online]. Available: https://huggingface.co/datasets/builddotai/Egocentric-10K

  49. [49]

    Gemma 4: Our most capable open models to date,

    Gemma Team, Google DeepMind, “Gemma 4: Our most capable open models to date,” https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 4 2026

  50. [50]

    Kwai keye-vl 1.5 technical report,

    B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025

  51. [51]

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    C. Fu, H. Lin, X. Wang, Y.-F. Zhang, Y. Shen, X. Liu, Y. Li, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025. 13 Appendix A Benchmark Data Sources Moment-Videois constructed from diverse video sources to cover momentary visual event across both real- world and vir...

  52. [52]

    Judge meaning, not wording

  53. [53]

    Accept paraphrases, synonyms, abbreviations, and equivalent naming variants

  54. [54]

    Mark as consistent if the model answer fully covers the reference answer’s meaning, even with extra harmless details

  55. [55]

    Mark as inconsistent if it misses a key point, contradicts the reference answer, or changes important facts, including entity, action, order, quantity, identity, or existence

  56. [56]

    Mark as inconsistent if the model answer is too vague to support the same meaning

  57. [57]

    is consistent

    If uncertain, choose consistent only when a reasonable reader would conclude full semantic coverage. Return JSON only, with no markdown and no extra text: {"is consistent": true/false, "reason": "one-sentence explanation"} Reference answer: {reference answer} Model answer: {model answer} Figure 10Open-ended LLM-as-Judge prompt used to evaluate semantic co...