Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Haoyu Cao; Shaofeng Zhang; Xiangyu Zhao; Xiaolin Liu; Xing Sun; Xin Li; Xuehui Wang; Xue Yang; Xu Yang; Yan Li

arxiv: 2606.02522 · v1 · pith:4EF5OEKPnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu , Yilun Zhu , Xiangyu Zhao , Xuehui Wang , Yan Li , Xin Li , Haoyu Cao , Xing Sun

show 4 more authors

Shaofeng Zhang Xu Yang Zhihang Zhong Xue Yang

This is my paper

Pith reviewed 2026-06-28 14:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video MLLMstemporal fidelitymomentary visual eventsbenchmarkvideo question answeringmultimodal modelstemporal reasoning

0 comments

The pith

Current video MLLMs fail to capture brief but decisive visual events that determine many practical answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Moment-Video, a benchmark of 1000 video-QA pairs that tests whether models can notice, count, describe or reason about short-lived visual events lasting only a few frames. It claims these events are often skipped by sparse sampling or lost in token compression, and that language reasoning cannot recover them reliably. Evaluations across 33 models show the best result at 39.6 percent accuracy and most open-source models below 25 percent. A sympathetic reader would care because many real questions hinge on such transient evidence rather than persistent objects or overall scene context.

Core claim

Moment-Video demonstrates that video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence, with the strongest model reaching only 39.6 percent overall accuracy across tasks that require attention to localized, sampling-sensitive events.

What carries the argument

The Moment-Video benchmark, which grounds each of its 1000 questions in a localized, visually observable, and sampling-sensitive event across four task types.

If this is right

Denser frame sampling raises accuracy for some models but leaves a remaining performance gap.
Longer videos increase the difficulty of temporal localization.
Proprietary models outperform open-source ones but none reach reliable understanding of momentary events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model designs may need new methods for preserving short-duration signals beyond current sampling or compression approaches.
The benchmark could serve as a diagnostic tool for comparing temporal handling across different video architectures.

Load-bearing premise

Each question truly requires attention to a transient visual event that cannot be answered from persistent objects, global context, or language priors.

What would settle it

A model that scores near ceiling on the benchmark while using only sparse frames sampled away from the key events would indicate the questions do not require momentary visual evidence.

read the original abstract

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moment-Video gives a clear empirical signal on video MLLMs missing brief events, backed by broad model testing but resting on the strength of its human verification.

read the letter

The main thing here is that the authors built Moment-Video, a set of 1,000 human-verified video-QA pairs that target short, localized events models often skip. They split it into four task types and 25 subcategories across seven domains, then ran 33 models. The best result is 39.6% from Seed-2.0-Pro; most open-source models sit below 25%. Denser sampling narrows the gap for some models but does not close it, and longer videos make the problem worse.

What the paper does well is isolate a practical failure mode that general video benchmarks tend to blur. The focus on sampling-sensitive evidence, plus the scale of the evaluation, makes the performance numbers useful as a diagnostic. The task breakdown into occurrence, counting, description, and reasoning also gives readers concrete categories to think about.

The soft spot is the verification step. The claim that every pair requires the transient visual evidence and cannot be recovered from language priors or static frames depends on how carefully the humans checked that. The paper states they did this verification, but without more on the exact criteria or sample questions it is hard to judge how strict it was. That is the main place where additional detail would strengthen the work.

This is for researchers working on video MLLMs or temporal modeling who want a targeted test set. A reader running their own models on video understanding tasks would find the numbers and the sampling analysis directly relevant.

I would send it to peer review. The evaluation covers enough models and the gap is consistent enough that referees should see the full paper.

Referee Report

3 major / 2 minor

Summary. The paper introduces Moment-Video, a benchmark of 1,000 human-verified video-QA pairs across 7 domains and 25 subcategories, designed to diagnose video MLLMs' handling of momentary visual events (localized actions or state changes lasting only a few frames) via four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. Each pair is claimed to require models to notice transient evidence rather than rely on persistent objects, global context, or language priors. Evaluation of 33 models shows Seed-2.0-Pro at 39.6% overall accuracy (most open-source models <25%), with diagnostics indicating denser sampling helps but does not close the gap and longer videos increase localization challenges; the central conclusion is that current video MLLMs lack temporally faithful representations.

Significance. If the sampling-sensitivity and non-recoverability claims hold, the benchmark provides a targeted diagnostic for a previously underexplored failure mode in video MLLMs, with potential to drive improvements in frame sampling, visual token compression, and temporal aggregation. The scale, human verification, and multi-task coverage strengthen its utility as an evaluation tool beyond existing long-form video benchmarks.

major comments (3)

[Section 3] Benchmark construction (Section 3): the claim that all 1,000 pairs are verifiably sampling-sensitive and non-recoverable from language priors or static frames rests on human verification, but the manuscript provides no details on verification protocol, inter-annotator agreement, or concrete tests (e.g., model performance on single-frame or text-only ablations) used to confirm this property for each subcategory.
[Section 4] Evaluation setup (Section 4): the reported accuracies and diagnostic findings on frame sampling density lack specification of the exact sampling rates, token budgets, and prompting templates applied uniformly across the 33 models, making it impossible to isolate whether failures stem from temporal fidelity or from implementation choices.
[Section 5] Error analysis (Section 5): the diagnostic claim that longer videos introduce stronger temporal-localization challenges is not supported by per-video-length breakdowns or statistical tests; aggregate accuracy alone does not establish this as load-bearing for the central temporal-fidelity conclusion.

minor comments (2)

[Table 1] Table 1 or equivalent: clarify the exact distribution of the 1,000 pairs across the 25 subcategories and four task types to allow readers to assess balance.
[Figure 2] Figure 2 or equivalent: the visualization of model performance gaps would benefit from error bars or per-task breakdowns to improve interpretability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the benchmark and evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested details.

read point-by-point responses

Referee: [Section 3] Benchmark construction (Section 3): the claim that all 1,000 pairs are verifiably sampling-sensitive and non-recoverable from language priors or static frames rests on human verification, but the manuscript provides no details on verification protocol, inter-annotator agreement, or concrete tests (e.g., model performance on single-frame or text-only ablations) used to confirm this property for each subcategory.

Authors: We agree that the current manuscript lacks sufficient detail on the verification process. In the revised version, we will add a new subsection in Section 3 that describes the full human verification protocol, including annotator instructions, number of annotators per item, inter-annotator agreement (Cohen's kappa), and the decision criteria for sampling sensitivity. We will also report single-frame and text-only ablation results across subcategories to empirically confirm that each QA pair requires transient visual evidence. revision: yes
Referee: [Section 4] Evaluation setup (Section 4): the reported accuracies and diagnostic findings on frame sampling density lack specification of the exact sampling rates, token budgets, and prompting templates applied uniformly across the 33 models, making it impossible to isolate whether failures stem from temporal fidelity or from implementation choices.

Authors: We concur that precise implementation details are required for reproducibility and to isolate temporal effects. The revision will specify the exact sampling rates (e.g., uniform sampling at fixed FPS values), per-model visual token budgets, and the standardized prompting templates used for all 33 models. These additions will be placed in Section 4 and the appendix. revision: yes
Referee: [Section 5] Error analysis (Section 5): the diagnostic claim that longer videos introduce stronger temporal-localization challenges is not supported by per-video-length breakdowns or statistical tests; aggregate accuracy alone does not establish this as load-bearing for the central temporal-fidelity conclusion.

Authors: We acknowledge that the manuscript currently presents only aggregate observations without length-stratified breakdowns or statistical support. In the revision, we will add per-video-length accuracy tables (binned by duration), correlation coefficients between video length and accuracy, and appropriate statistical tests to substantiate the claim that longer videos increase localization difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a benchmark dataset (Moment-Video) consisting of 1,000 human-verified video-QA pairs and performs direct empirical evaluation of 33 external MLLMs on it. No mathematical derivations, parameter fittings, or predictive claims are present; the central claim that models lack temporally faithful representations follows from observed low accuracies (e.g., best model at 39.6%) rather than any self-referential construction. All load-bearing elements (question grounding, sampling sensitivity, human verification) are externally validated and independent of the reported results. No self-citation chains or ansatzes reduce the findings to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed questions isolate momentary events without language or global context shortcuts; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human verification ensures questions require transient evidence rather than persistent objects or language priors
Abstract states each question is human-verified to be grounded in localized, sampling-sensitive events.

pith-pipeline@v0.9.1-grok · 5863 in / 1214 out tokens · 27936 ms · 2026-06-28T14:47:52.168334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 21 canonical work pages · 16 internal anchors

[1]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen3.5-Omni Technical Report

Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Animal kingdom: A large and diverse dataset for animal behavior understanding,

X. L. Ng, K. E. Ong, Q. Zheng, Y. Ni, S. Y. Yeo, and J. Liu, “Animal kingdom: A large and diverse dataset for animal behavior understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 023–19 034

2022
[4]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y. Yuan, G. Chen, S. Leng, Y. Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Panet al., “Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning,”arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Kimi-VL Technical Report

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Weiet al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Tracknetv2: Efficient shuttlecock tracking network,

N.-E. Sun, Y.-C. Lin, S.-P. Chuang, T.-H. Hsu, D.-R. Yu, H.-Y. Chung, and T.-U. ˙Ik, “Tracknetv2: Efficient shuttlecock tracking network,” in2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 2020, pp. 86–91

2020
[8]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[9]

Sports videos in the wild (svw): A video dataset for sports analysis,

S. M. Safdarnejad, X. Liu, L. Udpa, B. Andrus, J. Wood, and D. Craven, “Sports videos in the wild (svw): A video dataset for sports analysis,” in2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–7

2015
[10]

Multisports: A multi-person video dataset of spatio- temporally localized sports actions,

Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio- temporally localized sports actions,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 536–13 545

2021
[11]

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

B. Sun, J. Zhao, X. Chen, X. Wei, and Q. Hou, “Llava-octopus: Unlocking instruction-driven adaptive projector fusion for video understanding,”arXiv preprint arXiv:2501.05067, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Video-R1: Reinforcing Video Reasoning in MLLMs

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y. He, Y. Wang, Y. Qiao, Y. Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

Z. Yan, X. Li, Y. He, Z. Yue, X. Zeng, Y. Wang, Y. Qiao, L. Wang, and Y. Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025

work page arXiv 2025
[15]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 195–22 206

2024
[16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24 108–24 118

2025
[17]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

C. Fu, H. Yuan, Y. Dong, Y.-F. Zhang, Y. Shen, X. Hu, X. Li, J. Su, C. Long, X. Xieet al., “Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding,”arXiv preprint arXiv:2604.05015, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,

W. Hong, Y. Cheng, Z. Yang, W. Wang, L. Wang, X. Gu, S. Huang, Y. Dong, and J. Tang, “Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8450–8460. 11

2025
[19]

Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,

C. Tu, L. Zhang, P. Chen, P. Ye, X. Zeng, W. Cheng, G. Yu, and T. Chen, “Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,”arXiv preprint arXiv:2503.14935, 2025

work page arXiv 2025
[20]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y. Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xuet al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 22 958–22 967

2025
[21]

Longvideobench: A benchmark for long-context interleaved video-language understanding,

H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 28 828–28 857. [Online]. Available: ht...

2024
[22]

Mlvu: Benchmarking multi-task long video understanding,

J. Zhou, Y. Shu, B. Zhao, B. Wu, Z. Liang, S. Xiao, M. Qin, X. Yang, Y. Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: Benchmarking multi-task long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 13 691–13 701

2025
[23]

Hourvideo: 1-hour video-language understanding,

K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei, “Hourvideo: 1-hour video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024,...

2024
[24]

Seeing from another perspective: Evaluating multi-view understanding in mllms,

C.-H. Yeh, C. Wang, S. Tong, T.-Y. Cheng, R. Wang, T. Chu, Y. Zhai, Y. Chen, S. Gao, and Y. Ma, “Seeing from another perspective: Evaluating multi-view understanding in mllms,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 14, 2026, pp. 12 000–12 008

2026
[25]

Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,

J. Li, J. Wang, M. Tan, H. Wang, C. Yan, L. Shi, J. Cai, X. Jiang, and Y. Hu, “Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 8, 2026, pp. 6244–6252

2026
[26]

Videoreasonbench: Can mllms perform vision-centric complex video reasoning?

Y. Liu, K. Ouyang, H. Wu, Y. Liu, L. Sui, X. Li, Y. Zhong, Y. Charles, X. Zhou, and X. Sun, “Videoreasonbench: Can mllms perform vision-centric complex video reasoning?” 2026. [Online]. Available: https://arxiv.org/abs/2505.23359

work page arXiv 2026
[27]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

K. Hu, P. Wu, F. Pu, W. Xiao, Y. Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,” 2025. [Online]. Available: https://arxiv.org/abs/2501.13826

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mmvu: Measuring expert-level multi-discipline video understanding,

Y. Zhao, H. Zhang, L. Xie, T. Hu, G. Gan, Y. Long, Z. Hu, W. Chen, C. Li, Z. Xuet al., “Mmvu: Measuring expert-level multi-discipline video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8475–8489

2025
[29]

Videoads for fast-paced video understanding,

Z. Zhang, W. Dou, L. Peng, H. Pan, U. Bagci, and B. Gong, “Videoads for fast-paced video understanding,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 21 812–21 821

2025
[30]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 46 212–46 244. [Online]. Available: https://proc...

2023
[31]

Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,

Q. Fang, Y. Zhou, S. Guo, S. Zhang, and Y. Feng, “Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 617–18 629

2025
[32]

GLM-5: from Vibe Coding to Agentic Engineering

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,”arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,

J. Li, S. Li, Q. Lian, P. Li, X. Chen, and Y. Zhou, “Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,”IEEE Transactions on Robotics, vol. 42, pp. 1643–1661, 2026

2026
[34]

Aisafety: An ai-based smart system for enhancing operator safety in production processes,

F. Di Paco, L. Burattini, R. Gabbrielli, L. Landi, F. Marcelloni, L. Marrazzini, M. Palumbo, and M. Pirozzi, “Aisafety: An ai-based smart system for enhancing operator safety in production processes,”Safety Science, vol. 199, p. 107201, 2026. 12

2026
[35]

Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,

Google DeepMind, “Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,” Google Blog, 2025. [Online]. Available: https://deepmind.google/models/gemini/

2025
[36]

Seed2.0 model card: Towards intelligence frontier for real-world complexity,

ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” Feb. 2026, model Card. [Online]. Available: https://seed.bytedance.com/zh/seed2

2026
[37]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Mimo-v2.5,

“Mimo-v2.5,” https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026

2026
[39]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

2026
[40]

Qwen3.6-35B-A3B: Agentic coding power, now open to all,

——, “Qwen3.6-35B-A3B: Agentic coding power, now open to all,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-35b-a3b

2026
[41]

Qwen3.6-27B: Flagship-level coding in a 27B dense model,

——, “Qwen3.6-27B: Flagship-level coding in a 27B dense model,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-27b

2026
[42]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Kimi k2.6 tech blog: Advancing open-source coding,

M. AI, “Kimi k2.6 tech blog: Advancing open-source coding,” https://www.kimi.com/blog/kimi-k2-6, 2026

2026
[45]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,

X. Zhao, P. Zhang, K. Tang, H. Li, Z. Zhang, G. Zhai, J. Yan, H. Yang, X. Yang, and H. Duan, “Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,” 2025

2025
[46]

Gui-world: A dataset for gui-oriented multimodal llm-based agents,

D. Chen, Y. Huang, S. Wu, J. Tang, L. Chen, Y. Bai, Z. He, C. Wang, H. Zhou, Y. Liet al., “Gui-world: A dataset for gui-oriented multimodal llm-based agents,”arXiv preprint arXiv:2406.10819, 2024

work page arXiv 2024
[47]

Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season

H. G. Hunt, “Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season.”Data in Brief, vol. 30, p. 105630, 2020

2015
[48]

Egocentric-10k,

B. AI, “Egocentric-10k,” 2025. [Online]. Available: https://huggingface.co/datasets/builddotai/Egocentric-10K

2025
[49]

Gemma 4: Our most capable open models to date,

Gemma Team, Google DeepMind, “Gemma 4: Our most capable open models to date,” https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 4 2026

2026
[50]

Kwai keye-vl 1.5 technical report,

B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025

work page arXiv 2025
[51]

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

C. Fu, H. Lin, X. Wang, Y.-F. Zhang, Y. Shen, X. Liu, Y. Li, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025. 13 Appendix A Benchmark Data Sources Moment-Videois constructed from diverse video sources to cover momentary visual event across both real- world and vir...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Judge meaning, not wording
[53]

Accept paraphrases, synonyms, abbreviations, and equivalent naming variants
[54]

Mark as consistent if the model answer fully covers the reference answer’s meaning, even with extra harmless details
[55]

Mark as inconsistent if it misses a key point, contradicts the reference answer, or changes important facts, including entity, action, order, quantity, identity, or existence
[56]

Mark as inconsistent if the model answer is too vague to support the same meaning
[57]

is consistent

If uncertain, choose consistent only when a reasonable reader would conclude full semantic coverage. Return JSON only, with no markdown and no extra text: {"is consistent": true/false, "reason": "one-sentence explanation"} Reference answer: {reference answer} Model answer: {model answer} Figure 10Open-ended LLM-as-Judge prompt used to evaluate semantic co...

[1] [1]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen3.5-Omni Technical Report

Q. Team, “Qwen3. 5-omni technical report,”arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Animal kingdom: A large and diverse dataset for animal behavior understanding,

X. L. Ng, K. E. Ong, Q. Zheng, Y. Ni, S. Y. Yeo, and J. Liu, “Animal kingdom: A large and diverse dataset for animal behavior understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 023–19 034

2022

[4] [4]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B. Zhang, K. Li, Z. Cheng, Z. Hu, Y. Yuan, G. Chen, S. Leng, Y. Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Panet al., “Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning,”arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Kimi-VL Technical Report

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Weiet al., “Kimi-vl technical report,”arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Tracknetv2: Efficient shuttlecock tracking network,

N.-E. Sun, Y.-C. Lin, S.-P. Chuang, T.-H. Hsu, D.-R. Yu, H.-Y. Chung, and T.-U. ˙Ik, “Tracknetv2: Efficient shuttlecock tracking network,” in2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 2020, pp. 86–91

2020

[8] [8]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[9] [9]

Sports videos in the wild (svw): A video dataset for sports analysis,

S. M. Safdarnejad, X. Liu, L. Udpa, B. Andrus, J. Wood, and D. Craven, “Sports videos in the wild (svw): A video dataset for sports analysis,” in2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–7

2015

[10] [10]

Multisports: A multi-person video dataset of spatio- temporally localized sports actions,

Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio- temporally localized sports actions,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 536–13 545

2021

[11] [11]

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding

B. Sun, J. Zhao, X. Chen, X. Wei, and Q. Hou, “Llava-octopus: Unlocking instruction-driven adaptive projector fusion for video understanding,”arXiv preprint arXiv:2501.05067, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Video-R1: Reinforcing Video Reasoning in MLLMs

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y. He, Y. Wang, Y. Qiao, Y. Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

Z. Yan, X. Li, Y. He, Z. Yue, X. Zeng, Y. Wang, Y. Qiao, L. Wang, and Y. Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025

work page arXiv 2025

[15] [15]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 195–22 206

2024

[16] [16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24 108–24 118

2025

[17] [17]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

C. Fu, H. Yuan, Y. Dong, Y.-F. Zhang, Y. Shen, X. Hu, X. Li, J. Su, C. Long, X. Xieet al., “Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding,”arXiv preprint arXiv:2604.05015, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,

W. Hong, Y. Cheng, Z. Yang, W. Wang, L. Wang, X. Gu, S. Huang, Y. Dong, and J. Tang, “Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8450–8460. 11

2025

[19] [19]

Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,

C. Tu, L. Zhang, P. Chen, P. Ye, X. Zeng, W. Cheng, G. Yu, and T. Chen, “Favor-bench: A comprehensive benchmark for fine-grained video motion understanding,”arXiv preprint arXiv:2503.14935, 2025

work page arXiv 2025

[20] [20]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y. Cheng, X. Zhang, J. Qi, M. Ding, X. Gu, S. Huang, B. Xuet al., “Lvbench: An extreme long video understanding benchmark,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 22 958–22 967

2025

[21] [21]

Longvideobench: A benchmark for long-context interleaved video-language understanding,

H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 28 828–28 857. [Online]. Available: ht...

2024

[22] [22]

Mlvu: Benchmarking multi-task long video understanding,

J. Zhou, Y. Shu, B. Zhao, B. Wu, Z. Liang, S. Xiao, M. Qin, X. Yang, Y. Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: Benchmarking multi-task long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 13 691–13 701

2025

[23] [23]

Hourvideo: 1-hour video-language understanding,

K. Chandrasegaran, A. Gupta, L. M. Hadzic, T. Kota, J. He, C. Eyzaguirre, Z. Durante, M. Li, J. Wu, and L. Fei-Fei, “Hourvideo: 1-hour video-language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024,...

2024

[24] [24]

Seeing from another perspective: Evaluating multi-view understanding in mllms,

C.-H. Yeh, C. Wang, S. Tong, T.-Y. Cheng, R. Wang, T. Chu, Y. Zhai, Y. Chen, S. Gao, and Y. Ma, “Seeing from another perspective: Evaluating multi-view understanding in mllms,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 14, 2026, pp. 12 000–12 008

2026

[25] [25]

Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,

J. Li, J. Wang, M. Tan, H. Wang, C. Yan, L. Shi, J. Cai, X. Jiang, and Y. Hu, “Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 40, no. 8, 2026, pp. 6244–6252

2026

[26] [26]

Videoreasonbench: Can mllms perform vision-centric complex video reasoning?

Y. Liu, K. Ouyang, H. Wu, Y. Liu, L. Sui, X. Li, Y. Zhong, Y. Charles, X. Zhou, and X. Sun, “Videoreasonbench: Can mllms perform vision-centric complex video reasoning?” 2026. [Online]. Available: https://arxiv.org/abs/2505.23359

work page arXiv 2026

[27] [27]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

K. Hu, P. Wu, F. Pu, W. Xiao, Y. Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,” 2025. [Online]. Available: https://arxiv.org/abs/2501.13826

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Mmvu: Measuring expert-level multi-discipline video understanding,

Y. Zhao, H. Zhang, L. Xie, T. Hu, G. Gan, Y. Long, Z. Hu, W. Chen, C. Li, Z. Xuet al., “Mmvu: Measuring expert-level multi-discipline video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 8475–8489

2025

[29] [29]

Videoads for fast-paced video understanding,

Z. Zhang, W. Dou, L. Peng, H. Pan, U. Bagci, and B. Gong, “Videoads for fast-paced video understanding,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 21 812–21 821

2025

[30] [30]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, and J. Malik, “Egoschema: A diagnostic benchmark for very long-form video language understanding,” inAdvances in Neural Information Processing Systems (NeurIPS), A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 46 212–46 244. [Online]. Available: https://proc...

2023

[31] [31]

Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,

Q. Fang, Y. Zhou, S. Guo, S. Zhang, and Y. Feng, “Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 617–18 629

2025

[32] [32]

GLM-5: from Vibe Coding to Agentic Engineering

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,”arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,

J. Li, S. Li, Q. Lian, P. Li, X. Chen, and Y. Zhou, “Toward deep representation learning for event-enhanced visual autonomous perception: The eap dataset,”IEEE Transactions on Robotics, vol. 42, pp. 1643–1661, 2026

2026

[34] [34]

Aisafety: An ai-based smart system for enhancing operator safety in production processes,

F. Di Paco, L. Burattini, R. Gabbrielli, L. Landi, F. Marcelloni, L. Marrazzini, M. Palumbo, and M. Pirozzi, “Aisafety: An ai-based smart system for enhancing operator safety in production processes,”Safety Science, vol. 199, p. 107201, 2026. 12

2026

[35] [35]

Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,

Google DeepMind, “Introducing our most intelligent model yet. with state-of-the-art reasoning to help you learn, build, and plan anything,” Google Blog, 2025. [Online]. Available: https://deepmind.google/models/gemini/

2025

[36] [36]

Seed2.0 model card: Towards intelligence frontier for real-world complexity,

ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” Feb. 2026, model Card. [Online]. Available: https://seed.bytedance.com/zh/seed2

2026

[37] [37]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram et al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Mimo-v2.5,

“Mimo-v2.5,” https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026

2026

[39] [39]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

2026

[40] [40]

Qwen3.6-35B-A3B: Agentic coding power, now open to all,

——, “Qwen3.6-35B-A3B: Agentic coding power, now open to all,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-35b-a3b

2026

[41] [41]

Qwen3.6-27B: Flagship-level coding in a 27B dense model,

——, “Qwen3.6-27B: Flagship-level coding in a 27B dense model,” April 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.6-27b

2026

[42] [42]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y. Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Kimi k2.6 tech blog: Advancing open-source coding,

M. AI, “Kimi k2.6 tech blog: Advancing open-source coding,” https://www.kimi.com/blog/kimi-k2-6, 2026

2026

[45] [45]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,

X. Zhao, P. Zhang, K. Tang, H. Li, Z. Zhang, G. Zhai, J. Yan, H. Yang, X. Yang, and H. Duan, “Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing,” 2025

2025

[46] [46]

Gui-world: A dataset for gui-oriented multimodal llm-based agents,

D. Chen, Y. Huang, S. Wu, J. Tang, L. Chen, Y. Bai, Z. He, C. Wang, H. Zhou, Y. Liet al., “Gui-world: A dataset for gui-oriented multimodal llm-based agents,”arXiv preprint arXiv:2406.10819, 2024

work page arXiv 2024

[47] [47]

Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season

H. G. Hunt, “Dataset of photographed lightning events attaching to and around the brixton tower, johannesburg, south africa for the 2015-2016 thunderstorm season.”Data in Brief, vol. 30, p. 105630, 2020

2015

[48] [48]

Egocentric-10k,

B. AI, “Egocentric-10k,” 2025. [Online]. Available: https://huggingface.co/datasets/builddotai/Egocentric-10K

2025

[49] [49]

Gemma 4: Our most capable open models to date,

Gemma Team, Google DeepMind, “Gemma 4: Our most capable open models to date,” https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 4 2026

2026

[50] [50]

Kwai keye-vl 1.5 technical report,

B. Yang, B. Wen, B. Ding, C. Liu, C. Chu, C. Song, C. Rao, C. Yi, D. Li, D. Zanget al., “Kwai keye-vl 1.5 technical report,”arXiv preprint arXiv:2509.01563, 2025

work page arXiv 2025

[51] [51]

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

C. Fu, H. Lin, X. Wang, Y.-F. Zhang, Y. Shen, X. Liu, Y. Li, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025. 13 Appendix A Benchmark Data Sources Moment-Videois constructed from diverse video sources to cover momentary visual event across both real- world and vir...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Judge meaning, not wording

[53] [53]

Accept paraphrases, synonyms, abbreviations, and equivalent naming variants

[54] [54]

Mark as consistent if the model answer fully covers the reference answer’s meaning, even with extra harmless details

[55] [55]

Mark as inconsistent if it misses a key point, contradicts the reference answer, or changes important facts, including entity, action, order, quantity, identity, or existence

[56] [56]

Mark as inconsistent if the model answer is too vague to support the same meaning

[57] [57]

is consistent

If uncertain, choose consistent only when a reasonable reader would conclude full semantic coverage. Return JSON only, with no markdown and no extra text: {"is consistent": true/false, "reason": "one-sentence explanation"} Reference answer: {reference answer} Model answer: {model answer} Figure 10Open-ended LLM-as-Judge prompt used to evaluate semantic co...