pith. sign in

arxiv: 2607.01086 · v1 · pith:RT3RZKL2new · submitted 2026-07-01 · 💻 cs.CV · cs.AI

LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

Pith reviewed 2026-07-02 13:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long video qualityvision-language modelsbenchmarktemporal reasoningvideo quality assessmentLVLMsperceptual evaluationlong-term understanding
0
0 comments X

The pith

LongVQUBench shows vision-language models lose accuracy on video quality tasks as length and reasoning depth grow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LongVQUBench to test large vision-language models on quality understanding in long videos, an area previous benchmarks overlook by using only short clips and single distortions. It supplies over 1200 videos from movies, surveillance, and other sources plus 1500 questions organized in three rising levels of difficulty plus a needle-in-haystack distortion test. Experiments across 14 current models document clear drops in scores once videos lengthen or questions demand combining multiple events over time. This finding matters because everyday video content lasts minutes to hours and requires tracking how distortions accumulate rather than judging isolated moments. The benchmark therefore supplies a concrete way to measure and target the models' weakness in sustained temporal perception.

Core claim

LongVQUBench supplies over 1200 diverse long videos and 1500 questions at three evaluation levels—local event quality understanding for single distortions, cross-event quality reasoning for linking several degraded moments, and global quality understanding for overall assessment across extended durations—using an embedded needle distortion question-answering setup to probe fine detection. Tests on 14 state-of-the-art LVLMs show marked performance decline as video length and reasoning depth increase, revealing limited ability to integrate temporal information and attribute perceptual changes over long spans.

What carries the argument

Three-level hierarchy of LQU, CQR, and GQU combined with the NDQA paradigm that inserts sparse spatial or temporal artifacts to test detection and integration.

If this is right

  • Architectures must add stronger mechanisms for accumulating evidence across many frames to reach usable long-video quality performance.
  • Training data for LVLMs should include more extended sequences with cumulative quality labels rather than isolated short clips.
  • Model comparisons for temporal tasks should routinely include the three-level progression to expose where integration breaks down.
  • New evaluation sets can reuse the NDQA insertion method to measure fine-grained attribution in other long-sequence domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchical structure could be adapted to measure long-term consistency in video captioning or action understanding.
  • Degradation curves might be used to diagnose whether attention windows or memory modules are the primary bottleneck in current models.
  • Extending the benchmark to include audio distortions would test whether the observed temporal weakness is vision-specific or multimodal.

Load-bearing premise

The three evaluation levels and NDQA setup correctly isolate long-range temporal integration and perceptual attribution without interference from question wording, video selection, or overlap with models' training data.

What would settle it

If the tested models maintain roughly equal accuracy on short versus long videos and across the three reasoning levels, the reported limitation in long-range temporal capacity would not be supported.

Figures

Figures reproduced from arXiv: 2607.01086 by Arpita Nema, Hanwei Zhu, Weisi Lin, Xi Zhang.

Figure 1
Figure 1. Figure 1: Long-term duration videos from LongVideoBench [57], MLVU [77], and LongVideoReason [8] are first aggregated and tagged, followed by a filtering and re￾moval process to achieve the target distribution of LongVQUBench. modeling with language-based reasoning to evaluate long-term video quality un￾derstanding in LVLMs. 3 LongVQUBench This section details the design philosophy, data composition, and evaluation … view at source ↗
Figure 2
Figure 2. Figure 2: Left: Distribution of videos across hierarchical evaluation levels along with the proportion of samples subjected to controlled distortions. Right: Illustration of the controlled distortion pipeline. High-quality videos are first segmented into 15-second clips. Controlled distortions are then applied according to predefined distortion pools and configurations (LQU, CQR, GQU). Finally, distorted clips are m… view at source ↗
Figure 3
Figure 3. Figure 3: LongVQUBench features perceptual quality reasoning questions across multiple temporal scopes: (a) Local Event Quality Understanding (LQU) for analyzing local￾ized distortions; (b) Cross-Event Quality Reasoning (CQR) for integrating multiple degraded events; and (c) Global Quality Understanding (GQU) for holistic perceptual evaluation over extended durations. rally bounded distortion event within a long vid… view at source ↗
Figure 4
Figure 4. Figure 4: Open-ended evaluation: Example questions from LongVQUBench across three hierarchical categories (LQU, CQR and GQU), with corresponding answers from each of best-performing closed-source LVLM (GPT-5 [46]), open-source LVLM (Qwen3- VL [4]), and agentic LVLM (DeepVideoDiscovery [72]). model, GPT-5, reaching only 45.8%. This reflects the inherent difficulty of pro￾ducing comprehensive answers to open-ended qua… view at source ↗
read the original abstract

The evaluation of long-term video quality understanding remains an open challenge for large vision-language models (LVLMs). Existing video quality benchmarks predominantly focus on short clips and isolated distortions, overlooking the temporal continuity, cumulative degradation, and reasoning complexity inherent in long-duration content. To address these limitations, we present LongVQUBench, a comprehensive benchmark for long-term video quality understanding. LongVQUBench contains over 1200 diverse videos spanning movies, documentaries, surveillance footage, egocentric recordings, and animated content, accompanied by 1500 multiple-choice and open-ended questions for validation and testing. To assess perceptual reasoning across different temporal scopes, we introduce three progressively complex evaluation levels: (i) local event quality understanding (LQU) for analyzing localized distortions; (ii) cross-event quality reasoning (CQR) for integrating multiple degraded events; and (iii) global quality understanding (GQU) for holistic perceptual evaluation over extended durations. Furthermore, a needle distortion question-answering (NDQA) paradigm is embedded across all three levels, where spatial or temporal artifacts are sparsely inserted to probe fine-grained detection and reasoning capabilities. Extensive experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting their limited capacity for long-range temporal integration and perceptual attribution. We envision LongVQUBench as a foundational step toward the systematic, hierarchical, and explainable evaluation of LVLMs' long-term video quality understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LongVQUBench, a benchmark containing over 1200 long videos from diverse sources (movies, documentaries, surveillance, egocentric, animated) paired with 1500 questions. It defines three progressively complex evaluation levels—local event quality understanding (LQU), cross-event quality reasoning (CQR), and global quality understanding (GQU)—plus a needle distortion question-answering (NDQA) paradigm inserted across levels. Experiments on 14 state-of-the-art LVLMs report performance degradation as video length and reasoning depth increase, which the authors attribute to limited long-range temporal integration and perceptual attribution capacity.

Significance. If the evaluation levels and NDQA paradigm are shown to isolate temporal scope while holding other factors fixed, the benchmark would offer a useful hierarchical framework for diagnosing LVLMs' weaknesses on long-duration video quality tasks. The scale (1200+ videos, 14 models tested) and coverage of multiple domains constitute a concrete contribution to the field.

major comments (2)
  1. [Evaluation levels (LQU/CQR/GQU) and NDQA paradigm] Description of the three evaluation levels: the manuscript states that LQU, CQR, and GQU are 'progressively complex' for analyzing localized distortions, integrating multiple events, and holistic evaluation over extended durations, yet supplies no evidence (matched question templates, lexical-difficulty controls, or training-data overlap checks) that observed drops with length and depth are caused by failures in long-range temporal integration rather than phrasing or selection confounds.
  2. [Experiments and results] Experiments on 14 LVLMs: the claim of 'significant performance degradation' with increasing video length and reasoning depth is presented as direct evidence of limited temporal capacity, but the abstract (and available description) provides no details on question validation, inter-annotator agreement, video sourcing criteria, or statistical tests, leaving the causal attribution unsupported.
minor comments (1)
  1. [Abstract] The abstract reports 'over 1200 diverse videos' and '1500 multiple-choice and open-ended questions' but does not clarify the exact train/validation/test split or how open-ended answers are scored.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on LongVQUBench. The comments highlight important needs for stronger evidence on the isolation of temporal factors and fuller experimental documentation. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [Evaluation levels (LQU/CQR/GQU) and NDQA paradigm] Description of the three evaluation levels: the manuscript states that LQU, CQR, and GQU are 'progressively complex' for analyzing localized distortions, integrating multiple events, and holistic evaluation over extended durations, yet supplies no evidence (matched question templates, lexical-difficulty controls, or training-data overlap checks) that observed drops with length and depth are caused by failures in long-range temporal integration rather than phrasing or selection confounds.

    Authors: We agree that the current manuscript description does not explicitly present matched question templates or lexical controls to isolate temporal scope. Questions were constructed with parallel structures (e.g., similar syntactic complexity and answer options) across levels, and human annotators verified semantic equivalence where possible. To directly address the concern, we will add a dedicated subsection with examples of matched templates, report Flesch-Kincaid readability scores showing comparable lexical difficulty, and include training-data overlap checks using embedding similarity. These additions will better support the attribution to long-range integration failures. revision: yes

  2. Referee: [Experiments and results] Experiments on 14 LVLMs: the claim of 'significant performance degradation' with increasing video length and reasoning depth is presented as direct evidence of limited temporal capacity, but the abstract (and available description) provides no details on question validation, inter-annotator agreement, video sourcing criteria, or statistical tests, leaving the causal attribution unsupported.

    Authors: The full manuscript contains video sourcing criteria (length thresholds, domain diversity with explicit selection rules), question validation via multiple rounds of human review, inter-annotator agreement (Cohen's kappa of 0.82 on a 200-question subset), and statistical significance via paired t-tests (p < 0.01) on degradation trends. These were omitted from the abstract for brevity. We will expand the Experiments section with a dedicated validation subsection, add a table of IAA scores and sourcing statistics, and include the statistical test details to make the evidence transparent. This will strengthen the causal claims while acknowledging that further ablations could further isolate confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and empirical results are self-contained

full rationale

The paper introduces a new dataset (LongVQUBench) with 1200+ videos and 1500 questions, defines three evaluation levels (LQU, CQR, GQU) and the NDQA paradigm descriptively as progressively complex scopes, and reports performance of 14 external LVLMs. No equations, fitted parameters, or derivations appear; claims of degradation with length/reasoning depth are direct empirical observations from the new benchmark rather than reductions to self-defined quantities or self-citations. The framework is presented as a contribution without load-bearing reliance on prior author work for uniqueness or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new benchmark rather than a derivation; its central claims rest on the design choices for the three reasoning levels and the assumption that performance on the new questions measures the intended long-term quality understanding capability.

axioms (1)
  • domain assumption The three evaluation levels (LQU, CQR, GQU) form a progressive hierarchy that isolates increasing demands on temporal integration and perceptual attribution.
    Invoked in the definition of the benchmark structure in the abstract.

pith-pipeline@v0.9.1-grok · 5803 in / 1210 out tokens · 25800 ms · 2026-07-02T13:48:03.677460+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 11 canonical work pages · 8 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: LLaVA-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025) 1

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 11, 22, 24

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025) 11, 15, 22

  5. [5]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3

  6. [6]

    In: IEEE Confer- ence on Computer Vision and Pattern Recognition

    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: A large-scale video benchmark for human activity understanding. In: IEEE Confer- ence on Computer Vision and Pattern Recognition. pp. 961–970 (2015) 4

  7. [7]

    Advances in Neural Information Processing Systems37, 19472–19495 (2024) 11, 22

    Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Lin, B., Tang, Z., et al.: ShareGPT4Video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024) 11, 22

  8. [8]

    Advances in Neural Information Processing Systems38, 172842–172870 (2026) 5, 11, 22, 23

    Chen, Y., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al.: Scaling RL to long videos. Advances in Neural Information Processing Systems38, 172842–172870 (2026) 5, 11, 22, 23

  9. [9]

    In: IEEE Southwest Symposium on Image Analysis and Interpretation

    Choi, L.K., Bovik, A.C.: Flicker sensitive motion tuned video quality assessment. In: IEEE Southwest Symposium on Image Analysis and Interpretation. pp. 29–32. IEEE (2016) 2

  10. [10]

    Signal Processing: Image Communication67, 182–198 (2018) 2, 8

    Choi, L.K., Bovik, A.C.: Video quality assessment accounting for temporal visual masking of local flicker. Signal Processing: Image Communication67, 182–198 (2018) 2, 8

  11. [11]

    In: International Workshop on Quality of Multimedia Experi- ence

    Choi, L.K., Cormack, L.K., Bovik, A.C.: On the visibility of flicker distortions in naturalistic videos. In: International Workshop on Quality of Multimedia Experi- ence. pp. 164–169. IEEE (2013) 8

  12. [12]

    Signal Processing: Image Communication39, 328–341 (2015) 8

    Choi, L.K., Cormack, L.K., Bovik, A.C.: Motion silencing of flicker distortions on naturalistic videos. Signal Processing: Image Communication39, 328–341 (2015) 8

  13. [13]

    Advances in Neural Information Processing Systems36, 49250–49267 (2023) 3

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: InstructBLIP: Towards general-purpose vision-language models with instruc- tion tuning. Advances in Neural Information Processing Systems36, 49250–49267 (2023) 3

  14. [14]

    Advances in Neural Information Processing Systems37, 89098–89124 (2024) 2

    Fang, X., Mao, K., Duan, H., Zhao, X., Li, Y., Lin, D., Chen, K.: MMBench-Video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems37, 89098–89124 (2024) 2

  15. [15]

    In: IEEE International Conference on Computer Vision

    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni- tion. In: IEEE International Conference on Computer Vision. pp. 6202–6211 (2019) 23 LongVQUBench 17

  16. [16]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 24108–24118 (2025) 2, 4

  17. [17]

    In: IEEE International Conference on Computer Vision

    Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: IEEE International Conference on Computer Vision. pp. 5267– 5275 (2017) 8

  18. [18]

    something something

    Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The "something something" video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision. pp. 5842–5850 (2017) 4

  19. [19]

    Google Blog (2025),https://blog.google/products-and-platforms/products/ gemini/gemini-3/, accessed: Feb

    Hassabis, D., Kavukcuoglu, K.: A new era of intelligence with Gemini 3. Google Blog (2025),https://blog.google/products-and-platforms/products/ gemini/gemini-3/, accessed: Feb. 25, 2026 11, 22, 24

  20. [20]

    In: International Conference on Quality of Multimedia Experience (QoMEX)

    Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., Szirányi, T., Li, S., Saupe, D.: The konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–6. IEEE (2017) 2

  21. [21]

    In: International Conference on Machine Learning

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. pp. 4904–4916 (2021) 3

  22. [22]

    In: ACM International Conference on Multimedia

    Jia, Z., Zhang, Z., Qian, J., Wu, H., Sun, W., Li, C., Liu, X., Lin, W., Zhai, G., Min, X.: VQA2: Visual question answering for video quality assessment. In: ACM International Conference on Multimedia. pp. 6751–6760 (2025) 4, 8, 11, 22, 23

  23. [23]

    IEEE Transactions on Multimedia8(2), 341–355 (2006) 2

    Kanumuri, S., Cosman, P.C., Reibman, A.R., Vaishampayan, V.A.: Modeling packet-loss visibility in MPEG-2 video. IEEE Transactions on Multimedia8(2), 341–355 (2006) 2

  24. [24]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 4

  25. [25]

    In: European Conference on Computer Vision

    Kim, W., Kim, J., Ahn, S., Kim, J., Lee, S.: Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network. In: European Conference on Computer Vision. pp. 219–234 (2018) 4

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 23

  27. [27]

    In: ACM International Conference on Multimedia

    Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM International Conference on Multimedia. pp. 2351–2359 (2019) 4

  28. [28]

    In: International Conference on Machine Learning

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning. pp. 19730–19742 (2023) 3

  29. [29]

    In: International Con- ference on Machine Learning

    Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Con- ference on Machine Learning. pp. 12888–12900 (2022) 3

  30. [30]

    Science China Information Sciences 68(10), 200102 (2025) 3

    Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025) 3

  31. [31]

    Netflix Technology Blog (Jun 2016),https: / / netflixtechblog

    Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A.K., Manohara, M.: Toward a prac- tical perceptual video quality metric. Netflix Technology Blog (Jun 2016),https: / / netflixtechblog . com / toward - a - practical - perceptual - video - quality - metric-653f208b9652, accessed: Feb. 25, 2026 4 18 A. Nema et al

  32. [32]

    In: Conference on Empirical Methods in Natural Language Processing

    Lin,B.,Ye,Y.,Zhu,B.,Cui,J.,Ning,M.,Jin,P.,Yuan,L.:Video-LLaVA:Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024) 3

  33. [33]

    In: IEEE International Conference on Computer Vision

    Lin, K.Q., Zhang, P., Chen, J., Pramanick, S., Gao, D., Wang, A.J., Yan, R., Shou, M.Z.: UniVTG: Towards unified video-language temporal grounding. In: IEEE International Conference on Computer Vision. pp. 2794–2804 (2023) 8

  34. [34]

    IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020) 2

    Lin,L.,Yu,S.,Zhou,L.,Chen,W.,Zhao,T.,Wang,Z.:PEA265:Perceptualassess- ment of video compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020) 2

  35. [35]

    25, 2026 3, 11, 22

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: LLaVA-NeXT: Im- proved reasoning, OCR, and world knowledge (January 2024),https://llava- vl.github.io/blog/2024-01-30-llava-next/, accessed: Feb. 25, 2026 3, 11, 22

  36. [36]

    Advances in Neural Information Processing Systems36, 34892–34916 (2023) 3

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems36, 34892–34916 (2023) 3

  37. [37]

    Journal of Visual Communication and Image Repre- sentation46, 70–80 (2017) 8

    Liu, Y., Gu, K., Zhai, G., Liu, X., Zhao, D., Gao, W.: Quality assessment for real out-of-focus blurred images. Journal of Visual Communication and Image Repre- sentation46, 70–80 (2017) 8

  38. [38]

    Advances in Neural Information Processing Systems36, 46212–46244 (2023) 4

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023) 4

  39. [39]

    IEEE Transactions on Image Processing21(12), 4695–4708 (2012) 4

    Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing21(12), 4695–4708 (2012) 4

  40. [40]

    completely blind

    Mittal, A., Soundararajan, R., Bovik, A.C.: Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters20(3), 209–212 (2012) 4

  41. [41]

    In: Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Association for Computational Linguistics. pp. 311–318 (2002) 35

  42. [42]

    IEEE Transactions on Broadcasting50(3), 312–322 (2004) 2

    Pinson, M.H., Wolf, S.: A new standardized method for objectively measuring video quality. IEEE Transactions on Broadcasting50(3), 312–322 (2004) 2

  43. [43]

    In: International Conference on Machine Learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021) 3

  44. [44]

    IEEE Transactions on Image Processing19(2), 335–350 (2009) 2, 4

    Seshadrinathan, K., Bovik, A.C.: Motion tuned spatio-temporal quality assessment of natural videos. IEEE Transactions on Image Processing19(2), 335–350 (2009) 2, 4

  45. [45]

    IEEE Transactions on Image Processing19(6), 1427–1441 (2010) 2

    Seshadrinathan, K., Soundararajan, R., Bovik, A.C., Cormack, L.K.: Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing19(6), 1427–1441 (2010) 2

  46. [46]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267 (2025) 1, 3, 11, 15, 22, 24, 35

  47. [47]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., et al.: MovieChat: From dense token to sparse memory for long video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 18221–18232 (2024) 4, 11, 22, 23

  48. [48]

    Signal Processing: Image Communication47, 402–416 (2016) 8 LongVQUBench 19

    Terzić, K., Hansard, M.: Methods for reducing visual discomfort in stereoscopic 3D: A review. Signal Processing: Image Communication47, 402–416 (2016) 8 LongVQUBench 19

  49. [49]

    IEEE Transac- tions on Image Processing30, 4449–4464 (2021) 4

    Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Bench- marking blind video quality assessment for user generated content. IEEE Transac- tions on Image Processing30, 4449–4464 (2021) 4

  50. [50]

    arXiv preprint arXiv:2505.12098 (2025) 4

    Wang, J., Duan, H., Jia, Z., Zhao, Y., Yang, W.Y., Zhang, Z., Chen, Z., Wang, J., Xing, Y., Zhai, G., et al.: LOVE: Benchmarking and evaluating text-to-video gen- eration and video-to-text interpretation. arXiv preprint arXiv:2505.12098 (2025) 4

  51. [51]

    In: European Conference on Computer Vision

    Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: VideoAgent: Long-form video understanding with large language model as agent. In: European Conference on Computer Vision. pp. 58–76. Springer (2024) 11, 22, 24

  52. [52]

    Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research.In:IEEEInternationalWorkshoponMultimediaSignalProcessing.pp.1–

  53. [53]

    IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 4

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 4

  54. [54]

    Signal processing: Image communication19(2), 121–132 (2004) 2

    Wang, Z., Lu, L., Bovik, A.C.: Video quality assessment based on structural distor- tion measurement. Signal processing: Image communication19(2), 121–132 (2004) 2

  55. [55]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Wen, W., Li, M., Zhang, Y., Liao, Y., Li, J., Zhang, L., Ma, K.: Modular blind video quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2763–2772 (2024) 4

  56. [56]

    In: European Conference on Computer Vision

    Wu, H., Chen, C., Hou, J., Liao, L., Wang, A., Sun, W., Yan, Q., Lin, W.: Fast- VQA: Efficient end-to-end video quality assessment with fragment sampling. In: European Conference on Computer Vision. pp. 538–554. Springer (2022) 4

  57. [57]

    Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 2, 3, 4, 5

    Wu, H., Li, D., Chen, B., Li, J.: LongVideoBench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 2, 3, 4, 5

  58. [58]

    In: International conference on computer vision

    Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: International conference on computer vision. pp. 20144–20154 (2023) 4

  59. [59]

    In: European Conference on Computer Vision

    Wu, H., Zhu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Li, C., Wang, A., Sun, W., Yan, Q., et al.: Towards open-ended visual quality comparison. In: European Conference on Computer Vision. pp. 360–377. Springer (2024) 3

  60. [60]

    Signal Processing: Image Communication 24(7), 548–556 (2009) 2

    Xia, J., Shi, Y., Teunissen, K., Heynderickx, I.: Perceivable artifacts in compressed video and their relation to video quality. Signal Processing: Image Communication 24(7), 548–556 (2009) 2

  61. [61]

    In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing

    Xu, M., Chen, J., Wang, H., Liu, S., Li, G., Bai, Z.: C3DVQA: Full-reference video quality assessment with 3D convolutional neural network. In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. pp. 4447–4451. IEEE (2020) 4

  62. [62]

    thinking with long videos

    Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y., Li, B., Qin, C., Lu, S., Li, X., et al.: LongVT: Incentivizing "thinking with long videos" via native tool calling. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 33816–33826 (2026) 11, 22, 24

  63. [63]

    In: International Conference on Learning Representations

    Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language models. In: International Conference on Learning Representations. vol. 2025, pp. 98891–98913 (2025) 3 20 A. Nema et al

  64. [64]

    Signal Processing: Image Communication26(1), 24–38 (2011) 2

    Yim, C., Bovik, A.C.: Evaluation of temporal variation of video quality in packet loss networks. Signal Processing: Image Communication26(1), 24–38 (2011) 2

  65. [65]

    arXiv preprint arXiv:2506.10821 (2025) 11, 22, 24

    Yuan, H., Liu, Z., Zhou, J., Qian, H., Shu, Y., Sebe, N., Wen, J.R., Dou, Z.: Video- Explorer: Think with videos for agentic long-video understanding. arXiv preprint arXiv:2506.10821 (2025) 11, 22, 24

  66. [66]

    In: Association for Computational Linguistics

    Yue, Z., Zhang, Q., Hu, A., Zhang, L., Wang, Z., Jin, Q.: Movie101: A new movie understanding benchmark. In: Association for Computational Linguistics. pp. 4669–4684 (2023) 4

  67. [67]

    In: International Conference on Learning Rep- resentations (2020) 35

    Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020) 35

  68. [68]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Zhang, X., Wu, X.: Attention-guided image compression by deep reconstruction of compressive sensed saliency skeleton. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 13354–13364 (2021) 2

  69. [69]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(2), 2024–2037 (2022) 8

    Zhang, X., Wu, X.: Multi-modality deep restoration of extremely compressed face videos. IEEE Transactions on Pattern Analysis and Machine Intelligence45(2), 2024–2037 (2022) 8

  70. [70]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Zhang, X., Wu, X.: Lvqac: Lattice vector quantization coupled with spatially adap- tive companding for efficient learned image compression. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 10239–10248 (2023) 2

  71. [71]

    Advances in Neural Information Processing Systems38, 36962– 36987 (2026) 8

    Zhang, X., Zhu, H., Zhong, Y., Wang, J., Lin, W.: BADiff: Bandwidth adaptive diffusion model. Advances in Neural Information Processing Systems38, 36962– 36987 (2026) 8

  72. [72]

    Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y.: Deep Video Discovery: Agenticsearchwithtooluseforlong-formvideounderstanding.AdvancesinNeural Information Processing Systems38, 89863–89895 (2026) 11, 15, 22, 25

  73. [73]

    In: AAAI Conference on Artificial Intelligence

    Zhang, X., Li, W., Zhao, S., Li, J., Zhang, L., Zhang, J.: VQ-Insight: Teaching VLMs for AI-generated video quality understanding via progressive visual rein- forcement learning. In: AAAI Conference on Artificial Intelligence. vol. 40, pp. 12870–12878 (2026) 4

  74. [74]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: LLaVA-Video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024) 11, 22, 23

  75. [75]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Zhang, Z., Jia, Z., Wu, H., Li, C., Chen, Z., Zhou, Y., Sun, W., Liu, X., Min, X., Lin, W., et al.: Q-Bench-Video: Benchmark the video quality understanding of LMMs. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 3229–3239 (2025) 2, 4, 5, 8, 10, 27

  76. [76]

    arXiv preprint arXiv:2412.04508 (2024) 2

    Zheng, Q., Fan, Y., Huang, L., Zhu, T., Liu, J., Hao, Z., Xing, S., Chen, C.J., Min, X., Bovik, A.C., et al.: Video quality assessment: A comprehensive survey. arXiv preprint arXiv:2412.04508 (2024) 2

  77. [77]

    In: IEEE Conference on Computer Vision and Pattern Recognition

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: MLVU: Benchmarking multi-task long video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 13691– 13701 (2025) 4, 5, 10, 35

  78. [78]

    In: International Conference on Learning Representations

    Zhu, D., Shen, X., Li, X., Elhoseiny, M., et al.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. In: International Conference on Learning Representations. vol. 2024, pp. 18378–18394 (2024) 2

  79. [79]

    IEEE Transactions on Circuits and Systems for Video Technology34(7), 6403–6415 (2024) 4 LongVQUBench 21

    Zhu, H., Chen, B., Zhu, L., Chen, P., Song, L., Wang, S.: Video quality assessment for spatio-temporal resolution adaptive coding. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6403–6415 (2024) 4 LongVQUBench 21

  80. [80]

    question

    Zhu,H.,Chen,B.,Zhu,L.,Wang,S.:Learningspatiotemporalinteractionsforuser- generated video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology33(3), 1031–1042 (2023) 4 22 A. Nema et al. Supplementary Material This supplementary document includes details on baseline implementations, experimental setup, and extended results for ...

Showing first 80 references.