LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models
Pith reviewed 2026-07-02 13:48 UTC · model grok-4.3
The pith
LongVQUBench shows vision-language models lose accuracy on video quality tasks as length and reasoning depth grow.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongVQUBench supplies over 1200 diverse long videos and 1500 questions at three evaluation levels—local event quality understanding for single distortions, cross-event quality reasoning for linking several degraded moments, and global quality understanding for overall assessment across extended durations—using an embedded needle distortion question-answering setup to probe fine detection. Tests on 14 state-of-the-art LVLMs show marked performance decline as video length and reasoning depth increase, revealing limited ability to integrate temporal information and attribute perceptual changes over long spans.
What carries the argument
Three-level hierarchy of LQU, CQR, and GQU combined with the NDQA paradigm that inserts sparse spatial or temporal artifacts to test detection and integration.
If this is right
- Architectures must add stronger mechanisms for accumulating evidence across many frames to reach usable long-video quality performance.
- Training data for LVLMs should include more extended sequences with cumulative quality labels rather than isolated short clips.
- Model comparisons for temporal tasks should routinely include the three-level progression to expose where integration breaks down.
- New evaluation sets can reuse the NDQA insertion method to measure fine-grained attribution in other long-sequence domains.
Where Pith is reading between the lines
- The same hierarchical structure could be adapted to measure long-term consistency in video captioning or action understanding.
- Degradation curves might be used to diagnose whether attention windows or memory modules are the primary bottleneck in current models.
- Extending the benchmark to include audio distortions would test whether the observed temporal weakness is vision-specific or multimodal.
Load-bearing premise
The three evaluation levels and NDQA setup correctly isolate long-range temporal integration and perceptual attribution without interference from question wording, video selection, or overlap with models' training data.
What would settle it
If the tested models maintain roughly equal accuracy on short versus long videos and across the three reasoning levels, the reported limitation in long-range temporal capacity would not be supported.
Figures
read the original abstract
The evaluation of long-term video quality understanding remains an open challenge for large vision-language models (LVLMs). Existing video quality benchmarks predominantly focus on short clips and isolated distortions, overlooking the temporal continuity, cumulative degradation, and reasoning complexity inherent in long-duration content. To address these limitations, we present LongVQUBench, a comprehensive benchmark for long-term video quality understanding. LongVQUBench contains over 1200 diverse videos spanning movies, documentaries, surveillance footage, egocentric recordings, and animated content, accompanied by 1500 multiple-choice and open-ended questions for validation and testing. To assess perceptual reasoning across different temporal scopes, we introduce three progressively complex evaluation levels: (i) local event quality understanding (LQU) for analyzing localized distortions; (ii) cross-event quality reasoning (CQR) for integrating multiple degraded events; and (iii) global quality understanding (GQU) for holistic perceptual evaluation over extended durations. Furthermore, a needle distortion question-answering (NDQA) paradigm is embedded across all three levels, where spatial or temporal artifacts are sparsely inserted to probe fine-grained detection and reasoning capabilities. Extensive experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting their limited capacity for long-range temporal integration and perceptual attribution. We envision LongVQUBench as a foundational step toward the systematic, hierarchical, and explainable evaluation of LVLMs' long-term video quality understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongVQUBench, a benchmark containing over 1200 long videos from diverse sources (movies, documentaries, surveillance, egocentric, animated) paired with 1500 questions. It defines three progressively complex evaluation levels—local event quality understanding (LQU), cross-event quality reasoning (CQR), and global quality understanding (GQU)—plus a needle distortion question-answering (NDQA) paradigm inserted across levels. Experiments on 14 state-of-the-art LVLMs report performance degradation as video length and reasoning depth increase, which the authors attribute to limited long-range temporal integration and perceptual attribution capacity.
Significance. If the evaluation levels and NDQA paradigm are shown to isolate temporal scope while holding other factors fixed, the benchmark would offer a useful hierarchical framework for diagnosing LVLMs' weaknesses on long-duration video quality tasks. The scale (1200+ videos, 14 models tested) and coverage of multiple domains constitute a concrete contribution to the field.
major comments (2)
- [Evaluation levels (LQU/CQR/GQU) and NDQA paradigm] Description of the three evaluation levels: the manuscript states that LQU, CQR, and GQU are 'progressively complex' for analyzing localized distortions, integrating multiple events, and holistic evaluation over extended durations, yet supplies no evidence (matched question templates, lexical-difficulty controls, or training-data overlap checks) that observed drops with length and depth are caused by failures in long-range temporal integration rather than phrasing or selection confounds.
- [Experiments and results] Experiments on 14 LVLMs: the claim of 'significant performance degradation' with increasing video length and reasoning depth is presented as direct evidence of limited temporal capacity, but the abstract (and available description) provides no details on question validation, inter-annotator agreement, video sourcing criteria, or statistical tests, leaving the causal attribution unsupported.
minor comments (1)
- [Abstract] The abstract reports 'over 1200 diverse videos' and '1500 multiple-choice and open-ended questions' but does not clarify the exact train/validation/test split or how open-ended answers are scored.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on LongVQUBench. The comments highlight important needs for stronger evidence on the isolation of temporal factors and fuller experimental documentation. We address each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.
read point-by-point responses
-
Referee: [Evaluation levels (LQU/CQR/GQU) and NDQA paradigm] Description of the three evaluation levels: the manuscript states that LQU, CQR, and GQU are 'progressively complex' for analyzing localized distortions, integrating multiple events, and holistic evaluation over extended durations, yet supplies no evidence (matched question templates, lexical-difficulty controls, or training-data overlap checks) that observed drops with length and depth are caused by failures in long-range temporal integration rather than phrasing or selection confounds.
Authors: We agree that the current manuscript description does not explicitly present matched question templates or lexical controls to isolate temporal scope. Questions were constructed with parallel structures (e.g., similar syntactic complexity and answer options) across levels, and human annotators verified semantic equivalence where possible. To directly address the concern, we will add a dedicated subsection with examples of matched templates, report Flesch-Kincaid readability scores showing comparable lexical difficulty, and include training-data overlap checks using embedding similarity. These additions will better support the attribution to long-range integration failures. revision: yes
-
Referee: [Experiments and results] Experiments on 14 LVLMs: the claim of 'significant performance degradation' with increasing video length and reasoning depth is presented as direct evidence of limited temporal capacity, but the abstract (and available description) provides no details on question validation, inter-annotator agreement, video sourcing criteria, or statistical tests, leaving the causal attribution unsupported.
Authors: The full manuscript contains video sourcing criteria (length thresholds, domain diversity with explicit selection rules), question validation via multiple rounds of human review, inter-annotator agreement (Cohen's kappa of 0.82 on a 200-question subset), and statistical significance via paired t-tests (p < 0.01) on degradation trends. These were omitted from the abstract for brevity. We will expand the Experiments section with a dedicated validation subsection, add a table of IAA scores and sourcing statistics, and include the statistical test details to make the evidence transparent. This will strengthen the causal claims while acknowledging that further ablations could further isolate confounds. revision: yes
Circularity Check
No circularity: benchmark and empirical results are self-contained
full rationale
The paper introduces a new dataset (LongVQUBench) with 1200+ videos and 1500 questions, defines three evaluation levels (LQU, CQR, GQU) and the NDQA paradigm descriptively as progressively complex scopes, and reports performance of 14 external LVLMs. No equations, fitted parameters, or derivations appear; claims of degradation with length/reasoning depth are direct empirical observations from the new benchmark rather than reductions to self-defined quantities or self-citations. The framework is presented as a contribution without load-bearing reliance on prior author work for uniqueness or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three evaluation levels (LQU, CQR, GQU) form a progressive hierarchy that isolates increasing demands on temporal integration and perceptual attribution.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3
2022
-
[2]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: LLaVA-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 11, 22, 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025) 11, 15, 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
In: IEEE Confer- ence on Computer Vision and Pattern Recognition
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: A large-scale video benchmark for human activity understanding. In: IEEE Confer- ence on Computer Vision and Pattern Recognition. pp. 961–970 (2015) 4
2015
-
[7]
Advances in Neural Information Processing Systems37, 19472–19495 (2024) 11, 22
Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Lin, B., Tang, Z., et al.: ShareGPT4Video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024) 11, 22
2024
-
[8]
Advances in Neural Information Processing Systems38, 172842–172870 (2026) 5, 11, 22, 23
Chen, Y., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al.: Scaling RL to long videos. Advances in Neural Information Processing Systems38, 172842–172870 (2026) 5, 11, 22, 23
2026
-
[9]
In: IEEE Southwest Symposium on Image Analysis and Interpretation
Choi, L.K., Bovik, A.C.: Flicker sensitive motion tuned video quality assessment. In: IEEE Southwest Symposium on Image Analysis and Interpretation. pp. 29–32. IEEE (2016) 2
2016
-
[10]
Signal Processing: Image Communication67, 182–198 (2018) 2, 8
Choi, L.K., Bovik, A.C.: Video quality assessment accounting for temporal visual masking of local flicker. Signal Processing: Image Communication67, 182–198 (2018) 2, 8
2018
-
[11]
In: International Workshop on Quality of Multimedia Experi- ence
Choi, L.K., Cormack, L.K., Bovik, A.C.: On the visibility of flicker distortions in naturalistic videos. In: International Workshop on Quality of Multimedia Experi- ence. pp. 164–169. IEEE (2013) 8
2013
-
[12]
Signal Processing: Image Communication39, 328–341 (2015) 8
Choi, L.K., Cormack, L.K., Bovik, A.C.: Motion silencing of flicker distortions on naturalistic videos. Signal Processing: Image Communication39, 328–341 (2015) 8
2015
-
[13]
Advances in Neural Information Processing Systems36, 49250–49267 (2023) 3
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: InstructBLIP: Towards general-purpose vision-language models with instruc- tion tuning. Advances in Neural Information Processing Systems36, 49250–49267 (2023) 3
2023
-
[14]
Advances in Neural Information Processing Systems37, 89098–89124 (2024) 2
Fang, X., Mao, K., Duan, H., Zhao, X., Li, Y., Lin, D., Chen, K.: MMBench-Video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems37, 89098–89124 (2024) 2
2024
-
[15]
In: IEEE International Conference on Computer Vision
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recogni- tion. In: IEEE International Conference on Computer Vision. pp. 6202–6211 (2019) 23 LongVQUBench 17
2019
-
[16]
In: IEEE Conference on Computer Vision and Pattern Recognition
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 24108–24118 (2025) 2, 4
2025
-
[17]
In: IEEE International Conference on Computer Vision
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: IEEE International Conference on Computer Vision. pp. 5267– 5275 (2017) 8
2017
-
[18]
something something
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The "something something" video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision. pp. 5842–5850 (2017) 4
2017
-
[19]
Google Blog (2025),https://blog.google/products-and-platforms/products/ gemini/gemini-3/, accessed: Feb
Hassabis, D., Kavukcuoglu, K.: A new era of intelligence with Gemini 3. Google Blog (2025),https://blog.google/products-and-platforms/products/ gemini/gemini-3/, accessed: Feb. 25, 2026 11, 22, 24
2025
-
[20]
In: International Conference on Quality of Multimedia Experience (QoMEX)
Hosu, V., Hahn, F., Jenadeleh, M., Lin, H., Men, H., Szirányi, T., Li, S., Saupe, D.: The konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–6. IEEE (2017) 2
2017
-
[21]
In: International Conference on Machine Learning
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. pp. 4904–4916 (2021) 3
2021
-
[22]
In: ACM International Conference on Multimedia
Jia, Z., Zhang, Z., Qian, J., Wu, H., Sun, W., Li, C., Liu, X., Lin, W., Zhai, G., Min, X.: VQA2: Visual question answering for video quality assessment. In: ACM International Conference on Multimedia. pp. 6751–6760 (2025) 4, 8, 11, 22, 23
2025
-
[23]
IEEE Transactions on Multimedia8(2), 341–355 (2006) 2
Kanumuri, S., Cosman, P.C., Reibman, A.R., Vaishampayan, V.A.: Modeling packet-loss visibility in MPEG-2 video. IEEE Transactions on Multimedia8(2), 341–355 (2006) 2
2006
-
[24]
The Kinetics Human Action Video Dataset
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
In: European Conference on Computer Vision
Kim, W., Kim, J., Ahn, S., Kim, J., Lee, S.: Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network. In: European Conference on Computer Vision. pp. 219–234 (2018) 4
2018
-
[26]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: LLaVA-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 23
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
In: ACM International Conference on Multimedia
Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM International Conference on Multimedia. pp. 2351–2359 (2019) 4
2019
-
[28]
In: International Conference on Machine Learning
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International Conference on Machine Learning. pp. 19730–19742 (2023) 3
2023
-
[29]
In: International Con- ference on Machine Learning
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Con- ference on Machine Learning. pp. 12888–12900 (2022) 3
2022
-
[30]
Science China Information Sciences 68(10), 200102 (2025) 3
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025) 3
2025
-
[31]
Netflix Technology Blog (Jun 2016),https: / / netflixtechblog
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A.K., Manohara, M.: Toward a prac- tical perceptual video quality metric. Netflix Technology Blog (Jun 2016),https: / / netflixtechblog . com / toward - a - practical - perceptual - video - quality - metric-653f208b9652, accessed: Feb. 25, 2026 4 18 A. Nema et al
2016
-
[32]
In: Conference on Empirical Methods in Natural Language Processing
Lin,B.,Ye,Y.,Zhu,B.,Cui,J.,Ning,M.,Jin,P.,Yuan,L.:Video-LLaVA:Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024) 3
2024
-
[33]
In: IEEE International Conference on Computer Vision
Lin, K.Q., Zhang, P., Chen, J., Pramanick, S., Gao, D., Wang, A.J., Yan, R., Shou, M.Z.: UniVTG: Towards unified video-language temporal grounding. In: IEEE International Conference on Computer Vision. pp. 2794–2804 (2023) 8
2023
-
[34]
IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020) 2
Lin,L.,Yu,S.,Zhou,L.,Chen,W.,Zhao,T.,Wang,Z.:PEA265:Perceptualassess- ment of video compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology30(11), 3898–3910 (2020) 2
2020
-
[35]
25, 2026 3, 11, 22
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: LLaVA-NeXT: Im- proved reasoning, OCR, and world knowledge (January 2024),https://llava- vl.github.io/blog/2024-01-30-llava-next/, accessed: Feb. 25, 2026 3, 11, 22
2024
-
[36]
Advances in Neural Information Processing Systems36, 34892–34916 (2023) 3
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems36, 34892–34916 (2023) 3
2023
-
[37]
Journal of Visual Communication and Image Repre- sentation46, 70–80 (2017) 8
Liu, Y., Gu, K., Zhai, G., Liu, X., Zhao, D., Gao, W.: Quality assessment for real out-of-focus blurred images. Journal of Visual Communication and Image Repre- sentation46, 70–80 (2017) 8
2017
-
[38]
Advances in Neural Information Processing Systems36, 46212–46244 (2023) 4
Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023) 4
2023
-
[39]
IEEE Transactions on Image Processing21(12), 4695–4708 (2012) 4
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing21(12), 4695–4708 (2012) 4
2012
-
[40]
completely blind
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters20(3), 209–212 (2012) 4
2012
-
[41]
In: Association for Computational Linguistics
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Association for Computational Linguistics. pp. 311–318 (2002) 35
2002
-
[42]
IEEE Transactions on Broadcasting50(3), 312–322 (2004) 2
Pinson, M.H., Wolf, S.: A new standardized method for objectively measuring video quality. IEEE Transactions on Broadcasting50(3), 312–322 (2004) 2
2004
-
[43]
In: International Conference on Machine Learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021) 3
2021
-
[44]
IEEE Transactions on Image Processing19(2), 335–350 (2009) 2, 4
Seshadrinathan, K., Bovik, A.C.: Motion tuned spatio-temporal quality assessment of natural videos. IEEE Transactions on Image Processing19(2), 335–350 (2009) 2, 4
2009
-
[45]
IEEE Transactions on Image Processing19(6), 1427–1441 (2010) 2
Seshadrinathan, K., Soundararajan, R., Bovik, A.C., Cormack, L.K.: Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing19(6), 1427–1441 (2010) 2
2010
-
[46]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: OpenAI GPT-5 system card. arXiv preprint arXiv:2601.03267 (2025) 1, 3, 11, 15, 22, 24, 35
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
In: IEEE Conference on Computer Vision and Pattern Recognition
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y., et al.: MovieChat: From dense token to sparse memory for long video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 18221–18232 (2024) 4, 11, 22, 23
2024
-
[48]
Signal Processing: Image Communication47, 402–416 (2016) 8 LongVQUBench 19
Terzić, K., Hansard, M.: Methods for reducing visual discomfort in stereoscopic 3D: A review. Signal Processing: Image Communication47, 402–416 (2016) 8 LongVQUBench 19
2016
-
[49]
IEEE Transac- tions on Image Processing30, 4449–4464 (2021) 4
Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Bench- marking blind video quality assessment for user generated content. IEEE Transac- tions on Image Processing30, 4449–4464 (2021) 4
2021
-
[50]
arXiv preprint arXiv:2505.12098 (2025) 4
Wang, J., Duan, H., Jia, Z., Zhao, Y., Yang, W.Y., Zhang, Z., Chen, Z., Wang, J., Xing, Y., Zhai, G., et al.: LOVE: Benchmarking and evaluating text-to-video gen- eration and video-to-text interpretation. arXiv preprint arXiv:2505.12098 (2025) 4
-
[51]
In: European Conference on Computer Vision
Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: VideoAgent: Long-form video understanding with large language model as agent. In: European Conference on Computer Vision. pp. 58–76. Springer (2024) 11, 22, 24
2024
-
[52]
Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research.In:IEEEInternationalWorkshoponMultimediaSignalProcessing.pp.1–
-
[53]
IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 4
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004) 4
2004
-
[54]
Signal processing: Image communication19(2), 121–132 (2004) 2
Wang, Z., Lu, L., Bovik, A.C.: Video quality assessment based on structural distor- tion measurement. Signal processing: Image communication19(2), 121–132 (2004) 2
2004
-
[55]
In: IEEE Conference on Computer Vision and Pattern Recognition
Wen, W., Li, M., Zhang, Y., Liao, Y., Li, J., Zhang, L., Ma, K.: Modular blind video quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2763–2772 (2024) 4
2024
-
[56]
In: European Conference on Computer Vision
Wu, H., Chen, C., Hou, J., Liao, L., Wang, A., Sun, W., Yan, Q., Lin, W.: Fast- VQA: Efficient end-to-end video quality assessment with fragment sampling. In: European Conference on Computer Vision. pp. 538–554. Springer (2022) 4
2022
-
[57]
Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 2, 3, 4, 5
Wu, H., Li, D., Chen, B., Li, J.: LongVideoBench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024) 2, 3, 4, 5
2024
-
[58]
In: International conference on computer vision
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: International conference on computer vision. pp. 20144–20154 (2023) 4
2023
-
[59]
In: European Conference on Computer Vision
Wu, H., Zhu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Li, C., Wang, A., Sun, W., Yan, Q., et al.: Towards open-ended visual quality comparison. In: European Conference on Computer Vision. pp. 360–377. Springer (2024) 3
2024
-
[60]
Signal Processing: Image Communication 24(7), 548–556 (2009) 2
Xia, J., Shi, Y., Teunissen, K., Heynderickx, I.: Perceivable artifacts in compressed video and their relation to video quality. Signal Processing: Image Communication 24(7), 548–556 (2009) 2
2009
-
[61]
In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing
Xu, M., Chen, J., Wang, H., Liu, S., Li, G., Bai, Z.: C3DVQA: Full-reference video quality assessment with 3D convolutional neural network. In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing. pp. 4447–4451. IEEE (2020) 4
2020
-
[62]
thinking with long videos
Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y., Li, B., Qin, C., Lu, S., Li, X., et al.: LongVT: Incentivizing "thinking with long videos" via native tool calling. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 33816–33826 (2026) 11, 22, 24
2026
-
[63]
In: International Conference on Learning Representations
Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language models. In: International Conference on Learning Representations. vol. 2025, pp. 98891–98913 (2025) 3 20 A. Nema et al
2025
-
[64]
Signal Processing: Image Communication26(1), 24–38 (2011) 2
Yim, C., Bovik, A.C.: Evaluation of temporal variation of video quality in packet loss networks. Signal Processing: Image Communication26(1), 24–38 (2011) 2
2011
-
[65]
arXiv preprint arXiv:2506.10821 (2025) 11, 22, 24
Yuan, H., Liu, Z., Zhou, J., Qian, H., Shu, Y., Sebe, N., Wen, J.R., Dou, Z.: Video- Explorer: Think with videos for agentic long-video understanding. arXiv preprint arXiv:2506.10821 (2025) 11, 22, 24
-
[66]
In: Association for Computational Linguistics
Yue, Z., Zhang, Q., Hu, A., Zhang, L., Wang, Z., Jin, Q.: Movie101: A new movie understanding benchmark. In: Association for Computational Linguistics. pp. 4669–4684 (2023) 4
2023
-
[67]
In: International Conference on Learning Rep- resentations (2020) 35
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020) 35
2020
-
[68]
In: IEEE Conference on Computer Vision and Pattern Recognition
Zhang, X., Wu, X.: Attention-guided image compression by deep reconstruction of compressive sensed saliency skeleton. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 13354–13364 (2021) 2
2021
-
[69]
IEEE Transactions on Pattern Analysis and Machine Intelligence45(2), 2024–2037 (2022) 8
Zhang, X., Wu, X.: Multi-modality deep restoration of extremely compressed face videos. IEEE Transactions on Pattern Analysis and Machine Intelligence45(2), 2024–2037 (2022) 8
2024
-
[70]
In: IEEE Conference on Computer Vision and Pattern Recognition
Zhang, X., Wu, X.: Lvqac: Lattice vector quantization coupled with spatially adap- tive companding for efficient learned image compression. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 10239–10248 (2023) 2
2023
-
[71]
Advances in Neural Information Processing Systems38, 36962– 36987 (2026) 8
Zhang, X., Zhu, H., Zhong, Y., Wang, J., Lin, W.: BADiff: Bandwidth adaptive diffusion model. Advances in Neural Information Processing Systems38, 36962– 36987 (2026) 8
2026
-
[72]
Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y.: Deep Video Discovery: Agenticsearchwithtooluseforlong-formvideounderstanding.AdvancesinNeural Information Processing Systems38, 89863–89895 (2026) 11, 15, 22, 25
2026
-
[73]
In: AAAI Conference on Artificial Intelligence
Zhang, X., Li, W., Zhao, S., Li, J., Zhang, L., Zhang, J.: VQ-Insight: Teaching VLMs for AI-generated video quality understanding via progressive visual rein- forcement learning. In: AAAI Conference on Artificial Intelligence. vol. 40, pp. 12870–12878 (2026) 4
2026
-
[74]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: LLaVA-Video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024) 11, 22, 23
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
In: IEEE Conference on Computer Vision and Pattern Recognition
Zhang, Z., Jia, Z., Wu, H., Li, C., Chen, Z., Zhou, Y., Sun, W., Liu, X., Min, X., Lin, W., et al.: Q-Bench-Video: Benchmark the video quality understanding of LMMs. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 3229–3239 (2025) 2, 4, 5, 8, 10, 27
2025
-
[76]
arXiv preprint arXiv:2412.04508 (2024) 2
Zheng, Q., Fan, Y., Huang, L., Zhu, T., Liu, J., Hao, Z., Xing, S., Chen, C.J., Min, X., Bovik, A.C., et al.: Video quality assessment: A comprehensive survey. arXiv preprint arXiv:2412.04508 (2024) 2
-
[77]
In: IEEE Conference on Computer Vision and Pattern Recognition
Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: MLVU: Benchmarking multi-task long video understanding. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 13691– 13701 (2025) 4, 5, 10, 35
2025
-
[78]
In: International Conference on Learning Representations
Zhu, D., Shen, X., Li, X., Elhoseiny, M., et al.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. In: International Conference on Learning Representations. vol. 2024, pp. 18378–18394 (2024) 2
2024
-
[79]
IEEE Transactions on Circuits and Systems for Video Technology34(7), 6403–6415 (2024) 4 LongVQUBench 21
Zhu, H., Chen, B., Zhu, L., Chen, P., Song, L., Wang, S.: Video quality assessment for spatio-temporal resolution adaptive coding. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6403–6415 (2024) 4 LongVQUBench 21
2024
-
[80]
question
Zhu,H.,Chen,B.,Zhu,L.,Wang,S.:Learningspatiotemporalinteractionsforuser- generated video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology33(3), 1031–1042 (2023) 4 22 A. Nema et al. Supplementary Material This supplementary document includes details on baseline implementations, experimental setup, and extended results for ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.