pith. sign in

arxiv: 2605.21130 · v1 · pith:D2M446GKnew · submitted 2026-05-20 · 💻 cs.CV

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

Pith reviewed 2026-05-21 04:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords video quality assessmentpairwise comparisonmargin reasoninglarge multimodal modelsgeneralizationrelative evaluation
0
0 comments X

The pith

Pairwise margin reasoning lets video quality models generalize by comparing videos directly instead of assigning absolute scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that predicting absolute quality scores for videos ties models to dataset-specific rating habits and calibration biases, hurting performance on new domains. By instead having the model compare pairs of videos and reason about their perceptual differences, VersusQ focuses purely on relative quality. It predicts a signed continuous margin that indicates which video is better and by how much. This is optimized using Margin-Coupled GRPO that aligns the reasoning process with the numerical margin prediction. Experiments show this leads to better results on various benchmarks and stronger generalization.

Core claim

VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference, with Margin-Coupled GRPO jointly optimizing rollout-based relational reasoning and continuous margin regression.

What carries the argument

Pairwise margin reasoning, where the model directly compares two videos to output a signed continuous margin representing preference and difference magnitude.

Load-bearing premise

That focusing on perceptual differences through direct comparisons avoids the biases from dataset-specific rating habits and score distributions.

What would settle it

A new video quality benchmark with entirely different rating protocols where VersusQ shows no better cross-domain performance than absolute scoring baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21130 by Binxin Yang, Hubery Yin, Jiexuan Zhang, Qiang Xu, Shibei Meng, Yuan Liu, Zhengyao Lv.

Figure 1
Figure 1. Figure 1: Two comparisons motivating the proposed design. Left: under a gradient-based visualization protocol [13], pointwise scoring (e.g., VQ-Insight [14]) concentrates on a few positions, while VersusQ attends more broadly to scene-level evidence. Right: NTP (next-token prediction) directly predicts numeric score tokens and can collapse to repeated outputs, while Hybrid combines reasoning with margin regression t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed VersusQ pipeline. VersusQ warm-starts a motion-aware VLM with rationale supervision and MSE margin regression from the designated <reg> hidden state. MC-GRPO then uses score and format rewards for reasoning, with an auxiliary regression gradient for calibration. At inference, sparse margins are aggregated into an anchor-free leaderboard by zero-mean least squares. coarsens fine-gra… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative case study of VersusQ on video pairs. The model first produces comparison￾oriented reasoning and then predicts a continuous regression margin aligned with the ground-truth quality difference. Training recipe. The lower panel studies how reasoning supervision and reinforcement optimiza￾tion affect the margin-based model. Reasoning supervision improves transfer by encouraging the model to ground … view at source ↗
Figure 4
Figure 4. Figure 4: Convergence of sparse-graph leaderboard reconstruction. Curves are computed from prefixes of the sampled comparison-edge stream on the seven reported benchmarks. A method is marked stable for each metric once the corresponding SRCC or PLCC is within 0.005 of its final value, where SRCC denotes Spearman rank correlation coefficient and PLCC denotes Pearson linear correlation coefficient. LSQ denotes least-s… view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative case study. The case illustrates comparison-oriented reasoning for a video pair, followed by a continuous regression margin aligned with the ground-truth quality difference. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative case study. The case provides another example of comparison￾oriented reasoning for a video pair, followed by a continuous regression margin aligned with the ground-truth quality difference. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes VersusQ, a pairwise margin reasoning framework for video quality assessment using large multimodal models. It claims that absolute score prediction mixes perceptual quality with dataset-specific calibration biases from annotation protocols and rating habits, whereas relative comparisons focus purely on perceptual differences. The method performs LMM-based pairwise comparisons to predict signed continuous margins capturing both preference and degree of difference, optimized via Margin-Coupled GRPO that jointly handles rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks are reported to achieve state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous scenarios.

Significance. If the central claim holds without hidden dependencies on absolute scores, the approach could meaningfully advance generalizable VQA by reducing domain-specific biases, enabling better transfer across benchmarks with differing rating distributions. This would be particularly valuable for applications involving unseen video domains or heterogeneous evaluation protocols.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (method description): The claim that the framework is 'driven entirely by direct comparisons' and alleviates absolute-scale calibration bias is load-bearing for the cross-domain generalization results. However, a continuous margin regressor requires explicit target margins for each pair. Standard VQA practice derives these as signed differences of mean opinion scores (MOS) or normalized variants from the training benchmarks; these targets embed annotation protocols, rater biases, and score distributions, reintroducing the calibration information the method claims to sidestep. Please specify exactly how ground-truth margins are constructed and demonstrate that they do not rely on absolute scores from the same datasets used for training.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address the major comment point by point below, with plans to revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): The claim that the framework is 'driven entirely by direct comparisons' and alleviates absolute-scale calibration bias is load-bearing for the cross-domain generalization results. However, a continuous margin regressor requires explicit target margins for each pair. Standard VQA practice derives these as signed differences of mean opinion scores (MOS) or normalized variants from the training benchmarks; these targets embed annotation protocols, rater biases, and score distributions, reintroducing the calibration information the method claims to sidestep. Please specify exactly how ground-truth margins are constructed and demonstrate that they do not rely on absolute scores from the same datasets used for training.

    Authors: We appreciate the referee's careful reading and agree that explicit clarification is needed. In the revised manuscript, we will add the following details to §3: ground-truth margins for training are constructed as m = (MOS_v1 - MOS_v2) / σ_dataset, where σ_dataset is the standard deviation of MOS values in the training benchmark (or a normalized range variant). This construction does rely on absolute MOS from the training datasets. However, the framework remains driven by direct comparisons because the LMM never receives or predicts an absolute score for any individual video; supervision and inference are strictly pairwise. We argue this still reduces calibration bias relative to pointwise methods, as the model learns perceptual difference magnitudes rather than dataset-specific absolute scales. To demonstrate, we will add cross-domain experiments in the revision training on one benchmark and testing on another with substantially shifted MOS distributions, where VersusQ outperforms pointwise baselines. We will also include the exact margin formula and a short ablation on normalization choices. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; pairwise framework presented as independent proposal without self-referential reduction

full rationale

The provided abstract and context contain no equations, no self-citations invoked as uniqueness theorems, and no fitted parameters renamed as predictions. The central argument—that relative margin reasoning sidesteps absolute-scale bias—is a methodological claim justified by the framework design itself rather than by reducing to dataset MOS values by construction. No step matches the enumerated circularity patterns; the derivation remains self-contained against external benchmarks and experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or modeling choices; ledger entries cannot be populated.

pith-pipeline@v0.9.0 · 5750 in / 996 out tokens · 24968 ms · 2026-05-21T04:52:35.507015+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 7 internal anchors

  1. [1]

    Perceptual video quality assessment: A survey.Science China Information Sciences, 67(11):211301, 2024

    Xiongkuo Min, Huiyu Duan, Wei Sun, Yucheng Zhu, and Guangtao Zhai. Perceptual video quality assessment: A survey.Science China Information Sciences, 67(11):211301, 2024

  2. [2]

    Blind image quality assessment: A survey on blind image quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5): 5833–5857, 2023

    Guangtao Zhai and Xiongkuo Min. Blind image quality assessment: A survey on blind image quality assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5): 5833–5857, 2023

  3. [3]

    Q-align: Teaching lmms for visual scoring via discrete text-defined levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  4. [4]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  5. [5]

    Q-insight: Understanding image quality via visual reinforcement learning

    Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-insight: Understanding image quality via visual reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  6. [6]

    Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank

    Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  7. [7]

    Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning

    Chunyi Cao, Wei Sun, Zicheng Zhang, Yingjie Jia, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. Vqathinker: Exploring generalizable and explainable video quality assessment via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

  8. [8]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  9. [9]

    Learning to rank using gradient descent

    Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Svore. Learning to rank using gradient descent. InProceedings of the 22nd International Conference on Machine Learning (ICML), pages 89–96, 2005

  10. [10]

    Learning to rank for information retrieval.Foundations and Trends in Information Retrieval, 3(3):225–331, 2009

    Tie-Yan Liu. Learning to rank for information retrieval.Foundations and Trends in Information Retrieval, 3(3):225–331, 2009

  11. [11]

    Rankiqa: Learning from rankings for no-reference image quality assessment

    Xialei Liu, Joost van de Weijer, and Andrew D Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1040–1049, 2017

  12. [12]

    Adaptive image quality assessment via teaching large multimodal models to compare

    Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. Adaptive image quality assessment via teaching large multimodal models to compare. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  13. [13]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. InInternational Conference on Learning Representations (ICLR), 2025

  14. [14]

    Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning

    Xuanyu Zhang, Weiqi Li, Shijie Zhao, Junlin Li, Li Zhang, and Jian Zhang. Vq-insight: Teaching vlms for ai-generated video quality understanding via progressive visual reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

  15. [15]

    Statistical ranking and combina- torial Hodge theory.Mathematical Programming, 127(1):203–244, 2011

    Xiaoye Jiang, Lek-Heng Lim, Yuan Yao, and Yinyu Ye. Statistical ranking and combina- torial Hodge theory.Mathematical Programming, 127(1):203–244, 2011. doi: 10.1007/ s10107-010-0419-x

  16. [16]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004. 10

  17. [17]

    Motion-based perceptual quality assessment of video.IEEE Transactions on Image Processing, 19(2):335–350, 2010

    Kalpana Seshadrinathan, Rajiv Soundararajan, Alan C Bovik, and Lawrence K Cormack. Motion-based perceptual quality assessment of video.IEEE Transactions on Image Processing, 19(2):335–350, 2010

  18. [18]

    Toward a practical perceptual video quality metric

    Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy, and Megha Manohara. Toward a practical perceptual video quality metric. InThe Netflix Tech Blog, 2016

  19. [19]

    Video quality assessment by reduced reference spatio-temporal entropic differencing.IEEE Transactions on Circuits and Systems for Video Technology, 23(4):684–694, 2013

    Rajiv Soundararajan and Alan C Bovik. Video quality assessment by reduced reference spatio-temporal entropic differencing.IEEE Transactions on Circuits and Systems for Video Technology, 23(4):684–694, 2013

  20. [20]

    Convolutional neural networks for no-reference image quality assessment

    Le Kang, Peng Ye, Yi Li, and David Doermann. Convolutional neural networks for no-reference image quality assessment. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1733–1740, 2014

  21. [21]

    Blind image quality assessment using a deep bilinear convolutional neural network.IEEE Transactions on Circuits and Systems for Video Technology, 30(1):36–47, 2020

    Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment using a deep bilinear convolutional neural network.IEEE Transactions on Circuits and Systems for Video Technology, 30(1):36–47, 2020

  22. [22]

    Quality assessment of in-the-wild videos

    Dingquan Li, Tingting Jiang, and Ming Jiang. Quality assessment of in-the-wild videos. In Proceedings of the 27th ACM International Conference on Multimedia (ACM MM), pages 2351–2359, 2019

  23. [23]

    Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling

    Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. InProceedings of the European Conference on Computer Vision (ECCV), 2022

  24. [24]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  25. [25]

    KVQ: Boosting video quality assessment via saliency-guided local perception

    Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, and Jian Wang. KVQ: Boosting video quality assessment via saliency-guided local perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  27. [27]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  28. [28]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  29. [29]

    Vqa 2: Visual question answering for video quality assessment

    Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, and Xiongkuo Min. Vqa 2: Visual question answering for video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM), 2025

  30. [30]

    Rea- soning as representation: Rethinking visual reinforcement learning in image quality assessment

    Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, and Jian Zhang. Rea- soning as representation: Rethinking visual reinforcement learning in image quality assessment. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  31. [31]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  32. [32]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 11

  33. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  34. [35]

    Video-r1: Reinforcing video reasoning in mllms

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  35. [36]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, and Xiangyu Yue. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  36. [37]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  37. [38]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  38. [39]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zhiquan Sun, Yifan Li, Zhi Wang, Haotian Guo, Peng Yuan, Nan Shen, Fan Yang, and Zheng Yu. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  39. [40]

    R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.12478, 2025

    Jiahui Yang, Hao Dong, Sen Wang, Yi Wang, and Guansong Xia. R1-onevision: Advanc- ing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.12478, 2025

  40. [41]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Peng Liu, Kai Luo, Jiayuan Zheng, Enze Xie, and Zhenguo Li. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  41. [42]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6202–6211, 2019

  42. [43]

    Gemini 3 pro model evaluation, 2025

    Google DeepMind. Gemini 3 pro model evaluation, 2025. https://storage.googleapis. com/deepmind-media/gemini/gemini_3_pro_model_evaluation.pdf

  43. [44]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  44. [45]

    Patch-vq: ‘patching up’ the video quality problem

    Zhenqiang Ying, Maitreya Manivasagam, Hasnain Mannan, Eero P Simoncelli, and Alan C Bovik. Patch-vq: ‘patching up’ the video quality problem. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14019–14029, 2021

  45. [46]

    The konstanz natural video database (konvid-1k)

    Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6. IEEE, 2017

  46. [47]

    Zeina Sinno and Alan C. Bovik. Large-scale study of perceptual video quality.IEEE Transac- tions on Image Processing, 28(2):612–627, 2019. doi: 10.1109/TIP.2018.2869673

  47. [48]

    Xiangxu Yu, Zhengzhong Tu, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C. Bovik. Perceptual quality assessment of ugc gaming videos.arXiv preprint arXiv:2204.00128, 2022

  48. [49]

    Avc, hevc, vp9, avs2 or av1? a comparative study of state-of-the-art video encoders on 4k videos

    Zhuoran Li, Zhengfang Duanmu, Wentao Liu, and Zhou Wang. Avc, hevc, vp9, avs2 or av1? a comparative study of state-of-the-art video encoders on 4k videos. In16th International Conference on Image Analysis and Recognition (ICIAR), 2019. 12

  49. [50]

    Vdpve: Vqa dataset for perceptual video enhancement

    Yixuan Gao, Yuqin Cao, Tengchuan Kou, Wei Sun, Yunlong Dong, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. Vdpve: Vqa dataset for perceptual video enhancement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1474–1483, 2023

  50. [51]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal Processing Letters, 20(3):209–212, 2013

  51. [52]

    Saad, and Alan C

    Anish Mittal, Michele A. Saad, and Alan C. Bovik. A completely blind video integrity oracle. IEEE Transactions on Image Processing, 25(1):289–300, 2016. doi: 10.1109/TIP.2015.2502725

  52. [53]

    Channappayya

    Parimala Kancharla and Sumohana S. Channappayya. Completely blind quality assessment of user generated video content.IEEE Transactions on Image Processing, 31:263–274, 2022. doi: 10.1109/TIP.2021.3130541

  53. [54]

    Jianyi Wang, Kelvin C. K. Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023

  54. [55]

    A deep learning based no-reference quality assessment model for ugc videos

    Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai. A deep learning based no-reference quality assessment model for ugc videos. InProceedings of the 30th ACM International Conference on Multimedia (ACM MM), pages 856–865, 2022

  55. [56]

    Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, and Kede Ma. Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  56. [57]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  57. [58]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

  58. [59]

    Video 1 is better

    Arpad E. Elo.The Rating of Chessplayers, Past and Present. Arco Publishing, 1978. 13 A Sparse Graph Analysis Sparse graph construction.For a benchmark with N videos, the evaluation graph is an undirected sparse comparison graph G= (V,E) , where each vertex is a test video and each queried unordered pair becomes one edge. The edge orientation is assigned o...