arxiv: 2604.12813 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.MM

Recognition: unknown

DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

Xinyue Li , Shubo Xu , Zhichao Zhang , Zhaolin Cai , Yitong Chen , Guangtao Zhai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:03 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords video quality assessmentmultimodal large language modelsresidual calibrationmean opinion scoreuser-generated contentAI-generated contentparameter-efficient adaptationperceptual prior

0 comments

The pith

A frozen multimodal LLM supplies a perceptual prior for video quality that a small residual branch can calibrate to new mean opinion score targets without full retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that pretrained multimodal large language models already encode useful perceptual information for judging video quality, so the real task is not to re-learn perception from scratch but to map that prior efficiently onto any new set of human ratings. It therefore freezes the large model and adds only a lightweight branch that learns a residual correction term. This separation keeps training costs and label requirements low while still reaching competitive accuracy on both user-generated and AI-generated video collections. The approach matters because full fine-tuning of large models for every new domain or rating scale has been prohibitively expensive.

Core claim

DPC-VQA treats a frozen pretrained MLLM as a fixed provider of base quality estimates and perceptual features, then trains a separate lightweight calibration branch to output a residual that shifts those estimates into the desired mean-opinion-score space; the resulting system matches the accuracy of end-to-end retrained models while using under two percent of the trainable parameters and remaining effective when only twenty percent of the usual labeled scores are available.

What carries the argument

Frozen MLLM perceptual prior plus lightweight residual calibration branch that predicts a corrective offset for target MOS alignment.

Load-bearing premise

The frozen multimodal model already extracts perceptual features general enough that a small additional network can learn the remaining domain-specific shift without any updates to the base parameters.

What would settle it

On a new video domain, train the calibration branch on the usual split and check whether its final accuracy falls substantially below that of a fully fine-tuned MLLM baseline; if the gap is large and persists even after giving the calibration branch more capacity or labels, the decoupling premise is refuted.

Figures

Figures reproduced from arXiv: 2604.12813 by Guangtao Zhai, Shubo Xu, Xinyue Li, Yitong Chen, Zhaolin Cai, Zhichao Zhang.

**Figure 1.** Figure 1: Comparison between existing MLLM-based VQA pipelines and the proposed DPC-VQA framework. Top left: existing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Motivation for decoupling quality perception and residual calibration. (a) Frozen MLLM quality predictions are [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed DPC-VQA framework. A frozen MLLM first produces a base quality estimate and perceptual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Data efficiency comparison under different training [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Freezing the MLLM for a perceptual prior and training only a lightweight residual branch is a practical efficiency move, but the paper must show the base model already correlates with MOS or the calibration is mostly doing the work.

read the letter

The core idea here is straightforward: keep a pretrained MLLM frozen to supply quality-related features and a base estimate, then add a small calibration network that learns only the residual correction needed for a target dataset. This avoids full retraining and cuts both parameters and required MOS labels. The abstract reports competitive numbers on UGC and AIGC benchmarks with under 2% trainable parameters and 20% of the labels, which is the kind of result that could matter for labs that cannot afford large-scale annotation or compute each time a new video domain appears. Experiments across two content types and the plan to release code are clear positives. The framework is presented cleanly as a decoupling of perception and calibration, and the motivation matches real constraints in multimedia applications. The main uncertainty is whether the frozen MLLM's output already lines up reasonably with human judgments. If the base predictions show low correlation on the target data, the residual branch ends up learning most of the mapping anyway, which undercuts the claim that the prior is doing useful work. The abstract does not break out base-only performance or residual magnitudes, so those numbers and the corresponding ablations will decide how much credit the decoupling deserves. Minor gaps like that are fixable with clearer tables. This paper is aimed at researchers working on efficient adaptation of large models for perceptual tasks such as VQA. A reader focused on low-resource fine-tuning or MLLM deployment would find the setup and results worth examining. It deserves a serious referee because the efficiency claims are testable and the experimental scope is reasonable.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes DPC-VQA, a decoupling framework for video quality assessment in which a frozen multimodal large language model supplies a base quality estimate and perceptual prior while a lightweight calibration branch predicts a residual correction to adapt to target mean opinion score (MOS) distributions. The central claim is that this design achieves competitive performance on user-generated content (UGC) and AI-generated content (AIGC) benchmarks while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of the available MOS labels.

Significance. If the empirical support holds, the work offers a practical route to efficient domain adaptation of large pretrained models for perceptual tasks, substantially lowering both compute and annotation costs. The explicit separation of a general perceptual prior from a small residual adapter is a clean conceptual contribution that could generalize beyond VQA.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the claim that the frozen MLLM already supplies a 'useful perceptual prior' requires quantitative support. Please report the correlation (e.g., PLCC, SRCC) and absolute error of the base MLLM output alone (obtained via prompting or feature extraction) on the target UGC and AIGC test sets, both before and after the calibration branch is applied. Without these numbers it is impossible to determine whether the residual branch is performing a modest correction or largely compensating for domain shift.
[§3.2] §3.2 (Calibration Branch): the statement that the branch uses '<2% of the trainable parameters' is load-bearing for the efficiency argument. Provide the exact parameter count of the calibration network, the total parameter count of the frozen MLLM, and the precise architectural choices (layers, hidden dimensions, input features) so that the 2% figure can be independently verified.
[§4.3] §4.3 (Reduced-label experiments): the claim that the method 'remains effective with only 20% of MOS labels' is central. Clarify the sampling procedure for the 20% subset, whether the same split is used across all baselines, and whether results are averaged over multiple random subsets with standard deviation. If the calibration branch is trained on a small labeled set while the base MLLM is frozen, the risk that the residual largely memorizes the limited labels rather than learning a general correction should be quantified.

minor comments (3)

[§2] §2 (Related Work): the positioning relative to prior MLLM-based VQA methods (e.g., those that fine-tune the full model) would be clearer if a brief table contrasted parameter counts, label requirements, and reported PLCC/SRCC on the same benchmarks.
[Figure 1] Figure 1: the schematic would benefit from explicit visual distinction (color or hatching) between frozen and trainable modules and from indicating the exact input to the calibration branch (raw features, base quality score, or both).
[Throughout] Throughout: ensure that all acronyms (MLLM, VQA, MOS, UGC, AIGC, PLCC, SRCC) are defined at first use and that the term 'residual calibration' is used consistently rather than interchangeably with 'correction'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our decoupling framework. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that the frozen MLLM already supplies a 'useful perceptual prior' requires quantitative support. Please report the correlation (e.g., PLCC, SRCC) and absolute error of the base MLLM output alone (obtained via prompting or feature extraction) on the target UGC and AIGC test sets, both before and after the calibration branch is applied. Without these numbers it is impossible to determine whether the residual branch is performing a modest correction or largely compensating for domain shift.

Authors: We agree that explicit quantification of the frozen MLLM's standalone performance is necessary to substantiate the perceptual prior claim. In the revised manuscript, we will report PLCC, SRCC, and MAE for the base MLLM output (via direct prompting or feature extraction) on the UGC and AIGC test sets. These will be shown both before and after the calibration branch, allowing readers to assess the magnitude of the residual correction. revision: yes
Referee: [§3.2] §3.2 (Calibration Branch): the statement that the branch uses '<2% of the trainable parameters' is load-bearing for the efficiency argument. Provide the exact parameter count of the calibration network, the total parameter count of the frozen MLLM, and the precise architectural choices (layers, hidden dimensions, input features) so that the 2% figure can be independently verified.

Authors: We will revise §3.2 to include the exact parameter counts for the calibration network and the frozen MLLM, together with a complete description of the architectural choices including the number of layers, hidden dimensions, and the specific input features extracted from the MLLM. This will enable independent verification of the efficiency claim. revision: yes
Referee: [§4.3] §4.3 (Reduced-label experiments): the claim that the method 'remains effective with only 20% of MOS labels' is central. Clarify the sampling procedure for the 20% subset, whether the same split is used across all baselines, and whether results are averaged over multiple random subsets with standard deviation. If the calibration branch is trained on a small labeled set while the base MLLM is frozen, the risk that the residual largely memorizes the limited labels rather than learning a general correction should be quantified.

Authors: We will clarify in the revision that the 20% subset was obtained via random sampling and that the identical split was used for all baselines. We will also report results averaged over multiple random subsets with standard deviations. To address the memorization concern, we will add an analysis comparing the calibration branch against a simple memorization baseline on held-out labels, confirming that the lightweight branch learns a general correction rather than overfitting to the limited annotations. revision: partial

Circularity Check

0 steps flagged

No circularity in proposed decoupling framework

full rationale

The paper presents an engineering framework that freezes a pretrained MLLM to supply a base perceptual prior and trains an independent lightweight residual calibration branch on target MOS data. No equations, predictions, or derivations are shown that reduce by construction to fitted inputs or self-citations; the calibration step is explicitly trained separately and evaluated on held-out benchmarks. The approach is self-contained with external empirical validation on UGC and AIGC datasets, satisfying the criteria for a non-circular method proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained MLLMs encode a transferable perceptual prior for video quality and that residual calibration suffices for domain adaptation. The calibration branch parameters are learned from target data.

free parameters (1)

calibration branch parameters
Lightweight branch is trained on target MOS data to predict residuals; these parameters are fitted rather than derived.

axioms (1)

domain assumption Pretrained MLLM supplies a useful general perceptual prior for VQA
Stated as the paper's core insight in the abstract; no derivation provided.

pith-pipeline@v0.9.0 · 5536 in / 1254 out tokens · 47647 ms · 2026-05-10T16:03:50.416769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Pengfei Chen, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. 2021. Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5178–5187

2021
[2]

Pengfei Chen, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. 2022. Contrastive Self-Supervised Pre-Training for Video Quality Assessment.IEEE Transactions on Image Processing31 (2022), 458–471. doi:10.1109/TIP.2021.3130536

work page doi:10.1109/tip.2021.3130536 2022
[3]

Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chen- long He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Wei Sun, Yuqin Cao,...

work page doi:10.1109/cvprw63382.2024.00591 2024
[4]

Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. 2025. FineVQ: Fine-Grained User Generated Content Video Quality Assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3206–3217. doi:10.1109/CVPR52734.2025.00305

work page doi:10.1109/cvpr52734.2025.00305 2025
[5]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition. In2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6201–6210. doi:10.1109/ICCV.2019.00630

work page doi:10.1109/iccv.2019.00630 2019
[6]

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2025. LMM-VQA: Advancing Video Quality Assessment With Large Multimodal Models.IEEE Transactions on Circuits and Systems for Video Technology35, 11 (2025), 11083–11096. doi:10.1109/TCSVT. 2025.3571788

work page doi:10.1109/tcsvt 2025
[7]

Jari Korhonen. 2019. Two-Level Approach for No-Reference Consumer Video Quality Assessment.IEEE Transactions on Image Processing28, 12 (2019), 5923–

2019
[8]

doi:10.1109/TIP.2019.2923051

work page doi:10.1109/tip.2019.2923051 2019
[9]

Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. 2024. Subjective-aligned dataset and metric for text-to-video quality assessment. InProceedings of the 32nd ACM International Conference on Multimedia. 7793–7802

2024
[10]

Bowen Li, Weixia Zhang, Meng Tian, Guangtao Zhai, and Xianpei Wang. 2022. Blindly Assess Quality of In-the-Wild Videos via Quality-Aware Pre-Training and Motion Perception.IEEE Transactions on Circuits and Systems for Video Technology32, 9 (2022), 5944–5958. doi:10.1109/TCSVT.2022.3164467

work page doi:10.1109/tcsvt.2022.3164467 2022
[11]

Dingquan Li, Tingting Jiang, and Ming Jiang. 2019. Quality Assessment of In- the-Wild Videos. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 2351–2359. doi:10.1145/3343031.3351028

work page doi:10.1145/3343031.3351028 2019
[12]

Xudong Li, Zihao Huang, Yan Zhang, Yunhang Shen, Ke Li, Xiawu Zheng, Liujuan Cao, and Rongrong Ji. 2025. Few-Shot Image Quality Assessment via Adapta- tion of Vision-Language Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10442–10452

2025
[13]

Zutong Li and Lei Yang. 2022. DCVQE: A Hierarchical Transformer for Video Quality Assessment. arXiv:2210.04377 [cs.CV] https://arxiv.org/abs/2210.04377

work page arXiv 2022
[14]

Tian Liang, Jing Huang, Ming Kong, Luyuan Chen, and Qiang Zhu. 2024. Query- ing as Prompt: Parameter-Efficient Learning for Multimodal Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26855–26865

2024
[15]

Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C

Pavan C. Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C. Bovik. 2022. Image Quality Assessment Using Contrastive Learning.IEEE Transactions on Image Processing31 (2022), 4149–4161. doi:10.1109/TIP.2022. 3181496

work page doi:10.1109/tip.2022 2022
[16]

Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. 2023. Conviqt: Contrastive video quality estimator.IEEE Transactions on Image Processing32 (2023), 5138–5152

2023
[17]

Shankhanil Mitra and Rajiv Soundararajan. 2022. Multiview contrastive learning for completely blind video quality assessment of user generated content. In Proceedings of the 30th ACM International Conference on Multimedia. 1914–1924

2022
[18]

Shankhanil Mitra and Rajiv Soundararajan. 2024. Knowledge guided semi- supervised learning for quality assessment of user generated videos. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4251–4260

2024
[19]

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- reference image quality assessment in the spatial domain.IEEE Transactions on image processing21, 12 (2012), 4695–4708

2012
[20]

Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, and Ze- feng Ying. 2025. Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 1493–1502

2025
[21]

Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, and Jian Wang. 2025. KVQ: Boosting Video Quality Assessment via Saliency-guided Local Percep- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2150–2160

2025
[22]

Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai. 2022. A Deep Learning Based No-Reference Quality Assessment Model for UGC Videos. InProceedings of the 30th ACM International Conference on Multimedia. 856–865

2022
[23]

Zhengzhong Tu, Xiangxu Yu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C. Bovik. 2021. RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content.IEEE Open Journal of Signal Processing2 (2021), 425–440. doi:10.1109/OJSP.2021.3090333

work page doi:10.1109/ojsp.2021.3090333 2021
[24]

Jiarui Wang, Huiyu Duan, Guangtao Zhai, Juntong Wang, and Xiongkuo Min
[25]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18869–18880. doi:10.1109/CVPR52734. 2025.01758

work page doi:10.1109/cvpr52734 2025
[26]

Yilin Wang, Junjie Ke, Hossein Talebi, Joong Gon Yim, Neil Birkbeck, Balu Adsumilli, Peyman Milanfar, and Feng Yang. 2021. Rich features for percep- tual quality assessment of UGC videos. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13430–13439. doi:10.1109/CVPR46437. 2021.01323

work page doi:10.1109/cvpr46437 2021
[27]

Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, and Kede Ma
[28]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Modular Blind Video Quality Assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2763–2772
[29]

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, and Yilin Wang. 2025. CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment.CoRRabs/2505.16025 (2025). doi:10.48550/arXiv.2505.16025

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.16025 2025
[30]

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. FAST-VQA: Efficient End-to-End Video Quality Assessment with Fragment Sampling. InComputer Vision – ECCV 2022. Springer Nature Switzerland, Cham, 538–554

2022
[31]

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2023. Exploring Video Quality Assess- ment on User Generated Contents from Aesthetic and Technical Perspectives. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 20087–20097. doi:10.1109/ICCV51070.2023.01843

work page doi:10.1109/iccv51070.2023.01843 2023
[32]

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin. 2024. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. InForty-first International Conference on Machine Learning, ICML 2024. 54015–54029

2024
[33]

Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, and Kaiyang Zhou. 2025. Boot- strapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 208–217

2025
[34]

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik
[35]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Patch-VQ: ’Patching Up’ the Video Quality Problem. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14019–14029
[36]

Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, and Yansong Tang. 2024. PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2835–2845. doi:10.1109/CVPR52733.2024.00274

work page doi:10.1109/cvpr52733.2024.00274 2024
[37]

Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, and Ismail Ben Ayed. 2025. Realistic Test-Time Adaptation of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 25103–25112

2025
[38]

Weixia Zhang, Bingkun Zheng, Junlin Chen, and Zhihua Wang. 2025. Multi- Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 1557–1566

2025
[39]

Xingxuan Zhang, Jiansheng Li, Wenjing Chu, junjia hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, Liping Jing, and Peng Cui. 2025. On the Out- Of-Distribution Generalization of Large Multimodal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10315–10326

2025
[40]

Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, and Guangtao Zhai. 2025. Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3229–

2025
[41]

doi:10.1109/CVPR52734.2025.00307

work page doi:10.1109/cvpr52734.2025.00307 2025
[42]

Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui, et al. 2025. Human-activity agv quality assessment: A benchmark dataset and an objective evaluation metric. In Proceedings of the 33rd ACM International Conference on Multimedia. 6771–6780

2025
[43]

Zhichao Zhang, Wei Sun, Li Xinyue, Jun Jia, Xiongkuo Min, Zicheng Zhang, Chunyi Li, Zijian Chen, Wang Puyi, Sun Fengyu, Jui Shangling, and Guangtao Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xinyue Li et al. Zhai. 2025. Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model.ACM Trans. Multimedia Comput. Comm...

work page doi:10.1145/3749844 2018
[44]

Zheng, Y

Qi Zheng, Yibo Fan, Leilei Huang, Tianyu Zhu, Jiaming Liu, Zhijian Hao, Shuo Xing, Chia-Ju Chen, Xiongkuo Min, Alan C. Bovik, and Zhengzhong Tu. 2024. Video Quality Assessment: A Comprehensive Survey. arXiv:2412.04508 [eess.IV] https://arxiv.org/abs/2412.04508

work page arXiv 2024
[45]

Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Jiao, and Song J...

2025