Recognition: unknown
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment
Pith reviewed 2026-05-10 16:03 UTC · model grok-4.3
The pith
A frozen multimodal LLM supplies a perceptual prior for video quality that a small residual branch can calibrate to new mean opinion score targets without full retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DPC-VQA treats a frozen pretrained MLLM as a fixed provider of base quality estimates and perceptual features, then trains a separate lightweight calibration branch to output a residual that shifts those estimates into the desired mean-opinion-score space; the resulting system matches the accuracy of end-to-end retrained models while using under two percent of the trainable parameters and remaining effective when only twenty percent of the usual labeled scores are available.
What carries the argument
Frozen MLLM perceptual prior plus lightweight residual calibration branch that predicts a corrective offset for target MOS alignment.
Load-bearing premise
The frozen multimodal model already extracts perceptual features general enough that a small additional network can learn the remaining domain-specific shift without any updates to the base parameters.
What would settle it
On a new video domain, train the calibration branch on the usual split and check whether its final accuracy falls substantially below that of a fully fine-tuned MLLM baseline; if the gap is large and persists even after giving the calibration branch more capacity or labels, the decoupling premise is refuted.
Figures
read the original abstract
Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DPC-VQA, a decoupling framework for video quality assessment in which a frozen multimodal large language model supplies a base quality estimate and perceptual prior while a lightweight calibration branch predicts a residual correction to adapt to target mean opinion score (MOS) distributions. The central claim is that this design achieves competitive performance on user-generated content (UGC) and AI-generated content (AIGC) benchmarks while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of the available MOS labels.
Significance. If the empirical support holds, the work offers a practical route to efficient domain adaptation of large pretrained models for perceptual tasks, substantially lowering both compute and annotation costs. The explicit separation of a general perceptual prior from a small residual adapter is a clean conceptual contribution that could generalize beyond VQA.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that the frozen MLLM already supplies a 'useful perceptual prior' requires quantitative support. Please report the correlation (e.g., PLCC, SRCC) and absolute error of the base MLLM output alone (obtained via prompting or feature extraction) on the target UGC and AIGC test sets, both before and after the calibration branch is applied. Without these numbers it is impossible to determine whether the residual branch is performing a modest correction or largely compensating for domain shift.
- [§3.2] §3.2 (Calibration Branch): the statement that the branch uses '<2% of the trainable parameters' is load-bearing for the efficiency argument. Provide the exact parameter count of the calibration network, the total parameter count of the frozen MLLM, and the precise architectural choices (layers, hidden dimensions, input features) so that the 2% figure can be independently verified.
- [§4.3] §4.3 (Reduced-label experiments): the claim that the method 'remains effective with only 20% of MOS labels' is central. Clarify the sampling procedure for the 20% subset, whether the same split is used across all baselines, and whether results are averaged over multiple random subsets with standard deviation. If the calibration branch is trained on a small labeled set while the base MLLM is frozen, the risk that the residual largely memorizes the limited labels rather than learning a general correction should be quantified.
minor comments (3)
- [§2] §2 (Related Work): the positioning relative to prior MLLM-based VQA methods (e.g., those that fine-tune the full model) would be clearer if a brief table contrasted parameter counts, label requirements, and reported PLCC/SRCC on the same benchmarks.
- [Figure 1] Figure 1: the schematic would benefit from explicit visual distinction (color or hatching) between frozen and trainable modules and from indicating the exact input to the calibration branch (raw features, base quality score, or both).
- [Throughout] Throughout: ensure that all acronyms (MLLM, VQA, MOS, UGC, AIGC, PLCC, SRCC) are defined at first use and that the term 'residual calibration' is used consistently rather than interchangeably with 'correction'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of our decoupling framework. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that the frozen MLLM already supplies a 'useful perceptual prior' requires quantitative support. Please report the correlation (e.g., PLCC, SRCC) and absolute error of the base MLLM output alone (obtained via prompting or feature extraction) on the target UGC and AIGC test sets, both before and after the calibration branch is applied. Without these numbers it is impossible to determine whether the residual branch is performing a modest correction or largely compensating for domain shift.
Authors: We agree that explicit quantification of the frozen MLLM's standalone performance is necessary to substantiate the perceptual prior claim. In the revised manuscript, we will report PLCC, SRCC, and MAE for the base MLLM output (via direct prompting or feature extraction) on the UGC and AIGC test sets. These will be shown both before and after the calibration branch, allowing readers to assess the magnitude of the residual correction. revision: yes
-
Referee: [§3.2] §3.2 (Calibration Branch): the statement that the branch uses '<2% of the trainable parameters' is load-bearing for the efficiency argument. Provide the exact parameter count of the calibration network, the total parameter count of the frozen MLLM, and the precise architectural choices (layers, hidden dimensions, input features) so that the 2% figure can be independently verified.
Authors: We will revise §3.2 to include the exact parameter counts for the calibration network and the frozen MLLM, together with a complete description of the architectural choices including the number of layers, hidden dimensions, and the specific input features extracted from the MLLM. This will enable independent verification of the efficiency claim. revision: yes
-
Referee: [§4.3] §4.3 (Reduced-label experiments): the claim that the method 'remains effective with only 20% of MOS labels' is central. Clarify the sampling procedure for the 20% subset, whether the same split is used across all baselines, and whether results are averaged over multiple random subsets with standard deviation. If the calibration branch is trained on a small labeled set while the base MLLM is frozen, the risk that the residual largely memorizes the limited labels rather than learning a general correction should be quantified.
Authors: We will clarify in the revision that the 20% subset was obtained via random sampling and that the identical split was used for all baselines. We will also report results averaged over multiple random subsets with standard deviations. To address the memorization concern, we will add an analysis comparing the calibration branch against a simple memorization baseline on held-out labels, confirming that the lightweight branch learns a general correction rather than overfitting to the limited annotations. revision: partial
Circularity Check
No circularity in proposed decoupling framework
full rationale
The paper presents an engineering framework that freezes a pretrained MLLM to supply a base perceptual prior and trains an independent lightweight residual calibration branch on target MOS data. No equations, predictions, or derivations are shown that reduce by construction to fitted inputs or self-citations; the calibration step is explicitly trained separately and evaluated on held-out benchmarks. The approach is self-contained with external empirical validation on UGC and AIGC datasets, satisfying the criteria for a non-circular method proposal.
Axiom & Free-Parameter Ledger
free parameters (1)
- calibration branch parameters
axioms (1)
- domain assumption Pretrained MLLM supplies a useful general perceptual prior for VQA
Reference graph
Works this paper leans on
-
[1]
Pengfei Chen, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. 2021. Unsupervised Curriculum Domain Adaptation for No-Reference Video Quality Assessment. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5178–5187
2021
-
[2]
Pengfei Chen, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. 2022. Contrastive Self-Supervised Pre-Training for Video Quality Assessment.IEEE Transactions on Image Processing31 (2022), 458–471. doi:10.1109/TIP.2021.3130536
-
[3]
Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chen- long He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Wei Sun, Yuqin Cao,...
-
[4]
Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, and Guangtao Zhai. 2025. FineVQ: Fine-Grained User Generated Content Video Quality Assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3206–3217. doi:10.1109/CVPR52734.2025.00305
-
[5]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition. In2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6201–6210. doi:10.1109/ICCV.2019.00630
-
[6]
Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, and Guangtao Zhai. 2025. LMM-VQA: Advancing Video Quality Assessment With Large Multimodal Models.IEEE Transactions on Circuits and Systems for Video Technology35, 11 (2025), 11083–11096. doi:10.1109/TCSVT. 2025.3571788
-
[7]
Jari Korhonen. 2019. Two-Level Approach for No-Reference Consumer Video Quality Assessment.IEEE Transactions on Image Processing28, 12 (2019), 5923–
2019
-
[8]
doi:10.1109/TIP.2019.2923051
-
[9]
Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. 2024. Subjective-aligned dataset and metric for text-to-video quality assessment. InProceedings of the 32nd ACM International Conference on Multimedia. 7793–7802
2024
-
[10]
Bowen Li, Weixia Zhang, Meng Tian, Guangtao Zhai, and Xianpei Wang. 2022. Blindly Assess Quality of In-the-Wild Videos via Quality-Aware Pre-Training and Motion Perception.IEEE Transactions on Circuits and Systems for Video Technology32, 9 (2022), 5944–5958. doi:10.1109/TCSVT.2022.3164467
-
[11]
Dingquan Li, Tingting Jiang, and Ming Jiang. 2019. Quality Assessment of In- the-Wild Videos. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 2351–2359. doi:10.1145/3343031.3351028
-
[12]
Xudong Li, Zihao Huang, Yan Zhang, Yunhang Shen, Ke Li, Xiawu Zheng, Liujuan Cao, and Rongrong Ji. 2025. Few-Shot Image Quality Assessment via Adapta- tion of Vision-Language Models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10442–10452
2025
- [13]
-
[14]
Tian Liang, Jing Huang, Ming Kong, Luyuan Chen, and Qiang Zhu. 2024. Query- ing as Prompt: Parameter-Efficient Learning for Multimodal Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 26855–26865
2024
-
[15]
Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C
Pavan C. Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C. Bovik. 2022. Image Quality Assessment Using Contrastive Learning.IEEE Transactions on Image Processing31 (2022), 4149–4161. doi:10.1109/TIP.2022. 3181496
-
[16]
Pavan C Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, and Alan C Bovik. 2023. Conviqt: Contrastive video quality estimator.IEEE Transactions on Image Processing32 (2023), 5138–5152
2023
-
[17]
Shankhanil Mitra and Rajiv Soundararajan. 2022. Multiview contrastive learning for completely blind video quality assessment of user generated content. In Proceedings of the 30th ACM International Conference on Multimedia. 1914–1924
2022
-
[18]
Shankhanil Mitra and Rajiv Soundararajan. 2024. Knowledge guided semi- supervised learning for quality assessment of user generated videos. InPro- ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4251–4260
2024
-
[19]
Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012. No- reference image quality assessment in the spatial domain.IEEE Transactions on image processing21, 12 (2012), 4695–4708
2012
-
[20]
Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, and Ze- feng Ying. 2025. Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 1493–1502
2025
-
[21]
Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, and Jian Wang. 2025. KVQ: Boosting Video Quality Assessment via Saliency-guided Local Percep- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2150–2160
2025
-
[22]
Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai. 2022. A Deep Learning Based No-Reference Quality Assessment Model for UGC Videos. InProceedings of the 30th ACM International Conference on Multimedia. 856–865
2022
-
[23]
Zhengzhong Tu, Xiangxu Yu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C. Bovik. 2021. RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content.IEEE Open Journal of Signal Processing2 (2021), 425–440. doi:10.1109/OJSP.2021.3090333
-
[24]
Jiarui Wang, Huiyu Duan, Guangtao Zhai, Juntong Wang, and Xiongkuo Min
-
[25]
Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18869–18880. doi:10.1109/CVPR52734. 2025.01758
-
[26]
Yilin Wang, Junjie Ke, Hossein Talebi, Joong Gon Yim, Neil Birkbeck, Balu Adsumilli, Peyman Milanfar, and Feng Yang. 2021. Rich features for percep- tual quality assessment of UGC videos. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13430–13439. doi:10.1109/CVPR46437. 2021.01323
-
[27]
Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, and Kede Ma
-
[28]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Modular Blind Video Quality Assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2763–2772
-
[29]
Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, and Yilin Wang. 2025. CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment.CoRRabs/2505.16025 (2025). doi:10.48550/arXiv.2505.16025
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.16025 2025
-
[30]
Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. FAST-VQA: Efficient End-to-End Video Quality Assessment with Fragment Sampling. InComputer Vision – ECCV 2022. Springer Nature Switzerland, Cham, 538–554
2022
-
[31]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2023. Exploring Video Quality Assess- ment on User Generated Contents from Aesthetic and Technical Perspectives. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 20087–20097. doi:10.1109/ICCV51070.2023.01843
-
[32]
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, and Weisi Lin. 2024. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. InForty-first International Conference on Machine Learning, ICML 2024. 54015–54029
2024
-
[33]
Jiaer Xia, Bingkui Tong, Yuhang Zang, Rui Shao, and Kaiyang Zhou. 2025. Boot- strapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 208–217
2025
-
[34]
Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik
-
[35]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Patch-VQ: ’Patching Up’ the Video Quality Problem. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14019–14029
-
[36]
Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, and Yansong Tang. 2024. PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2835–2845. doi:10.1109/CVPR52733.2024.00274
-
[37]
Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, and Ismail Ben Ayed. 2025. Realistic Test-Time Adaptation of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 25103–25112
2025
-
[38]
Weixia Zhang, Bingkun Zheng, Junlin Chen, and Zhihua Wang. 2025. Multi- Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 1557–1566
2025
-
[39]
Xingxuan Zhang, Jiansheng Li, Wenjing Chu, junjia hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, Liping Jing, and Peng Cui. 2025. On the Out- Of-Distribution Generalization of Large Multimodal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10315–10326
2025
-
[40]
Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, and Guangtao Zhai. 2025. Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3229–
2025
-
[41]
doi:10.1109/CVPR52734.2025.00307
-
[42]
Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui, et al. 2025. Human-activity agv quality assessment: A benchmark dataset and an objective evaluation metric. In Proceedings of the 33rd ACM International Conference on Multimedia. 6771–6780
2025
-
[43]
Zhichao Zhang, Wei Sun, Li Xinyue, Jun Jia, Xiongkuo Min, Zicheng Zhang, Chunyi Li, Zijian Chen, Wang Puyi, Sun Fengyu, Jui Shangling, and Guangtao Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xinyue Li et al. Zhai. 2025. Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model.ACM Trans. Multimedia Comput. Comm...
- [44]
-
[45]
Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Jiao, and Song J...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.