How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

Alexander Raake; Benjamin Herb; Rakesh Rao Ramachandra Rao; Steve G\"oring

arxiv: 2605.25940 · v1 · pith:XBXHNUOWnew · submitted 2026-05-25 · 📡 eess.IV · cs.CV

How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

Benjamin Herb , Steve G\"oring , Alexander Raake , Rakesh Rao Ramachandra Rao This is my paper

Pith reviewed 2026-06-29 19:13 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords video quality assessmentvideo super-resolutiondiffusion modelssubjective testingfull-reference metricsno-reference metricsperceptual qualitygenerative video

0 comments

The pith

No tested video quality model reaches sufficient accuracy to replace subjective testing for diffusion-based video super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares predictions from six full-reference and no-reference video quality models against human subjective ratings for six upscaling methods applied to compressed and uncompressed low-resolution videos. CNN-based full-reference models such as LPIPS, DISTS, and CVQA-FR produce higher correlations than conventional or no-reference models, yet all models still misjudge specific artifacts like excessive sharpness or spatial inconsistencies. The authors conclude that current objective models cannot stand in for human viewers when evaluating this class of generative upscaling. The work matters because diffusion methods create degradations outside the training distributions of most existing metrics. The reference, degraded, and upscaled sequences plus the ratings and scores are released as open data.

Core claim

When six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) are applied to AV1- and DCVC-RT-compressed as well as uncompressed low-resolution content and displayed on UHD-1/4K screens, none of the tested full- or no-reference quality models achieve correlation levels high enough to replace subjective testing. CNN-based full-reference models outperform others but still overestimate the overly sharp output of SCST and fail on the spatial inconsistencies produced by Starlight Mini.

What carries the argument

Direct comparison of objective model scores against subjective human ratings collected for within-sequence performance on diffusion-based VSR outputs.

If this is right

CNN-based full-reference models exhibit significantly higher correlation than conventional full-reference or no-reference models on these outputs.
Most models overestimate the sharpness produced by SCST.
VMAF underperforms mainly because of spatial inconsistencies introduced by Starlight Mini.
Subjective testing must remain part of the evaluation pipeline for diffusion-based video super-resolution.
The released dataset of videos, ratings, and model scores can be used to benchmark future metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on traditional compression or scaling artifacts will likely need retraining or new features to handle the distinct error patterns of diffusion generators.
The open dataset provides a concrete test bed for developing or fine-tuning metrics targeted at generative video restoration.
The same evaluation gap may appear when applying current quality models to other diffusion-based video tasks such as frame interpolation or denoising.

Load-bearing premise

The collected subjective ratings serve as reliable ground truth for perceptual quality across the tested upscaling methods, viewing conditions, and content types.

What would settle it

A new subjective test on the same or similar diffusion VSR sequences in which at least one of the tested models reaches correlation coefficients high enough to be considered a substitute for human ratings.

Figures

Figures reproduced from arXiv: 2605.25940 by Alexander Raake, Benjamin Herb, Rakesh Rao Ramachandra Rao, Steve G\"oring.

**Figure 1.** Figure 1: Overall Processing Pipeline. Spatial and temporal complexity calculated using the video complexity analyzer (VCA) [1] [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Example crops from the Daydreamer sequence using 360p AV1 source encodings. add a prepend frame option. The results (see [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: SOS [36] Analysis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overall results and results per codec / setting with 95% CI Intervals [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Upsampling methods results per degraded video, with the best [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 8.** Figure 8: Prediction difference between Lanczos / SeedVR2 and each method. [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 7.** Figure 7: Spearman correlation difference (×100) without each method / source compared to the overall within-sequence result. High ∆SRCC point toward a metric failing to integrate a given method / source into the overall result. PSNR PSNR-HVS SSIM MS-SSIM SSIMULACRA 2 Butteraugli VMAF VMAF (neg) CVVDP PIEAPP LPIPS (AlexNet) LPIPS (VGG) DISTS CVQA-FR CVQA-FR-MS BRISQUE NIQE MUSIQ CLIP-IQA+ MDTVSFA UVQ UVQ-1.5 CVQA-NR… view at source ↗

**Figure 9.** Figure 9: MOS compared to the metric results. Each metric axis is [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

read the original abstract

Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR as open data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

No tested quality model is accurate enough to replace subjective tests for diffusion VSR, and the open data release is the main practical value.

read the letter

The key point is that none of the video quality models they tested correlate well enough with human ratings to stand in for subjective testing on diffusion-based super-resolution outputs. They compared full- and no-reference models against ratings for six upscalers (including SeedVR2 and Starlight Mini) on both compressed AV1 and uncompressed low-res sources, played back on 4K. CNN-based full-reference models like LPIPS, DISTS, and CVQA-FR came out ahead of VMAF and the no-reference options, but still fell short overall. The GitHub release of the videos, ratings, and scores is the clearest strength; that kind of open resource lets others verify the numbers without relying on the paper alone.

The soft spot is the subjective test itself. The abstract gives no participant count, inter-rater agreement numbers, or checks for bias and consistency. If the full paper does not supply those, the claim that models are inadequate rests on unverified ground truth, exactly as the stress-test note flags. The phrase "sufficient accuracy" also stays vague without a stated correlation threshold.

This is for people who evaluate or deploy VSR tools in practice and need to know where objective metrics break. It deserves peer review because the empirical question is straightforward and the data contribution is real, even if the subjective methodology section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical comparison of full-reference and no-reference video quality assessment (VQA) models against subjective ratings collected for six upscaling methods (including diffusion-based VSR approaches such as DOVE, SeedVR2, and Starlight Mini) applied to both compressed (AV1, DCVC-RT) and uncompressed low-resolution content, viewed on UHD-1/4K displays. It reports that CNN-based FR models (LPIPS, DISTS, CVQA-FR) achieve the highest correlations but still overestimate sharpness artifacts and fail to capture spatial inconsistencies, concluding that no tested model reaches sufficient accuracy to replace subjective testing. The reference, degraded, and enhanced sequences plus ratings and model scores are released as open data.

Significance. If the subjective ground truth is reliable, the work provides a useful benchmark showing that current VQA models are inadequate for diffusion-induced artifacts in VSR, particularly spatial inconsistencies and over-sharpening. The public release of the full dataset (videos, ratings, and scores) is a clear strength that enables future reproducibility and model development.

major comments (2)

[Subjective test / results sections] Subjective test description (likely §3 or §4): the manuscript provides no quantitative indicators of data reliability such as number of participants, inter-rater agreement (e.g., ICC or Cronbach’s alpha), outlier screening criteria, or comparison against anchor conditions. Because the central claim—that no VQA model is accurate enough to replace subjective testing—rests entirely on treating the collected ratings as authoritative ground truth, the absence of these checks leaves the reported correlation gaps (even for LPIPS/DISTS) open to reinterpretation as possible artifacts of noisy or biased subjective data.
[Results / discussion] Results and discussion (likely §5): the claim that “none of the tested video quality models reach sufficient accuracy” is presented without explicit numerical thresholds for what constitutes “sufficient” (e.g., minimum Pearson/Spearman correlation or RMSE relative to subjective variance). Without such criteria or statistical significance tests comparing model performance to the subjective variability, it is difficult to judge whether the observed gaps are practically meaningful or merely within the noise of the subjective data.

minor comments (2)

[Abstract / methods] The abstract states that the study focuses on “within-sequence performance,” but the manuscript does not define this term or clarify how it differs from standard across-sequence evaluation.
[Tables / figures] Table or figure captions listing the six upscaling methods should explicitly note which are diffusion-based to help readers map the results to the title.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of the subjective methodology and the interpretation of results.

read point-by-point responses

Referee: Subjective test description lacks quantitative indicators of data reliability such as number of participants, inter-rater agreement (ICC or Cronbach’s alpha), outlier screening criteria, or comparison against anchor conditions. This questions treating the ratings as authoritative ground truth.

Authors: We agree these reliability metrics are essential. The manuscript text does not report them explicitly, although the raw ratings are released publicly. In revision we will add a dedicated subsection in the subjective test description reporting participant count, inter-rater agreement (ICC), outlier screening, and any anchor conditions used, thereby reinforcing the validity of the ground-truth ratings. revision: yes
Referee: The claim that none of the tested models reach sufficient accuracy lacks explicit numerical thresholds (e.g., minimum Pearson/Spearman correlation or RMSE) and statistical significance tests against subjective variability.

Authors: We concur that explicit criteria would improve rigor. In the revised discussion we will define sufficient accuracy by reference to common VQA literature thresholds and will add statistical comparisons of model performance relative to the observed variance in subjective scores, clarifying whether the reported gaps are practically meaningful. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with external ground truth

full rationale

The paper conducts an empirical study comparing objective video quality model outputs to results from a new subjective test on six upscaling methods (including diffusion-based VSR) under specified viewing conditions. No derivations, equations, fitted parameters, or self-citation chains are present that would reduce any claim to its own inputs by construction. The central conclusion (insufficient model accuracy to replace subjective testing) rests on direct correlation measurements against the collected ratings and the released open dataset, which constitutes independent external benchmarking rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study; it relies on the domain assumption that subjective ratings constitute ground truth but introduces no free parameters, new entities, or mathematical axioms beyond standard statistical comparison.

axioms (1)

domain assumption Subjective ratings collected under the described viewing conditions represent true perceptual quality for the tested content and methods.
The entire comparison of model accuracy rests on treating the subjective test as the reference benchmark.

pith-pipeline@v0.9.1-grok · 5812 in / 1180 out tokens · 26869 ms · 2026-06-29T19:13:07.127408+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

[1]

VCA: Video Complexity Analyzer

V . V . Menon, C. Feldmann, H. Amirpour, M. Ghanbari, and C. Timmerer. “VCA: Video Complexity Analyzer”. In:Proc 13th ACM Multimed. Syst. Conf. Athlone Ireland: ACM, June 14, 2022, pp. 259–264

2022
[2]

A Survey of Deep Learning Video Super-Resolution

A. A. Baniya, T.-K. Lee, P. W. Eklund, and S. Aryal. “A Survey of Deep Learning Video Super-Resolution”. In:IEEE Trans. Emerg. Top. Comput. Intell.8.4 (Aug. 2024), pp. 2655– 2676

2024
[3]

Upscale-A- Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

S. Zhou, P. Yang, J. Wang, Y . Luo, and C. C. Loy. “Upscale-A- Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution”. In:Conf. Comput. Vis. Pattern Recognit.Seattle, USA: IEEE, June 16, 2024, pp. 2535–2545

2024
[4]

Self-Supervised Con- trolNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

S. Shi, J. Xu, L. Lu, Z. Li, and K. Hu. “Self-Supervised Con- trolNet with Spatio-Temporal Mamba for Real-world Video Super-resolution”. In:Conf. Comput. Vis. Pattern Recognit. Nashville, USA: IEEE, June 10, 2025, pp. 7385–7395

2025
[5]

SeedVR: Seeding Infinity in Diffusion Trans- former Towards Generic Video Restoration

J. Wang et al. “SeedVR: Seeding Infinity in Diffusion Trans- former Towards Generic Video Restoration”. In:Conf. Com- put. Vis. Pattern Recognit.Nashville, USA: IEEE, June 10, 2025, pp. 2161–2172

2025
[6]

Chen et al.DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Z. Chen et al.DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution. May 22, 2025. arXiv: 2505.16239. Pre-published

work page arXiv 2025
[7]

Wang et al.SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

J. Wang et al.SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training. June 5, 2025. arXiv: 2506.05301. Pre-published

work page arXiv 2025
[8]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”. In:Conf Comput. Vis. Pattern Recognit. Salt Lake City, UT: IEEE, June 2018, pp. 586–595

2018
[9]

Exploring CLIP for Assessing the Look and Feel of Images

J. Wang, K. C. Chan, and C. C. Loy. “Exploring CLIP for Assessing the Look and Feel of Images”. In:AAAI37.2 (June 26, 2023), pp. 2555–2563

2023
[10]

Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspec- tives

H. Wu et al. “Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspec- tives”. In:Int Conf Comput. Vis.Paris, France: IEEE, Oct. 1, 2023, pp. 20087–20097

2023
[11]

Image Quality Assessment: Unifying Structure and Texture Similar- ity

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli. “Image Quality Assessment: Unifying Structure and Texture Similar- ity”. In:IEEE Trans. Pattern Anal. Mach. Intell.(May 2020), pp. 2567–2581

2020
[12]

MUSIQ: Multi-scale Image Quality Transformer

J. Ke, Q. Wang, Y . Wang, P. Milanfar, and F. Yang. “MUSIQ: Multi-scale Image Quality Transformer”. In:Int. Conf. Com- put. Vis.Montreal, QC, Canada: IEEE, Oct. 2021, pp. 5128– 5137

2021
[13]

Making a “Completely Blind

A. Mittal, R. Soundararajan, and A. C. Bovik. “Making a “Completely Blind” Image Quality Analyzer”. In:IEEE Signal Process. Lett.20.3 (Mar. 2013), pp. 209–212

2013
[14]

Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment

H. Wu et al. “Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment”. In:Trans. Pattern Anal. Mach. Intell.45.12 (Dec. 2023), pp. 15185– 15202

2023
[15]

A Comparative Study of Super-Resolution Algorithms for Video Streaming Appli- cation

X. He, Y . Qiao, B. Lee, and Y . Ye. “A Comparative Study of Super-Resolution Algorithms for Video Streaming Appli- cation”. In:Multimed. Tools Appl.83.14 (Oct. 13, 2023), pp. 43493–43512

2023
[16]

AIM 2024 Challenge on Video Super- Resolution Quality Assessment: Methods and Results

I. Molodetskikh et al. “AIM 2024 Challenge on Video Super- Resolution Quality Assessment: Methods and Results”. In: Comput. Vis. – ECCV 2024 Workshop. Berlin, Heidelberg: Springer-Verlag, 2025, pp. 160–177

2024
[17]

VSRQAD: Video Super-Resolution Quality Assess- ment Dataset and Benchmark

A. Borisov, E. Bogatyrev, K. Abud, I. Molodetskikh, and D. Vatolin. “VSRQAD: Video Super-Resolution Quality Assess- ment Dataset and Benchmark”. In:IEEE Access14 (2026), pp. 60229–60251

2026
[18]

A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1

R. R. Ramachandra Rao, S. Göring, W. Robitza, B. Feiten, and A. Raake. “A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1”. In:Int. Symp. Multimed.San Diego, USA: IEEE, Dec. 2019, pp. 17–177

2019
[19]

Towards Practical Real-Time Neural Video Compression

Z. Jia et al. “Towards Practical Real-Time Neural Video Compression”. In:Proc. Comput. Vis. Pattern Recognit. Conf. June 2025, pp. 12543–12552

2025
[20]

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

T.-S. Chen et al. “Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers”. In:Conf. Comput. Vis. Pattern Recognit.Seattle, USA: IEEE, June 16, 2024, pp. 13320–13331

2024
[21]

Cloudinary, 2022.URL: https://github.com/cloudinary/ssimulacra2 (visited on 01/09/2026)

Jon Sneyers.SSIMULACRA 2 - Structural SIMilarity Unveil- ing Local And Compression Related Artifacts. Cloudinary, 2022.URL: https://github.com/cloudinary/ssimulacra2 (visited on 01/09/2026)

2022
[22]

Google, 2016.URL: https : / / github.com/google/butteraugli (visited on 01/07/2026)

Jyrki Alakuijala.Butteraugli - A Tool for Measuring Perceived Differences between Images. Google, 2016.URL: https : / / github.com/google/butteraugli (visited on 01/07/2026)

2016
[23]

Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression

M. Jenadeleh et al. “Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression”. In: 17th Int. Conf. Qual. Multimed. Exp.Madrid, Spain: IEEE, Sept. 30, 2025, pp. 1–7

2025
[24]

ColorVideoVDP: A Visual Difference Predictor for Image, Video and Display Distortions

R. K. Mantiuk, P. Hanji, M. Ashraf, Y . Asano, and A. Chapiro. “ColorVideoVDP: A Visual Difference Predictor for Image, Video and Display Distortions”. In:ACM Trans. Graph.43.4 (July 19, 2024), pp. 1–20

2024
[25]

PieAPP: Percep- tual Image-Error Assessment Through Pairwise Preference

E. Prashnani, H. Cai, Y . Mostofi, and P. Sen. “PieAPP: Percep- tual Image-Error Assessment Through Pairwise Preference”. In:Conf. Comput. Vis. Pattern Recognit.Salt Lake City, UT: IEEE, June 2018, pp. 1808–1817

2018
[26]

Deep Learning Based Full-Reference and No-Reference Quality Assessment Models for Compressed UGC Videos

W. Sun, T. Wang, X. Min, F. Yi, and G. Zhai. “Deep Learning Based Full-Reference and No-Reference Quality Assessment Models for Compressed UGC Videos”. In:Int Conf Multimed. Expo Workshop. 2021, pp. 1–6

2021
[27]

No-Reference Image Quality Assessment in the Spatial Domain

A. Mittal, A. K. Moorthy, and A. C. Bovik. “No-Reference Image Quality Assessment in the Spatial Domain”. In:IEEE Trans. on Image Process.21.12 (Dec. 2012), pp. 4695–4708

2012
[28]

FAST-VQA: Efficient End-to-End Video Qual- ity Assessment with Fragment Sampling

H. Wu et al. “FAST-VQA: Efficient End-to-End Video Qual- ity Assessment with Fragment Sampling”. In:Comput. Vis. ECCV. Cham: Springer Nature Switzerland, 2022, pp. 538– 554

2022
[29]

Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training

D. Li, T. Jiang, and M. Jiang. “Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training”. In:Int. J. Comput. Vis.129.4 (Apr. 2021), pp. 1238–1257

2021
[30]

Rich Features for Perceptual Quality Assess- ment of UGC Videos

Y . Wang et al. “Rich Features for Perceptual Quality Assess- ment of UGC Videos”. In:Conf. Comput. Vis. Pattern Recog- nit.Nashville, USA: IEEE, June 2021, pp. 13430–13439

2021
[31]

Towards Explainable In-the-Wild Video Qual- ity Assessment: A Database and a Language-Prompted Ap- proach

H. Wu et al. “Towards Explainable In-the-Wild Video Qual- ity Assessment: A Database and a Language-Prompted Ap- proach”. In:Proc. 31st ACM Int. Conf. Multimed.Ottawa ON Canada: ACM, Oct. 26, 2023, pp. 1045–1054

2023
[32]

Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

H. Wu et al. “Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels”. In:Proc. 41st Int. Conf. Mach. Learn.V ol. 235. Vienna, Austria: JMLR.org, July 21, 2024, pp. 54015–54029

2024
[33]

COVER: A Comprehensive Video Quality Eval- uator

C. He et al. “COVER: A Comprehensive Video Quality Eval- uator”. In:Conf. Comput. Vis. Pattern Recognit. Workshop. Seattle, USA: IEEE, June 17, 2024, pp. 5799–5809

2024
[34]

ITU-T.P .910: Subjective Video Quality Assessment Methods for Multimedia Applications. Oct. 2023

2023
[35]

A Vrate V oyager: An Open Source Online Testing Platform

S. Göring, R. R. Ramachandra Rao, S. Fremerey, and A. Raake. “A Vrate V oyager: An Open Source Online Testing Platform”. In:23rd Int. Workshop Multimed. Signal Process. Tampere, Finland: IEEE, Oct. 6, 2021, pp. 1–6

2021
[36]

SOS: The MOS Is Not Enough!

T. Hossfeld, R. Schatz, and S. Egger. “SOS: The MOS Is Not Enough!” In:3rd Int. Workshop Qual. Multimed. Exp. Mechelen, Belgium: IEEE, Sept. 2011, pp. 131–136

2011
[37]

A Large-Scale Evaluation of Subject Rating Behaviour in Visual Quality Assessment Studies

R. R. R. Rao, S. Göring, S. Fremerey, D. Keller, and A. Raake. “A Large-Scale Evaluation of Subject Rating Behaviour in Visual Quality Assessment Studies”. In:17th Int. Workshop Qual. Multimed. Exp.(Madrid, Spain). IEEE, 2025, pp. 1–7

2025
[38]

Averaging Cor- relations: Expected Values and Bias in Combined Pearson Rs and Fisher’s z Transformations

D. M. Corey, W. P. Dunlap, and M. J. Burke. “Averaging Cor- relations: Expected Values and Bias in Combined Pearson Rs and Fisher’s z Transformations”. In:The Journal of General Psychology125.3 (July 1998), pp. 245–261

1998
[39]

Comparing Correlated Correlation Coefficients

X.-L. Meng, R. Rosenthal, and D. B. Rubin. “Comparing Correlated Correlation Coefficients”. In:Psychol. Bull.111 (1992), pp. 172–175

1992
[40]

Netflix, Inc., 2016.URL: https : / / github

Netflix.VMAF - Video Multi-Method Assessment Fusion. Netflix, Inc., 2016.URL: https : / / github. com / Netflix / vmaf (visited on 01/09/2026)

2016
[41]

ITU-T.P .1401: Methods, Metrics and Procedures for Statis- tical Evaluation, Qualification and Comparison of Objective Quality Prediction Models. Jan. 2020

2020

[1] [1]

VCA: Video Complexity Analyzer

V . V . Menon, C. Feldmann, H. Amirpour, M. Ghanbari, and C. Timmerer. “VCA: Video Complexity Analyzer”. In:Proc 13th ACM Multimed. Syst. Conf. Athlone Ireland: ACM, June 14, 2022, pp. 259–264

2022

[2] [2]

A Survey of Deep Learning Video Super-Resolution

A. A. Baniya, T.-K. Lee, P. W. Eklund, and S. Aryal. “A Survey of Deep Learning Video Super-Resolution”. In:IEEE Trans. Emerg. Top. Comput. Intell.8.4 (Aug. 2024), pp. 2655– 2676

2024

[3] [3]

Upscale-A- Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

S. Zhou, P. Yang, J. Wang, Y . Luo, and C. C. Loy. “Upscale-A- Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution”. In:Conf. Comput. Vis. Pattern Recognit.Seattle, USA: IEEE, June 16, 2024, pp. 2535–2545

2024

[4] [4]

Self-Supervised Con- trolNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

S. Shi, J. Xu, L. Lu, Z. Li, and K. Hu. “Self-Supervised Con- trolNet with Spatio-Temporal Mamba for Real-world Video Super-resolution”. In:Conf. Comput. Vis. Pattern Recognit. Nashville, USA: IEEE, June 10, 2025, pp. 7385–7395

2025

[5] [5]

SeedVR: Seeding Infinity in Diffusion Trans- former Towards Generic Video Restoration

J. Wang et al. “SeedVR: Seeding Infinity in Diffusion Trans- former Towards Generic Video Restoration”. In:Conf. Com- put. Vis. Pattern Recognit.Nashville, USA: IEEE, June 10, 2025, pp. 2161–2172

2025

[6] [6]

Chen et al.DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Z. Chen et al.DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution. May 22, 2025. arXiv: 2505.16239. Pre-published

work page arXiv 2025

[7] [7]

Wang et al.SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

J. Wang et al.SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training. June 5, 2025. arXiv: 2506.05301. Pre-published

work page arXiv 2025

[8] [8]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”. In:Conf Comput. Vis. Pattern Recognit. Salt Lake City, UT: IEEE, June 2018, pp. 586–595

2018

[9] [9]

Exploring CLIP for Assessing the Look and Feel of Images

J. Wang, K. C. Chan, and C. C. Loy. “Exploring CLIP for Assessing the Look and Feel of Images”. In:AAAI37.2 (June 26, 2023), pp. 2555–2563

2023

[10] [10]

Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspec- tives

H. Wu et al. “Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspec- tives”. In:Int Conf Comput. Vis.Paris, France: IEEE, Oct. 1, 2023, pp. 20087–20097

2023

[11] [11]

Image Quality Assessment: Unifying Structure and Texture Similar- ity

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli. “Image Quality Assessment: Unifying Structure and Texture Similar- ity”. In:IEEE Trans. Pattern Anal. Mach. Intell.(May 2020), pp. 2567–2581

2020

[12] [12]

MUSIQ: Multi-scale Image Quality Transformer

J. Ke, Q. Wang, Y . Wang, P. Milanfar, and F. Yang. “MUSIQ: Multi-scale Image Quality Transformer”. In:Int. Conf. Com- put. Vis.Montreal, QC, Canada: IEEE, Oct. 2021, pp. 5128– 5137

2021

[13] [13]

Making a “Completely Blind

A. Mittal, R. Soundararajan, and A. C. Bovik. “Making a “Completely Blind” Image Quality Analyzer”. In:IEEE Signal Process. Lett.20.3 (Mar. 2013), pp. 209–212

2013

[14] [14]

Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment

H. Wu et al. “Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment”. In:Trans. Pattern Anal. Mach. Intell.45.12 (Dec. 2023), pp. 15185– 15202

2023

[15] [15]

A Comparative Study of Super-Resolution Algorithms for Video Streaming Appli- cation

X. He, Y . Qiao, B. Lee, and Y . Ye. “A Comparative Study of Super-Resolution Algorithms for Video Streaming Appli- cation”. In:Multimed. Tools Appl.83.14 (Oct. 13, 2023), pp. 43493–43512

2023

[16] [16]

AIM 2024 Challenge on Video Super- Resolution Quality Assessment: Methods and Results

I. Molodetskikh et al. “AIM 2024 Challenge on Video Super- Resolution Quality Assessment: Methods and Results”. In: Comput. Vis. – ECCV 2024 Workshop. Berlin, Heidelberg: Springer-Verlag, 2025, pp. 160–177

2024

[17] [17]

VSRQAD: Video Super-Resolution Quality Assess- ment Dataset and Benchmark

A. Borisov, E. Bogatyrev, K. Abud, I. Molodetskikh, and D. Vatolin. “VSRQAD: Video Super-Resolution Quality Assess- ment Dataset and Benchmark”. In:IEEE Access14 (2026), pp. 60229–60251

2026

[18] [18]

A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1

R. R. Ramachandra Rao, S. Göring, W. Robitza, B. Feiten, and A. Raake. “A VT-VQDB-UHD-1: A Large Scale Video Quality Database for UHD-1”. In:Int. Symp. Multimed.San Diego, USA: IEEE, Dec. 2019, pp. 17–177

2019

[19] [19]

Towards Practical Real-Time Neural Video Compression

Z. Jia et al. “Towards Practical Real-Time Neural Video Compression”. In:Proc. Comput. Vis. Pattern Recognit. Conf. June 2025, pp. 12543–12552

2025

[20] [20]

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

T.-S. Chen et al. “Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers”. In:Conf. Comput. Vis. Pattern Recognit.Seattle, USA: IEEE, June 16, 2024, pp. 13320–13331

2024

[21] [21]

Cloudinary, 2022.URL: https://github.com/cloudinary/ssimulacra2 (visited on 01/09/2026)

Jon Sneyers.SSIMULACRA 2 - Structural SIMilarity Unveil- ing Local And Compression Related Artifacts. Cloudinary, 2022.URL: https://github.com/cloudinary/ssimulacra2 (visited on 01/09/2026)

2022

[22] [22]

Google, 2016.URL: https : / / github.com/google/butteraugli (visited on 01/07/2026)

Jyrki Alakuijala.Butteraugli - A Tool for Measuring Perceived Differences between Images. Google, 2016.URL: https : / / github.com/google/butteraugli (visited on 01/07/2026)

2016

[23] [23]

Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression

M. Jenadeleh et al. “Subjective Visual Quality Assessment for High-Fidelity Learning-Based Image Compression”. In: 17th Int. Conf. Qual. Multimed. Exp.Madrid, Spain: IEEE, Sept. 30, 2025, pp. 1–7

2025

[24] [24]

ColorVideoVDP: A Visual Difference Predictor for Image, Video and Display Distortions

R. K. Mantiuk, P. Hanji, M. Ashraf, Y . Asano, and A. Chapiro. “ColorVideoVDP: A Visual Difference Predictor for Image, Video and Display Distortions”. In:ACM Trans. Graph.43.4 (July 19, 2024), pp. 1–20

2024

[25] [25]

PieAPP: Percep- tual Image-Error Assessment Through Pairwise Preference

E. Prashnani, H. Cai, Y . Mostofi, and P. Sen. “PieAPP: Percep- tual Image-Error Assessment Through Pairwise Preference”. In:Conf. Comput. Vis. Pattern Recognit.Salt Lake City, UT: IEEE, June 2018, pp. 1808–1817

2018

[26] [26]

Deep Learning Based Full-Reference and No-Reference Quality Assessment Models for Compressed UGC Videos

W. Sun, T. Wang, X. Min, F. Yi, and G. Zhai. “Deep Learning Based Full-Reference and No-Reference Quality Assessment Models for Compressed UGC Videos”. In:Int Conf Multimed. Expo Workshop. 2021, pp. 1–6

2021

[27] [27]

No-Reference Image Quality Assessment in the Spatial Domain

A. Mittal, A. K. Moorthy, and A. C. Bovik. “No-Reference Image Quality Assessment in the Spatial Domain”. In:IEEE Trans. on Image Process.21.12 (Dec. 2012), pp. 4695–4708

2012

[28] [28]

FAST-VQA: Efficient End-to-End Video Qual- ity Assessment with Fragment Sampling

H. Wu et al. “FAST-VQA: Efficient End-to-End Video Qual- ity Assessment with Fragment Sampling”. In:Comput. Vis. ECCV. Cham: Springer Nature Switzerland, 2022, pp. 538– 554

2022

[29] [29]

Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training

D. Li, T. Jiang, and M. Jiang. “Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training”. In:Int. J. Comput. Vis.129.4 (Apr. 2021), pp. 1238–1257

2021

[30] [30]

Rich Features for Perceptual Quality Assess- ment of UGC Videos

Y . Wang et al. “Rich Features for Perceptual Quality Assess- ment of UGC Videos”. In:Conf. Comput. Vis. Pattern Recog- nit.Nashville, USA: IEEE, June 2021, pp. 13430–13439

2021

[31] [31]

Towards Explainable In-the-Wild Video Qual- ity Assessment: A Database and a Language-Prompted Ap- proach

H. Wu et al. “Towards Explainable In-the-Wild Video Qual- ity Assessment: A Database and a Language-Prompted Ap- proach”. In:Proc. 31st ACM Int. Conf. Multimed.Ottawa ON Canada: ACM, Oct. 26, 2023, pp. 1045–1054

2023

[32] [32]

Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

H. Wu et al. “Q-ALIGN: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels”. In:Proc. 41st Int. Conf. Mach. Learn.V ol. 235. Vienna, Austria: JMLR.org, July 21, 2024, pp. 54015–54029

2024

[33] [33]

COVER: A Comprehensive Video Quality Eval- uator

C. He et al. “COVER: A Comprehensive Video Quality Eval- uator”. In:Conf. Comput. Vis. Pattern Recognit. Workshop. Seattle, USA: IEEE, June 17, 2024, pp. 5799–5809

2024

[34] [34]

ITU-T.P .910: Subjective Video Quality Assessment Methods for Multimedia Applications. Oct. 2023

2023

[35] [35]

A Vrate V oyager: An Open Source Online Testing Platform

S. Göring, R. R. Ramachandra Rao, S. Fremerey, and A. Raake. “A Vrate V oyager: An Open Source Online Testing Platform”. In:23rd Int. Workshop Multimed. Signal Process. Tampere, Finland: IEEE, Oct. 6, 2021, pp. 1–6

2021

[36] [36]

SOS: The MOS Is Not Enough!

T. Hossfeld, R. Schatz, and S. Egger. “SOS: The MOS Is Not Enough!” In:3rd Int. Workshop Qual. Multimed. Exp. Mechelen, Belgium: IEEE, Sept. 2011, pp. 131–136

2011

[37] [37]

A Large-Scale Evaluation of Subject Rating Behaviour in Visual Quality Assessment Studies

R. R. R. Rao, S. Göring, S. Fremerey, D. Keller, and A. Raake. “A Large-Scale Evaluation of Subject Rating Behaviour in Visual Quality Assessment Studies”. In:17th Int. Workshop Qual. Multimed. Exp.(Madrid, Spain). IEEE, 2025, pp. 1–7

2025

[38] [38]

Averaging Cor- relations: Expected Values and Bias in Combined Pearson Rs and Fisher’s z Transformations

D. M. Corey, W. P. Dunlap, and M. J. Burke. “Averaging Cor- relations: Expected Values and Bias in Combined Pearson Rs and Fisher’s z Transformations”. In:The Journal of General Psychology125.3 (July 1998), pp. 245–261

1998

[39] [39]

Comparing Correlated Correlation Coefficients

X.-L. Meng, R. Rosenthal, and D. B. Rubin. “Comparing Correlated Correlation Coefficients”. In:Psychol. Bull.111 (1992), pp. 172–175

1992

[40] [40]

Netflix, Inc., 2016.URL: https : / / github

Netflix.VMAF - Video Multi-Method Assessment Fusion. Netflix, Inc., 2016.URL: https : / / github. com / Netflix / vmaf (visited on 01/09/2026)

2016

[41] [41]

ITU-T.P .1401: Methods, Metrics and Procedures for Statis- tical Evaluation, Qualification and Comparison of Objective Quality Prediction Models. Jan. 2020

2020