arxiv: 2604.24123 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs

Jiayi Wang , Lichun Zhang , Xiaoqi Zhuang , Jiaqi Zhang , Lu Yu , Yin Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords video quality assessmentvideo quality metricneural video codecsfeature distancegeneralizationHDR videoSDR videoperceptual quality

0 comments

The pith

FDIM measures video quality through distances in a hybrid space of deep multi-scale features and hand-crafted features, producing scores that align with human ratings across codecs never seen during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video compression now includes neural codecs whose artifacts vary with content and differ from those of older block-based methods, so quality metrics must handle both kinds without retraining. FDIM computes quality by comparing reference and compressed videos in feature spaces that combine learned deep representations at multiple scales with stable hand-crafted cues. The model is trained once on a large collection of videos encoded by a mix of traditional and neural codecs. When tested on separate collections that contain codecs and content outside the training set, the resulting scores track subjective human judgments closely for both standard and high dynamic range video. This approach matters because it lets researchers compare and refine any new codec using the same automatic metric instead of running fresh human studies for each format or architecture.

Core claim

FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs show that

What carries the argument

Hybrid feature-distance calculation that combines multi-scale deep representations with hand-crafted descriptors to quantify perceptual differences between a reference video and its compressed version.

If this is right

The same trained model can be applied directly to evaluate quality for any new neural video codec without retraining or fine-tuning.
FDIM produces usable scores for both standard dynamic range and high dynamic range video without separate versions of the metric.
Codec developers can use FDIM scores to guide optimization loops and benchmark competing algorithms on equal footing.
The method reduces the frequency of large-scale subjective tests when exploring new compression techniques or content types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hybrid distance remains stable, FDIM could be embedded inside encoders to adjust parameters on the fly for better perceptual results.
The same distance-based idea might transfer to quality assessment for still images or volumetric content by swapping in appropriate feature extractors.
Real-world streaming systems that switch between multiple codecs could adopt FDIM to monitor delivered quality uniformly.
Further tests on generative compression methods that produce artifacts outside the current training distribution would clarify the limits of the generalization.

Load-bearing premise

The specific deep and hand-crafted features chosen and the distances learned from the training videos will continue to match human perception when applied to entirely new codec designs and video contents.

What would settle it

Apply FDIM to videos encoded by a newly developed neural codec architecture absent from both the training set and the ten evaluation datasets, collect fresh subjective ratings from human viewers, and check whether the correlation between FDIM scores and those ratings falls substantially below the levels reported on the original test sets.

Figures

Figures reproduced from arXiv: 2604.24123 by Jiaqi Zhang, Jiayi Wang, Lichun Zhang, Lu Yu, Xiaoqi Zhuang, Yin Zhao.

**Figure 1.** Figure 1: Illustration of characteristic artifacts introduced by neural video codecs in comparison with traditional codecs. For each sequence, frames are uniformly view at source ↗

**Figure 2.** Figure 2: Architecture of FDIM. The framework consists of a deep feature component and a complementary hand-crafted feature component based on VMAF view at source ↗

**Figure 3.** Figure 3: SROCC performance comparison across different dataset-codec view at source ↗

**Figure 4.** Figure 4: Performance of FDIM(Deep) with different scales of training data view at source ↗

**Figure 5.** Figure 5: Visualization of multi-scale features H˜ (s,t) . For each scale, feature maps are averaged along the channel dimension, normalized to [0, 1], and upsampled to the input resolution for overlay visualization. We report two scores for FDIM: the raw output score and a linearly mapped score using the dataset MOS scale, which facilitates observing the deviation from MOS. 0 10 20 30 40 50 Inference Time (s) on 10… view at source ↗

**Figure 6.** Figure 6: Comparison of model performance (average SROCC on five SDR view at source ↗

read the original abstract

Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at https://github.com/MCL-ZJU/FDIM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FDIM is a practical hybrid VQA metric that combines deep multi-scale features with hand-crafted cues and shows decent generalization across traditional and neural codecs on held-out data.

read the letter

FDIM is a new feature-distance video quality metric built to handle both block-based codecs and neural ones, plus SDR and HDR content. The core idea is a hybrid setup where multi-scale deep features catch structural, textural, and semantic distortions while hand-crafted cues add stability for better cross-codec performance. They trained it on the new DCVQA set of over 16k sequences that mixes both codec families, then tested on ten other SDR/HDR datasets with previously unseen codecs. The abstract and experiments claim high subjective correlations and solid generalization, which is the main practical takeaway. Releasing the code and validation set is also a clear positive for anyone who wants to check or extend the work. The experiments are broad enough to give the generalization claim some weight, and the focus on neural codec artifacts matches a real need in current compression research. The main soft spot is that FDIM is still supervised on human scores, so its numbers depend on how well DCVQA represents future codecs and content. Without seeing the exact correlation values or ablations, it is hard to tell how much the hand-crafted features actually add beyond the deep ones alone. If the training set has biases in artifact types or subjective distributions, performance on truly novel codecs could drop, though the scale of the ten-dataset test helps. This paper is aimed at codec developers and VQA researchers who need metrics that do not collapse when moving to neural compression. Readers working on evaluation pipelines will get the most out of the multi-dataset comparisons. It has enough concrete experiments and a timely problem to deserve a serious referee rather than a desk reject. I would send it for peer review so the training details, fusion method, and exact performance numbers can be checked properly.

Referee Report

2 major / 4 minor

Summary. The manuscript proposes FDIM, a hybrid feature-distance-based video quality metric for traditional and neural video codecs across SDR and HDR. It combines multi-scale deep features (capturing structural, textural, and semantic distortions) with hand-crafted features for complementary cues, trained supervised on the large DCVQA dataset (>16k sequences from block-based and end-to-end neural codecs). Extensive tests on ten held-out SDR/HDR VQA datasets with previously unseen codecs are reported to show strong generalization and high correlation with subjective scores; code and DCVQA validation set will be released.

Significance. If the reported generalization holds under scrutiny, FDIM would address a timely need for versatile VQA tools that handle generative artifacts from neural codecs alongside conventional ones. The hybrid architecture, scale of the DCVQA training set, and planned public release of code plus validation data are concrete strengths that support reproducibility and further research in codec evaluation.

major comments (2)

[§4] §4 (Experiments and Results): The central generalization claim relies on performance across ten held-out datasets, yet the manuscript provides no ablation isolating the contribution of the hand-crafted feature component versus deep features alone (or versus a pure deep baseline). Without these controls it remains unclear whether the hybrid design is load-bearing for the reported gains on unseen codecs.
[§3.2] §3.2 (Hybrid Feature Fusion): The fusion mechanism for deep and hand-crafted distances is described at a high level but lacks explicit equations or pseudocode for the weighting scheme and training objective. This makes it difficult to assess whether the reported robustness stems from the architecture or from post-hoc tuning on DCVQA subjective scores.

minor comments (4)

[Abstract] Abstract: Key quantitative results (e.g., average PLCC/SROCC across the ten datasets) are omitted; adding the headline numbers would immediately strengthen the summary of the generalization claim.
[§2] §2 (Related Work): Several recent neural codec VQA papers are cited, but the discussion does not explicitly contrast FDIM's feature-distance approach with recent no-reference or full-reference neural metrics; a short comparative table would improve context.
[Figure 3] Figure 3 and Table 4: Axis labels and legend entries are too small for readability in print; increasing font size and adding error bars or confidence intervals on correlation plots would aid interpretation.
[§5] §5 (Conclusion): The limitations paragraph mentions content dependence but does not discuss potential failure modes on extreme HDR content or very low-bitrate neural codecs; expanding this would be useful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the timely need for versatile VQA metrics, and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to strengthen clarity and evidence.

read point-by-point responses

Referee: [§4] §4 (Experiments and Results): The central generalization claim relies on performance across ten held-out datasets, yet the manuscript provides no ablation isolating the contribution of the hand-crafted feature component versus deep features alone (or versus a pure deep baseline). Without these controls it remains unclear whether the hybrid design is load-bearing for the reported gains on unseen codecs.

Authors: We agree that an explicit ablation would strengthen the evidence for the hybrid design. The original manuscript reported the full FDIM performance on the ten held-out datasets but did not include component-wise ablations. In the revised version we will add a new ablation subsection (in §4) that compares (i) the complete hybrid FDIM, (ii) a deep-features-only variant, and (iii) a hand-crafted-features-only baseline, all trained and evaluated under identical conditions on the same held-out sets. This will directly quantify the incremental benefit of the hybrid fusion for generalization across unseen codecs. revision: yes
Referee: [§3.2] §3.2 (Hybrid Feature Fusion): The fusion mechanism for deep and hand-crafted distances is described at a high level but lacks explicit equations or pseudocode for the weighting scheme and training objective. This makes it difficult to assess whether the reported robustness stems from the architecture or from post-hoc tuning on DCVQA subjective scores.

Authors: We acknowledge that the description in §3.2 was kept at a high level. In the revised manuscript we will expand this section with (a) explicit equations defining the per-scale deep-feature distances, the hand-crafted distance terms, and the learned fusion weights; (b) the composite training objective (including the regression loss and any regularization terms); and (c) pseudocode for the end-to-end training and inference pipeline. These additions will make the architecture and training procedure fully reproducible and will clarify that the weighting is learned jointly rather than tuned post-hoc. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents FDIM as a supervised hybrid feature-distance metric trained on the DCVQA subjective dataset and evaluated for generalization on ten held-out SDR/HDR datasets with unseen codecs. No derivation chain, equations, or first-principles claims are advanced that reduce by construction to the training inputs or prior self-citations. The performance claims rest on empirical cross-dataset correlations rather than any self-definitional, fitted-prediction, or uniqueness-imported tautology. This is the expected non-circular outcome for a data-driven VQA metric paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on fitted model parameters from supervised training and the domain assumption that feature-space distances serve as a proxy for perceived quality; no new physical entities are postulated.

free parameters (2)

deep feature extractor parameters
Weights of the multi-scale deep network trained end-to-end on DCVQA subjective scores to minimize prediction error.
hybrid fusion weights
Parameters that combine deep and hand-crafted feature distances, fitted during training.

axioms (1)

domain assumption Distances in learned deep feature spaces and hand-crafted feature spaces correlate monotonically with human-perceived video quality degradation.
Invoked as the justification for using feature distance as the quality metric instead of pixel-level or other direct measures.

pith-pipeline@v0.9.0 · 5601 in / 1306 out tokens · 34859 ms · 2026-05-08T04:49:52.218031+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 3 canonical work pages

[1]

Overview of the high efficiency video coding (HEVC) standard,

G. J. Sullivan, J.-R. Ohm, W. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649– 1668, 2012

2012
[2]

Overview of the versatile video coding (VVC) standard and its applications,

B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.- R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021

2021
[3]

Algorithm description for neural network-based video coding (nnvc- 6.0),

“Algorithm description for neural network-based video coding (nnvc- 6.0),” Joint Video Experts Team (JVET) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC 29 AE2019-v2, 2023

2023
[4]

DVC: An end-to-end deep video compression framework,

G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 006–11 015

2019
[5]

FVC: A new framework towards deep video compression in feature space,

Z. Hu, W. Xu, J. Hu, S. Liu, W. Yang, and W. Lin, “FVC: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1502–1511

2021
[6]

AlphaVC: High-performance and efficient learned video compression,

Y . Shi, Y . Ge, J. Wang, and J. Mao, “AlphaVC: High-performance and efficient learned video compression,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 616–631

2022
[7]

Towards practical real-time neural video compression,

Z. Jia, B. Li, J. Li, W. Xie, L. Qi, H. Li, and Y . Lu, “Towards practical real-time neural video compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 12 543–12 552

2025
[8]

Deep hierarchies in the primate visual cortex: What can we learn for computer vision?

N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J. Rodriguez-Sanchez, and L. Wiskott, “Deep hierarchies in the primate visual cortex: What can we learn for computer vision?”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1847–1871, 2013

2013
[9]

Image quality assessment: From error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

2004
[10]

Image information and visual quality,

H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, 2006

2006
[11]

FSIM: A feature similarity index for image quality assessment,

L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,”IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011

2011
[12]

The perception-distortion tradeoff,

Y . Blau and T. Michaeli, “The perception-distortion tradeoff,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

2018
[13]

Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment,

J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, and A. C. Bovik, “Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment,”IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 130–141, 2017

2017
[14]

Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network,

W. Kim, J. Kim, S. Ahn, J. Kim, and S. Lee, “Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network,” inEuropean Conference on Computer Vision, 2018, pp. 219–234

2018
[15]

Perceptual quality assessment of internet videos,

J. Xu, J. Li, X. Zhou, W. Zhou, B. Wang, and Z. Chen, “Perceptual quality assessment of internet videos,” inProceedings of the ACM International Conference on Multimedia, 2020, pp. 1248–1257

2020
[16]

Attentions help CNNs see better: Attention-based hybrid image quality assessment network,

S. Laoet al., “Attentions help CNNs see better: Attention-based hybrid image quality assessment network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 1140–1149

2022
[17]

StarVQA: Space- time attention for video quality assessment,

F. Xing, Y .-G. Wang, H. Wang, L. Li, and G. Zhu, “StarVQA: Space- time attention for video quality assessment,” in2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 2326–2330

2022
[18]

DisCoVQA: Temporal distortion-content transformers for video quality assessment,

H. Wu, C. Chen, L. Liao, J. Hou, W. Sun, Q. Yan, and W. Lin, “DisCoVQA: Temporal distortion-content transformers for video quality assessment,” 2022

2022
[19]

TOPIQ: A top-down approach from semantics to distortions for image quality assessment,

J. Chenet al., “TOPIQ: A top-down approach from semantics to distortions for image quality assessment,”IEEE Transactions on Image Processing, vol. 33, pp. 2404–2418, 2024

2024
[20]

A study of subjective and objective quality assessment of HDR videos,

Z. Shang, J. P. Ebenezer, A. K. Venkataramanan, Y . Wu, H. Wei, S. Sethuraman, and A. C. Bovik, “A study of subjective and objective quality assessment of HDR videos,”IEEE Trans. Image Process., vol. 33, pp. 42–57, 2024

2024
[21]

2304.13625 , archivePrefix=

R. K. Mantiuk, D. Hammou, and P. Hanji, “Hdr-vdp-3: A multi-metric for predicting image differences, quality and contrast distortions in high dynamic range and regular content,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13625

work page arXiv 2023
[22]

ColorVideoVDP: A visual difference predictor for image, video and display distortions,

R. K. Mantiuket al., “ColorVideoVDP: A visual difference predictor for image, video and display distortions,”ACM Transactions on Graphics, vol. 40, no. 4, 2021

2021
[23]

Pu21: A novel perceptually uniform encoding for adapting existing quality metrics for hdr,

R. K. Mantiuk and M. Azimi, “Pu21: A novel perceptually uniform encoding for adapting existing quality metrics for hdr,” in2021 Picture Coding Symposium (PCS), 2021, pp. 1–5

2021
[24]

Mean squared error: Love it or leave it? A new look at signal fidelity measures,

Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? A new look at signal fidelity measures,”IEEE Signal Processing Magazine, vol. 26, no. 1, pp. 98–117, 2009

2009
[25]

Multiscale structural similarity for image quality assessment,

Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” inProceedings of the 37th Asilomar Conference on Signals, Systems and Computers, vol. 2, 2003, pp. 1398–1402

2003
[26]

A new standardized method for objectively measuring video quality,

M. H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,”IEEE Transactions on Broadcasting, vol. 50, no. 3, pp. 312–322, 2004. 13

2004
[27]

Motion tuned spatio-temporal quality assessment of natural videos,

K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,”IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 335–350, 2010

2010
[28]

Toward a practical perceptual video quality metric,

Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, “Toward a practical perceptual video quality metric,” Netflix Technology Blog, 2016

2016
[29]

Spatiotemporal feature integra- tion and model fusion for full reference video quality assessment,

C. G. Bampis, Z. Li, and A. C. Bovik, “Spatiotemporal feature integra- tion and model fusion for full reference video quality assessment,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2256–2270, 2019

2019
[30]

Enhancing VMAF through new feature integration and model combi- nation,

F. Zhang, A. Katsenou, C. Bampis, L. Krasula, Z. Li, and D. Bull, “Enhancing VMAF through new feature integration and model combi- nation,” in2021 Picture Coding Symposium (PCS), 2021, pp. 1–5

2021
[31]

High frame rate video quality assessment using VMAF and entropic differences,

P. C. Madhusudana, N. Birkbeck, Y . Wang, B. Adsumilli, and A. C. Bovik, “High frame rate video quality assessment using VMAF and entropic differences,” in2021 Picture Coding Symposium (PCS), 2021, pp. 1–5

2021
[32]

Perceptual video quality prediction emphasizing chroma distortions,

L.-H. Chen, C. G. Bampis, Z. Li, J. Sole, and A. C. Bovik, “Perceptual video quality prediction emphasizing chroma distortions,”IEEE Trans- actions on Image Processing, vol. 30, pp. 1408–1422, 2021

2021
[33]

One transform to compute them all: Efficient fusion-based full-reference video quality assessment,

A. K. Venkataramanan, C. Stejerean, I. Katsavounidis, and A. C. Bovik, “One transform to compute them all: Efficient fusion-based full-reference video quality assessment,”IEEE Transactions on Image Processing, vol. 33, pp. 509–524, 2024

2024
[34]

Quality assessment of in-the-wild videos,

D. Li, T. Jiang, and M. Jiang, “Quality assessment of in-the-wild videos,” inProceedings of the ACM International Conference on Multimedia, 2019, pp. 2351–2359

2019
[35]

dipiq: Blind image quality assessment by learning-to-rank discriminable image pairs,

K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipiq: Blind image quality assessment by learning-to-rank discriminable image pairs,”IEEE Transactions on Image Processing, vol. 26, no. 8, p. 3951–3964, 2017

2017
[36]

RankDVQA: Deep VQA based on ranking-inspired hybrid training,

C. Feng, D. Danier, F. Zhang, and D. Bull, “RankDVQA: Deep VQA based on ranking-inspired hybrid training,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1648–1658

2024
[37]

Conviqt: Contrastive video quality estimator,

P. C. Madhusudana, N. Birkbeck, Y . Wang, B. Adsumilli, and A. C. Bovik, “Conviqt: Contrastive video quality estimator,”IEEE Transac- tions on Image Processing, vol. 32, pp. 5138–5152, 2023

2023
[38]

arXiv preprint arXiv:2312.17090 (2023)

H. Wu, Z. Zhang, W. Zhang, C. Chen, C. Li, L. Liao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin, “Q-Align: Teaching lmms for visual scoring via discrete text-defined levels,”arXiv preprint arXiv:2312.17090, 2023, equal Contribution by Wu, Haoning and Zhang, Zicheng. Corresponding Authors: Zhai, Guangtao and Lin, Weisi

work page arXiv 2023
[39]

Lmm-vqa: Advancing video quality assessment with large multimodal models,

Q. Ge, W. Sun, Y . Zhang, Y . Li, Z. Ji, F. Sun, S. Jui, X. Min, and G. Zhai, “Lmm-vqa: Advancing video quality assessment with large multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.14008

work page arXiv 2024
[40]

Overview of intelligent video coding: from model-based to learning- based approaches,

S. Ma, J. Gao, R. Wang, J. Chang, Q. Mao, Z. Huang, and C. Jia, “Overview of intelligent video coding: from model-based to learning- based approaches,”Visual Intelligence, vol. 1, no. 1, p. 15, 2023

2023
[41]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595

2018
[42]

Full reference video quality assessment for machine learning- based video codecs,

A. Majeedi, B. Naderi, Y . Hosseinkashi, J. Cho, R. A. Martinez, and R. Cutler, “Full reference video quality assessment for machine learning- based video codecs,” 2023

2023
[43]

Evaluating video quality metrics for neural and traditional codecs using 4k/uhd-1 videos,

B. Herb, R. R. R. Rao, S. G ¨oring, and A. Raake, “Evaluating video quality metrics for neural and traditional codecs using 4k/uhd-1 videos,” inProc. Picture Coding Symposium (PCS), 2025

2025
[44]

A technical overview of A V1,

J. Han, B. Li, D. Mukherjee, C.-H. Chiang, A. Grange, C. Chen, H. Su, S. Parker, S. Deng, U. Joshi, Y . Chen, Y . Wang, P. Wilkins, Y . Xu, and J. Bankoski, “A technical overview of A V1,”Proceedings of the IEEE, vol. 109, no. 9, pp. 1435–1462, 2021

2021
[45]

Overview of the versatile video coding (VVC) standard and its applications,

B. Bross, Y . Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

2021
[46]

Neural video compression with feature modulation,

J. Li, B. Li, and Y . Lu, “Neural video compression with feature modulation,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 099–26 108

2024
[47]

Subjective visual quality assessment for high-fidelity learning-based image compression,

M. Jenadeleh, J. Sneyers, P. Jia, S. Mohammadi, J. Ascenso, and D. Saupe, “Subjective visual quality assessment for high-fidelity learning-based image compression,” inProc. QoMEX, 2025

2025
[48]

Jpeg ai: The first international standard for image coding based on an end-to-end learning-based approach,

E. Alshina, J. Ascenso, and T. Ebrahimi, “Jpeg ai: The first international standard for image coding based on an end-to-end learning-based approach,”IEEE MultiMedia, vol. 31, no. 4, pp. 60–69, 2024

2024
[49]

A funque approach to the quality assessment of compressed hdr videos,

A. K. Venkataramanan, C. Stejerean, I. Katsavounidis, and A. C. Bovik, “A funque approach to the quality assessment of compressed hdr videos,” in2024 Picture Coding Symposium (PCS), 2024, pp. 1–5

2024
[50]

Adapting pretrained networks for image quality assessment on high dynamic range displays,

A. Chubarau, H. Yoo, T. Akhavan, and J. Clark, “Adapting pretrained networks for image quality assessment on high dynamic range displays,” 2024

2024
[51]

HIDRO-VQA: High dynamic range oracle for video quality assessment,

S. Saini, A. Saha, and A. C. Bovik, “HIDRO-VQA: High dynamic range oracle for video quality assessment,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Work- shops, January 2024, pp. 469–479

2024
[52]

Compressedvqa-hdr: Generalized full-reference and no-reference qual- ity assessment models for compressed high dynamic range videos,

W. Sun, L. Cao, K. Fu, D. Zhu, J. Jia, M. Hu, X. Min, and G. Zhai, “Compressedvqa-hdr: Generalized full-reference and no-reference qual- ity assessment models for compressed high dynamic range videos,” in 2025 IEEE International Conference on Multimedia and Expo Work- shops (ICMEW), 2025, pp. 1–6

2025
[53]

HDR-VDP-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions,

R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich, “HDR-VDP-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions,”ACM Transactions on Graphics, vol. 30, no. 4, pp. 1–14, 2011

2011
[54]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

2016
[55]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernsteinet al., “Imagenet large scale visual recognition challenge,”International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015

2015
[56]

Deformable convolutional networks,

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773

2017
[57]

Seven challenges in image quality assessment: Past, present, and future research,

D. M. Chandler, “Seven challenges in image quality assessment: Past, present, and future research,”ISRN Signal Processing, vol. 2013, pp. 1–53, 2013

2013
[58]

CBAM: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” inEuropean Conference on Computer Vision, 2018, pp. 3–19

2018
[59]

Call for technical proposals for the avs objective video quality assess- ment standard,

“Call for technical proposals for the avs objective video quality assess- ment standard,” A VS N3856, 2024

2024
[60]

Consumer Digital Video Library,

Institute for Telecommunication Sciences, “Consumer Digital Video Library,” https://www.cdvl.org/
[61]

Xiph Video Codec Test Media,

Xiph.Org Foundation, “Xiph Video Codec Test Media,” https://media. xiph.org/
[62]

IVP Subjective Quality Video Database,

Chinese University of Hong Kong, “IVP Subjective Quality Video Database,” https://ivp.ee.cuhk.edu.hk/research/database/subjective/index. shtml
[63]

Towards perceptually optimized end-to-end adaptive video streaming,

C. Bampis, Z. Li, I. Katsavounidis, T.-Y . Huang, C. Ekanadham, and A. Bovik, “Towards perceptually optimized end-to-end adaptive video streaming,” 2018

2018
[64]

Gamingvideoset: A dataset for gaming video streaming applications,

N. Barman, S. Zadtootaghaj, S. Schmidt, M. G. Martini, and S. M ¨oller, “Gamingvideoset: A dataset for gaming video streaming applications,” in2018 16th Annual Workshop on Network and Systems Support for Games (NetGames), 2018, pp. 1–6

2018
[65]

Update on a perceptually optimized learning-based video codec for cvqm

“Update on a perceptually optimized learning-based video codec for cvqm.” ISO/IEC JTC 1/SC 29/AG 5 m69717, 2024

2024
[66]

Generative adversarial networks,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” 2014

2014
[67]

A statistical evaluation of recent full reference image quality assessment algorithms,

H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,”IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, 2006

2006
[68]

Image quality assess- ment: Unifying structure and texture similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assess- ment: Unifying structure and texture similarity,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2567– 2581, 2022

2022
[69]

VTM and HM common test conditions and evaluation procedures for HDR/WCG video

A. Segall, E. Franc ¸ois, W. Husak, S. Iwamura, and D. Rusanovskyy, “VTM and HM common test conditions and evaluation procedures for HDR/WCG video.” Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29 JVET-AC2011, 2023

2023