arxiv: 2604.19159 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.LG

Recognition: unknown

MSDS: Deep Structural Similarity with Multiscale Representation

Danling Kang , Xue-Hua Chen , Bin Liu , Keke Zhang , Weiling Chen , Tiesong Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:50 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords multiscale representationdeep structural similarityimage quality assessmentperceptual similaritydeep featurespyramid levels

0 comments

The pith

Multiscale representation improves deep structural similarity by treating spatial scale as an independent factor in perceptual models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most deep-feature perceptual similarity models operate at a single spatial scale and implicitly assume this fixed resolution captures all necessary structural information. The paper isolates spatial scale as a separate variable by extending DeepSSIM into MSDS, which runs the similarity computation independently at each level of an image pyramid and then combines the per-scale scores using a small set of learnable global weights. Experiments across standard benchmark datasets for image quality assessment show consistent gains over the single-scale version, with the added steps introducing negligible extra computation. A sympathetic reader would conclude that scale information contributes meaningfully to how well these models match human visual judgments.

Core claim

The paper introduces MSDS as a minimal multiscale extension of DeepSSIM. It decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

What carries the argument

Independent DeepSSIM computation at each pyramid level followed by fusion with learnable global weights

If this is right

Deep perceptual similarity models gain measurable accuracy by accounting for spatial scale explicitly.
The added multiscale processing requires only a small number of extra parameters and negligible runtime.
Global-weight fusion suffices to integrate information across scales without needing more elaborate cross-scale mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar minimal multiscale extensions could be tested on other deep-feature similarity metrics to check for comparable gains.
Image quality assessment pipelines that already use deep features might see routine benefits from incorporating pyramid-level processing.
The approach leaves room for future variants that learn scale-specific weights instead of a single global set.

Load-bearing premise

The improvements stem cleanly from adding multiple scales rather than from the choice of pyramid construction or the fitting of the fusion weights themselves.

What would settle it

Repeating the experiments using a different pyramid construction method or replacing the learnable weights with fixed equal weights, and finding that the performance gains disappear.

Figures

Figures reproduced from arXiv: 2604.19159 by Bin Liu, Danling Kang, Keke Zhang, Tiesong Zhao, Weiling Chen, Xue-Hua Chen.

**Figure 1.** Figure 1: Motivation of multiscale deep structural similarity. Single-scale DeepSSIM assumes a fixed spatial scale is sufficient, introducing scale bias under frequency-diverse distortions. MSDS explicitly models deep structural similarity across multiple scales and integrates them for scale-aware perceptual quality assessment. While multiscale modeling has been extensively studied in traditional IQA, such as MS-SSI… view at source ↗

**Figure 2.** Figure 2: Overall architecture of the proposed MSDS framework. Given a reference image and a distorted image, Gaussian pyramids are first constructed to generate multiscale representations. At each scale, deep structural similarity is independently computed using the original DeepSSIM formulation. A small set of learnable global weights is then used to fuse the scale-wise similarity scores, yielding the final qualit… view at source ↗

**Figure 3.** Figure 3: further visualizes the per-scale feature responses for five representative distortions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MSDS, a minimal multiscale extension of DeepSSIM for deep perceptual similarity in image quality assessment. DeepSSIM is computed independently at each level of a spatial pyramid and the per-level scores are fused via a lightweight set of learnable global weights. Experiments on multiple IQA benchmark datasets are reported to yield consistent, statistically significant gains over the single-scale baseline while adding negligible complexity, thereby empirically confirming spatial scale as a non-negligible factor.

Significance. If the empirical demonstration survives controls for the learnable fusion weights, the result would usefully highlight that single-scale deep-feature similarity is insufficient and that a simple multiscale aggregation improves perceptual alignment. The contribution is modest in scope (a minimal testbed rather than a new architecture) but could still influence subsequent IQA model design if the isolation of scale is shown to be clean.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the claim that the framework 'isolates spatial scale as an independent factor' is not supported by the described procedure. Because the fusion weights are learnable parameters fitted to the same IQA datasets used for evaluation, any observed improvement may reflect dataset-specific calibration rather than the contribution of multiscale structure per se. The single-scale baseline has no equivalent degrees of freedom, so the comparison is unbalanced. No controls (uniform weights, frozen cross-dataset weights, or parameter-matched baselines) are described.
[Abstract and Experiments] Abstract and Experiments section: the assertion of 'statistically significant improvements' is made without any numerical results, dataset names, p-values, or details of the statistical test. In the absence of these data it is impossible to assess whether the gains are practically meaningful or whether they survive multiple-comparison correction across the reported benchmarks.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete quantitative result (e.g., mean PLCC or SRCC delta on a named dataset).
[§3] Clarify the exact pyramid construction (Gaussian, Laplacian, or feature-space downsampling) and whether the backbone feature extractor is shared or duplicated across scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the claim that the framework 'isolates spatial scale as an independent factor' is not supported by the described procedure. Because the fusion weights are learnable parameters fitted to the same IQA datasets used for evaluation, any observed improvement may reflect dataset-specific calibration rather than the contribution of multiscale structure per se. The single-scale baseline has no equivalent degrees of freedom, so the comparison is unbalanced. No controls (uniform weights, frozen cross-dataset weights, or parameter-matched baselines) are described.

Authors: We agree that the learnable fusion weights (a small set of global scalars, one per pyramid level) are optimized on the same data used for evaluation, which introduces a potential confound with dataset-specific calibration. The single-scale DeepSSIM baseline indeed has fewer degrees of freedom. To better isolate the contribution of multiscale structure, we will add the following controls in the revised Experiments section: (1) results using fixed uniform weights (equal averaging across scales), (2) cross-dataset transfer where weights are learned on one benchmark and frozen for evaluation on the others, and (3) a brief comparison against a parameter-matched single-scale variant if a suitable proxy can be constructed. These additions will allow readers to assess whether the observed gains persist beyond calibration effects. revision: yes
Referee: [Abstract and Experiments] Abstract and Experiments section: the assertion of 'statistically significant improvements' is made without any numerical results, dataset names, p-values, or details of the statistical test. In the absence of these data it is impossible to assess whether the gains are practically meaningful or whether they survive multiple-comparison correction across the reported benchmarks.

Authors: We acknowledge that the abstract and the current Experiments write-up do not explicitly list the numerical deltas, exact dataset names, p-values, or the statistical procedure (e.g., paired t-test on PLCC/SROCC with Bonferroni correction). The full tables in the manuscript contain the per-dataset scores, but the significance claims are not sufficiently detailed in the text. In the revision we will (a) expand the abstract to name the primary benchmarks and report the range of improvements, and (b) add a dedicated paragraph in §4 that lists the datasets, the exact p-values obtained, the test used, and confirmation that multiple-comparison correction was applied. This will make the statistical evidence transparent and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical extension validated on external benchmarks

full rationale

The paper proposes a minimal multiscale extension to DeepSSIM by computing the metric independently at each pyramid level and fusing the scores via a small set of learnable global weights. It then reports statistically significant gains over the single-scale baseline on standard IQA datasets. No first-principles derivation, uniqueness theorem, or closed-form prediction is offered whose output is forced by construction to equal its inputs. The central claim is an empirical observation of improvement under the stated architecture; the learnable weights constitute additional degrees of freedom whose effect is measured against an external benchmark rather than being renamed as a prediction. Because the evaluation relies on held-out dataset performance rather than tautological reduction, the derivation chain (such as it exists) is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The model introduces a small number of learnable fusion weights fitted to IQA data and relies on standard assumptions about deep features and pyramid decompositions; no new physical entities are postulated.

free parameters (1)

learnable global weights
A lightweight set of weights used to fuse DeepSSIM scores computed at different pyramid levels; these are adjusted during training on IQA data.

axioms (2)

domain assumption Deep features extracted from a pre-trained network provide a valid basis for perceptual similarity
The entire construction inherits this premise from the base DeepSSIM model.
domain assumption An image pyramid decomposition isolates independent spatial-scale information
The decoupling step assumes that separate computation at each pyramid level cleanly separates scale effects.

pith-pipeline@v0.9.0 · 5467 in / 1328 out tokens · 55530 ms · 2026-05-10T02:50:56.258345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 28 canonical work pages

[1]

Structural Similarity in Deep Features: Unified Image Quality Assessment Robust to Geomet- rically Disparate Reference,

K. Zhang, W. Chen, T. Zhao, and Z. Wang, “Structural Similarity in Deep Features: Unified Image Quality Assessment Robust to Geomet- rically Disparate Reference,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025, doi: 10.1109/TPAMI.2025.3627285

work page doi:10.1109/tpami.2025.3627285 2025
[2]

Wang, E.P

Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural sim- ilarity for image quality assessment,” inProc. Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, USA, 2003, pp. 1398–1402, doi: 10.1109/ACSSC.2003.1292216

work page doi:10.1109/acssc.2003.1292216 2003
[3]

Wang and A

Z. Wang and A. C. Bovik,Modern Image Quality Assessment, 1st ed. Morgan & Claypool Publishers, 2006

2006
[4]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004, doi: 10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[5]

FSIM: A Feature Similarity Index for Image Quality Assessment,

Lin Zhang, Lei Zhang, Xuanqin Mou, and D. Zhang, “FSIM: A Feature Similarity Index for Image Quality Assessment,”IEEE Trans. Image Process., vol. 20, no. 8, pp. 2378–2386, Aug. 2011, doi: 10.1109/TIP.2011.2109730

work page doi:10.1109/tip.2011.2109730 2011
[6]

Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index,

W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index,”IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014, doi: 10.1109/TIP.2013.2293423

work page doi:10.1109/tip.2013.2293423 2014
[7]

Image information and visual quality,

H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Process., vol. 15, no. 2, pp. 430–444, Feb. 2006, doi: 10.1109/TIP.2005.859378

work page doi:10.1109/tip.2005.859378 2006
[8]

Visual Horizontal Effect for Image Quality Assessment,

L. Ma, S. Li, and K. N. Ngan, “Visual Horizontal Effect for Image Quality Assessment,”IEEE Signal Process. Lett., vol. 17, no. 7, pp. 627–630, Jul. 2010, doi: 10.1109/LSP.2010.2048726

work page doi:10.1109/lsp.2010.2048726 2010
[9]

Edge Strength Similarity for Image Quality Assessment,

X. Zhang, X. Feng, W. Wang, and W. Xue, “Edge Strength Similarity for Image Quality Assessment,”IEEE Signal Process. Lett., vol. 20, no. 4, pp. 319–322, Apr. 2013, doi: 10.1109/LSP.2013.2244081

work page doi:10.1109/lsp.2013.2244081 2013
[10]

Decision Fusion for Image Quality Assessment using an Optimization Approach,

M. Oszust, “Decision Fusion for Image Quality Assessment using an Optimization Approach,”IEEE Signal Process. Lett., vol. 23, no. 1, pp. 65–69, 2016, doi: 10.1109/LSP.2015.2500819

work page doi:10.1109/lsp.2015.2500819 2016
[11]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real- Time Style Transfer and Super-Resolution,” arXiv:1603.08155, 2016, doi: 10.48550/arXiv.1603.08155

work page Pith review doi:10.48550/arxiv.1603.08155 2016
[12]

Image Quality Assess- ment: Unifying Structure and Texture Similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image Quality Assess- ment: Unifying Structure and Texture Similarity,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2020, doi: 10.1109/TPAMI.2020.3045810

work page doi:10.1109/tpami.2020.3045810 2020
[13]

Adaptive Structure and Texture Similarity Metric for Image Quality Assessment and Opti- mization,

K. Ding, R. Zhong, Z. Wang, Y . Yu, and Y . Fang, “Adaptive Structure and Texture Similarity Metric for Image Quality Assessment and Opti- mization,”IEEE Trans. Multimedia, vol. 26, pp. 5398–5409, 2024, doi: 10.1109/TMM.2023.3333208

work page doi:10.1109/tmm.2023.3333208 2024
[14]

Image Quality Assessment Using Contrastive Learning,

P. C. Madhusudana, N. Birkbeck, Y . Wang, B. Adsumilli, and A. C. Bovik, “Image Quality Assessment Using Contrastive Learning,” IEEE Trans. Image Process., vol. 31, pp. 4149–4161, 2022, doi: 10.1109/TIP.2022.3181496

work page doi:10.1109/tip.2022.3181496 2022
[15]

Towards Scalable and Efficient Full-Reference Omnidirectional Image Quality Assess- ment,

J. Yan, Z. Liu, Z. Wang, Y . Fang, and H. Liu, “Towards Scalable and Efficient Full-Reference Omnidirectional Image Quality Assess- ment,”IEEE Signal Process. Lett., vol. 32, pp. 2459–2463, 2025, doi: 10.1109/LSP.2025.3569458

work page doi:10.1109/lsp.2025.3569458 2025
[16]

MILO: A Lightweight Perceptual Quality Metric for Image and Latent- Space Optimization,

U. Cogalan, M. Bemana, K. Myszkowski, H.-P. Seidel, and C. Groth, “MILO: A Lightweight Perceptual Quality Metric for Image and Latent- Space Optimization,”ACM Trans. Graph., vol. 44, no. 6, pp. 1–11, Dec. 2025, doi: 10.1145/3763340

work page doi:10.1145/3763340 2025
[17]

BiRQA: Bidirectional Robust Quality Assessment for Images,

A. Gushchin, D. S. Vatolin, and A. Antsiferova, “BiRQA: Bidirectional Robust Quality Assessment for Images,” arXiv:2602.20351, 2026, doi: 10.48550/arXiv.2602.20351

work page doi:10.48550/arxiv.2602.20351 2026
[18]

FsPN: Blind Image Quality Assessment Based on Feature-Selected Pyramid Network,

L. Tang, Y . Han, L. Yuan, and G. Zhai, “FsPN: Blind Image Quality Assessment Based on Feature-Selected Pyramid Network,”IEEE Signal Process. Lett., vol. 32, pp. 1–5, 2025, doi: 10.1109/LSP.2024.3475912

work page doi:10.1109/lsp.2024.3475912 2025
[19]

URL https://openreview.net/ pdf/9a7e7a9787d14ac8302215f8e4ef959606b78a94.pdf

M. M. R. Mithila and M. C. Q. Farias, “MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment,” inICASSP 2025, Hyderabad, India, Apr. 2025, pp. 1–5, doi: 10.1109/ICASSP49660.2025.10887759

work page doi:10.1109/icassp49660.2025.10887759 2025
[20]

LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations,

S. Saini, R.-L. Liao, Y . Ye, and A. Bovik, “LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations,” in *Proc. Int. Conf. Mach. Learn. (ICML)*, Vancouver, BC, Canada, 2025

2025
[21]

Pyramid Methods in Image Processing,

E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid Methods in Image Processing,”RCA Engineer, 1983

1983
[22]

A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms,

H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms,”IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006, doi: 10.1109/TIP.2006.881959

work page doi:10.1109/tip.2006.881959 2006
[23]

Most apparent distortion: full-reference image quality assessment and the role of strategy,

D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,”J. Electron. Imaging, vol. 19, no. 1, p. 011006, Jan. 2010, doi: 10.1117/1.3267105

work page doi:10.1117/1.3267105 2010
[24]

Color image database TID2013: Peculiari- ties and preliminary results,

N. Ponomarenko et al., “Color image database TID2013: Peculiari- ties and preliminary results,” inEuropean Workshop on Visual Infor- mation Processing (EUVIP), Paris, France, 2013, pp. 106–111, doi: 10.1109/EUVIP.2013.6623960

work page doi:10.1109/euvip.2013.6623960 2013
[25]

In: Proceedings of the 11th International Conference on Quality of Multimedia Experience (QoMEX)

H. Lin, V . Hosu, and D. Saupe, “KADID-10k: A Large-scale Artificially Distorted IQA Database,” in2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin, Germany, Jun. 2019, pp. 1–3, doi: 10.1109/QoMEX.2019.8743252

work page doi:10.1109/qomex.2019.8743252 2019
[26]

PIPAL: A Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration,

G. Jinjin, C. Haoming, C. Haoyu, Y . Xiaoxing, J. S. Ren, and D. Chao, “PIPAL: A Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration,” inComputer Vision – ECCV 2020, LNCS vol. 12356, A. Vedaldi et al. Eds., Springer, Cham, 2020, pp. 633–651, doi: 10.1007/978-3-030-58621-8 37

work page doi:10.1007/978-3-030-58621-8 2020
[27]

Efros, Eli Shechtman, and Oliver Wang

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Met- ric,” in2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, Salt Lake City, UT, Jun. 2018, pp. 586–595, doi: 10.1109/CVPR.2018.00068

work page doi:10.1109/cvpr.2018.00068 2018
[28]

PieAPP: Perceptual Image- Error Assessment Through Pairwise Preference,

E. Prashnani, H. Cai, Y . Mostofi, and P. Sen, “PieAPP: Perceptual Image- Error Assessment Through Pairwise Preference,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, Jun. 2018, pp. 1808–1817, doi: 10.1109/CVPR.2018.00194

work page doi:10.1109/cvpr.2018.00194 2018
[29]

Debiased Map- ping for Full-Reference Image Quality Assessment,

B. Chen, H. Zhu, L. Zhu, S. Wang, J. Pan, and S. Wang, “Debiased Map- ping for Full-Reference Image Quality Assessment,”IEEE Trans. Multi- media, vol. 27, pp. 2638–2649, 2025, doi: 10.1109/TMM.2025.3535280

work page doi:10.1109/tmm.2025.3535280 2025
[30]

Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment,

S. Bosse, D. Maniry, K.-R. Muller, T. Wiegand, and W. Samek, “Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment,”IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018, doi: 10.1109/TIP.2017.2760518

work page doi:10.1109/tip.2017.2760518 2018
[31]

Perceptual Quality Analysis in Deep Domains Using Structure Separation and High- Order Moments,

W. Xian, M. Zhou, B. Fang, T. Xiang, W. Jia, and B. Chen, “Perceptual Quality Analysis in Deep Domains Using Structure Separation and High- Order Moments,”IEEE Trans. Multimedia, vol. 26, pp. 2219–2234, 2024, doi: 10.1109/TMM.2023.3293730

work page doi:10.1109/tmm.2023.3293730 2024