Recognition: unknown
MSDS: Deep Structural Similarity with Multiscale Representation
Pith reviewed 2026-05-10 02:50 UTC · model grok-4.3
The pith
Multiscale representation improves deep structural similarity by treating spatial scale as an independent factor in perceptual models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces MSDS as a minimal multiscale extension of DeepSSIM. It decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.
What carries the argument
Independent DeepSSIM computation at each pyramid level followed by fusion with learnable global weights
If this is right
- Deep perceptual similarity models gain measurable accuracy by accounting for spatial scale explicitly.
- The added multiscale processing requires only a small number of extra parameters and negligible runtime.
- Global-weight fusion suffices to integrate information across scales without needing more elaborate cross-scale mechanisms.
Where Pith is reading between the lines
- Similar minimal multiscale extensions could be tested on other deep-feature similarity metrics to check for comparable gains.
- Image quality assessment pipelines that already use deep features might see routine benefits from incorporating pyramid-level processing.
- The approach leaves room for future variants that learn scale-specific weights instead of a single global set.
Load-bearing premise
The improvements stem cleanly from adding multiple scales rather than from the choice of pyramid construction or the fitting of the fusion weights themselves.
What would settle it
Repeating the experiments using a different pyramid construction method or replacing the learnable weights with fixed equal weights, and finding that the performance gains disappear.
Figures
read the original abstract
Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MSDS, a minimal multiscale extension of DeepSSIM for deep perceptual similarity in image quality assessment. DeepSSIM is computed independently at each level of a spatial pyramid and the per-level scores are fused via a lightweight set of learnable global weights. Experiments on multiple IQA benchmark datasets are reported to yield consistent, statistically significant gains over the single-scale baseline while adding negligible complexity, thereby empirically confirming spatial scale as a non-negligible factor.
Significance. If the empirical demonstration survives controls for the learnable fusion weights, the result would usefully highlight that single-scale deep-feature similarity is insufficient and that a simple multiscale aggregation improves perceptual alignment. The contribution is modest in scope (a minimal testbed rather than a new architecture) but could still influence subsequent IQA model design if the isolation of scale is shown to be clean.
major comments (2)
- [Abstract and §3] Abstract and §3 (method): the claim that the framework 'isolates spatial scale as an independent factor' is not supported by the described procedure. Because the fusion weights are learnable parameters fitted to the same IQA datasets used for evaluation, any observed improvement may reflect dataset-specific calibration rather than the contribution of multiscale structure per se. The single-scale baseline has no equivalent degrees of freedom, so the comparison is unbalanced. No controls (uniform weights, frozen cross-dataset weights, or parameter-matched baselines) are described.
- [Abstract and Experiments] Abstract and Experiments section: the assertion of 'statistically significant improvements' is made without any numerical results, dataset names, p-values, or details of the statistical test. In the absence of these data it is impossible to assess whether the gains are practically meaningful or whether they survive multiple-comparison correction across the reported benchmarks.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one concrete quantitative result (e.g., mean PLCC or SRCC delta on a named dataset).
- [§3] Clarify the exact pyramid construction (Gaussian, Laplacian, or feature-space downsampling) and whether the backbone feature extractor is shared or duplicated across scales.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the claim that the framework 'isolates spatial scale as an independent factor' is not supported by the described procedure. Because the fusion weights are learnable parameters fitted to the same IQA datasets used for evaluation, any observed improvement may reflect dataset-specific calibration rather than the contribution of multiscale structure per se. The single-scale baseline has no equivalent degrees of freedom, so the comparison is unbalanced. No controls (uniform weights, frozen cross-dataset weights, or parameter-matched baselines) are described.
Authors: We agree that the learnable fusion weights (a small set of global scalars, one per pyramid level) are optimized on the same data used for evaluation, which introduces a potential confound with dataset-specific calibration. The single-scale DeepSSIM baseline indeed has fewer degrees of freedom. To better isolate the contribution of multiscale structure, we will add the following controls in the revised Experiments section: (1) results using fixed uniform weights (equal averaging across scales), (2) cross-dataset transfer where weights are learned on one benchmark and frozen for evaluation on the others, and (3) a brief comparison against a parameter-matched single-scale variant if a suitable proxy can be constructed. These additions will allow readers to assess whether the observed gains persist beyond calibration effects. revision: yes
-
Referee: [Abstract and Experiments] Abstract and Experiments section: the assertion of 'statistically significant improvements' is made without any numerical results, dataset names, p-values, or details of the statistical test. In the absence of these data it is impossible to assess whether the gains are practically meaningful or whether they survive multiple-comparison correction across the reported benchmarks.
Authors: We acknowledge that the abstract and the current Experiments write-up do not explicitly list the numerical deltas, exact dataset names, p-values, or the statistical procedure (e.g., paired t-test on PLCC/SROCC with Bonferroni correction). The full tables in the manuscript contain the per-dataset scores, but the significance claims are not sufficiently detailed in the text. In the revision we will (a) expand the abstract to name the primary benchmarks and report the range of improvements, and (b) add a dedicated paragraph in §4 that lists the datasets, the exact p-values obtained, the test used, and confirmation that multiple-comparison correction was applied. This will make the statistical evidence transparent and verifiable. revision: yes
Circularity Check
No circularity: empirical extension validated on external benchmarks
full rationale
The paper proposes a minimal multiscale extension to DeepSSIM by computing the metric independently at each pyramid level and fusing the scores via a small set of learnable global weights. It then reports statistically significant gains over the single-scale baseline on standard IQA datasets. No first-principles derivation, uniqueness theorem, or closed-form prediction is offered whose output is forced by construction to equal its inputs. The central claim is an empirical observation of improvement under the stated architecture; the learnable weights constitute additional degrees of freedom whose effect is measured against an external benchmark rather than being renamed as a prediction. Because the evaluation relies on held-out dataset performance rather than tautological reduction, the derivation chain (such as it exists) is self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable global weights
axioms (2)
- domain assumption Deep features extracted from a pre-trained network provide a valid basis for perceptual similarity
- domain assumption An image pyramid decomposition isolates independent spatial-scale information
Reference graph
Works this paper leans on
-
[1]
K. Zhang, W. Chen, T. Zhao, and Z. Wang, “Structural Similarity in Deep Features: Unified Image Quality Assessment Robust to Geomet- rically Disparate Reference,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025, doi: 10.1109/TPAMI.2025.3627285
-
[2]
Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural sim- ilarity for image quality assessment,” inProc. Asilomar Conf. Signals, Syst. Comput., Pacific Grove, CA, USA, 2003, pp. 1398–1402, doi: 10.1109/ACSSC.2003.1292216
-
[3]
Wang and A
Z. Wang and A. C. Bovik,Modern Image Quality Assessment, 1st ed. Morgan & Claypool Publishers, 2006
2006
-
[4]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004, doi: 10.1109/TIP.2003.819861
-
[5]
FSIM: A Feature Similarity Index for Image Quality Assessment,
Lin Zhang, Lei Zhang, Xuanqin Mou, and D. Zhang, “FSIM: A Feature Similarity Index for Image Quality Assessment,”IEEE Trans. Image Process., vol. 20, no. 8, pp. 2378–2386, Aug. 2011, doi: 10.1109/TIP.2011.2109730
-
[6]
Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index,
W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index,”IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014, doi: 10.1109/TIP.2013.2293423
-
[7]
Image information and visual quality,
H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Trans. Image Process., vol. 15, no. 2, pp. 430–444, Feb. 2006, doi: 10.1109/TIP.2005.859378
-
[8]
Visual Horizontal Effect for Image Quality Assessment,
L. Ma, S. Li, and K. N. Ngan, “Visual Horizontal Effect for Image Quality Assessment,”IEEE Signal Process. Lett., vol. 17, no. 7, pp. 627–630, Jul. 2010, doi: 10.1109/LSP.2010.2048726
-
[9]
Edge Strength Similarity for Image Quality Assessment,
X. Zhang, X. Feng, W. Wang, and W. Xue, “Edge Strength Similarity for Image Quality Assessment,”IEEE Signal Process. Lett., vol. 20, no. 4, pp. 319–322, Apr. 2013, doi: 10.1109/LSP.2013.2244081
-
[10]
Decision Fusion for Image Quality Assessment using an Optimization Approach,
M. Oszust, “Decision Fusion for Image Quality Assessment using an Optimization Approach,”IEEE Signal Process. Lett., vol. 23, no. 1, pp. 65–69, 2016, doi: 10.1109/LSP.2015.2500819
-
[11]
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real- Time Style Transfer and Super-Resolution,” arXiv:1603.08155, 2016, doi: 10.48550/arXiv.1603.08155
-
[12]
Image Quality Assess- ment: Unifying Structure and Texture Similarity,
K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image Quality Assess- ment: Unifying Structure and Texture Similarity,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2020, doi: 10.1109/TPAMI.2020.3045810
-
[13]
Adaptive Structure and Texture Similarity Metric for Image Quality Assessment and Opti- mization,
K. Ding, R. Zhong, Z. Wang, Y . Yu, and Y . Fang, “Adaptive Structure and Texture Similarity Metric for Image Quality Assessment and Opti- mization,”IEEE Trans. Multimedia, vol. 26, pp. 5398–5409, 2024, doi: 10.1109/TMM.2023.3333208
-
[14]
Image Quality Assessment Using Contrastive Learning,
P. C. Madhusudana, N. Birkbeck, Y . Wang, B. Adsumilli, and A. C. Bovik, “Image Quality Assessment Using Contrastive Learning,” IEEE Trans. Image Process., vol. 31, pp. 4149–4161, 2022, doi: 10.1109/TIP.2022.3181496
-
[15]
Towards Scalable and Efficient Full-Reference Omnidirectional Image Quality Assess- ment,
J. Yan, Z. Liu, Z. Wang, Y . Fang, and H. Liu, “Towards Scalable and Efficient Full-Reference Omnidirectional Image Quality Assess- ment,”IEEE Signal Process. Lett., vol. 32, pp. 2459–2463, 2025, doi: 10.1109/LSP.2025.3569458
-
[16]
MILO: A Lightweight Perceptual Quality Metric for Image and Latent- Space Optimization,
U. Cogalan, M. Bemana, K. Myszkowski, H.-P. Seidel, and C. Groth, “MILO: A Lightweight Perceptual Quality Metric for Image and Latent- Space Optimization,”ACM Trans. Graph., vol. 44, no. 6, pp. 1–11, Dec. 2025, doi: 10.1145/3763340
-
[17]
BiRQA: Bidirectional Robust Quality Assessment for Images,
A. Gushchin, D. S. Vatolin, and A. Antsiferova, “BiRQA: Bidirectional Robust Quality Assessment for Images,” arXiv:2602.20351, 2026, doi: 10.48550/arXiv.2602.20351
-
[18]
FsPN: Blind Image Quality Assessment Based on Feature-Selected Pyramid Network,
L. Tang, Y . Han, L. Yuan, and G. Zhai, “FsPN: Blind Image Quality Assessment Based on Feature-Selected Pyramid Network,”IEEE Signal Process. Lett., vol. 32, pp. 1–5, 2025, doi: 10.1109/LSP.2024.3475912
-
[19]
URL https://openreview.net/ pdf/9a7e7a9787d14ac8302215f8e4ef959606b78a94.pdf
M. M. R. Mithila and M. C. Q. Farias, “MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment,” inICASSP 2025, Hyderabad, India, Apr. 2025, pp. 1–5, doi: 10.1109/ICASSP49660.2025.10887759
-
[20]
LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations,
S. Saini, R.-L. Liao, Y . Ye, and A. Bovik, “LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations,” in *Proc. Int. Conf. Mach. Learn. (ICML)*, Vancouver, BC, Canada, 2025
2025
-
[21]
Pyramid Methods in Image Processing,
E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid Methods in Image Processing,”RCA Engineer, 1983
1983
-
[22]
A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms,
H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms,”IEEE Trans. Image Process., vol. 15, no. 11, pp. 3440–3451, Nov. 2006, doi: 10.1109/TIP.2006.881959
-
[23]
Most apparent distortion: full-reference image quality assessment and the role of strategy,
D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,”J. Electron. Imaging, vol. 19, no. 1, p. 011006, Jan. 2010, doi: 10.1117/1.3267105
-
[24]
Color image database TID2013: Peculiari- ties and preliminary results,
N. Ponomarenko et al., “Color image database TID2013: Peculiari- ties and preliminary results,” inEuropean Workshop on Visual Infor- mation Processing (EUVIP), Paris, France, 2013, pp. 106–111, doi: 10.1109/EUVIP.2013.6623960
-
[25]
In: Proceedings of the 11th International Conference on Quality of Multimedia Experience (QoMEX)
H. Lin, V . Hosu, and D. Saupe, “KADID-10k: A Large-scale Artificially Distorted IQA Database,” in2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin, Germany, Jun. 2019, pp. 1–3, doi: 10.1109/QoMEX.2019.8743252
-
[26]
PIPAL: A Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration,
G. Jinjin, C. Haoming, C. Haoyu, Y . Xiaoxing, J. S. Ren, and D. Chao, “PIPAL: A Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration,” inComputer Vision – ECCV 2020, LNCS vol. 12356, A. Vedaldi et al. Eds., Springer, Cham, 2020, pp. 633–651, doi: 10.1007/978-3-030-58621-8 37
-
[27]
Efros, Eli Shechtman, and Oliver Wang
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Met- ric,” in2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, Salt Lake City, UT, Jun. 2018, pp. 586–595, doi: 10.1109/CVPR.2018.00068
-
[28]
PieAPP: Perceptual Image- Error Assessment Through Pairwise Preference,
E. Prashnani, H. Cai, Y . Mostofi, and P. Sen, “PieAPP: Perceptual Image- Error Assessment Through Pairwise Preference,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, Jun. 2018, pp. 1808–1817, doi: 10.1109/CVPR.2018.00194
-
[29]
Debiased Map- ping for Full-Reference Image Quality Assessment,
B. Chen, H. Zhu, L. Zhu, S. Wang, J. Pan, and S. Wang, “Debiased Map- ping for Full-Reference Image Quality Assessment,”IEEE Trans. Multi- media, vol. 27, pp. 2638–2649, 2025, doi: 10.1109/TMM.2025.3535280
-
[30]
Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment,
S. Bosse, D. Maniry, K.-R. Muller, T. Wiegand, and W. Samek, “Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment,”IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018, doi: 10.1109/TIP.2017.2760518
-
[31]
Perceptual Quality Analysis in Deep Domains Using Structure Separation and High- Order Moments,
W. Xian, M. Zhou, B. Fang, T. Xiang, W. Jia, and B. Chen, “Perceptual Quality Analysis in Deep Domains Using Structure Separation and High- Order Moments,”IEEE Trans. Multimedia, vol. 26, pp. 2219–2234, 2024, doi: 10.1109/TMM.2023.3293730
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.