pith. machine review for the scientific record. sign in

arxiv: 2604.03225 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords image super-resolutiongenerative modelsvision-only trainingdiffusion modelsclassifier-free guidanceimage restorationone-step distillationperceptual quality
0
0 comments X

The pith

A vision-only generative model achieves high-quality image super-resolution without any text-to-image pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generative super-resolution must start from large text-to-image diffusion models pretrained on web-scale data. It introduces VOSR, which trains a diffusion model from scratch using only visual inputs: a pretrained vision encoder supplies semantic and spatial features directly from the low-resolution image, while a new restoration-oriented guidance strategy replaces the usual unconditional branch to keep the output anchored to the input. The resulting multi-step model is distilled into a fast one-step version. On both synthetic and real-world benchmarks, VOSR matches or exceeds the perceptual quality of text-to-image baselines, produces fewer hallucinations, and requires less than one-tenth the training compute. This shows that multimodal pretraining is not required for competitive generative restoration.

Core claim

VOSR shows that a generative super-resolution model trained purely on visual data can match the perceptual quality and efficiency of text-to-image diffusion methods by extracting spatially grounded semantic features from the low-resolution input via a pretrained vision encoder and replacing standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors; after distillation to one step, the model delivers competitive results on synthetic and real-world benchmarks at under one-tenth the training cost of representative T2I-based approaches.

What carries the argument

The VOSR framework: features from a pretrained vision encoder applied to the LR input serve as visual semantic guidance, paired with a restoration-oriented guidance strategy that substitutes for the unconditional branch in classifier-free guidance to maintain structural fidelity during training from scratch.

If this is right

  • High-quality generative super-resolution becomes possible without access to massive multimodal pretraining datasets or models.
  • Training costs for such models drop by more than a factor of ten while perceptual quality and structural faithfulness remain at least as high.
  • The distilled one-step version retains the quality gains, enabling efficient inference on both synthetic and real-world images.
  • Fewer hallucinations appear because the model stays anchored to the input rather than freely generating from text priors.
  • The same visual-guidance and restoration-oriented strategy can be applied to other image-restoration tasks that currently rely on text-to-image backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vision-only training may lower the barrier for researchers without large-scale text-image compute resources to develop competitive generative restoration models.
  • The approach invites direct tests on whether the same encoder-plus-guidance pattern improves other conditional generation tasks such as inpainting or deblurring.
  • If the vision encoder features prove sufficient across domains, future models could drop text conditioning entirely for restoration while retaining generative flexibility.
  • Longer training or larger vision encoders might close any remaining quality gaps on the hardest real-world cases.

Load-bearing premise

A pretrained vision encoder can extract semantically rich and spatially accurate features from the low-resolution input that are sufficient to replace the semantic prior normally supplied by a text-to-image model.

What would settle it

A head-to-head evaluation on a real-world benchmark such as Real-ESRGAN in which the VOSR one-step model produces measurably higher LPIPS error or visibly more structural hallucinations than a representative T2I-based method would falsify the claim of competitive or superior performance.

Figures

Figures reproduced from arXiv: 2604.03225 by Jixin Zhao, Lei Zhang, Lingchen Sun, Rongyuan Wu, Shihao Wang, Xiangtao Kong, Zhengqiang Zhang.

Figure 1
Figure 1. Figure 1: Comparison of VOSR with existing generative SR methods in terms of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VOSR. (a) Framework overview. Given an LR image, VOSR builds two complementary conditions from the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-step (top) and one-step (bottom) SR visual comparison on RealDeg [ [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of guidance scale on VOSR-1.4B-ms. As the scale [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Thumbnail montage of the ScreenSR benchmark. The selected 130 GT images cover diverse scenarios, including indoor and [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study results in the multi-step and one-step settings. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional visual comparisons of multi-step (1st, 3rd and 5th) and one-step (2nd, 4th and 6th) SR results. Compared with [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes VOSR, a vision-only generative framework for image super-resolution. It extracts semantically rich features from low-resolution inputs via a pretrained vision encoder and replaces standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors. A multi-step model is trained from scratch on visual data only and then distilled to a one-step model. The work claims that VOSR achieves competitive or superior perceptual quality and fewer hallucinations than T2I-adapted baselines on synthetic and real-world benchmarks while requiring less than one-tenth the training cost, thereby demonstrating that high-quality generative SR is possible without multimodal pretraining.

Significance. If the empirical claims hold under rigorous verification, the result would be significant for the field: it provides the first demonstration that a purely vision-based generative SR model trained from scratch can match or exceed the perceptual performance of methods built on large T2I diffusion backbones. The public release of code and models at the cited GitHub repository is a clear strength that supports reproducibility. The work also highlights a practical design choice (restoration-oriented guidance) that may generalize to other restoration tasks where multimodal priors are unavailable or undesirable.

major comments (3)
  1. [Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.
  2. [Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.
  3. [Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'for the first time' should be qualified with a precise citation to the closest prior vision-only generative SR attempts so readers can immediately assess novelty.
  2. [Method] The one-step distillation procedure is mentioned only briefly; a short paragraph or diagram clarifying the distillation loss and how it preserves the multi-step model's perceptual advantages would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to incorporate additional ablations, experimental details, and quantitative analyses where appropriate. These changes strengthen the presentation of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.

    Authors: We agree that an explicit ablation isolating the vision encoder contribution would provide stronger support. In the revised manuscript, we have added Section 3.4 with a new ablation study comparing: (i) the full VOSR model, (ii) a variant using only restoration-oriented guidance (no vision encoder), and (iii) a variant with vision encoder features but standard CFG. We also report quantitative metrics including average cosine similarity between vision-encoder features and intermediate generative features, as well as downstream task performance (e.g., semantic segmentation accuracy on generated outputs) to demonstrate the transfer of spatially grounded semantics. These results confirm the vision encoder's role in replacing T2I priors. revision: yes

  2. Referee: [Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.

    Authors: We acknowledge the need for greater transparency. The revised Section 4.1 now includes: complete specifications of all training/validation/test splits for synthetic (e.g., DIV2K, Flickr2K) and real-world benchmarks; detailed re-implementation protocols for T2I baselines (including exact adaptation steps, training iterations, and any hyper-parameter adjustments made to match our evaluation setup); and the full hyper-parameter search ranges with final selected values for VOSR and all baselines. We also added a note confirming that all methods were evaluated under identical protocols, with code for reproduction released in the GitHub repository. revision: yes

  3. Referee: [Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.

    Authors: We thank the referee for this suggestion. In the revised manuscript, we have added quantitative validation in Section 3.2 and the experiments: we report LPIPS-based perceptual drift scores, edge preservation metrics (Sobel gradient similarity), and a diversity index (standard deviation across 10 stochastic samples) to compare restoration-oriented guidance against standard CFG on the identical VOSR backbone. These metrics demonstrate reduced over-smoothing and mode collapse, with specific numerical improvements listed in a new table. The analysis directly supports that the modified guidance better preserves LR anchors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes an empirical framework (pretrained vision encoder for semantic features + restoration-oriented guidance replacing the unconditional branch in classifier-free guidance) and validates it through training from scratch plus distillation, with results measured against external T2I-based baselines on synthetic and real-world benchmarks. No mathematical derivation, fitted parameter, or self-citation chain is load-bearing; the central claim that vision-only training suffices is supported by direct performance comparisons rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes standard diffusion training dynamics and that a frozen pretrained vision encoder supplies adequate semantic conditioning; no new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption A pretrained vision encoder extracts semantically rich and spatially grounded features from low-resolution inputs that are sufficient for high-quality generative restoration.
    Invoked in the first paragraph of the abstract as the source of visual semantic guidance.
  • domain assumption Standard unconditional classifier-free guidance is ill-suited to restoration models trained from scratch.
    Stated explicitly when introducing the restoration-oriented guidance replacement.

pith-pipeline@v0.9.0 · 5594 in / 1496 out tokens · 35830 ms · 2026-05-13T20:29:28.909307+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

  1. [1]

    ai / stable - diffusion

    Stability.ai.https : / / stability . ai / stable - diffusion. 1

  2. [2]

    Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024

    Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024. 3

  3. [3]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 2, 6, 14

  4. [4]

    Adversarial diffusion compression for real-world image super-resolution.arXiv preprint arXiv:2411.13383, 2024

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution.arXiv preprint arXiv:2411.13383, 2024. 3, 6

  5. [5]

    Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing,

    Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing,

  6. [6]

    To- ward generalized image quality assessment: Relaxing the perfect reference quality assumption.arXiv preprint arXiv:2503.11221, 2025

    Du Chen, Tianhe Wu, Kede Ma, and Lei Zhang. To- ward generalized image quality assessment: Relaxing the perfect reference quality assumption.arXiv preprint arXiv:2503.11221, 2025. 6

  7. [7]

    Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

    Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 1, 3, 5, 8

  8. [8]

    Activating more pixels in image super- resolution transformer

    Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023. 1, 3

  9. [9]

    Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

  10. [10]

    Learning a deep convolutional network for image super-resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer,

  11. [11]

    Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

    Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 3

  12. [12]

    Dit4sr: Taming diffusion trans- former for real-world image super-resolution.arXiv preprint arXiv:2503.23580, 2025

    Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion trans- former for real-world image super-resolution.arXiv preprint arXiv:2503.23580, 2025. 1, 3, 6

  13. [13]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 1, 3

  14. [14]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3, 6, 12

  15. [15]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 3

  16. [16]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2, 4

  17. [17]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3

  18. [18]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 6

  19. [19]

    Photo- realistic single image super-resolution using a generative ad- versarial network

    Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

  20. [20]

    Srdiff: Single image super-resolution with diffusion probabilistic models

    Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022. 1, 2, 3, 5

  21. [21]

    Lsdir: A large scale dataset for image restoration

    Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 6

  22. [22]

    Swinir: Image restoration us- ing swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

  23. [23]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024. 1

  24. [24]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 6

  26. [26]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- 9 els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 3

  27. [27]

    Xpsr: Cross-modal priors for diffusion-based image super-resolution

    Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. InEuropean Con- ference on Computer Vision, pages 285–303. Springer, 2024. 3

  28. [28]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  29. [29]

    Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2022

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2022. 1, 2, 3, 4, 5

  30. [30]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4

  31. [31]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3

  32. [32]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023. 3

  33. [33]

    Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023

    Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hong- wei Yong, and Lei Zhang. Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023. 1, 3

  34. [34]

    Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach.arXiv preprint arXiv:2412.03017, 2024

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach.arXiv preprint arXiv:2412.03017, 2024. 3, 6, 15

  35. [35]

    Any-step generation via n-th order re- cursive consistent velocity field estimation

    Peng Sun and Tao Lin. Any-step generation via n-th order re- cursive consistent velocity field estimation. InInternational Conference on Learning Representations, 2026. 6, 12

  36. [36]

    Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

    Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Hao- fan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, et al. Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025. 3

  37. [37]

    Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023. 1, 3, 6, 15

  38. [38]

    Esrgan: En- hanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 1, 3

  39. [39]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

  40. [40]

    Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,

  41. [41]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  42. [42]

    Component divide- and-conquer for real-world image super-resolution

    Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixi- ang Ye, Wangmeng Zuo, and Liang Lin. Component divide- and-conquer for real-world image super-resolution. InCom- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 14

  43. [43]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  44. [44]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3, 6, 15

  45. [45]

    Seesr: Towards semantics- aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 1, 3, 5, 6, 8, 15

  46. [46]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 6

  47. [47]

    Effectmaker: Unifying reasoning and generation for customized visual effect creation

    Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, and Jing Liao. Effectmaker: Unifying reasoning and generation for customized visual effect creation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 3

  48. [48]

    Pixel-aware stable diffusion for realistic image super- resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023

    Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super- resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023. 1, 3, 6, 15

  49. [49]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 4, 6

  50. [50]

    Fine-structure preserved real-world im- 10 age super-resolution via transfer vae training.arXiv preprint arXiv:2507.20291, 2025

    Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, and Lei Zhang. Fine-structure preserved real-world im- 10 age super-resolution via transfer vae training.arXiv preprint arXiv:2507.20291, 2025. 3

  51. [51]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3

  52. [52]

    Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023. 1, 2, 3, 4, 5, 6, 15

  53. [53]

    Arbitrary-steps image super-resolution via diffusion inver- sion.arXiv preprint arXiv:2412.09013, 2024

    Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion.arXiv preprint arXiv:2412.09013, 2024. 3

  54. [54]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 3

  55. [55]

    A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015

    Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 6

  56. [56]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3

  57. [57]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 3, 6

  58. [58]

    Efficient long-range attention network for image super- resolution

    Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. InEuropean Conference on Computer Vision, pages 649–667. Springer, 2022. 1, 3

  59. [59]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 1, 3

  60. [60]

    Residual dense network for image super-resolution

    Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 3 11 A. Appendix This appendix presents distillation details, ScreenSR bench- mark details, training settings, ablation studies, user study r...