arxiv: 2604.03225 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu , Lingchen Sun , Zhengqiang Zhang , Xiangtao Kong , Jixin Zhao , Shihao Wang , Lei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords image super-resolutiongenerative modelsvision-only trainingdiffusion modelsclassifier-free guidanceimage restorationone-step distillationperceptual quality

0 comments

The pith

A vision-only generative model achieves high-quality image super-resolution without any text-to-image pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether generative super-resolution must start from large text-to-image diffusion models pretrained on web-scale data. It introduces VOSR, which trains a diffusion model from scratch using only visual inputs: a pretrained vision encoder supplies semantic and spatial features directly from the low-resolution image, while a new restoration-oriented guidance strategy replaces the usual unconditional branch to keep the output anchored to the input. The resulting multi-step model is distilled into a fast one-step version. On both synthetic and real-world benchmarks, VOSR matches or exceeds the perceptual quality of text-to-image baselines, produces fewer hallucinations, and requires less than one-tenth the training compute. This shows that multimodal pretraining is not required for competitive generative restoration.

Core claim

VOSR shows that a generative super-resolution model trained purely on visual data can match the perceptual quality and efficiency of text-to-image diffusion methods by extracting spatially grounded semantic features from the low-resolution input via a pretrained vision encoder and replacing standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors; after distillation to one step, the model delivers competitive results on synthetic and real-world benchmarks at under one-tenth the training cost of representative T2I-based approaches.

What carries the argument

The VOSR framework: features from a pretrained vision encoder applied to the LR input serve as visual semantic guidance, paired with a restoration-oriented guidance strategy that substitutes for the unconditional branch in classifier-free guidance to maintain structural fidelity during training from scratch.

If this is right

High-quality generative super-resolution becomes possible without access to massive multimodal pretraining datasets or models.
Training costs for such models drop by more than a factor of ten while perceptual quality and structural faithfulness remain at least as high.
The distilled one-step version retains the quality gains, enabling efficient inference on both synthetic and real-world images.
Fewer hallucinations appear because the model stays anchored to the input rather than freely generating from text priors.
The same visual-guidance and restoration-oriented strategy can be applied to other image-restoration tasks that currently rely on text-to-image backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Vision-only training may lower the barrier for researchers without large-scale text-image compute resources to develop competitive generative restoration models.
The approach invites direct tests on whether the same encoder-plus-guidance pattern improves other conditional generation tasks such as inpainting or deblurring.
If the vision encoder features prove sufficient across domains, future models could drop text conditioning entirely for restoration while retaining generative flexibility.
Longer training or larger vision encoders might close any remaining quality gaps on the hardest real-world cases.

Load-bearing premise

A pretrained vision encoder can extract semantically rich and spatially accurate features from the low-resolution input that are sufficient to replace the semantic prior normally supplied by a text-to-image model.

What would settle it

A head-to-head evaluation on a real-world benchmark such as Real-ESRGAN in which the VOSR one-step model produces measurably higher LPIPS error or visibly more structural hallucinations than a representative T2I-based method would falsify the claim of competitive or superior performance.

Figures

Figures reproduced from arXiv: 2604.03225 by Jixin Zhao, Lei Zhang, Lingchen Sun, Rongyuan Wu, Shihao Wang, Xiangtao Kong, Zhengqiang Zhang.

**Figure 2.** Figure 2: Overview of VOSR. (a) Framework overview. Given an LR image, VOSR builds two complementary conditions from the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-step (top) and one-step (bottom) SR visual comparison on RealDeg [ [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of guidance scale on VOSR-1.4B-ms. As the scale [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Thumbnail montage of the ScreenSR benchmark. The selected 130 GT images cover diverse scenarios, including indoor and [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: User study results in the multi-step and one-step settings. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Additional visual comparisons of multi-step (1st, 3rd and 5th) and one-step (2nd, 4th and 6th) SR results. Compared with [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VOSR shows generative super-resolution can match T2I methods without any multimodal pretraining by using vision-encoder features and a modified guidance strategy at far lower cost.

read the letter

VOSR's main result is that a generative model for image super-resolution can be trained from scratch using only visual data and still deliver perceptual quality on par with or better than methods built on large text-to-image diffusion models. The paper introduces a few practical changes. It uses features from a pretrained vision encoder to provide semantic guidance from the low-res input. It also changes classifier-free guidance by replacing the unconditional branch with a restoration-oriented version that better preserves the weak LR signal. The workflow trains a multi-step model first, then distills it to one step for fast inference. This setup is trained entirely without multimodal pretraining. On the positive side, the approach cuts training cost to less than a tenth of typical T2I SR methods. The results show competitive or superior perceptual metrics and fewer hallucinations while keeping structures more faithful on both synthetic and real benchmarks. Having the code and models available makes it easier to assess. The weaker parts center on the assumptions behind the vision encoder. These encoders are usually trained for classification, not dense generation, so they may not always supply the detailed priors needed for complex images. The guidance modification is claimed to anchor the output, but more analysis would help confirm it avoids new artifacts like reduced sharpness or mode collapse in some scenarios. The abstract is promising, but full ablations and implementation details are needed to rule out benchmark-specific effects. This paper targets the image restoration community, especially those working on efficient generative methods or applications where text data is limited. Readers interested in dropping multimodal dependencies will find the design choices useful. I would recommend sending it to peer review. The claim is concrete, the efficiency gains are substantial, and the reproducibility elements are in place.

Referee Report

3 major / 2 minor

Summary. The paper proposes VOSR, a vision-only generative framework for image super-resolution. It extracts semantically rich features from low-resolution inputs via a pretrained vision encoder and replaces standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors. A multi-step model is trained from scratch on visual data only and then distilled to a one-step model. The work claims that VOSR achieves competitive or superior perceptual quality and fewer hallucinations than T2I-adapted baselines on synthetic and real-world benchmarks while requiring less than one-tenth the training cost, thereby demonstrating that high-quality generative SR is possible without multimodal pretraining.

Significance. If the empirical claims hold under rigorous verification, the result would be significant for the field: it provides the first demonstration that a purely vision-based generative SR model trained from scratch can match or exceed the perceptual performance of methods built on large T2I diffusion backbones. The public release of code and models at the cited GitHub repository is a clear strength that supports reproducibility. The work also highlights a practical design choice (restoration-oriented guidance) that may generalize to other restoration tasks where multimodal priors are unavailable or undesirable.

major comments (3)

[Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.
[Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.
[Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.

minor comments (2)

[Abstract] Abstract: the phrase 'for the first time' should be qualified with a precise citation to the closest prior vision-only generative SR attempts so readers can immediately assess novelty.
[Method] The one-step distillation procedure is mentioned only briefly; a short paragraph or diagram clarifying the distillation loss and how it preserves the multi-step model's perceptual advantages would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to incorporate additional ablations, experimental details, and quantitative analyses where appropriate. These changes strengthen the presentation of our claims without altering the core contributions.

read point-by-point responses

Referee: [Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.

Authors: We agree that an explicit ablation isolating the vision encoder contribution would provide stronger support. In the revised manuscript, we have added Section 3.4 with a new ablation study comparing: (i) the full VOSR model, (ii) a variant using only restoration-oriented guidance (no vision encoder), and (iii) a variant with vision encoder features but standard CFG. We also report quantitative metrics including average cosine similarity between vision-encoder features and intermediate generative features, as well as downstream task performance (e.g., semantic segmentation accuracy on generated outputs) to demonstrate the transfer of spatially grounded semantics. These results confirm the vision encoder's role in replacing T2I priors. revision: yes
Referee: [Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.

Authors: We acknowledge the need for greater transparency. The revised Section 4.1 now includes: complete specifications of all training/validation/test splits for synthetic (e.g., DIV2K, Flickr2K) and real-world benchmarks; detailed re-implementation protocols for T2I baselines (including exact adaptation steps, training iterations, and any hyper-parameter adjustments made to match our evaluation setup); and the full hyper-parameter search ranges with final selected values for VOSR and all baselines. We also added a note confirming that all methods were evaluated under identical protocols, with code for reproduction released in the GitHub repository. revision: yes
Referee: [Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.

Authors: We thank the referee for this suggestion. In the revised manuscript, we have added quantitative validation in Section 3.2 and the experiments: we report LPIPS-based perceptual drift scores, edge preservation metrics (Sobel gradient similarity), and a diversity index (standard deviation across 10 stochastic samples) to compare restoration-oriented guidance against standard CFG on the identical VOSR backbone. These metrics demonstrate reduced over-smoothing and mode collapse, with specific numerical improvements listed in a new table. The analysis directly supports that the modified guidance better preserves LR anchors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes an empirical framework (pretrained vision encoder for semantic features + restoration-oriented guidance replacing the unconditional branch in classifier-free guidance) and validates it through training from scratch plus distillation, with results measured against external T2I-based baselines on synthetic and real-world benchmarks. No mathematical derivation, fitted parameter, or self-citation chain is load-bearing; the central claim that vision-only training suffices is supported by direct performance comparisons rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes standard diffusion training dynamics and that a frozen pretrained vision encoder supplies adequate semantic conditioning; no new physical or mathematical entities are introduced.

axioms (2)

domain assumption A pretrained vision encoder extracts semantically rich and spatially grounded features from low-resolution inputs that are sufficient for high-quality generative restoration.
Invoked in the first paragraph of the abstract as the source of visual semantic guidance.
domain assumption Standard unconditional classifier-free guidance is ill-suited to restoration models trained from scratch.
Stated explicitly when introducing the restoration-oriented guidance replacement.

pith-pipeline@v0.9.0 · 5594 in / 1496 out tokens · 35830 ms · 2026-05-13T20:29:28.909307+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance... replace it with a restoration-oriented guidance strategy that preserves weak LR anchors.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

[1]

ai / stable - diffusion

Stability.ai.https : / / stability . ai / stable - diffusion. 1

work page
[2]

Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024

Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024. 3

work page 2024
[3]

Toward real-world single image super-resolution: A new benchmark and a new model

Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 2, 6, 14

work page 2019
[4]

Adversarial diffusion compression for real-world image super-resolution.arXiv preprint arXiv:2411.13383, 2024

Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution.arXiv preprint arXiv:2411.13383, 2024. 3, 6

work page arXiv 2024
[5]

Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing,

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing,

work page
[6]

To- ward generalized image quality assessment: Relaxing the perfect reference quality assumption.arXiv preprint arXiv:2503.11221, 2025

Du Chen, Tianhe Wu, Kede Ma, and Lei Zhang. To- ward generalized image quality assessment: Relaxing the perfect reference quality assumption.arXiv preprint arXiv:2503.11221, 2025. 6

work page arXiv 2025
[7]

Faithd- iff: Unleashing diffusion priors for faithful image super- resolution

Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 1, 3, 5, 8

work page 2025
[8]

Activating more pixels in image super- resolution transformer

Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023. 1, 3

work page 2023
[9]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

work page 2020
[10]

Learning a deep convolutional network for image super-resolution

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer,

work page 2014
[11]

Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 3

work page 2025
[12]

Dit4sr: Taming diffusion trans- former for real-world image super-resolution.arXiv preprint arXiv:2503.23580, 2025

Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion trans- former for real-world image super-resolution.arXiv preprint arXiv:2503.23580, 2025. 1, 3, 6

work page arXiv 2025
[13]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3, 6, 12

work page internal anchor Pith review arXiv 2024
[15]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3

work page 2020
[18]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 6

work page 2021
[19]

Photo- realistic single image super-resolution using a generative ad- versarial network

Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

work page
[20]

Srdiff: Single image super-resolution with diffusion probabilistic models

Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022. 1, 2, 3, 5

work page 2022
[21]

Lsdir: A large scale dataset for image restoration

Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 6

work page 2023
[22]

Swinir: Image restoration us- ing swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

work page
[23]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024. 1

work page 2024
[24]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- 9 els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Xpsr: Cross-modal priors for diffusion-based image super-resolution

Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. InEuropean Con- ference on Computer Vision, pages 285–303. Springer, 2024. 3

work page 2024
[28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[29]

Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2022

Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2022. 1, 2, 3, 4, 5

work page 2022
[30]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[32]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023

Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hong- wei Yong, and Lei Zhang. Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023. 1, 3

work page arXiv 2023
[34]

Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach.arXiv preprint arXiv:2412.03017, 2024

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach.arXiv preprint arXiv:2412.03017, 2024. 3, 6, 15

work page arXiv 2024
[35]

Any-step generation via n-th order re- cursive consistent velocity field estimation

Peng Sun and Tao Lin. Any-step generation via n-th order re- cursive consistent velocity field estimation. InInternational Conference on Learning Representations, 2026. 6, 12

work page 2026
[36]

Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Hao- fan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, et al. Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025. 3

work page arXiv 2025
[37]

Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023. 1, 3, 6, 15

work page arXiv 2023
[38]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 1, 3

work page 2018
[39]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905
[40]

Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,

Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,

work page arXiv
[41]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[42]

Component divide- and-conquer for real-world image super-resolution

Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixi- ang Ye, Wangmeng Zuo, and Liang Lin. Component divide- and-conquer for real-world image super-resolution. InCom- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 14

work page 2020
[43]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[44]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3, 6, 15

work page 2024
[45]

Seesr: Towards semantics- aware real-world image super-resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 1, 3, 5, 6, 8, 15

work page 2024
[46]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 6

work page 2022
[47]

Effectmaker: Unifying reasoning and generation for customized visual effect creation

Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, and Jing Liao. Effectmaker: Unifying reasoning and generation for customized visual effect creation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 3

work page 2026
[48]

Pixel-aware stable diffusion for realistic image super- resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023

Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super- resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023. 1, 3, 6, 15

work page arXiv 2023
[49]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 4, 6

work page 2025
[50]

Fine-structure preserved real-world im- 10 age super-resolution via transfer vae training.arXiv preprint arXiv:2507.20291, 2025

Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, and Lei Zhang. Fine-structure preserved real-world im- 10 age super-resolution via transfer vae training.arXiv preprint arXiv:2507.20291, 2025. 3

work page arXiv 2025
[51]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3

work page 2024
[52]

Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023. 1, 2, 3, 4, 5, 6, 15

work page arXiv 2023
[53]

Arbitrary-steps image super-resolution via diffusion inver- sion.arXiv preprint arXiv:2412.09013, 2024

Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion.arXiv preprint arXiv:2412.09013, 2024. 3

work page arXiv 2024
[54]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 3

work page 2021
[55]

A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015

Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 6

work page 2015
[56]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3

work page 2023
[57]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 3, 6

work page 2018
[58]

Efficient long-range attention network for image super- resolution

Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. InEuropean Conference on Computer Vision, pages 649–667. Springer, 2022. 1, 3

work page 2022
[59]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 1, 3

work page 2018
[60]

Residual dense network for image super-resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 3 11 A. Appendix This appendix presents distillation details, ScreenSR bench- mark details, training settings, ablation studies, user study r...

work page 2018