Recognition: 2 theorem links
· Lean TheoremVOSR: A Vision-Only Generative Model for Image Super-Resolution
Pith reviewed 2026-05-13 20:29 UTC · model grok-4.3
The pith
A vision-only generative model achieves high-quality image super-resolution without any text-to-image pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VOSR shows that a generative super-resolution model trained purely on visual data can match the perceptual quality and efficiency of text-to-image diffusion methods by extracting spatially grounded semantic features from the low-resolution input via a pretrained vision encoder and replacing standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors; after distillation to one step, the model delivers competitive results on synthetic and real-world benchmarks at under one-tenth the training cost of representative T2I-based approaches.
What carries the argument
The VOSR framework: features from a pretrained vision encoder applied to the LR input serve as visual semantic guidance, paired with a restoration-oriented guidance strategy that substitutes for the unconditional branch in classifier-free guidance to maintain structural fidelity during training from scratch.
If this is right
- High-quality generative super-resolution becomes possible without access to massive multimodal pretraining datasets or models.
- Training costs for such models drop by more than a factor of ten while perceptual quality and structural faithfulness remain at least as high.
- The distilled one-step version retains the quality gains, enabling efficient inference on both synthetic and real-world images.
- Fewer hallucinations appear because the model stays anchored to the input rather than freely generating from text priors.
- The same visual-guidance and restoration-oriented strategy can be applied to other image-restoration tasks that currently rely on text-to-image backbones.
Where Pith is reading between the lines
- Vision-only training may lower the barrier for researchers without large-scale text-image compute resources to develop competitive generative restoration models.
- The approach invites direct tests on whether the same encoder-plus-guidance pattern improves other conditional generation tasks such as inpainting or deblurring.
- If the vision encoder features prove sufficient across domains, future models could drop text conditioning entirely for restoration while retaining generative flexibility.
- Longer training or larger vision encoders might close any remaining quality gaps on the hardest real-world cases.
Load-bearing premise
A pretrained vision encoder can extract semantically rich and spatially accurate features from the low-resolution input that are sufficient to replace the semantic prior normally supplied by a text-to-image model.
What would settle it
A head-to-head evaluation on a real-world benchmark such as Real-ESRGAN in which the VOSR one-step model produces measurably higher LPIPS error or visibly more structural hallucinations than a representative T2I-based method would falsify the claim of competitive or superior performance.
Figures
read the original abstract
Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VOSR, a vision-only generative framework for image super-resolution. It extracts semantically rich features from low-resolution inputs via a pretrained vision encoder and replaces standard classifier-free guidance with a restoration-oriented strategy that preserves weak LR anchors. A multi-step model is trained from scratch on visual data only and then distilled to a one-step model. The work claims that VOSR achieves competitive or superior perceptual quality and fewer hallucinations than T2I-adapted baselines on synthetic and real-world benchmarks while requiring less than one-tenth the training cost, thereby demonstrating that high-quality generative SR is possible without multimodal pretraining.
Significance. If the empirical claims hold under rigorous verification, the result would be significant for the field: it provides the first demonstration that a purely vision-based generative SR model trained from scratch can match or exceed the perceptual performance of methods built on large T2I diffusion backbones. The public release of code and models at the cited GitHub repository is a clear strength that supports reproducibility. The work also highlights a practical design choice (restoration-oriented guidance) that may generalize to other restoration tasks where multimodal priors are unavailable or undesirable.
major comments (3)
- [Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.
- [Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.
- [Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.
minor comments (2)
- [Abstract] Abstract: the phrase 'for the first time' should be qualified with a precise citation to the closest prior vision-only generative SR attempts so readers can immediately assess novelty.
- [Method] The one-step distillation procedure is mentioned only briefly; a short paragraph or diagram clarifying the distillation loss and how it preserves the multi-step model's perceptual advantages would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript to incorporate additional ablations, experimental details, and quantitative analyses where appropriate. These changes strengthen the presentation of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Method] The central claim that a standard pretrained vision encoder supplies spatially grounded semantics sufficient to replace T2I generative priors is load-bearing yet rests on an untested assumption. The method section does not include an ablation that isolates the contribution of the vision-encoder features versus the guidance modification, nor does it quantify how much semantic richness is actually transferred to the dense generative task.
Authors: We agree that an explicit ablation isolating the vision encoder contribution would provide stronger support. In the revised manuscript, we have added Section 3.4 with a new ablation study comparing: (i) the full VOSR model, (ii) a variant using only restoration-oriented guidance (no vision encoder), and (iii) a variant with vision encoder features but standard CFG. We also report quantitative metrics including average cosine similarity between vision-encoder features and intermediate generative features, as well as downstream task performance (e.g., semantic segmentation accuracy on generated outputs) to demonstrate the transfer of spatially grounded semantics. These results confirm the vision encoder's role in replacing T2I priors. revision: yes
-
Referee: [Experiments] §4 (Experiments): the reported gains in perceptual quality and reduced hallucinations are presented without full details on data splits, baseline re-implementations, or hyper-parameter search procedures. This leaves open the possibility that the comparisons contain post-hoc choices that affect the central claim of outperforming T2I-based methods.
Authors: We acknowledge the need for greater transparency. The revised Section 4.1 now includes: complete specifications of all training/validation/test splits for synthetic (e.g., DIV2K, Flickr2K) and real-world benchmarks; detailed re-implementation protocols for T2I baselines (including exact adaptation steps, training iterations, and any hyper-parameter adjustments made to match our evaluation setup); and the full hyper-parameter search ranges with final selected values for VOSR and all baselines. We also added a note confirming that all methods were evaluated under identical protocols, with code for reproduction released in the GitHub repository. revision: yes
-
Referee: [Method] The restoration-oriented guidance is described as preserving LR anchors, but no quantitative analysis (e.g., drift metrics or structure-preservation scores) is provided to show that the modified guidance actually prevents mode collapse or over-smoothing relative to standard CFG on the same backbone.
Authors: We thank the referee for this suggestion. In the revised manuscript, we have added quantitative validation in Section 3.2 and the experiments: we report LPIPS-based perceptual drift scores, edge preservation metrics (Sobel gradient similarity), and a diversity index (standard deviation across 10 stochastic samples) to compare restoration-oriented guidance against standard CFG on the identical VOSR backbone. These metrics demonstrate reduced over-smoothing and mode collapse, with specific numerical improvements listed in a new table. The analysis directly supports that the modified guidance better preserves LR anchors. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes an empirical framework (pretrained vision encoder for semantic features + restoration-oriented guidance replacing the unconditional branch in classifier-free guidance) and validates it through training from scratch plus distillation, with results measured against external T2I-based baselines on synthetic and real-world benchmarks. No mathematical derivation, fitted parameter, or self-citation chain is load-bearing; the central claim that vision-only training suffices is supported by direct performance comparisons rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A pretrained vision encoder extracts semantically rich and spatially grounded features from low-resolution inputs that are sufficient for high-quality generative restoration.
- domain assumption Standard unconditional classifier-free guidance is ill-suited to restoration models trained from scratch.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance... replace it with a restoration-oriented guidance strategy that preserves weak LR anchors.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dream- clear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Informa- tion Processing Systems, 37:55443–55469, 2024. 3
work page 2024
-
[3]
Toward real-world single image super-resolution: A new benchmark and a new model
Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019. 2, 6, 14
work page 2019
-
[4]
Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution.arXiv preprint arXiv:2411.13383, 2024. 3, 6
-
[5]
Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing,
-
[6]
Du Chen, Tianhe Wu, Kede Ma, and Lei Zhang. To- ward generalized image quality assessment: Relaxing the perfect reference quality assumption.arXiv preprint arXiv:2503.11221, 2025. 6
-
[7]
Faithd- iff: Unleashing diffusion priors for faithful image super- resolution
Junyang Chen, Jinshan Pan, and Jiangxin Dong. Faithd- iff: Unleashing diffusion priors for faithful image super- resolution. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 28188–28197, 2025. 1, 3, 5, 8
work page 2025
-
[8]
Activating more pixels in image super- resolution transformer
Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super- resolution transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023. 1, 3
work page 2023
-
[9]
Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6
work page 2020
-
[10]
Learning a deep convolutional network for image super-resolution
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer,
work page 2014
-
[11]
Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution
Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23174–23184, 2025. 3
work page 2025
-
[12]
Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion trans- former for real-world image super-resolution.arXiv preprint arXiv:2503.23580, 2025. 1, 3, 6
-
[13]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 3, 6, 12
work page internal anchor Pith review arXiv 2024
-
[15]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.arXiv preprint arXiv:2505.13447, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3
work page 2020
-
[18]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021. 6
work page 2021
-
[19]
Photo- realistic single image super-resolution using a generative ad- versarial network
Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,
-
[20]
Srdiff: Single image super-resolution with diffusion probabilistic models
Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022. 1, 2, 3, 5
work page 2022
-
[21]
Lsdir: A large scale dataset for image restoration
Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman- dolx, et al. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023. 6
work page 2023
-
[22]
Swinir: Image restoration us- ing swin transformer
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,
-
[23]
Diff- bir: Toward blind image restoration with generative diffusion prior
Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024. 1
work page 2024
-
[24]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- 9 els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Xpsr: Cross-modal priors for diffusion-based image super-resolution
Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. Xpsr: Cross-modal priors for diffusion-based image super-resolution. InEuropean Con- ference on Computer Vision, pages 285–303. Springer, 2024. 3
work page 2024
-
[28]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[29]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713– 4726, 2022. 1, 2, 3, 4, 5
work page 2022
-
[30]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[32]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hong- wei Yong, and Lei Zhang. Improving the stability of dif- fusion models for content consistent super-resolution.arXiv preprint arXiv:2401.00877, 2023. 1, 3
-
[34]
Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach.arXiv preprint arXiv:2412.03017, 2024. 3, 6, 15
-
[35]
Any-step generation via n-th order re- cursive consistent velocity field estimation
Peng Sun and Tao Lin. Any-step generation via n-th order re- cursive consistent velocity field estimation. InInternational Conference on Learning Representations, 2026. 6, 12
work page 2026
-
[36]
Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Hao- fan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, et al. Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025. 3
-
[37]
Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023. 1, 3, 6, 15
-
[38]
Esrgan: En- hanced super-resolution generative adversarial networks
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 1, 3
work page 2018
-
[39]
Real-esrgan: Training real-world blind super-resolution with pure synthetic data
Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,
work page 1905
-
[40]
Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,
Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: Diffusion-based image super- resolution in a single step.arXiv preprint arXiv:2311.14760,
-
[41]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6
work page 2004
-
[42]
Component divide- and-conquer for real-world image super-resolution
Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixi- ang Ye, Wangmeng Zuo, and Liang Lin. Component divide- and-conquer for real-world image super-resolution. InCom- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020. 14
work page 2020
-
[43]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[44]
Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 3, 6, 15
work page 2024
-
[45]
Seesr: Towards semantics- aware real-world image super-resolution
Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics- aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024. 1, 3, 5, 6, 8, 15
work page 2024
-
[46]
Maniqa: Multi-dimension attention network for no-reference image quality assessment
Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1191–1200, 2022. 6
work page 2022
-
[47]
Effectmaker: Unifying reasoning and generation for customized visual effect creation
Shiyuan Yang, Ruihuang Li, Jiale Tao, Shuai Shao, Qinglin Lu, and Jing Liao. Effectmaker: Unifying reasoning and generation for customized visual effect creation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 3
work page 2026
-
[48]
Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super- resolution and personalized stylization.arXiv preprint arXiv:2308.14469, 2023. 1, 3, 6, 15
-
[49]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 4, 6
work page 2025
-
[50]
Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Yuhui Wu, and Lei Zhang. Fine-structure preserved real-world im- 10 age super-resolution via transfer vae training.arXiv preprint arXiv:2507.20291, 2025. 3
-
[51]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3
work page 2024
-
[52]
Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting.arXiv preprint arXiv:2307.12348, 2023. 1, 2, 3, 4, 5, 6, 15
-
[53]
Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion.arXiv preprint arXiv:2412.09013, 2024. 3
-
[54]
Designing a practical degradation model for deep blind image super-resolution
Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 3
work page 2021
-
[55]
Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator.IEEE Transactions on Image Processing, 24(8):2579–2591, 2015. 6
work page 2015
-
[56]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 3
work page 2023
-
[57]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 3, 6
work page 2018
-
[58]
Efficient long-range attention network for image super- resolution
Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super- resolution. InEuropean Conference on Computer Vision, pages 649–667. Springer, 2022. 1, 3
work page 2022
-
[59]
Image super-resolution using very deep residual channel attention networks
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 1, 3
work page 2018
-
[60]
Residual dense network for image super-resolution
Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 3 11 A. Appendix This appendix presents distillation details, ScreenSR bench- mark details, training settings, ablation studies, user study r...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.