pith. sign in

arxiv: 2505.18600 · v3 · submitted 2025-05-24 · 💻 cs.CV · cs.AI· cs.LG

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Pith reviewed 2026-05-19 13:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords super-resolutionextreme magnificationdiffusion modelsvision-language modelsautoregressive generationpreference alignmentimage scaling
0
0 comments X

The pith

A standard 4x diffusion super-resolution model chained through intermediate scales reaches beyond 256x enlargement while preserving perceptual quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chain-of-Zoom as a way to extend single-image super-resolution far past the scales on which existing models were trained. It decomposes the task into an autoregressive sequence of smaller zoom steps that reuse the same backbone model, each guided by text prompts from a vision-language model. Those prompts are further tuned with preference optimization so they stay aligned with human judgment even as image details fade at high magnifications. The result is extreme enlargement without retraining any component. Readers would care because current models simply fail when asked to produce usable output at these levels.

Core claim

Factorizing extreme super-resolution into an autoregressive chain of intermediate scale states, each augmented by multi-scale-aware text prompts generated by a vision-language model and aligned via Generalized Reward Policy Optimization, lets a standard 4x diffusion model produce images beyond 256x with high perceptual quality and fidelity.

What carries the argument

Chain-of-Zoom (CoZ), an autoregressive factorization that decomposes the conditional probability of large-scale enlargement into repeated applications of a fixed backbone SR model plus VLM-generated prompts at each step.

If this is right

  • A pretrained 4x diffusion model can be reused directly for 256x or greater enlargement without any new training.
  • Perceptual quality and fidelity hold at extreme magnifications when guidance is supplied at every intermediate scale.
  • Multi-scale-aware text prompts compensate for the loss of visual detail that occurs at high zoom factors.
  • Preference alignment of the prompt generator improves output alignment with human judgments across the chain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stepwise decomposition could be tested on other generative tasks where direct large jumps currently fail, such as video frame synthesis.
  • Reducing the number of zoom steps while keeping prompt quality high would be a direct next measurement to assess efficiency.
  • The approach implies that preference-tuned language guidance can substitute for missing visual information in any iterative image process.

Load-bearing premise

Errors do not accumulate across the sequence of intermediate scales and the vision-language prompts remain accurate rather than hallucinated once most original visual cues have disappeared.

What would settle it

A side-by-side comparison at 256x showing whether output images contain accumulating artifacts or lose fidelity relative to the sequence of lower-scale intermediates.

Figures

Figures reproduced from arXiv: 2505.18600 by Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye.

Figure 1
Figure 1. Figure 1: Extreme super-resolution of photorealistic images by CoZ with up to 64 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Conventional SR. When an SR backbone trained for a fixed up-scale factor (e.g., 4×) is pushed to much larger magnifications beyond its training regime, blur and artifacts are produced. (b) Chain-of-Zoom (ours). Starting from an LR input, a pretrained VLM generates a descriptive prompt, which—together with the image—is fed to the same SR backbone to yield the next HR scale-state. This prompt-and-upscale… view at source ↗
Figure 3
Figure 3. Figure 3: Significance of proposed multi-scale-aware prompts: (a) Null prompt: coarse structure is retained, but high-frequency details are smoothed out. (b) DAPE prompt: inserting text from a degradation-aware prompt extractor (DAPE) helps, yet the images lack intricate detail at large magnifications. (c) VLM-generated prompts (ours): multi-scale prompts extracted by a VLM steer the SR backbone to synthesize realis… view at source ↗
Figure 4
Figure 4. Figure 4: GRPO Training Framework. At every zoom step, multi-scale image crops are fed to the base VLM, which generates candidate prompts after perceiving input images. A critic VLM scores the prompt for semantic quality, while phrase-exclusion and repetition penalties enforce conciseness and relevance. The weighted sum of these rewards forms the GRPO signal that iteratively fine-tunes the base VLM, steering it towa… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results. For each input image, super-resolution is performed on different magnifications with various methods: (a) Nearest neighbor interpolation; (b) One-step direct SR with the backbone SR model; (c-e) Variants of CoZ with different text prompts. The CoZ framework shows significantly better performance at large magnifications. Furthermore, using CoZ with VLM prompts assists the SR model in ge… view at source ↗
Figure 6
Figure 6. Figure 6: Reward graphs of using InternVL2.5-8B as the critic VLM, evaluated on a validation set. Values for Critic Reward, Phrase Exclusion Reward, Repetition Penalty, and Total Reward increase throughout the training process. 0 2500 5000 7500 10000 Step 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Critic Reward 0 2500 5000 7500 10000 Step 0.95 0.96 0.97 0.98 0.99 1.00 Phrase Exclusion Reward 0 2500 5000 7500 10000 Step 0.12… view at source ↗
Figure 7
Figure 7. Figure 7: Reward graphs of using Qwen2.5-VL-7B-Instruct as the critic VLM, evaluated on a validation set. Values for Critic Reward, Phrase Exclusion Reward, Repetition Penalty, and Total Reward increase throughout the training process [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: RLHF training with GRPO assists the prompt-extraction VLM in creating meaningful [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example questions used for the MOS test. (Left) [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) Mean opinion scores for image generation. (b) Mean opinion scores for text generation. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results for performing CoZ with the open-source OSEDiff (leveraging Stable [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Extreme super-resolution of photorealistic images by CoZ up to [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Extreme super-resolution of photorealistic images by CoZ up to [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Extreme super-resolution of photorealistic images by CoZ up to [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Extreme super-resolution of photorealistic images by CoZ up to [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Extreme super-resolution of photorealistic images by CoZ up to [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes extreme single-image super-resolution into an autoregressive chain of intermediate 4x scale steps using a fixed backbone diffusion SR model augmented by multi-scale-aware VLM prompts. The prompt extractor is fine-tuned via Generalized Reward Policy Optimization (GRPO) with a critic VLM to align outputs with human preferences. The central claim is that this enables a standard 4x model to achieve beyond 256x enlargement with high perceptual quality and fidelity without retraining on extreme scales.

Significance. If the experimental claims are substantiated with rigorous metrics, the work would be significant for computer vision as it offers a practical, training-free route to extreme magnifications by leveraging existing backbones and prompt-based guidance. The GRPO alignment step and the autoregressive decomposition are constructive ideas that could generalize beyond SR. The approach directly targets the scalability limitation of current SISR models.

major comments (2)
  1. Abstract: The central claim of attaining beyond 256x enlargement with high perceptual quality and fidelity is supported only by qualitative descriptions; no quantitative metrics (e.g., PSNR, LPIPS, or perceptual scores at 256x), error analysis, or ablations on chain length versus prompt quality are provided, leaving the load-bearing experimental validation unverified.
  2. Method description (autoregressive chaining): The framework assumes that stochastic hallucinations and artifacts from early 4x steps do not compound across the approximately four iterations needed for 256x, yet no analysis, bounds, or ablation on error propagation is given. GRPO alignment is applied only to the prompt extractor, not the SR chain itself, so later prompts may become uninformative when visual cues degrade.
minor comments (2)
  1. The abstract and introduction could more clearly distinguish the contributions of the chaining procedure from those of the GRPO-tuned prompt extractor.
  2. Notation for scale states and prompt conditioning could be formalized with a diagram or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the significance of our work. We address each of the major comments in detail below.

read point-by-point responses
  1. Referee: Abstract: The central claim of attaining beyond 256x enlargement with high perceptual quality and fidelity is supported only by qualitative descriptions; no quantitative metrics (e.g., PSNR, LPIPS, or perceptual scores at 256x), error analysis, or ablations on chain length versus prompt quality are provided, leaving the load-bearing experimental validation unverified.

    Authors: We acknowledge that the abstract and some sections rely primarily on qualitative results for the extreme 256x magnification to demonstrate the framework's capability. The manuscript does include quantitative metrics like PSNR and LPIPS for scales up to 16x-32x where ground truth is available, along with perceptual user studies. For 256x on real images, reference-based metrics are inherently limited without high-resolution ground truth. In the revised version, we will add more detailed ablations on chain length and prompt quality, and include additional perceptual scores where feasible. We will also clarify the experimental setup in the abstract. revision: yes

  2. Referee: Method description (autoregressive chaining): The framework assumes that stochastic hallucinations and artifacts from early 4x steps do not compound across the approximately four iterations needed for 256x, yet no analysis, bounds, or ablation on error propagation is given. GRPO alignment is applied only to the prompt extractor, not the SR chain itself, so later prompts may become uninformative when visual cues degrade.

    Authors: This is a valid concern regarding potential error accumulation in the autoregressive chain. Our design mitigates this through the use of multi-scale-aware VLM prompts that adapt to the current resolution state, providing contextual guidance even when low-level details are sparse. The GRPO optimization ensures that the prompt extractor generates preference-aligned prompts that help steer the SR model towards high-quality outputs at each step. While we do not provide theoretical bounds in the current manuscript, we will include an empirical ablation study on error propagation by varying the number of zoom steps and analyzing intermediate outputs in the revision. We will also discuss how the prompt extractor maintains informativeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Chain-of-Zoom as a procedural wrapper around an existing 4x diffusion SR backbone, factorizing extreme upscaling into an autoregressive sequence of intermediate steps augmented by VLM-generated prompts that are separately fine-tuned via GRPO. No load-bearing equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input; the autoregressive decomposition is presented as an engineering factorization whose validity is tested empirically rather than asserted tautologically. The framework remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work to force its central result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is therefore sparse. The method implicitly assumes that intermediate-scale states remain conditionally independent enough for the chain to be tractable and that VLM-generated prompts supply useful information beyond what the image itself provides at high zoom.

axioms (1)
  • domain assumption Autoregressive decomposition of the conditional probability over scale states remains tractable and stable
    Stated in abstract as the core factorization that allows reuse of a 4x backbone without retraining.

pith-pipeline@v0.9.0 · 5723 in / 1218 out tokens · 23622 ms · 2026-05-19T13:27:53.747496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution

    cs.CV 2026-02 unverdicted novelty 7.0

    Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.

  2. GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

    cs.CV 2026-05 unverdicted novelty 5.0

    GaussianZoom enables high-fidelity extreme zoom-in 3D rendering from low-res inputs via an iterative framework combining geometry-consistent modeling, depth-based super-resolution, VLM detail synthesis, and an expanda...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Ntire 2017 challenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017

  2. [2]

    Imaging intracellular fluorescent proteins at nanometer resolution.science, 313(5793):1642–1645, 2006

    Eric Betzig, George H Patterson, Rachid Sougrat, O Wolf Lindwasser, Scott Olenych, Juan S Bonifacino, Michael W Davidson, Jennifer Lippincott-Schwartz, and Harald F Hess. Imaging intracellular fluorescent proteins at nanometer resolution.science, 313(5793):1642–1645, 2006

  3. [3]

    Any-resolution training for high-resolution image synthesis

    Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, and Richard Zhang. Any-resolution training for high-resolution image synthesis. InEuropean conference on computer vision, pages 170–188. Springer, 2022

  4. [4]

    Learning continuous image representation with local implicit image function

    Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8628–8638, 2021

  5. [5]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  6. [6]

    Improving diffusion models for inverse problems using manifold constraints

    Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=nJJjv0JDJju

  7. [7]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=OnD9zGAGT0k

  8. [8]

    Prompt-tuning latent diffusion models for inverse problems.arXiv preprint arXiv:2310.01110, 2023

    Hyungjin Chung, Jong Chul Ye, Peyman Milanfar, and Mauricio Delbracio. Prompt-tuning latent diffusion models for inverse problems.arXiv preprint arXiv:2310.01110, 2023

  9. [9]

    Decomposed diffusion sampler for accelerating large-scale inverse problems

    Hyungjin Chung, Suhyeon Lee, and Jong Chul Ye. Decomposed diffusion sampler for accelerating large-scale inverse problems. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=DsEhqQtfAG

  10. [10]

    Pixel recursive super resolution

    Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. InProceedings of the IEEE international conference on computer vision, pages 5439–5448, 2017

  11. [11]

    Learning a deep convolutional network for image super-resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 184–199. Springer, 2014

  12. [12]

    Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015

  13. [13]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

  14. [14]

    Example-based super-resolution.IEEE Computer graphics and Applications, 22(2):56–65, 2002

    William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution.IEEE Computer graphics and Applications, 22(2):56–65, 2002. 11

  15. [15]

    Div8k: Diverse 8k resolution image dataset

    Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse 8k resolution image dataset. In2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019

  16. [16]

    Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

    Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean Conference on Computer Vision, pages 39–55. Springer, 2024

  17. [17]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022

  18. [18]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019

  19. [19]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

  20. [20]

    Cubic convolution interpolation for digital image processing.IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 2003

    Robert Keys. Cubic convolution interpolation for digital image processing.IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 2003

  21. [21]

    Regularization by texts for latent diffusion inverse solvers.arXiv preprint arXiv:2311.15658, 2023

    Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, and Jong Chul Ye. Regularization by texts for latent diffusion inverse solvers.arXiv preprint arXiv:2311.15658, 2023

  22. [22]

    FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems, March 2025

    Jeongsol Kim, Bryan Sangwoo Kim, and Jong Chul Ye. Flowdps: Flow-driven posterior sampling for inverse problems.arXiv preprint arXiv:2503.08136, 2025

  23. [23]

    Photo-realistic single image super- resolution using a generative adversarial network

    Christian Ledig, Lucas Theis, Ferenc Husz’ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super- resolution using a generative adversarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017

  24. [24]

    Lsdir: A large scale dataset for image restoration

    Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, et al. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023

  25. [25]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017

  26. [26]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  27. [27]

    Pulse: Self-supervised photo upsampling via latent space exploration of generative models

    Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 2437–2445, 2020

  28. [28]

    Zoomed in, diffused out: Towards local degradation-aware multi-diffusion for extreme image super-resolution.arXiv preprint arXiv:2411.12072, 2024

    Brian B Moser, Stanislav Frolov, Tobias C Nauen, Federico Raue, and Andreas Dengel. Zoomed in, diffused out: Towards local degradation-aware multi-diffusion for extreme image super-resolution.arXiv preprint arXiv:2411.12072, 2024

  29. [29]

    Multi-input cardiac image super-resolution using convolutional neural networks

    Ozan Oktay, Wenjia Bai, Matthew Lee, Ricardo Guerrero, Konstantinos Kamnitsas, Jose Caballero, Antonio de Marvao, Stuart Cook, Declan O’Regan, and Daniel Rueckert. Multi-input cardiac image super-resolution using convolutional neural networks. InMedical Image Computing and Computer- Assisted Intervention-MICCAI 2016: 19th International Conference, Athens,...

  30. [30]

    Pan and H

    Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470, 2025

  31. [31]

    MR image reconstruction from highly undersampled k-space data by dictionary learning.IEEE transactions on medical imaging, 30(5):1028–1041, 2010

    Saiprasad Ravishankar and Yoram Bresler. MR image reconstruction from highly undersampled k-space data by dictionary learning.IEEE transactions on medical imaging, 30(5):1028–1041, 2010

  32. [32]

    Saharia, J

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement.arXiv preprint arXiv:2104.07636, 2021. 12

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  35. [35]

    zero-shot

    Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3118–3126, 2018

  36. [36]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang- Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

  37. [37]

    Qwen2.5-vl, January 2025

    Qwen Team. Qwen2.5-vl, January 2025. URLhttps://qwenlm.github.io/blog/qwen2.5-vl/

  38. [38]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  39. [39]

    Conditional image generation with pixelcnn decoders.Advances in neural information processing systems, 29, 2016

    Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders.Advances in neural information processing systems, 29, 2016

  40. [40]

    Pixel recurrent neural networks

    Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016

  41. [41]

    Lena Wagner, Lukas Liebel, and Marco Körner. Deep residual learning for single-image super-resolution of multi-spectral satellite imagery.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 4:189–196, 2019

  42. [42]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

  43. [43]

    Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12): 5929–5949, 2024

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12): 5929–5949, 2024

  44. [44]

    A comprehensive review on deep learning based remote sensing image super-resolution methods.Earth-Science Reviews, 232:104110, 2022

    Peijuan Wang, Bulent Bayram, and Elif Sertel. A comprehensive review on deep learning based remote sensing image super-resolution methods.Earth-Science Reviews, 232:104110, 2022

  45. [45]

    Generative powers of ten

    Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steven M Seitz, Ira Kemelmacher-Shlizerman, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, and Aleksander Holynski. Generative powers of ten. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7173–7182, 2024

  46. [46]

    Deep learning for image super-resolution: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(10):3365–3387, 2020

    Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(10):3365–3387, 2020

  47. [47]

    Learning images across scales using adversarial training.ACM Transactions on Graphics, 43(4):131, 2024

    Krzysztof Wolski, Adarsh Djeacoumar, Alireza Javanmardi, Hans-Peter Seidel, Christian Theobalt, Guil- laume Cordonnier, Karol Myszkowski, George Drettakis, Xingang Pan, and Thomas Leimkühler. Learning images across scales using adversarial training.ACM Transactions on Graphics, 43(4):131, 2024

  48. [48]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Processing Systems, 37:92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Processing Systems, 37:92529–92553, 2024

  49. [49]

    Seesr: Towards semantics-aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25456–25467, 2024

  50. [50]

    ArXiv preprint abs/2410.02712 (2024)

    Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. Llava-critic: Learning to evaluate multimodal models.arXiv preprint arXiv:2410.02712, 2024

  51. [51]

    Image super-resolution via sparse representation

    Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010. 13

  52. [52]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022

  53. [53]

    Zoomldm: Latent diffusion model for multi-scale image generation.arXiv preprint arXiv:2411.16969, 2024

    Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R Gupta, Joel Saltz, and Dimitris Samaras. Zoomldm: Latent diffusion model for multi-scale image generation.arXiv preprint arXiv:2411.16969, 2024

  54. [54]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

  55. [55]

    Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25669–25680, 2024

  56. [56]

    Degradation- guided one-step image super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

    Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step image super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

  57. [57]

    A feature-enriched completely blind image quality evaluator

    Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing, 24(8):2579–2591, 2015

  58. [58]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018

  59. [59]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025. 14 A Proofs Proposition 1.Given a sequence of scale-states xi tha...

  60. [60]

    A given LR image is resized to the target resolution and V AE encoding is done in tiles, allowing encoding to be performed even in settings of limited GPU memory

  61. [61]

    For latent sizes of 64×64, we find overlaps of 16 to work sufficiently well

    The encoded (low-resolution) latent is tiled into overlapping patches. For latent sizes of 64×64, we find overlaps of 16 to work sufficiently well

  62. [62]

    Note that this step requires multiple passes of the VLM, a computational bottleneck to be solved by future work

    Each low-resolution patch of 64×64 passes through the super-resolution network to become high-resolution patches, each guided by patch-specific prompts generated by the prompt- extractor VLM. Note that this step requires multiple passes of the VLM, a computational bottleneck to be solved by future work

  63. [63]

    The output high-resolution patches are multiplied by Gaussian weights in overlapping regions for smooth transposition between patches, and then combined to create the final high-resolution image

  64. [64]

    19 I Additional Qualitative Results Additional qualitative results of extreme super-resolution by CoZ are provided below

    The whole process is repeated asscale autoregressionto achieve higher resolutions. 19 I Additional Qualitative Results Additional qualitative results of extreme super-resolution by CoZ are provided below. Figure 12: Extreme super-resolution of photorealistic images by CoZ up to64×magnification. 20 Figure 13: Extreme super-resolution of photorealistic imag...