pith. machine review for the scientific record. sign in

arxiv: 2512.18599 · v2 · submitted 2025-12-21 · 💻 cs.CV

Recognition: no theorem link

Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords image restorationreinforcement learningmultimodal LLMpolicy optimizationlabel-free trainingagent-based restorationperceptual feedback
0
0 comments X

The pith

A reinforcement learning agent trained solely on multimodal LLM perceptual feedback learns efficient restoration sequences and matches state-of-the-art image quality without any labels or supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train a lightweight agent that picks the right restoration tools step by step for images hit by blur, noise, rain, and compression. It replaces expensive labeled data and slow reflection loops with a reward signal that comes directly from a multimodal large language model acting as a human-aligned judge of perceptual quality. Once trained, the agent produces a fixed sequence of operations at inference time, cutting out redundant calls while still hitting full-reference metrics on par with supervised methods and exceeding them on no-reference scores across varied degradations.

Core claim

The central discovery is a policy-optimization framework in which a sequential decision agent learns to output tool-calling sequences that maximize final image quality, with the only training signal supplied by multimodal LLM perceptual feedback in a completely label-free setting. This yields a deterministic restoration plan that runs faster than prior agent-based methods while matching supervised performance on full-reference metrics and improving on no-reference metrics.

What carries the argument

The policy optimization agent that selects the next restoration operation at each step to maximize the multimodal LLM's perceptual quality reward.

If this is right

  • Restoration agents can be trained end-to-end in label-free environments for any combination of degradations.
  • Inference speed improves because the trained policy produces a single deterministic sequence without reflection or rollback steps.
  • No-reference quality metrics improve because the LLM feedback directly optimizes for human-aligned perceptual criteria.
  • The same training loop can be applied to new tool sets or new degradation types by swapping only the LLM evaluator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to video restoration by treating frame sequences as the state space and using the same LLM reward on temporal consistency.
  • If the LLM evaluator is kept fixed, the method offers a way to benchmark new restoration tools without collecting human annotations.
  • The deterministic policy may serve as a fast initialization for fine-tuning on small labeled sets when higher precision is needed.

Load-bearing premise

Multimodal LLMs can reliably judge perceptual image quality in a way that aligns with human preferences and is stable enough to train an effective restoration policy without ground-truth labels.

What would settle it

An experiment in which the same agent is retrained with the multimodal LLM reward replaced by random scores or by scores from a non-perceptual metric and then evaluated on held-out multi-degradation images to check whether performance collapses.

Figures

Figures reproduced from arXiv: 2512.18599 by Abrar Majeedi, Felix Jimenez, Hongcheng Wang, Jianglin Lu, Yuanwei Wu, Yun Fu, Ziyi Zhao.

Figure 1
Figure 1. Figure 1: (a) Existing restoration agents [5, 24, 73] typically consist of assessment, scheduling, execution, reflection, and roll￾back, using VLMs for degradation recognition and LLMs for plan making; (b) Our SimpleCall agent determines the tool-calling se￾quence via a single policy execution, avoids the need for iterative trial-and-error, and generalizes to label-free environments. noise, haze, low-light condition… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. The restoration agent predicts the next action based on the current input status (sampling actions during [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between our method and SOTA restoration baselines (for other baselines see the supplementary material). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Runtime comparison between ours and AgenticIR [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of tool effects. Left: images with dark degra [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the distortion-perception tradeoff for (a) 3 degradations and (b) 5 degradations. As the number of actions increases, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of distortion-perception tradeoff on (a) noise+jpeg compression artifact and (b) motion blur+defocus blur+noise. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between our method and SOTA restoration baselines. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Restore-R1, a reinforcement learning framework for training a lightweight policy that selects sequences of image restoration tools. Training occurs in a label-free setting by using multimodal LLMs as perceptual evaluators to supply the sole reward signal. The resulting deterministic agent is claimed to match SOTA performance on full-reference metrics (PSNR/SSIM) while surpassing prior methods on no-reference metrics across multiple degradation types, with substantially faster inference by eliminating iterative reflection and tool search.

Significance. If the empirical claims hold after proper validation, the work would demonstrate that MLLM-derived rewards can substitute for supervised signals in restoration policy learning, offering a route to label-free training and efficient inference for complex degradations.

major comments (3)
  1. [Abstract] Abstract: the central claim that the method 'matches SOTA performance on full-reference metrics' despite using no supervision is unsupported; no correlation, ablation, or human validation between MLLM perceptual scores and objective metrics (PSNR/SSIM) is reported, leaving open the possibility that the policy optimizes for LLM biases rather than pixel-level fidelity.
  2. [Method] Method / Reward section: the reward formulation relies on an external MLLM evaluator with no described ablations on prompt design, score aggregation, temperature, or alternative reward shapes; this is load-bearing for the label-free training claim and must be shown to be robust.
  3. [Experiments] Experiments: the abstract asserts 'extensive experiments' yet supplies no datasets, baselines, exact metric tables, statistical significance tests, or validation procedures, preventing assessment of whether the reported gains are reproducible or generalizable.
minor comments (2)
  1. Define all acronyms (MLLM, SOTA, etc.) on first use and ensure consistent notation for policy parameters versus reward components.
  2. Clarify the exact architecture of the 'lightweight agent' (parameter count, backbone) and provide direct runtime comparisons to prior reflection-based agents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method 'matches SOTA performance on full-reference metrics' despite using no supervision is unsupported; no correlation, ablation, or human validation between MLLM perceptual scores and objective metrics (PSNR/SSIM) is reported, leaving open the possibility that the policy optimizes for LLM biases rather than pixel-level fidelity.

    Authors: We appreciate this observation. To address the lack of explicit validation, we have added to the revised manuscript a dedicated analysis subsection that reports the correlation (using Spearman rank correlation) between the MLLM perceptual rewards and ground-truth PSNR/SSIM values on validation sets. Additionally, we conducted a human evaluation study with 20 participants rating restored images, showing alignment between MLLM scores and human preferences. These results support that the policy optimizes for perceptual quality that correlates with objective fidelity, reducing concerns about LLM-specific biases. revision: yes

  2. Referee: [Method] Method / Reward section: the reward formulation relies on an external MLLM evaluator with no described ablations on prompt design, score aggregation, temperature, or alternative reward shapes; this is load-bearing for the label-free training claim and must be shown to be robust.

    Authors: We concur that ablations are necessary to validate the reward design. In the revised Method section, we now include comprehensive ablations covering variations in prompt engineering for the MLLM evaluator, different methods for aggregating scores (mean, median, and ensemble), MLLM sampling temperatures from 0.1 to 1.0, and alternative reward shapes including linear, logarithmic, and binary threshold-based rewards. The policy learning curves and final performance metrics remain consistent, demonstrating robustness of the label-free training approach. revision: yes

  3. Referee: [Experiments] Experiments: the abstract asserts 'extensive experiments' yet supplies no datasets, baselines, exact metric tables, statistical significance tests, or validation procedures, preventing assessment of whether the reported gains are reproducible or generalizable.

    Authors: We thank the referee for pointing this out. To improve clarity and completeness, we have revised the Experiments section to include explicit descriptions of the datasets (e.g., DIV2K for training and standard test sets like Set5, BSD100 for evaluation), full tables of quantitative results with all metrics, comparisons to relevant baselines, statistical significance tests (e.g., t-tests), and detailed validation procedures. These additions make the experimental claims fully assessable and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external MLLM reward

full rationale

The paper trains a lightweight RL policy for sequential tool selection using only MLLM perceptual feedback as the reward signal in a label-free setting. The reported matching of SOTA full-reference metrics (PSNR/SSIM) and superiority on no-reference metrics are presented as experimental outcomes from evaluation on held-out data, not as quantities derived by construction from the reward model or any fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text that reduce the performance claims to the inputs. The MLLM evaluator is treated as an independent black-box source of human-aligned feedback.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven domain assumption that MLLM feedback is sufficiently aligned with human perception to train a policy that generalizes across degradations; no free parameters are explicitly named but the RL policy itself is fitted via optimization.

free parameters (1)
  • RL policy parameters
    The agent policy is optimized via reinforcement learning, so its weights are fitted to maximize the MLLM-derived reward.
axioms (1)
  • domain assumption Multimodal LLMs provide human-aligned perceptual feedback usable as a reward signal
    Invoked to justify the label-free training mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1162 out tokens · 21614 ms · 2026-05-16T20:53:13.360298+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

  3. [3]

    The perception-distortion tradeoff

    Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 8

  4. [4]

    Dspo: Direct semantic pref- erence optimization for real-world image super-resolution

    Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, and Yunhe Wang. Dspo: Direct semantic pref- erence optimization for real-world image super-resolution. arXiv preprint arXiv:2504.15176, 2025. 3

  5. [5]

    Restoreagent: Autonomous image restoration agent via multimodal large language models

    Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, and Lei Zhu. Restoreagent: Autonomous image restoration agent via multimodal large language models. InAdvances in Neural Information Processing Systems, pages 110643– 110666. Curran Associates, Inc., 2024. 1, 2, 3, 4, 6, 8

  6. [6]

    A comparative study of image restoration networks for general backbone network design

    Xiangyu Chen, Zheyuan Li, Yuandong Pu, Yihao Liu, Jiantao Zhou, Yu Qiao, and Chao Dong. A comparative study of image restoration networks for general backbone network design. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 5, 1

  7. [7]

    Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback

    Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14239–14250, 2024. 3

  8. [8]

    Instruc- tir: High-quality image restoration following human instruc- tions

    Marcos V Conde, Gregor Geigle, and Radu Timofte. Instruc- tir: High-quality image restoration following human instruc- tions. InEuropean Conference on Computer Vision, pages 1–21. Springer, 2024. 2, 3, 5, 6, 7

  9. [9]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

  10. [10]

    Feededit: Text-based image editing with dynamic feedback regulation

    Fengyi Fu, Lei Zhang, Mengqi Huang, and Zhendong Mao. Feededit: Text-based image editing with dynamic feedback regulation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 2661–2670, 2025. 3

  11. [11]

    Fully convolutional network with multi-step reinforcement learning for image processing

    Ryosuke Furuta, Naoto Inoue, and Toshihiko Yamasaki. Fully convolutional network with multi-step reinforcement learning for image processing. InProceedings of the AAAI conference on artificial intelligence, pages 3598–3605,

  12. [12]

    Single image haze removal using dark channel prior.IEEE transactions on pat- tern analysis and machine intelligence, 33(12):2341–2353,

    Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze removal using dark channel prior.IEEE transactions on pat- tern analysis and machine intelligence, 33(12):2341–2353,

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

  14. [14]

    Towards flex- ible blind jpeg artifacts removal

    Jiaxi Jiang, Kai Zhang, and Radu Timofte. Towards flex- ible blind jpeg artifacts removal. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4997–5006, 2021. 1

  15. [15]

    A survey on all-in-one image restoration: Tax- onomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, and Xi- anming Liu. A survey on all-in-one image restoration: Tax- onomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

  16. [16]

    Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025

    Xu Jiang, Gehui Li, Bin Chen, and Jian Zhang. Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025. 1, 2, 3, 4, 5, 6, 8

  17. [17]

    Autodir: Automatic all-in-one image restoration with latent diffusion

    Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, and Jinwei Gu. Autodir: Automatic all-in-one image restoration with latent diffusion. InEuropean Conference on Computer Vi- sion, pages 340–359. Springer, 2024. 1, 2, 3, 5, 6, 7

  18. [18]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 7, 1

  19. [19]

    Towards ef- fective multiple-in-one image restoration: A sequential and prompt learning strategy.arXiv preprint arXiv:2401.03379,

    Xiangtao Kong, Chao Dong, and Lei Zhang. Towards ef- fective multiple-in-one image restoration: A sequential and prompt learning strategy.arXiv preprint arXiv:2401.03379,

  20. [20]

    A preliminary ex- ploration towards general image restoration.arXiv preprint arXiv:2408.15143, 2024

    Xiangtao Kong, Jinjin Gu, Yihao Liu, Wenlong Zhang, Xi- angyu Chen, Yu Qiao, and Chao Dong. A preliminary ex- ploration towards general image restoration.arXiv preprint arXiv:2408.15143, 2024. 6

  21. [21]

    Iterative filter adaptive network for single image defocus deblurring

    Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, and Seungyong Lee. Iterative filter adaptive network for single image defocus deblurring. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2034–2042, 2021. 1

  22. [22]

    Benchmarking single- image dehazing and beyond.IEEE transactions on image processing, 28(1):492–505, 2018

    Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single- image dehazing and beyond.IEEE transactions on image processing, 28(1):492–505, 2018. 1

  23. [23]

    All-in-one image restoration for unknown corruption

    Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-in-one image restoration for unknown corruption. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17452– 17462, 2022. 1, 2, 5, 6, 3, 7

  24. [24]

    Hybrid agents for image restoration.arXiv preprint arXiv:2503.10120, 2025

    Bingchen Li, Xin Li, Yiting Lu, and Zhibo Chen. Hybrid agents for image restoration.arXiv preprint arXiv:2503.10120, 2025. 1, 2, 3, 4, 6

  25. [25]

    Foundir: Unleashing million-scale training data to advance foundation models for image restoration

    Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, and Jin- shan Pan. Foundir: Unleashing million-scale training data to advance foundation models for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12626–12636, 2025. 1

  26. [26]

    Swinir: Image restoration us- ing swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

  27. [27]

    Improving image restoration through removing degradations in textual repre- sentations

    Jingbo Lin, Zhilu Zhang, Yuxiang Wei, Dongwei Ren, Dong- sheng Jiang, Qi Tian, and Wangmeng Zuo. Improving image restoration through removing degradations in textual repre- sentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2866– 2878, 2024. 1

  28. [28]

    Controlling vision-language models for multi-task image restoration

    Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sj¨olund, and Thomas B Sch ¨on. Controlling vision-language models for multi-task image restoration. InICLR, 2024. 2, 3, 5, 6, 7

  29. [29]

    Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming.arXiv preprint arXiv:1907.07484, 2019

    Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming.arXiv preprint arXiv:1907.07484, 2019. 1

  30. [30]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 3

  31. [31]

    Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015

  32. [32]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

  33. [33]

    Hir-diff: Unsupervised hyper- spectral image restoration via improved diffusion models

    Li Pang, Xiangyu Rui, Long Cui, Hongzhong Wang, Deyu Meng, and Xiangyong Cao. Hir-diff: Unsupervised hyper- spectral image restoration via improved diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3005–3014, 2024. 1

  34. [34]

    All- in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations

    Dongwon Park, Byung Hyun Lee, and Se Young Chun. All- in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5815–5824. IEEE, 2023. 1, 2

  35. [35]

    Distort-and-recover: Color enhancement using deep reinforcement learning

    Jongchan Park, Joon-Young Lee, Donggeun Yoo, and In So Kweon. Distort-and-recover: Color enhancement using deep reinforcement learning. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 5928–5936, 2018. 3

  36. [36]

    Robust un- supervised stylegan image restoration

    Yohan Poirier-Ginter and Jean-Franc ¸ois Lalonde. Robust un- supervised stylegan image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22292–22301, 2023. 1

  37. [37]

    Promptir: Prompting for all-in- one image restoration.Advances in Neural Information Pro- cessing Systems, 36:71275–71293, 2023

    Vaishnav Potlapalli, Syed Waqas Zamir, Salman H Khan, and Fahad Shahbaz Khan. Promptir: Prompting for all-in- one image restoration.Advances in Neural Information Pro- cessing Systems, 36:71275–71293, 2023. 2, 5, 6, 3, 7

  38. [38]

    RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

    Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang, Gaoqi He, Jiao Xie, Jie Hu, Xinghao Chen, and Shaohui Lin. Realsr-r1: Reinforcement learning for real- world image super-resolution with vision-language chain-of- thought.arXiv preprint arXiv:2506.16796, 2025. 3

  39. [39]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

  40. [40]

    Learning to deblur using light field generated and real de- focus images

    Lingyan Ruan, Bin Chen, Jizhou Li, and Miuling Lam. Learning to deblur using light field generated and real de- focus images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16304– 16313, 2022. 1

  41. [41]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jor- dan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–

  42. [42]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jor- dan, and Pieter Abbeel. High-dimensional continuous con- trol using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 4

  43. [43]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3, 4

  44. [44]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 8

  45. [45]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit- twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016. 3

  46. [46]

    Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu- cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017. 3

  47. [47]

    Vision transformers for single image dehazing.IEEE Transactions on Image Processing, 32:1927–1941, 2023

    Yuda Song, Zhuqing He, Hui Qian, and Xin Du. Vision transformers for single image dehazing.IEEE Transactions on Image Processing, 32:1927–1941, 2023. 5, 1

  48. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 4

  49. [49]

    Maxim: Multi-axis mlp for image processing

    Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxim: Multi-axis mlp for image processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5769–5780, 2022. 5, 1

  50. [50]

    Image demoir ´eing with a dual-domain distilling network

    Hailing Wang, Qiaoyu Tian, Liang Li, and Xiaojie Guo. Image demoir ´eing with a dual-domain distilling network. In2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2021. 1

  51. [51]

    Ift: Image fusion transformer 10 for ghost-free high dynamic range imaging.arXiv preprint arXiv:2309.15019, 2023

    Hailing Wang, Wei Li, Yuanyuan Xi, Jie Hu, Hanting Chen, Longyu Li, and Yunhe Wang. Ift: Image fusion transformer 10 for ghost-free high dynamic range imaging.arXiv preprint arXiv:2309.15019, 2023. 1

  52. [52]

    Outlier-aware post-training quantization for image super- resolution

    Hailing Wang, Jianglin Lu, Yitian Zhang, and Yun Fu. Outlier-aware post-training quantization for image super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16175–16184, 2025. 1

  53. [53]

    Otc: Optimal tool calls via reinforce- ment learning.arXiv e-prints, pages arXiv–2504, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jia- hao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Otc: Optimal tool calls via reinforce- ment learning.arXiv e-prints, pages arXiv–2504, 2025. 3

  54. [54]

    Ex- ploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 7, 1

  55. [55]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 4

  56. [56]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7, 1

  57. [57]

    Uformer: A general u-shaped transformer for image restoration

    Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17683–17693, 2022. 1, 6

  58. [58]

    Q-learning.Ma- chine learning, 8(3):279–292, 1992

    Christopher JCH Watkins and Peter Dayan. Q-learning.Ma- chine learning, 8(3):279–292, 1992. 3

  59. [59]

    Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992. 3

  60. [60]

    Towards open-ended visual quality comparison

    Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–377. Springer, 2024. 2, 4

  61. [61]

    Ridcp: Revitalizing real image dehaz- ing via high-quality codebook priors

    Rui-Qi Wu, Zheng-Peng Duan, Chun-Le Guo, Zhi Chai, and Chongyi Li. Ridcp: Revitalizing real image dehaz- ing via high-quality codebook priors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22282–22291, 2023. 5, 7, 8, 1

  62. [62]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 7, 1

  63. [63]

    All- in-one medical image restoration via task-adaptive routing

    Zhiwen Yang, Haowei Chen, Ziniu Qian, Yang Yi, Hui Zhang, Dan Zhao, Bingzheng Wei, and Yan Xu. All- in-one medical image restoration via task-adaptive routing. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 67–77. Springer,

  64. [64]

    Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

    Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InEuropean Conference on Computer Vision, pages 259–276. Springer, 2024. 2, 4

  65. [65]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025. 5, 7, 1

  66. [66]

    Craft- ing a toolchain for image restoration by deep reinforce- ment learning

    Ke Yu, Chao Dong, Liang Lin, and Chen Change Loy. Craft- ing a toolchain for image restoration by deep reinforce- ment learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2443–2452,

  67. [67]

    Learning enriched features for real image restoration and enhancement

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. InEuropean conference on computer vi- sion, pages 492–511. Springer, 2020. 1

  68. [68]

    Multi-stage progressive image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021. 1

  69. [69]

    Restormer: Efficient transformer for high-resolution image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739,

  70. [70]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7, 1

  71. [71]

    Residual Non-local Attention Networks for Image Restoration

    Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restora- tion.arXiv preprint arXiv:1903.10082, 2019. 1

  72. [72]

    Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model

    Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. Qagent: Quality-driven chain-of-thought image restoration agent through robust multimodal large lan- guage model.arXiv preprint arXiv:2504.07148, 2025. 2, 3, 4, 6

  73. [73]

    An intelligent agentic system for complex image restoration problems

    Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems. InThe Thirteenth International Con- ference on Learning Representations, 2025. 1, 2, 3, 4, 5, 6, 7, 8

  74. [74]

    Academic Press Professional, Inc., USA, 1994

    Karel Zuiderveld.Contrast limited adaptive histogram equalization, page 474–485. Academic Press Professional, Inc., USA, 1994. 1 11 SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback Supplementary Material

  75. [75]

    Data In this section, we show how to synthesize degraded im- ages following existing work [73]

    Experimental Details 6.1. Data In this section, we show how to synthesize degraded im- ages following existing work [73]. For dark images, the V channel value of the images in the HSV color space will be randomly decreased by one of the following strategies: lin- ear mapping, Gamma correction, and subtracting a constant. For defocus blur, the images will ...

  76. [76]

    •Denoising: SwinIR [26] (noise level 15), SwinIR

    (quality factor 5), FBCNN [14] (blind to quality factor). •Denoising: SwinIR [26] (noise level 15), SwinIR

  77. [77]

    •Deraining: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]

    (noise level 50), MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Deraining: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Motion deblurring: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Dehazing: MAXIM [49], X-Restormer [6]; RIDCP [61], DehazeFormer [47]. 6.3. Evaluation Metrics We assess model performance using thre...

  78. [78]

    Supervised Extension Table 5 reports the results of our method when extended to the label-available setting

    More Results 7.1. Supervised Extension Table 5 reports the results of our method when extended to the label-available setting. In this configuration, we use the clean reference images as supervision and define the re- 1 Table 4. Degradation data construction Settings # of Degradations Case Number Combinations I 2 Case 1 dark+noise Case 2 defocus blur+JPEG...

  79. [79]

    under rain+haze and rain+dark+noise degradation cases. The results further demonstrate that our method ef- fectively removes multiple co-occurring corruptions from degraded images and produces visual quality that is compa- rable to, or even exceeds, these supervised baselines. 7.3. Quantitative Comparison Tabls 6, 7, 8 present the performance comparison b...