arxiv: 2512.18599 · v2 · submitted 2025-12-21 · 💻 cs.CV

Recognition: no theorem link

Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback

Jianglin Lu , Yuanwei Wu , Ziyi Zhao , Hongcheng Wang , Felix Jimenez , Abrar Majeedi , Yun Fu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords image restorationreinforcement learningmultimodal LLMpolicy optimizationlabel-free trainingagent-based restorationperceptual feedback

0 comments

The pith

A reinforcement learning agent trained solely on multimodal LLM perceptual feedback learns efficient restoration sequences and matches state-of-the-art image quality without any labels or supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train a lightweight agent that picks the right restoration tools step by step for images hit by blur, noise, rain, and compression. It replaces expensive labeled data and slow reflection loops with a reward signal that comes directly from a multimodal large language model acting as a human-aligned judge of perceptual quality. Once trained, the agent produces a fixed sequence of operations at inference time, cutting out redundant calls while still hitting full-reference metrics on par with supervised methods and exceeding them on no-reference scores across varied degradations.

Core claim

The central discovery is a policy-optimization framework in which a sequential decision agent learns to output tool-calling sequences that maximize final image quality, with the only training signal supplied by multimodal LLM perceptual feedback in a completely label-free setting. This yields a deterministic restoration plan that runs faster than prior agent-based methods while matching supervised performance on full-reference metrics and improving on no-reference metrics.

What carries the argument

The policy optimization agent that selects the next restoration operation at each step to maximize the multimodal LLM's perceptual quality reward.

If this is right

Restoration agents can be trained end-to-end in label-free environments for any combination of degradations.
Inference speed improves because the trained policy produces a single deterministic sequence without reflection or rollback steps.
No-reference quality metrics improve because the LLM feedback directly optimizes for human-aligned perceptual criteria.
The same training loop can be applied to new tool sets or new degradation types by swapping only the LLM evaluator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to video restoration by treating frame sequences as the state space and using the same LLM reward on temporal consistency.
If the LLM evaluator is kept fixed, the method offers a way to benchmark new restoration tools without collecting human annotations.
The deterministic policy may serve as a fast initialization for fine-tuning on small labeled sets when higher precision is needed.

Load-bearing premise

Multimodal LLMs can reliably judge perceptual image quality in a way that aligns with human preferences and is stable enough to train an effective restoration policy without ground-truth labels.

What would settle it

An experiment in which the same agent is retrained with the multimodal LLM reward replaced by random scores or by scores from a non-perceptual metric and then evaluated on held-out multi-degradation images to check whether performance collapses.

Figures

Figures reproduced from arXiv: 2512.18599 by Abrar Majeedi, Felix Jimenez, Hongcheng Wang, Jianglin Lu, Yuanwei Wu, Yun Fu, Ziyi Zhao.

**Figure 1.** Figure 1: (a) Existing restoration agents [5, 24, 73] typically consist of assessment, scheduling, execution, reflection, and rollback, using VLMs for degradation recognition and LLMs for plan making; (b) Our SimpleCall agent determines the tool-calling sequence via a single policy execution, avoids the need for iterative trial-and-error, and generalizes to label-free environments. noise, haze, low-light condition… view at source ↗

**Figure 2.** Figure 2: Framework overview. The restoration agent predicts the next action based on the current input status (sampling actions during [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between our method and SOTA restoration baselines (for other baselines see the supplementary material). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime comparison between ours and AgenticIR [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of tool effects. Left: images with dark degra [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the distortion-perception tradeoff for (a) 3 degradations and (b) 5 degradations. As the number of actions increases, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of distortion-perception tradeoff on (a) noise+jpeg compression artifact and (b) motion blur+defocus blur+noise. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison between our method and SOTA restoration baselines. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Restore-R1 trains a lightweight RL policy for sequential tool selection in image restoration using only MLLM perceptual rewards, which is new, but the claim that it matches SOTA on full-reference metrics without any label supervision rests on an unverified assumption about MLLM alignment.

read the letter

The core contribution is a policy optimization setup that learns a deterministic agent for picking restoration operations step by step. Training happens entirely through feedback from a multimodal LLM acting as the reward model, so no degradation labels or paired data are needed. After training the agent runs fast without the reflection loops that earlier vision-language agents used. That efficiency angle is concrete and addresses a real bottleneck mentioned in the abstract. The approach also claims to hit SOTA on full-reference metrics while improving no-reference ones across mixed degradations, which would be useful if it holds. What stands out is the clean framing of tool selection as a sequential decision process optimized end-to-end with perceptual signals. The paper does a reasonable job laying out why prior agents are slow and label-dependent. The soft spot is exactly where the stress test points: there is no reported correlation between the MLLM scores and actual PSNR/SSIM, no ablation on how the reward is computed, and no human validation. If the LLM reward favors semantic plausibility over pixel fidelity, the policy could improve perceptual metrics while hurting the full-reference ones the abstract highlights. The abstract supplies no experimental details, so it is impossible to judge robustness or whether the results survive standard controls. This paper is aimed at researchers building practical, low-supervision restoration pipelines. A reader working on RL for vision agents or label-free training would get value from the formulation even if the empirical claims need more backing. It deserves a serious referee because the idea is well-scoped and the efficiency claim is testable, though the current evidence level is thin.

Referee Report

3 major / 2 minor

Summary. The paper introduces Restore-R1, a reinforcement learning framework for training a lightweight policy that selects sequences of image restoration tools. Training occurs in a label-free setting by using multimodal LLMs as perceptual evaluators to supply the sole reward signal. The resulting deterministic agent is claimed to match SOTA performance on full-reference metrics (PSNR/SSIM) while surpassing prior methods on no-reference metrics across multiple degradation types, with substantially faster inference by eliminating iterative reflection and tool search.

Significance. If the empirical claims hold after proper validation, the work would demonstrate that MLLM-derived rewards can substitute for supervised signals in restoration policy learning, offering a route to label-free training and efficient inference for complex degradations.

major comments (3)

[Abstract] Abstract: the central claim that the method 'matches SOTA performance on full-reference metrics' despite using no supervision is unsupported; no correlation, ablation, or human validation between MLLM perceptual scores and objective metrics (PSNR/SSIM) is reported, leaving open the possibility that the policy optimizes for LLM biases rather than pixel-level fidelity.
[Method] Method / Reward section: the reward formulation relies on an external MLLM evaluator with no described ablations on prompt design, score aggregation, temperature, or alternative reward shapes; this is load-bearing for the label-free training claim and must be shown to be robust.
[Experiments] Experiments: the abstract asserts 'extensive experiments' yet supplies no datasets, baselines, exact metric tables, statistical significance tests, or validation procedures, preventing assessment of whether the reported gains are reproducible or generalizable.

minor comments (2)

Define all acronyms (MLLM, SOTA, etc.) on first use and ensure consistent notation for policy parameters versus reward components.
Clarify the exact architecture of the 'lightweight agent' (parameter count, backbone) and provide direct runtime comparisons to prior reflection-based agents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'matches SOTA performance on full-reference metrics' despite using no supervision is unsupported; no correlation, ablation, or human validation between MLLM perceptual scores and objective metrics (PSNR/SSIM) is reported, leaving open the possibility that the policy optimizes for LLM biases rather than pixel-level fidelity.

Authors: We appreciate this observation. To address the lack of explicit validation, we have added to the revised manuscript a dedicated analysis subsection that reports the correlation (using Spearman rank correlation) between the MLLM perceptual rewards and ground-truth PSNR/SSIM values on validation sets. Additionally, we conducted a human evaluation study with 20 participants rating restored images, showing alignment between MLLM scores and human preferences. These results support that the policy optimizes for perceptual quality that correlates with objective fidelity, reducing concerns about LLM-specific biases. revision: yes
Referee: [Method] Method / Reward section: the reward formulation relies on an external MLLM evaluator with no described ablations on prompt design, score aggregation, temperature, or alternative reward shapes; this is load-bearing for the label-free training claim and must be shown to be robust.

Authors: We concur that ablations are necessary to validate the reward design. In the revised Method section, we now include comprehensive ablations covering variations in prompt engineering for the MLLM evaluator, different methods for aggregating scores (mean, median, and ensemble), MLLM sampling temperatures from 0.1 to 1.0, and alternative reward shapes including linear, logarithmic, and binary threshold-based rewards. The policy learning curves and final performance metrics remain consistent, demonstrating robustness of the label-free training approach. revision: yes
Referee: [Experiments] Experiments: the abstract asserts 'extensive experiments' yet supplies no datasets, baselines, exact metric tables, statistical significance tests, or validation procedures, preventing assessment of whether the reported gains are reproducible or generalizable.

Authors: We thank the referee for pointing this out. To improve clarity and completeness, we have revised the Experiments section to include explicit descriptions of the datasets (e.g., DIV2K for training and standard test sets like Set5, BSD100 for evaluation), full tables of quantitative results with all metrics, comparisons to relevant baselines, statistical significance tests (e.g., t-tests), and detailed validation procedures. These additions make the experimental claims fully assessable and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external MLLM reward

full rationale

The paper trains a lightweight RL policy for sequential tool selection using only MLLM perceptual feedback as the reward signal in a label-free setting. The reported matching of SOTA full-reference metrics (PSNR/SSIM) and superiority on no-reference metrics are presented as experimental outcomes from evaluation on held-out data, not as quantities derived by construction from the reward model or any fitted parameters. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text that reduce the performance claims to the inputs. The MLLM evaluator is treated as an independent black-box source of human-aligned feedback.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven domain assumption that MLLM feedback is sufficiently aligned with human perception to train a policy that generalizes across degradations; no free parameters are explicitly named but the RL policy itself is fitted via optimization.

free parameters (1)

RL policy parameters
The agent policy is optimized via reinforcement learning, so its weights are fitted to maximize the MLLM-derived reward.

axioms (1)

domain assumption Multimodal LLMs provide human-aligned perceptual feedback usable as a reward signal
Invoked to justify the label-free training mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1162 out tokens · 21614 ms · 2026-05-16T20:53:13.360298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 8

work page 2018
[4]

Dspo: Direct semantic pref- erence optimization for real-world image super-resolution

Miaomiao Cai, Simiao Li, Wei Li, Xudong Huang, Hanting Chen, Jie Hu, and Yunhe Wang. Dspo: Direct semantic pref- erence optimization for real-world image super-resolution. arXiv preprint arXiv:2504.15176, 2025. 3

work page arXiv 2025
[5]

Restoreagent: Autonomous image restoration agent via multimodal large language models

Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, and Lei Zhu. Restoreagent: Autonomous image restoration agent via multimodal large language models. InAdvances in Neural Information Processing Systems, pages 110643– 110666. Curran Associates, Inc., 2024. 1, 2, 3, 4, 6, 8

work page 2024
[6]

A comparative study of image restoration networks for general backbone network design

Xiangyu Chen, Zheyuan Li, Yuandong Pu, Yihao Liu, Jiantao Zhou, Yu Qiao, and Chao Dong. A comparative study of image restoration networks for general backbone network design. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 5, 1

work page 2024
[7]

Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback

Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14239–14250, 2024. 3

work page 2024
[8]

Instruc- tir: High-quality image restoration following human instruc- tions

Marcos V Conde, Gregor Geigle, and Radu Timofte. Instruc- tir: High-quality image restoration following human instruc- tions. InEuropean Conference on Computer Vision, pages 1–21. Springer, 2024. 2, 3, 5, 6, 7

work page 2024
[9]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

work page
[10]

Feededit: Text-based image editing with dynamic feedback regulation

Fengyi Fu, Lei Zhang, Mengqi Huang, and Zhendong Mao. Feededit: Text-based image editing with dynamic feedback regulation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 2661–2670, 2025. 3

work page 2025
[11]

Fully convolutional network with multi-step reinforcement learning for image processing

Ryosuke Furuta, Naoto Inoue, and Toshihiko Yamasaki. Fully convolutional network with multi-step reinforcement learning for image processing. InProceedings of the AAAI conference on artificial intelligence, pages 3598–3605,

work page
[12]

Single image haze removal using dark channel prior.IEEE transactions on pat- tern analysis and machine intelligence, 33(12):2341–2353,

Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze removal using dark channel prior.IEEE transactions on pat- tern analysis and machine intelligence, 33(12):2341–2353,

work page
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Towards flex- ible blind jpeg artifacts removal

Jiaxi Jiang, Kai Zhang, and Radu Timofte. Towards flex- ible blind jpeg artifacts removal. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4997–5006, 2021. 1

work page 2021
[15]

A survey on all-in-one image restoration: Tax- onomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, and Xi- anming Liu. A survey on all-in-one image restoration: Tax- onomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

work page 2025
[16]

Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025

Xu Jiang, Gehui Li, Bin Chen, and Jian Zhang. Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025. 1, 2, 3, 4, 5, 6, 8

work page arXiv 2025
[17]

Autodir: Automatic all-in-one image restoration with latent diffusion

Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, and Jinwei Gu. Autodir: Automatic all-in-one image restoration with latent diffusion. InEuropean Conference on Computer Vi- sion, pages 340–359. Springer, 2024. 1, 2, 3, 5, 6, 7

work page 2024
[18]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 7, 1

work page 2021
[19]

Towards ef- fective multiple-in-one image restoration: A sequential and prompt learning strategy.arXiv preprint arXiv:2401.03379,

Xiangtao Kong, Chao Dong, and Lei Zhang. Towards ef- fective multiple-in-one image restoration: A sequential and prompt learning strategy.arXiv preprint arXiv:2401.03379,

work page arXiv
[20]

A preliminary ex- ploration towards general image restoration.arXiv preprint arXiv:2408.15143, 2024

Xiangtao Kong, Jinjin Gu, Yihao Liu, Wenlong Zhang, Xi- angyu Chen, Yu Qiao, and Chao Dong. A preliminary ex- ploration towards general image restoration.arXiv preprint arXiv:2408.15143, 2024. 6

work page arXiv 2024
[21]

Iterative filter adaptive network for single image defocus deblurring

Junyong Lee, Hyeongseok Son, Jaesung Rim, Sunghyun Cho, and Seungyong Lee. Iterative filter adaptive network for single image defocus deblurring. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2034–2042, 2021. 1

work page 2034
[22]

Benchmarking single- image dehazing and beyond.IEEE transactions on image processing, 28(1):492–505, 2018

Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single- image dehazing and beyond.IEEE transactions on image processing, 28(1):492–505, 2018. 1

work page 2018
[23]

All-in-one image restoration for unknown corruption

Boyun Li, Xiao Liu, Peng Hu, Zhongqin Wu, Jiancheng Lv, and Xi Peng. All-in-one image restoration for unknown corruption. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17452– 17462, 2022. 1, 2, 5, 6, 3, 7

work page 2022
[24]

Hybrid agents for image restoration.arXiv preprint arXiv:2503.10120, 2025

Bingchen Li, Xin Li, Yiting Lu, and Zhibo Chen. Hybrid agents for image restoration.arXiv preprint arXiv:2503.10120, 2025. 1, 2, 3, 4, 6

work page arXiv 2025
[25]

Foundir: Unleashing million-scale training data to advance foundation models for image restoration

Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, and Jin- shan Pan. Foundir: Unleashing million-scale training data to advance foundation models for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12626–12636, 2025. 1

work page 2025
[26]

Swinir: Image restoration us- ing swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

work page
[27]

Improving image restoration through removing degradations in textual repre- sentations

Jingbo Lin, Zhilu Zhang, Yuxiang Wei, Dongwei Ren, Dong- sheng Jiang, Qi Tian, and Wangmeng Zuo. Improving image restoration through removing degradations in textual repre- sentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2866– 2878, 2024. 1

work page 2024
[28]

Controlling vision-language models for multi-task image restoration

Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sj¨olund, and Thomas B Sch ¨on. Controlling vision-language models for multi-task image restoration. InICLR, 2024. 2, 3, 5, 6, 7

work page 2024
[29]

Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming.arXiv preprint arXiv:1907.07484, 2019

Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming.arXiv preprint arXiv:1907.07484, 2019. 1

work page arXiv 1907
[30]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015

work page 2015
[32]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

work page 2022
[33]

Hir-diff: Unsupervised hyper- spectral image restoration via improved diffusion models

Li Pang, Xiangyu Rui, Long Cui, Hongzhong Wang, Deyu Meng, and Xiangyong Cao. Hir-diff: Unsupervised hyper- spectral image restoration via improved diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3005–3014, 2024. 1

work page 2024
[34]

All- in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations

Dongwon Park, Byung Hyun Lee, and Se Young Chun. All- in-one image restoration for unknown degradations using adaptive discriminative filters for specific degradations. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5815–5824. IEEE, 2023. 1, 2

work page 2023
[35]

Distort-and-recover: Color enhancement using deep reinforcement learning

Jongchan Park, Joon-Young Lee, Donggeun Yoo, and In So Kweon. Distort-and-recover: Color enhancement using deep reinforcement learning. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 5928–5936, 2018. 3

work page 2018
[36]

Robust un- supervised stylegan image restoration

Yohan Poirier-Ginter and Jean-Franc ¸ois Lalonde. Robust un- supervised stylegan image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22292–22301, 2023. 1

work page 2023
[37]

Promptir: Prompting for all-in- one image restoration.Advances in Neural Information Pro- cessing Systems, 36:71275–71293, 2023

Vaishnav Potlapalli, Syed Waqas Zamir, Salman H Khan, and Fahad Shahbaz Khan. Promptir: Prompting for all-in- one image restoration.Advances in Neural Information Pro- cessing Systems, 36:71275–71293, 2023. 2, 5, 6, 3, 7

work page 2023
[38]

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang, Gaoqi He, Jiao Xie, Jie Hu, Xinghao Chen, and Shaohui Lin. Realsr-r1: Reinforcement learning for real- world image super-resolution with vision-language chain-of- thought.arXiv preprint arXiv:2506.16796, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

work page 2021
[40]

Learning to deblur using light field generated and real de- focus images

Lingyan Ruan, Bin Chen, Jizhou Li, and Miuling Lam. Learning to deblur using light field generated and real de- focus images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16304– 16313, 2022. 1

work page 2022
[41]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jor- dan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–

work page
[42]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jor- dan, and Pieter Abbeel. High-dimensional continuous con- trol using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 4

work page internal anchor Pith review Pith/arXiv arXiv 2015
[43]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit- twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016. 3

work page 2016
[46]

Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu- cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017. 3

work page 2017
[47]

Vision transformers for single image dehazing.IEEE Transactions on Image Processing, 32:1927–1941, 2023

Yuda Song, Zhuqing He, Hui Qian, and Xin Du. Vision transformers for single image dehazing.IEEE Transactions on Image Processing, 32:1927–1941, 2023. 5, 1

work page 1927
[48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Maxim: Multi-axis mlp for image processing

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxim: Multi-axis mlp for image processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5769–5780, 2022. 5, 1

work page 2022
[50]

Image demoir ´eing with a dual-domain distilling network

Hailing Wang, Qiaoyu Tian, Liang Li, and Xiaojie Guo. Image demoir ´eing with a dual-domain distilling network. In2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2021. 1

work page 2021
[51]

Ift: Image fusion transformer 10 for ghost-free high dynamic range imaging.arXiv preprint arXiv:2309.15019, 2023

Hailing Wang, Wei Li, Yuanyuan Xi, Jie Hu, Hanting Chen, Longyu Li, and Yunhe Wang. Ift: Image fusion transformer 10 for ghost-free high dynamic range imaging.arXiv preprint arXiv:2309.15019, 2023. 1

work page arXiv 2023
[52]

Outlier-aware post-training quantization for image super- resolution

Hailing Wang, Jianglin Lu, Yitian Zhang, and Yun Fu. Outlier-aware post-training quantization for image super- resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16175–16184, 2025. 1

work page 2025
[53]

Otc: Optimal tool calls via reinforce- ment learning.arXiv e-prints, pages arXiv–2504, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jia- hao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Otc: Optimal tool calls via reinforce- ment learning.arXiv e-prints, pages arXiv–2504, 2025. 3

work page 2025
[54]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 7, 1

work page 2023
[55]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7, 1

work page 2004
[57]

Uformer: A general u-shaped transformer for image restoration

Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17683–17693, 2022. 1, 6

work page 2022
[58]

Q-learning.Ma- chine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Ma- chine learning, 8(3):279–292, 1992. 3

work page 1992
[59]

Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following al- gorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992. 3

work page 1992
[60]

Towards open-ended visual quality comparison

Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–377. Springer, 2024. 2, 4

work page 2024
[61]

Ridcp: Revitalizing real image dehaz- ing via high-quality codebook priors

Rui-Qi Wu, Zheng-Peng Duan, Chun-Le Guo, Zhi Chai, and Chongyi Li. Ridcp: Revitalizing real image dehaz- ing via high-quality codebook priors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22282–22291, 2023. 5, 7, 8, 1

work page 2023
[62]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 7, 1

work page 2022
[63]

All- in-one medical image restoration via task-adaptive routing

Zhiwen Yang, Haowei Chen, Ziniu Qian, Yang Yi, Hui Zhang, Dan Zhao, Bingzheng Wei, and Yan Xu. All- in-one medical image restoration via task-adaptive routing. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 67–77. Springer,

work page
[64]

Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InEuropean Conference on Computer Vision, pages 259–276. Springer, 2024. 2, 4

work page 2024
[65]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025. 5, 7, 1

work page 2025
[66]

Craft- ing a toolchain for image restoration by deep reinforce- ment learning

Ke Yu, Chao Dong, Liang Lin, and Chen Change Loy. Craft- ing a toolchain for image restoration by deep reinforce- ment learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2443–2452,

work page
[67]

Learning enriched features for real image restoration and enhancement

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. InEuropean conference on computer vi- sion, pages 492–511. Springer, 2020. 1

work page 2020
[68]

Multi-stage progressive image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14821–14831, 2021. 1

work page 2021
[69]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739,

work page
[70]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7, 1

work page 2018
[71]

Residual Non-local Attention Networks for Image Restoration

Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restora- tion.arXiv preprint arXiv:1903.10082, 2019. 1

work page internal anchor Pith review Pith/arXiv arXiv 1903
[72]

Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model

Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. Qagent: Quality-driven chain-of-thought image restoration agent through robust multimodal large lan- guage model.arXiv preprint arXiv:2504.07148, 2025. 2, 3, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

An intelligent agentic system for complex image restoration problems

Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems. InThe Thirteenth International Con- ference on Learning Representations, 2025. 1, 2, 3, 4, 5, 6, 7, 8

work page 2025
[74]

Academic Press Professional, Inc., USA, 1994

Karel Zuiderveld.Contrast limited adaptive histogram equalization, page 474–485. Academic Press Professional, Inc., USA, 1994. 1 11 SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback Supplementary Material

work page 1994
[75]

Data In this section, we show how to synthesize degraded im- ages following existing work [73]

Experimental Details 6.1. Data In this section, we show how to synthesize degraded im- ages following existing work [73]. For dark images, the V channel value of the images in the HSV color space will be randomly decreased by one of the following strategies: lin- ear mapping, Gamma correction, and subtracting a constant. For defocus blur, the images will ...

work page
[76]

•Denoising: SwinIR [26] (noise level 15), SwinIR

(quality factor 5), FBCNN [14] (blind to quality factor). •Denoising: SwinIR [26] (noise level 15), SwinIR

work page
[77]

•Deraining: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]

(noise level 50), MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Deraining: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Motion deblurring: MAXIM [49], MPRNet [68], Restormer [69], X-Restormer [6]. •Dehazing: MAXIM [49], X-Restormer [6]; RIDCP [61], DehazeFormer [47]. 6.3. Evaluation Metrics We assess model performance using thre...

work page
[78]

Supervised Extension Table 5 reports the results of our method when extended to the label-available setting

More Results 7.1. Supervised Extension Table 5 reports the results of our method when extended to the label-available setting. In this configuration, we use the clean reference images as supervision and define the re- 1 Table 4. Degradation data construction Settings # of Degradations Case Number Combinations I 2 Case 1 dark+noise Case 2 defocus blur+JPEG...

work page
[79]

under rain+haze and rain+dark+noise degradation cases. The results further demonstrate that our method ef- fectively removes multiple co-occurring corruptions from degraded images and produces visual quality that is compa- rable to, or even exceeds, these supervised baselines. 7.3. Quantitative Comparison Tabls 6, 7, 8 present the performance comparison b...

work page