DiTTo: Scalable Order-aware All-in-One Image Restoration Agent
Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3
The pith
DiTTo trains an order-aware image restoration agent with linear-cost simulator data and plug-and-play expert addition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiTTo overcomes efficiency and extensibility bottlenecks in agent-based image restoration by introducing the DiTTo Simulator, which reduces ORTD construction to O(N^D) simulator calls per image via ∪S-IR single-step restoration simulation and AiO-IQA per-action quality prediction, and the DiTTo Agent, trained by SFT on the generated trajectories followed by Order-aware Restoration Alignment (ORA) that aligns degradation identification, restoration-action-ordering, and output format along independent axes, thereby enabling plug-and-play scalable extensibility when adding new restoration-experts.
What carries the argument
The DiTTo Simulator, which combines single-step restoration-action simulation (∪S-IR) and per-action quality prediction (AiO-IQA) to produce order-aware training trajectories at linear cost.
If this is right
- Training data construction for the agent scales linearly rather than quadratically with the number of degradation types.
- A new restoration expert can be added by updating only the lightweight ORA stage without retraining the full agent.
- The resulting agent reaches state-of-the-art multi-degradation restoration quality on MiO-100 among prior agent-based methods.
- Order-aware scheduling improves final quality when degradations interact.
Where Pith is reading between the lines
- The separation of concerns in ORA could let the agent adapt to changing expert pools over time without repeated full training.
- Similar linear-cost simulation of action sequences might reduce data-generation expense in other vision tasks that involve ordered operations.
- Direct measurement of how well simulator-predicted quality ranks match actual quality rankings on held-out real images would test the core generalization premise.
Load-bearing premise
The simulator's single-step simulations and quality predictions generate trajectories accurate enough that an agent trained on them generalizes to real multi-degraded images.
What would settle it
If agents trained solely on simulator-generated trajectories produce lower restoration quality than agents trained on fully enumerated real trajectories when both are tested on the same set of real multi-degraded images, the claim that the reduced-cost data suffices would be refuted.
Figures
read the original abstract
Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal Restoration-action Trajectory Dataset (ORTD), where $N^{\mathbf{D}}$ denotes the number of degradation types in the universe $\mathbf{D}$, and couple agent training to a fixed restoration-expert pool, preventing extension to newly introduced restoration-experts without full retraining. To overcome these efficiency and extensibility bottlenecks, we propose \textbf{DiTTo}, a novel order-aware image restoration agent framework consisting of the DiTTo Simulator and the DiTTo Agent. The DiTTo Simulator combines $\cup$S-IR for single-step restoration-action simulation and AiO-IQA for per-action quality prediction, reducing ORTD construction to $\mathcal{O}(N^{\mathbf{D}})$ simulator calls per image; the DiTTo Agent is trained by SFT on the simulator-generated ORTD, followed by \textbf{Order-aware Restoration Alignment (ORA)} that aligns degradation identification, restoration-action-ordering, and output format along independent axes. This enables \textbf{plug-and-play scalable extensibility}: adding a new restoration-expert requires updating only the lightweight ORA stage. On the MiO-100 evaluation set with up to five concurrent degradations, our DiTTo Agent achieves state-of-the-art multi-degradation restoration quality among previous agent-based IR methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiTTo, an agent-based framework for order-aware all-in-one image restoration. It consists of the DiTTo Simulator, which uses ∪S-IR single-step simulation combined with AiO-IQA per-action quality prediction to construct the Optimal Restoration-action Trajectory Dataset (ORTD) in O(N^D) calls per image (down from O((N^D)^2)), and the DiTTo Agent, trained via supervised fine-tuning on simulator-generated ORTD followed by Order-aware Restoration Alignment (ORA) for degradation identification, ordering, and output format. The framework claims plug-and-play extensibility to new restoration experts and state-of-the-art multi-degradation restoration quality on the MiO-100 set (up to five concurrent degradations) among prior agent-based IR methods.
Significance. If the simulator's trajectories prove representative, the O(N^D) reduction and ORA-based extensibility would meaningfully lower the barrier to training scalable agents for real-world multi-degradation restoration without retraining on every new expert; the explicit separation of simulation from agent training is a clear engineering contribution.
major comments (3)
- [Abstract] Abstract: the central SOTA claim on MiO-100 among agent-based methods rests on the DiTTo Simulator generating ORTD trajectories whose rankings align with real full-sequence restoration quality, yet no quantitative validation, error analysis, or correlation between simulator-predicted order rankings and ground-truth PSNR/SSIM after executing the full ordered chains is reported.
- [Abstract] Abstract / Method description: the reduction of ORTD construction to O(N^D) via single-step ∪S-IR simulation plus AiO-IQA implicitly assumes that (a) single-step restorations compose sufficiently linearly to rank multi-degradation orders and (b) per-action AiO-IQA scores predict final restored-image metrics after the complete sequence; no empirical test of these assumptions on held-out images with ≥3 concurrent degradations is described.
- [Abstract] Abstract: the claim of 'plug-and-play scalable extensibility' via lightweight ORA updates is presented without any ablation showing that adding a new restoration-expert actually preserves or improves performance on MiO-100 without full retraining of the SFT stage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical support of the simulator assumptions and extensibility claims. We address each major comment below and will incorporate the requested validations and ablations in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central SOTA claim on MiO-100 among agent-based methods rests on the DiTTo Simulator generating ORTD trajectories whose rankings align with real full-sequence restoration quality, yet no quantitative validation, error analysis, or correlation between simulator-predicted order rankings and ground-truth PSNR/SSIM after executing the full ordered chains is reported.
Authors: We agree that explicit quantitative validation of the alignment between simulator rankings and real full-sequence metrics would strengthen the SOTA claim. In the revised manuscript we will add a dedicated validation subsection reporting error analysis together with correlation coefficients (Pearson and Spearman) between simulator-predicted order rankings and ground-truth PSNR/SSIM obtained by executing the complete ordered trajectories on held-out images. revision: yes
-
Referee: [Abstract] Abstract / Method description: the reduction of ORTD construction to O(N^D) via single-step ∪S-IR simulation plus AiO-IQA implicitly assumes that (a) single-step restorations compose sufficiently linearly to rank multi-degradation orders and (b) per-action AiO-IQA scores predict final restored-image metrics after the complete sequence; no empirical test of these assumptions on held-out images with ≥3 concurrent degradations is described.
Authors: We acknowledge that the composition assumptions require direct empirical testing, especially for images with three or more degradations. The revision will include new experiments on held-out images with ≥3 concurrent degradations that quantify ranking accuracy under the linearity assumption and the predictive correlation of per-action AiO-IQA scores with final full-sequence metrics. revision: yes
-
Referee: [Abstract] Abstract: the claim of 'plug-and-play scalable extensibility' via lightweight ORA updates is presented without any ablation showing that adding a new restoration-expert actually preserves or improves performance on MiO-100 without full retraining of the SFT stage.
Authors: We agree that an ablation study is necessary to substantiate the plug-and-play claim. The revised manuscript will add an ablation that measures MiO-100 performance when a new restoration expert is introduced using only the lightweight ORA stage versus full SFT retraining, demonstrating that performance is preserved or improved without retraining the SFT component. revision: yes
Circularity Check
No circularity; empirical method with independent evaluation
full rationale
The paper proposes DiTTo Simulator (∪S-IR + AiO-IQA) to generate ORTD at reduced cost, then trains the DiTTo Agent via SFT + ORA and reports empirical SOTA on MiO-100. No equations, derivations, or self-citations reduce the central performance claim to a quantity defined by the method itself; the result is obtained by running the trained agent on held-out images rather than by construction from fitted inputs or prior self-work. The simulator approximation is an engineering choice whose validity is externally testable via correlation with real PSNR/SSIM, not a self-referential loop.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agustsson and R
E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017
2017
-
[2]
H. Chen, W. Li, J. Gu, J. Ren, S. Chen, T. Ye, R. Pei, K. Zhou, F. Song, and L. Zhu. Restoreagent: Autonomous image restoration agent via multimodal large language models. Advances in Neural Information Processing Systems , 37:110643–110666, 2024
2024
- [3]
-
[4]
M. V. Conde, G. Geigle, and R. Timofte. Instructir: High-quality image restoration following human instructions. In European Conference on Computer Vision , pages 1–21. Springer, 2024
2024
- [5]
-
[6]
Esser, S
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024
2024
-
[7]
S. Gu, A. Lugmayr, M. Danelljan, M. Fritsche, J. Lamour, and R. Timofte. Div8k: Diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019. 10
2019
-
[8]
K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE transactions on pattern analysis and machine intelligence , 33(12):2341–2353, 2010
2010
-
[9]
Hodosh, P
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research , 47:853–899, 2013
2013
-
[10]
Jiang, Z
J. Jiang, Z. Zuo, G. Wu, K. Jiang, and X. Liu. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, 2025
2025
-
[11]
Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025
X. Jiang, G. Li, B. Chen, and J. Zhang. Multi-agent image restoration. arXiv preprint arXiv:2503.09403, 2025
-
[12]
Jiang, Z
Y. Jiang, Z. Zhang, T. Xue, and J. Gu. Autodir: Automatic all-in-one image restoration with latent diffusion. In European Conference on Computer Vision , pages 340–359. Springer, 2024
2024
-
[13]
J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5148–5157, 2021
2021
- [14]
-
[15]
B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17452–17462, June 2022
2022
-
[16]
B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , pages 136–144, 2017
2017
-
[17]
X. Lin, J. He, Z. Chen, Z. Lyu, B. Dai, F. Yu, Y. Qiao, W. Ouyang, and C. Dong. Diffbir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision , pages 430–448. Springer, 2024
2024
-
[18]
Y. Lin, Z. Lin, H. Chen, P. Pan, C. Li, S. Chen, K. Wen, Y. Jin, W. Li, and X. Ding. Jarvisir: Elevating autonomous driving perception with intelligent image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22369–22380, 2025
2025
-
[19]
J. Lu, Y. Wu, Z. Zhao, H. Wang, F. Jimenez, A. Majeedi, and Y. Fu. Simplecall: A lightweight image restoration agent in label-free environments with mllm perceptual feedback. arXiv preprint arXiv:2512.18599 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Controlling vision-language models for multi-task image restoration. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[21]
Mittal, R
A. Mittal, R. Soundararajan, and A. C. Bovik. Making a completely blind image quality analyzer. IEEE Signal processing letters , 20(3):209–212, 2012
2012
-
[22]
S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017
2017
-
[23]
S. Nah, S. Son, S. Lee, R. Timofte, K. M. Lee, L. Chen, J. Zhang, X. Lu, X. Chu, C. Chen, et al. Ntire 2021 challenge on image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 149–165, 2021
2021
-
[24]
Perez, F
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018
2018
-
[25]
Potlapalli, S
V. Potlapalli, S. W. Zamir, S. H. Khan, and F. Shahbaz Khan. Promptir: Prompting for all-in-one image restoration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 71275–71293. Curran Associates, Inc., 2023. 11
2023
-
[26]
Rafailov, A
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[27]
J. Wang, K. C. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence , volume 37, pages 2555–2563, 2023
2023
-
[28]
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , 13(4):600–612, 2004
2004
-
[29]
S. Yang, T. Wu, S. Shi, S. Lao, Y. Gong, M. Cao, J. Wang, and Y. Yang. Maniqa: Multi- dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1191–1200, 2022
2022
-
[30]
M. Yao, R. Xu, Y. Guan, J. Huang, and Z. Xiong. Neural degradation representation learning for all-in-one image restoration. IEEE Transactions on Image Processing , 33:5408–5423, 2024
2024
-
[31]
S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5728–5739, 2022
2022
-
[32]
L. Zhai, Y. Wang, S. Cui, and Y. Zhou. A comprehensive review of deep learning-based real-world image restoration. IEEE Access, 11:21049–21067, 2023
2023
-
[33]
Zhang, W
X. Zhang, W. Gao, G. Li, Q. Jiang, and R. Cong. Image quality assessmentdriven reinforcement learning for mixed distorted image restoration. ACM Trans. Multimedia Comput. Commun. Appl., 19(1s), Feb. 2023
2023
-
[34]
Tir-agent: Training an explorative and efficient agent for image restoration
Y. Zhang, G. Jia, H. Hu, S. Zhao, K. Zhao, L. Sun, X. Long, K. Tian, C. Jiang, Z. Liu, K. Wang, S. Lian, K. Zhang, and B. Zhou. Tir-agent: Training an explorative and efficient agent for image restoration. arXiv preprint arXiv:2603.27742 , 2026
-
[35]
Y. Zhou, J. Cao, Z. Zhang, F. Wen, Y. Jiang, J. Jia, X. Liu, X. Min, and G. Zhai. Q- agent: Quality-driven chain-of-thought image restoration agent through robust multimodal large language model. arXiv preprint arXiv:2504.07148 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
K. Zhu, J. Gu, Z. You, Y. Qiao, and C. Dong. An intelligent agentic system for complex image restoration problems. In The Thirteenth International Conference on Learning Representations, 2025
2025
-
[37]
R. Zhu, Z. Tu, J. Liu, A. C. Bovik, and Y. Fan. Mwformer: Multi-weather image restoration using degradation-aware transformers. IEEE Transactions on Image Processing , 33:6790–6805, 2024
2024
-
[38]
the value this symbol takes in the instance with degradation-ordering δ
Y. Zuo, Q. Zheng, M. Wu, X. Jiang, R. Li, J. Wang, Y. Zhang, G. Mai, L. Wang, J. Zou, X. Wang, M.-H. Yang, and Z. Tu. 4KAgent: Agentic any image to 4k super-resolution. In The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2026. 12 Appendix Contents A Related Work 15 B Notation 15 C Algorithm 19 C.1 Training ∪S-IR . . . . . . . ...
2026
-
[39]
DiTTo Agent
We use greedy decoding at inference for structured-JSON parse stability. F.4 Stage 2 ORA (Order-aware Restoration Alignment) Objective. ORA is a DPO-style objective applied to the decomposed planning axes (DP, OR, Tool) introduced in the main paper. Let πθ and πref be the policy and reference models, and let (yc, yr) be a chosen/rejected response pair sha...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.