pith. sign in

arxiv: 2605.30915 · v2 · pith:5GO43TOBnew · submitted 2026-05-29 · 💻 cs.CV

DiTTo: Scalable Order-aware All-in-One Image Restoration Agent

Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords image restorationmulti-degradation restorationagent-based image restorationorder-aware restorationrestoration simulatorall-in-one image restorationplug-and-play extensibility
0
0 comments X

The pith

DiTTo trains an order-aware image restoration agent with linear-cost simulator data and plug-and-play expert addition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-world images often carry several degradations whose removal order affects final quality, yet building training data for agents that choose the right sequence has required a quadratic number of restoration-expert calls. The paper demonstrates that a simulator using single-step restoration simulation combined with per-action quality prediction can generate the needed Optimal Restoration-action Trajectory Dataset in only linear time. An agent is then trained on this data through supervised fine-tuning and a separate Order-aware Restoration Alignment step that handles degradation identification, ordering, and output format on independent axes. This yields state-of-the-art quality on the MiO-100 benchmark for images with up to five concurrent degradations. A reader would care because the linear scaling and modular alignment remove the main obstacles to handling larger degradation sets and evolving pools of restoration models.

Core claim

DiTTo overcomes efficiency and extensibility bottlenecks in agent-based image restoration by introducing the DiTTo Simulator, which reduces ORTD construction to O(N^D) simulator calls per image via ∪S-IR single-step restoration simulation and AiO-IQA per-action quality prediction, and the DiTTo Agent, trained by SFT on the generated trajectories followed by Order-aware Restoration Alignment (ORA) that aligns degradation identification, restoration-action-ordering, and output format along independent axes, thereby enabling plug-and-play scalable extensibility when adding new restoration-experts.

What carries the argument

The DiTTo Simulator, which combines single-step restoration-action simulation (∪S-IR) and per-action quality prediction (AiO-IQA) to produce order-aware training trajectories at linear cost.

If this is right

  • Training data construction for the agent scales linearly rather than quadratically with the number of degradation types.
  • A new restoration expert can be added by updating only the lightweight ORA stage without retraining the full agent.
  • The resulting agent reaches state-of-the-art multi-degradation restoration quality on MiO-100 among prior agent-based methods.
  • Order-aware scheduling improves final quality when degradations interact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of concerns in ORA could let the agent adapt to changing expert pools over time without repeated full training.
  • Similar linear-cost simulation of action sequences might reduce data-generation expense in other vision tasks that involve ordered operations.
  • Direct measurement of how well simulator-predicted quality ranks match actual quality rankings on held-out real images would test the core generalization premise.

Load-bearing premise

The simulator's single-step simulations and quality predictions generate trajectories accurate enough that an agent trained on them generalizes to real multi-degraded images.

What would settle it

If agents trained solely on simulator-generated trajectories produce lower restoration quality than agents trained on fully enumerated real trajectories when both are tested on the same set of real multi-degraded images, the claim that the reduced-cost data suffices would be refuted.

Figures

Figures reproduced from arXiv: 2605.30915 by Jihyong Oh, Seungho Choi.

Figure 1
Figure 1. Figure 1: All-in-One (AiO) image restoration (IR) quality. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the DiTTo Simulator, which constructs DDiTTo ORTD without exhaustive real restoration-expert calls. The simulator consists of two modules: ∪S-IR, instantiated as the single-degradation restoration simulator Sθ, and AiO-IQA, instantiated as the IQA￾based scoring model fψ. ∪S-IR is first trained to approximate the next restored image￾state induced by a candidate restoration-action identifier ρ ∈ … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on multi-degraded images shows that DiTTo Agent more effectively removes mixed degradations while preserving natural textures and semantic de￾tails [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Degradation process (left) and restoration process (right) for two instances sharing the type set Dδ = {D1, D2, DND } but using different degradation-orderings δ, so they are distinct instances. The degradation side is indexed by i D ∈ {0, . . . , j} (counts degradation￾actions applied); the restoration side is indexed by i R ∈ {0, . . . , j} (counts degradations still present), so the restoration-action-t… view at source ↗
Figure 6
Figure 6. Figure 6: An ORTD example with j=3. (a) The simulator-generated optimal restoration￾action-trajectory (Ieδ,∗ iR ) 0 iR=3 in DDiTTo ORTD. (b) The corresponding agent response, decomposed into DP (Degradation Perception-Reasoning), OR (Order-aware Restoration), and Tool (JSON-based tool call) axes used in ORA. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative comparisons on multi-degraded inputs with j ∈ {2, 3} con￾current degradations. Input 4KAgent JarvisIR DiTTo Agent ⋆DiTTo Agent [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative comparisons on multi-degraded inputs with j ∈ {3, 4, 5} concurrent degradations. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
read the original abstract

Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal Restoration-action Trajectory Dataset (ORTD), where $N^{\mathbf{D}}$ denotes the number of degradation types in the universe $\mathbf{D}$, and couple agent training to a fixed restoration-expert pool, preventing extension to newly introduced restoration-experts without full retraining. To overcome these efficiency and extensibility bottlenecks, we propose \textbf{DiTTo}, a novel order-aware image restoration agent framework consisting of the DiTTo Simulator and the DiTTo Agent. The DiTTo Simulator combines $\cup$S-IR for single-step restoration-action simulation and AiO-IQA for per-action quality prediction, reducing ORTD construction to $\mathcal{O}(N^{\mathbf{D}})$ simulator calls per image; the DiTTo Agent is trained by SFT on the simulator-generated ORTD, followed by \textbf{Order-aware Restoration Alignment (ORA)} that aligns degradation identification, restoration-action-ordering, and output format along independent axes. This enables \textbf{plug-and-play scalable extensibility}: adding a new restoration-expert requires updating only the lightweight ORA stage. On the MiO-100 evaluation set with up to five concurrent degradations, our DiTTo Agent achieves state-of-the-art multi-degradation restoration quality among previous agent-based IR methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces DiTTo, an agent-based framework for order-aware all-in-one image restoration. It consists of the DiTTo Simulator, which uses ∪S-IR single-step simulation combined with AiO-IQA per-action quality prediction to construct the Optimal Restoration-action Trajectory Dataset (ORTD) in O(N^D) calls per image (down from O((N^D)^2)), and the DiTTo Agent, trained via supervised fine-tuning on simulator-generated ORTD followed by Order-aware Restoration Alignment (ORA) for degradation identification, ordering, and output format. The framework claims plug-and-play extensibility to new restoration experts and state-of-the-art multi-degradation restoration quality on the MiO-100 set (up to five concurrent degradations) among prior agent-based IR methods.

Significance. If the simulator's trajectories prove representative, the O(N^D) reduction and ORA-based extensibility would meaningfully lower the barrier to training scalable agents for real-world multi-degradation restoration without retraining on every new expert; the explicit separation of simulation from agent training is a clear engineering contribution.

major comments (3)
  1. [Abstract] Abstract: the central SOTA claim on MiO-100 among agent-based methods rests on the DiTTo Simulator generating ORTD trajectories whose rankings align with real full-sequence restoration quality, yet no quantitative validation, error analysis, or correlation between simulator-predicted order rankings and ground-truth PSNR/SSIM after executing the full ordered chains is reported.
  2. [Abstract] Abstract / Method description: the reduction of ORTD construction to O(N^D) via single-step ∪S-IR simulation plus AiO-IQA implicitly assumes that (a) single-step restorations compose sufficiently linearly to rank multi-degradation orders and (b) per-action AiO-IQA scores predict final restored-image metrics after the complete sequence; no empirical test of these assumptions on held-out images with ≥3 concurrent degradations is described.
  3. [Abstract] Abstract: the claim of 'plug-and-play scalable extensibility' via lightweight ORA updates is presented without any ablation showing that adding a new restoration-expert actually preserves or improves performance on MiO-100 without full retraining of the SFT stage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support of the simulator assumptions and extensibility claims. We address each major comment below and will incorporate the requested validations and ablations in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central SOTA claim on MiO-100 among agent-based methods rests on the DiTTo Simulator generating ORTD trajectories whose rankings align with real full-sequence restoration quality, yet no quantitative validation, error analysis, or correlation between simulator-predicted order rankings and ground-truth PSNR/SSIM after executing the full ordered chains is reported.

    Authors: We agree that explicit quantitative validation of the alignment between simulator rankings and real full-sequence metrics would strengthen the SOTA claim. In the revised manuscript we will add a dedicated validation subsection reporting error analysis together with correlation coefficients (Pearson and Spearman) between simulator-predicted order rankings and ground-truth PSNR/SSIM obtained by executing the complete ordered trajectories on held-out images. revision: yes

  2. Referee: [Abstract] Abstract / Method description: the reduction of ORTD construction to O(N^D) via single-step ∪S-IR simulation plus AiO-IQA implicitly assumes that (a) single-step restorations compose sufficiently linearly to rank multi-degradation orders and (b) per-action AiO-IQA scores predict final restored-image metrics after the complete sequence; no empirical test of these assumptions on held-out images with ≥3 concurrent degradations is described.

    Authors: We acknowledge that the composition assumptions require direct empirical testing, especially for images with three or more degradations. The revision will include new experiments on held-out images with ≥3 concurrent degradations that quantify ranking accuracy under the linearity assumption and the predictive correlation of per-action AiO-IQA scores with final full-sequence metrics. revision: yes

  3. Referee: [Abstract] Abstract: the claim of 'plug-and-play scalable extensibility' via lightweight ORA updates is presented without any ablation showing that adding a new restoration-expert actually preserves or improves performance on MiO-100 without full retraining of the SFT stage.

    Authors: We agree that an ablation study is necessary to substantiate the plug-and-play claim. The revised manuscript will add an ablation that measures MiO-100 performance when a new restoration expert is introduced using only the lightweight ORA stage versus full SFT retraining, demonstrating that performance is preserved or improved without retraining the SFT component. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent evaluation

full rationale

The paper proposes DiTTo Simulator (∪S-IR + AiO-IQA) to generate ORTD at reduced cost, then trains the DiTTo Agent via SFT + ORA and reports empirical SOTA on MiO-100. No equations, derivations, or self-citations reduce the central performance claim to a quantity defined by the method itself; the result is obtained by running the trained agent on held-out images rather than by construction from fitted inputs or prior self-work. The simulator approximation is an engineering choice whose validity is externally testable via correlation with real PSNR/SSIM, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no concrete free parameters, axioms, or invented entities; ledger left empty pending full text.

pith-pipeline@v0.9.1-grok · 5835 in / 1012 out tokens · 23292 ms · 2026-06-28T23:14:50.806849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Agustsson and R

    E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017

  2. [2]

    H. Chen, W. Li, J. Gu, J. Ren, S. Chen, T. Ye, R. Pei, K. Zhou, F. Song, and L. Zhu. Restoreagent: Autonomous image restoration agent via multimodal large language models. Advances in Neural Information Processing Systems , 37:110643–110666, 2024

  3. [3]

    I.-H. Chen, I. Hadji, E. Sanchez, A. Bulat, S.-Y. Kuo, R. Timofte, G. Tzimiropoulos, and B. Martinez. Restore, assess, repeat: A unified framework for iterative image restoration. arXiv preprint arXiv:2603.26385 , 2026

  4. [4]

    M. V. Conde, G. Geigle, and R. Timofte. Instructir: High-quality image restoration following human instructions. In European Conference on Computer Vision , pages 1–21. Springer, 2024

  5. [5]

    Z. Duan, J. Zhang, X. Jin, Z. Zhang, Z. Xiong, D. Zou, J. S. Ren, C. Guo, and C. Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. arXiv preprint arXiv:2503.23580, 2025

  6. [6]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  7. [7]

    S. Gu, A. Lugmayr, M. Danelljan, M. Fritsche, J. Lamour, and R. Timofte. Div8k: Diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019. 10

  8. [8]

    K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE transactions on pattern analysis and machine intelligence , 33(12):2341–2353, 2010

  9. [9]

    Hodosh, P

    M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research , 47:853–899, 2013

  10. [10]

    Jiang, Z

    J. Jiang, Z. Zuo, G. Wu, K. Jiang, and X. Liu. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, 2025

  11. [11]

    Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025

    X. Jiang, G. Li, B. Chen, and J. Zhang. Multi-agent image restoration. arXiv preprint arXiv:2503.09403, 2025

  12. [12]

    Jiang, Z

    Y. Jiang, Z. Zhang, T. Xue, and J. Gu. Autodir: Automatic all-in-one image restoration with latent diffusion. In European Conference on Computer Vision , pages 340–359. Springer, 2024

  13. [13]

    J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision , pages 5148–5157, 2021

  14. [14]

    X. Kong, C. Dong, and L. Zhang. Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy. arXiv preprint arXiv:2401.03379 , 2024

  15. [15]

    B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17452–17462, June 2022

  16. [16]

    B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , pages 136–144, 2017

  17. [17]

    X. Lin, J. He, Z. Chen, Z. Lyu, B. Dai, F. Yu, Y. Qiao, W. Ouyang, and C. Dong. Diffbir: Toward blind image restoration with generative diffusion prior. In European conference on computer vision , pages 430–448. Springer, 2024

  18. [18]

    Y. Lin, Z. Lin, H. Chen, P. Pan, C. Li, S. Chen, K. Wen, Y. Jin, W. Li, and X. Ding. Jarvisir: Elevating autonomous driving perception with intelligent image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22369–22380, 2025

  19. [19]

    J. Lu, Y. Wu, Z. Zhao, H. Wang, F. Jimenez, A. Majeedi, and Y. Fu. Simplecall: A lightweight image restoration agent in label-free environments with mllm perceptual feedback. arXiv preprint arXiv:2512.18599 , 2025

  20. [20]

    Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön. Controlling vision-language models for multi-task image restoration. In The Twelfth International Conference on Learning Representations, 2024

  21. [21]

    Mittal, R

    A. Mittal, R. Soundararajan, and A. C. Bovik. Making a completely blind image quality analyzer. IEEE Signal processing letters , 20(3):209–212, 2012

  22. [22]

    S. Nah, T. Hyun Kim, and K. Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017

  23. [23]

    S. Nah, S. Son, S. Lee, R. Timofte, K. M. Lee, L. Chen, J. Zhang, X. Lu, X. Chu, C. Chen, et al. Ntire 2021 challenge on image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 149–165, 2021

  24. [24]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018

  25. [25]

    Potlapalli, S

    V. Potlapalli, S. W. Zamir, S. H. Khan, and F. Shahbaz Khan. Promptir: Prompting for all-in-one image restoration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 71275–71293. Curran Associates, Inc., 2023. 11

  26. [26]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  27. [27]

    J. Wang, K. C. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence , volume 37, pages 2555–2563, 2023

  28. [28]

    Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , 13(4):600–612, 2004

  29. [29]

    S. Yang, T. Wu, S. Shi, S. Lao, Y. Gong, M. Cao, J. Wang, and Y. Yang. Maniqa: Multi- dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1191–1200, 2022

  30. [30]

    M. Yao, R. Xu, Y. Guan, J. Huang, and Z. Xiong. Neural degradation representation learning for all-in-one image restoration. IEEE Transactions on Image Processing , 33:5408–5423, 2024

  31. [31]

    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5728–5739, 2022

  32. [32]

    L. Zhai, Y. Wang, S. Cui, and Y. Zhou. A comprehensive review of deep learning-based real-world image restoration. IEEE Access, 11:21049–21067, 2023

  33. [33]

    Zhang, W

    X. Zhang, W. Gao, G. Li, Q. Jiang, and R. Cong. Image quality assessmentdriven reinforcement learning for mixed distorted image restoration. ACM Trans. Multimedia Comput. Commun. Appl., 19(1s), Feb. 2023

  34. [34]

    Tir-agent: Training an explorative and efficient agent for image restoration

    Y. Zhang, G. Jia, H. Hu, S. Zhao, K. Zhao, L. Sun, X. Long, K. Tian, C. Jiang, Z. Liu, K. Wang, S. Lian, K. Zhang, and B. Zhou. Tir-agent: Training an explorative and efficient agent for image restoration. arXiv preprint arXiv:2603.27742 , 2026

  35. [35]

    Y. Zhou, J. Cao, Z. Zhang, F. Wen, Y. Jiang, J. Jia, X. Liu, X. Min, and G. Zhai. Q- agent: Quality-driven chain-of-thought image restoration agent through robust multimodal large language model. arXiv preprint arXiv:2504.07148 , 2025

  36. [36]

    K. Zhu, J. Gu, Z. You, Y. Qiao, and C. Dong. An intelligent agentic system for complex image restoration problems. In The Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    R. Zhu, Z. Tu, J. Liu, A. C. Bovik, and Y. Fan. Mwformer: Multi-weather image restoration using degradation-aware transformers. IEEE Transactions on Image Processing , 33:6790–6805, 2024

  38. [38]

    the value this symbol takes in the instance with degradation-ordering δ

    Y. Zuo, Q. Zheng, M. Wu, X. Jiang, R. Li, J. Wang, Y. Zhang, G. Mai, L. Wang, J. Zou, X. Wang, M.-H. Yang, and Z. Tu. 4KAgent: Agentic any image to 4k super-resolution. In The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2026. 12 Appendix Contents A Related Work 15 B Notation 15 C Algorithm 19 C.1 Training ∪S-IR . . . . . . . ...

  39. [39]

    DiTTo Agent

    We use greedy decoding at inference for structured-JSON parse stability. F.4 Stage 2 ORA (Order-aware Restoration Alignment) Objective. ORA is a DPO-style objective applied to the decomposed planning axes (DP, OR, Tool) introduced in the main paper. Let πθ and πref be the policy and reference models, and let (yc, yr) be a chosen/rejected response pair sha...