pith. sign in

arxiv: 1907.08488 · v1 · pith:H4JBGZGUnew · submitted 2019-07-19 · 🧮 math.OC · cs.LG· eess.IV

An Optimal Control Approach to Early Stopping Variational Methods for Image Restoration

Pith reviewed 2026-05-24 19:00 UTC · model grok-4.3

classification 🧮 math.OC cs.LGeess.IV
keywords optimal controlearly stoppingvariational methodsimage restorationimage denoisingimage deblurringgradient flow
0
0 comments X

The pith

Learning an optimal stopping time via optimal control improves variational gradient flows for image restoration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Variational methods for image processing typically reach their best quality when the gradient flow is stopped early instead of running to a stationary point. This occurs because of an inherent tradeoff between the error from incomplete optimization and the error from the model itself not perfectly matching the data. The paper treats the stopping time itself as the variable to optimize and learns it directly from training data by casting the problem as an optimal control task. The resulting schemes run efficiently and match the performance of existing methods on denoising and deblurring.

Core claim

By introducing an optimal stopping time into the gradient flow process and learning it from data by means of an optimal control approach, we obtain highly efficient numerical schemes that achieve competitive results for image denoising and image deblurring. A nonlinear spectral analysis of the gradient of the learned regularizer gives enlightening insights about the different regularization properties.

What carries the argument

Optimal stopping time learned from data via an optimal control formulation of the gradient flow.

If this is right

  • The learned stopping time produces competitive numerical results on image denoising and deblurring tasks.
  • Nonlinear spectral analysis of the learned regularizer reveals its distinct regularization properties.
  • The formulation remains valid even when the regularizer itself is learned from data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single learned stopping time may simplify parameter tuning compared with methods that adjust multiple hyperparameters separately.
  • The same optimal-control framing could be applied to other inverse problems where early stopping improves practical performance.

Load-bearing premise

The tradeoff between optimization and modelling errors in variational models can be captured and optimized by learning a single stopping time via an optimal control formulation from data.

What would settle it

If a fixed stopping time or a standard early-stopping heuristic matches or exceeds the restoration quality of the learned stopping-time scheme on held-out denoising or deblurring test images, the central claim is falsified.

Figures

Figures reproduced from arXiv: 1907.08488 by Alexander Effland, Erich Kobler, Karl Kunisch, Thomas Pock.

Figure 1
Figure 1. Figure 1: Contour plot of the peak signal-to-noise ratio depending on the num [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Image sequence with globally best PSNR value. Left to right: input [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schematic drawing of optimal trajectory (black curve) as well as [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Trajectories of the state equation for [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plots of the average PSNR value across the test set (first and third [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: In the case of denoising, for T < T we still observe noisy images, whereas for too large T local image patterns are smoothed out. For image deblurring, images computed with too small values of T remain blurry, while for 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average change of consecutive convolution kernels (solid blue) and [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Plots of the energies (first and third plot) and first order conditions [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: From left to right: ground truth image, noisy input image ( [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: From left to right: ground truth image, blurry input image ( [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Band plots of the energies (blue plots) and first order conditions [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Triplets of 7 × 7-kernels (top), potential functions ρ (middle) and activation functions φ (bottom) learned for image denoising. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Triplets of 7 × 7-kernels (top), potential functions ρ (middle) and activation functions φ (bottom) learned for image deblurring. which shows that the regularizer has a tendency to decrease the contrast. For￾mula (34) also reveals that eigenfunctions corresponding to contrast factors close to 1 are preserved over several iterations. In summary, the learned regularizer has a tendency to reduce the contrast… view at source ↗
Figure 13
Figure 13. Figure 13: Nv = 64 eigenpairs for image denoising, where all eigenfunctions have the resolution 127×127 and the intensity of each eigenfunction is adjusted to [0, 1]. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Nv = 64 eigenpairs for image deblurring, where all eigenfunctions have the resolution 127×127 and the intensity of each eigenfunction is adjusted to [0, 1]. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

We investigate a well-known phenomenon of variational approaches in image processing, where typically the best image quality is achieved when the gradient flow process is stopped before converging to a stationary point. This paradox originates from a tradeoff between optimization and modelling errors of the underlying variational model and holds true even if deep learning methods are used to learn highly expressive regularizers from data. In this paper, we take advantage of this paradox and introduce an optimal stopping time into the gradient flow process, which in turn is learned from data by means of an optimal control approach. As a result, we obtain highly efficient numerical schemes that achieve competitive results for image denoising and image deblurring. A nonlinear spectral analysis of the gradient of the learned regularizer gives enlightening insights about the different regularization properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes learning a single stopping time T for the gradient flow of variational image restoration models via an optimal control formulation. This exploits the known tradeoff between optimization error and modeling error (the early-stopping paradox), even when the regularizer is learned from data. The resulting schemes are applied to denoising and deblurring and are claimed to be highly efficient while achieving competitive results; a nonlinear spectral analysis of the learned regularizer is provided for interpretability.

Significance. If the central construction holds, the work supplies a principled, data-driven mechanism for selecting stopping times in variational flows and demonstrates that optimal-control ideas can be used to turn the early-stopping paradox into a practical advantage. The spectral analysis of the learned regularizer is a concrete strength that may aid interpretability. The approach sits at the intersection of optimal control, variational methods, and learning, which is timely for the math.OC community.

major comments (2)
  1. [§3 and §4] §3 (optimal-control formulation) and §4 (experiments): the method learns and deploys a single scalar stopping time T for the entire test set. No analysis is supplied showing the variation of per-image optimal stopping times (e.g., histograms or standard deviation of T* across the training images or across noise levels). If this variation is large, the single-T compromise undermines the claim that the formulation directly exploits the paradox to produce competitive, generalizable schemes.
  2. [§4] §4 (experimental protocol): because T is learned from data, the manuscript must demonstrate that the reported competitive results are obtained on held-out test images after T has been fixed on a disjoint training set. No explicit statement of the train/test split, cross-validation procedure, or independent validation of T appears; without it the evaluation risks circularity and the competitiveness claim cannot be verified.
minor comments (1)
  1. [Abstract and §4] The abstract states that the schemes are “highly efficient” yet the manuscript does not report wall-clock times, iteration counts, or flop counts relative to full convergence or to standard early-stopping heuristics; adding a small efficiency table would strengthen the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and the recognition of the work's timeliness at the intersection of optimal control and variational methods. We address the two major comments point by point below, proposing revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (optimal-control formulation) and §4 (experiments): the method learns and deploys a single scalar stopping time T for the entire test set. No analysis is supplied showing the variation of per-image optimal stopping times (e.g., histograms or standard deviation of T* across the training images or across noise levels). If this variation is large, the single-T compromise undermines the claim that the formulation directly exploits the paradox to produce competitive, generalizable schemes.

    Authors: We agree that quantifying the variation of per-image optimal stopping times would strengthen the paper and allow readers to evaluate the single-T compromise directly. Although the formulation is designed to learn one global T (as stated in §3), we will add in the revision histograms of per-image T* values obtained by solving the optimal-control problem independently on each training image, together with mean, standard deviation, and dependence on noise level. This analysis will be placed in §4. We maintain that the global-T results remain competitive even if variation exists, because the optimal-control objective explicitly balances the early-stopping tradeoff across the training distribution; the added figures will make this transparent rather than undermine the claim. revision: yes

  2. Referee: [§4] §4 (experimental protocol): because T is learned from data, the manuscript must demonstrate that the reported competitive results are obtained on held-out test images after T has been fixed on a disjoint training set. No explicit statement of the train/test split, cross-validation procedure, or independent validation of T appears; without it the evaluation risks circularity and the competitiveness claim cannot be verified.

    Authors: We accept that the experimental protocol description in §4 lacks sufficient explicitness on data partitioning. The underlying experiments follow the conventional splits of the BSDS500 and Set12/Set14 datasets (training subset for learning the regularizer and T, disjoint test subset for final evaluation), but this was not stated clearly. In the revised manuscript we will insert a concise paragraph at the beginning of §4 that specifies the exact train/test division, confirms that T is learned solely on the training portion and then frozen, and states that all quantitative and visual results are reported on the held-out test images. This removes any ambiguity about circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: optimal control formulation for stopping time is a standard data-driven method

full rationale

The paper formulates early stopping of gradient flow as an optimal control problem whose solution (stopping time T) is learned from data to balance optimization and modeling error. This is a conventional supervised learning setup whose output (restored images on test data) is not equivalent to the training inputs by construction. No self-definitional steps, fitted-input-called-prediction reductions, or load-bearing self-citations are identifiable from the abstract or description; the central claim rests on the external validity of the learned T rather than re-deriving the input data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the learning procedure itself is the main unstated modeling choice.

pith-pipeline@v0.9.0 · 5664 in / 857 out tokens · 21321 ms · 2026-05-24T19:00:20.192121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Ambrosio, N

    L. Ambrosio, N. Gigli, and G. Savare. Gradient Flows in Metric Spaces and in the Space of Probability Measures . Birkh¨ auser Basel, 2008

  2. [2]

    Atkinson

    K. Atkinson. An introduction to numerical analysis . John Wiley & Sons, second edition, 1989

  3. [3]

    Benning, E

    M. Benning, E. Celledoni, M. Ehrhardt, B. Owren, and C.-B. Sch¨ onlieb. Deep learning as optimal control problems: models and numerical methods. 2019

  4. [4]

    Binder, M

    A. Binder, M. Hanke, and O. Scherzer. (2009) On the Landweber itera- tion for nonlinear ill-posed problems J. Inv. Ill-Posed Prob/ems , 4(5):381– 390,1996

  5. [5]

    J. C. Butcher. Numerical Methods for Ordinary Differential Equations . John Wiley & Sons, second edition, 2008

  6. [6]

    Chang, L

    B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert, and E. Holtham. Reversible Architectures for Arbitrarily Deep Residual Neural Networks. AAAI Conference on Artificial Intelligence , 2018

  7. [7]

    Chambolle, V

    A. Chambolle, V. Caselles, M. Novaga, D. Cremers, and T. Pock. An introduction to total variation for image analysis, 2009

  8. [8]

    Chambolle and T

    A. Chambolle and T. Pock. An introduction to continuous optimization for imaging. Acta Numer., 25:161–319, 2016. 29

  9. [9]

    Y. Chen, R. Ranftl, and T. Pock. Insights into analysis operator learning: From patch-based sparse models to higher-order MRFs. IEEE transactions on image processing, 99(1):1060–1072, 2014

  10. [10]

    Chen and T

    Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(6):1256–1272, 2017

  11. [11]

    W. E. A proposal on machine learning via dynamical systems Commun Math Stat, 5:1–11, 2017

  12. [12]

    W. E, J. Han, and Q. Li. A mean-field optimal control formulation of deep learning. Res Math Sci , 6(10), 2019

  13. [13]

    G. Gilboa. Nonlinear Eigenproblems in Image Processing and Computer Vision. Springer International Publishing AG, 2018

  14. [14]

    Haber and L

    E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1), 2017

  15. [15]

    J. K. Hale. Ordinary Differential Equations. Dover Publications, 1980

  16. [16]

    Hammernik, T

    K. Hammernik, T. Klatzer, E. Kobler, M. P. Recht, D. K. Sodickson, T. Pock, and F. Knoll. Learning a variational network for reconstruction of accelerated MRI data. Magnetic Resonance in Medicine, 79(6):3055–3071, 2018

  17. [17]

    He and X

    K. He and X. Zhang and S. Ren and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  18. [18]

    Ito and K

    K. Ito and K. Kunisch. Lagrange Multiplier Approach to Variational Prob- lems and Applications . Society for Industrial and Applied Mathematics, 2008

  19. [19]

    Kobler, T

    E. Kobler, T. Klatzer, K. Hammernik, and T. Pock. Variational networks: Connecting variational methods and deep learning. In Pattern Recognition, pages 281–293. Springer International Publishing, 2017

  20. [20]

    Landweber

    L. Landweber. An iteration formula for fredholm integral equations of the first kind. American Journal of Mathematics , 73(3):615–624, 1951

  21. [21]

    LeCun, Y

    Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521:436–444, 2015

  22. [22]

    Q. Li, L. Chen, C. Tai, and W. E. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research , 18:1–29, 2018

  23. [23]

    Li and S

    Q. Li and S. Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. 2018. 30

  24. [24]

    Martin, C

    D. Martin, C. Fowlkes, D. Tal, J. Malik. A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In International Conference on Com- puter Vision, 2001

  25. [25]

    Don't relax: early stopping for convex regularization

    S. Matet, L. Rosasco, S. Villa, and B. L. Vu. Dont relax: early stopping for convex regularization. arXiv:1707.05422, 2017

  26. [26]

    Mumford and J

    D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math. , 42(5):577–685, 1989

  27. [27]

    Perona and J

    P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 12(7):629–639, 1990

  28. [28]

    Pock and S

    T. Pock and S. Sabach. Inertial proximal alternating linearized minimiza- tion (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imaging Sci., 9(4):1756–1787, 2016

  29. [29]

    Prechelt

    L. Prechelt. Early Stopping — But When? In Neural Networks: Tricks of the Trade, second edition, Springer Berlin Heidelberg, pages 53–67, 2012

  30. [30]

    Rosasco and S

    L. Rosasco and S. Villa. Learning with Incremental Iterative Regulariza- tion. In Advances in Neural Information Processing Systems 28 , pages 1630–1638, 2015

  31. [31]

    Roth and M

    S. Roth and M. J. Black. Fields of Experts. Int J Comput Vis , 82(2):205– 229, 2009

  32. [32]

    Raskutti, M

    G. Raskutti, M. J. Wainwright, and B. Yu. Early stopping for non- parametric regression: An optimal data-dependent stopping rule. In 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1318–1325, 2011

  33. [33]

    Rudin, S

    L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992

  34. [34]

    Schulter, C

    S. Schulter, C. Leistner, and H. Bischof. Fast and accurate image upscaling with super-resolution forests. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3791–3799, 2015

  35. [35]

    G. Teschl. Ordinary Differential Equations and Dynamical Systems. Amer- ican Mathematical Society, 2012

  36. [36]

    Y. Yao, L. Rosasco, and A. Caponnetto. On Early Stopping in Gradient Descent Learning. Constructive Approximation, 26(2):289–315, 2007

  37. [37]

    E. Zeidler. Nonlinear Functional Analysis and its Applications III: Varia- tional Methods and Optimization . Springer-Verlag New York, 1985. 31

  38. [38]

    Zhang and B

    T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. Annals of Statistics , 33(4):1538–1579, 2005. 32