pith. sign in

arxiv: 1907.00103 · v1 · pith:KW6Z2VHYnew · submitted 2019-06-28 · 💻 cs.LG · stat.ML

Learning Effective Loss Functions Efficiently

Pith reviewed 2026-05-25 13:19 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords loss function learninganytime algorithmvalidation errorhyperparameter tuningmachine learning optimizationmeta-learning
0
0 comments X

The pith

An anytime algorithm learns loss functions that minimize validation error while guaranteeing asymptotic optimality in the worst case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the task of finding a loss function such that minimizing it on training data produces a model with low validation error. Although finding the optimal loss function is NP-hard, the work introduces an anytime algorithm that improves its output over time, remains asymptotically optimal no matter the difficulty, and runs efficiently when the problem matches an idealized easy case. This method tunes loss hyperparameters orders of magnitude faster than prior approaches and can discover new loss functions automatically while a model is being trained.

Core claim

We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized 'easy' case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.

What carries the argument

The anytime algorithm that iteratively searches over candidate loss functions to reduce validation error, with built-in worst-case optimality guarantees.

If this is right

  • Loss function hyperparameters can be tuned orders of magnitude faster than existing methods.
  • Novel loss functions can be discovered automatically while training proceeds.
  • The search process remains useful even when the space of possible loss functions is large, because the algorithm is anytime.
  • Models trained with the learned losses achieve lower validation error than those trained with hand-designed losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same search strategy could be adapted to other meta-learning tasks where the objective is itself an optimization problem.
  • If the easy case corresponds to smooth dependence of validation error on loss parameters, the method may work especially well when the base model is convex or nearly convex.
  • Combining the discrete search with gradient-based fine-tuning of loss parameters could reduce the time to reach good solutions further.

Load-bearing premise

There exists an idealized easy case in which the algorithm is provably efficient, and real-world instances are close enough to this case for the method to remain practical despite general NP-hardness.

What would settle it

On standard image or text classification benchmarks, the algorithm fails to reach lower validation error than random search or grid search within the same computational budget.

Figures

Figures reproduced from arXiv: 1907.00103 by Matthew Streeter.

Figure 1
Figure 1. Figure 1: Comparison of three quadratic loss functions, on a one-dimensional minimization problem. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of algorithms for tuning regularization hyperparameters. Each curve is the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of training, validation, and test log loss when learning a regularizer online [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized "easy" case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an anytime algorithm for learning loss functions that minimize a validation error metric when optimized over training data. It claims the algorithm is asymptotically optimal in the worst case for this NP-hard problem and provably efficient in an idealized 'easy' case. Experiments demonstrate orders-of-magnitude speedups in loss hyperparameter tuning compared to state-of-the-art methods and the ability to discover novel effective losses during training.

Significance. If the theoretical claims hold and align with the experimental regimes, the work would offer a principled approach to loss function design with both worst-case guarantees and practical efficiency, potentially impacting automated machine learning and meta-learning by reducing reliance on manual or grid-search-based loss tuning.

major comments (2)
  1. [Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.
  2. [Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.
minor comments (2)
  1. Notation for the validation error metric and the loss parameterization is introduced without a dedicated preliminary section, making the problem formalization harder to follow on first reading.
  2. [Abstract] The abstract states 'orders of magnitude faster' without referencing the specific baselines, datasets, or quantitative factors shown in the experimental tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity on the idealized 'easy' case.

read point-by-point responses
  1. Referee: [Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.

    Authors: We agree that an explicit characterization of the 'easy' case is needed for verification. The theory section describes this case as the regime in which the validation metric admits efficient optimization (e.g., when it is convex in the model parameters and the loss family consists of differentiable functions whose gradients can be computed in closed form under i.i.d. data). However, these conditions are stated informally. We will revise the abstract to include a one-sentence summary of the conditions and add a formal subsection in the theory section that lists the precise assumptions on the loss family, data distribution, and validation metric. This revision will also clarify the relationship to the anytime property, whose worst-case exponential complexity is already noted in the manuscript. revision: yes

  2. Referee: [Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.

    Authors: We accept that the experiments section lacks explicit verification against the 'easy' case conditions. In the revised manuscript we will add a dedicated paragraph (or subsection) that checks each experimental setting against the conditions listed in the new theory subsection (e.g., convexity of the validation metric and differentiability of the loss family). For any experiment that falls outside the 'easy' case we will explicitly attribute the observed speedups to the anytime property rather than to the provable-efficiency guarantee. This will ensure the empirical claims are properly aligned with the theoretical statements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic claims are independent of inputs

full rationale

The paper presents an anytime algorithm for loss function learning with stated worst-case asymptotic optimality and efficiency in an idealized easy case. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or result equivalent to its own inputs by construction. The NP-hardness acknowledgment and experimental claims stand as separate from the theoretical statements, with no self-definitional loops or renamed known results evident from the abstract or description. The derivation chain is self-contained as an algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger is minimal because only the abstract is available for review.

axioms (1)
  • domain assumption The problem of learning an optimal loss function is NP-hard
    Stated directly in the abstract as background for the algorithmic contribution.

pith-pipeline@v0.9.0 · 5605 in / 1193 out tokens · 45555 ms · 2026-05-25T13:19:51.805975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Gradient-based optimization of hyperparameters

    Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): 1889–1900, 2000

  2. [2]

    Solving linear programs in the current matrix multiplication time

    Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing (STOC), 2019

  3. [3]

    CVXPY: A python-embedded modeling language for convex optimization

    Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016

  4. [4]

    DeCAF: A deep convolutional activation feature for generic visual recognition

    Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, pages 647–655, 2014

  5. [5]

    Adaptive subgradient methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

  6. [6]

    Learning to teach

    Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Inter- national Conference on Learning Representations, 2018. URL https://openreview.net/ pdf?id=HJewuJWCZ

  7. [7]

    Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004

  8. [8]

    A statistical view of some chemometrics regression tools

    Ildiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993

  9. [9]

    Penalized regressions: the bridge versus the lasso

    Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998

  10. [10]

    Ridge regression: Biased estimation for nonorthogonal problems

    Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970

  11. [11]

    Multi-class texture analysis in colorectal cancer histology

    Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988, 2016

  12. [12]

    Asymptotically optimal regularization in smooth parametric models

    Percy S Liang, Guillaume Bouchard, Francis R Bach, and Michael I Jordan. Asymptotically optimal regularization in smooth parametric models. In Advances in Neural Information Processing Systems, pages 1132–1140, 2009

  13. [13]

    Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions

    Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, and Roger Grosse. Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions. 2019. URL https://openreview.net/pdf?id=r1eEG20qKQ

  14. [14]

    A survey of algorithms and analysis for adaptive online learning

    H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017

  15. [15]

    O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

  16. [16]

    Hyperparameter optimization with approximate gradient

    Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746. PMLR, 2016

  17. [17]

    ImageNet Large Scale Visual Recognition Challenge,

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. 9

  18. [18]

    Practical Bayesian optimization of machine learning algorithms

    Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012

  19. [19]

    Learning optimal linear regularizers

    Matthew Streeter. Learning optimal linear regularizers. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research , pages 5996–6004, Long Beach, California, USA, 09–15 Jun 2019. PMLR

  20. [20]

    Flowers, January 2019

    The TensorFlow Team. Flowers, January 2019. URL http://download.tensorflow.org/ example_images/flower_photos.tgz

  21. [21]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996

  22. [22]

    Bayesian optimization with gradients

    Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pages 5267–5278, 2017

  23. [23]

    Learning to teach with dynamic loss functions

    Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pages 6466–6477, 2018

  24. [24]

    Regularization and variable selection via the elastic net

    Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. Appendix A Proofs We now present a formal proof of Theorem 1. Theorem 1. Minimizing (5) is NP-hard, even in the special case when Θ = [0 , 1]n, F = Rk, φ(θ) =θ, ˜e(θ) is a ...