Learning Effective Loss Functions Efficiently

Matthew Streeter

arxiv: 1907.00103 · v1 · pith:KW6Z2VHYnew · submitted 2019-06-28 · 💻 cs.LG · stat.ML

Learning Effective Loss Functions Efficiently

Matthew Streeter This is my paper

Pith reviewed 2026-05-25 13:19 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords loss function learninganytime algorithmvalidation errorhyperparameter tuningmachine learning optimizationmeta-learning

0 comments

The pith

An anytime algorithm learns loss functions that minimize validation error while guaranteeing asymptotic optimality in the worst case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the task of finding a loss function such that minimizing it on training data produces a model with low validation error. Although finding the optimal loss function is NP-hard, the work introduces an anytime algorithm that improves its output over time, remains asymptotically optimal no matter the difficulty, and runs efficiently when the problem matches an idealized easy case. This method tunes loss hyperparameters orders of magnitude faster than prior approaches and can discover new loss functions automatically while a model is being trained.

Core claim

We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized 'easy' case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.

What carries the argument

The anytime algorithm that iteratively searches over candidate loss functions to reduce validation error, with built-in worst-case optimality guarantees.

If this is right

Loss function hyperparameters can be tuned orders of magnitude faster than existing methods.
Novel loss functions can be discovered automatically while training proceeds.
The search process remains useful even when the space of possible loss functions is large, because the algorithm is anytime.
Models trained with the learned losses achieve lower validation error than those trained with hand-designed losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search strategy could be adapted to other meta-learning tasks where the objective is itself an optimization problem.
If the easy case corresponds to smooth dependence of validation error on loss parameters, the method may work especially well when the base model is convex or nearly convex.
Combining the discrete search with gradient-based fine-tuning of loss parameters could reduce the time to reach good solutions further.

Load-bearing premise

There exists an idealized easy case in which the algorithm is provably efficient, and real-world instances are close enough to this case for the method to remain practical despite general NP-hardness.

What would settle it

On standard image or text classification benchmarks, the algorithm fails to reach lower validation error than random search or grid search within the same computational budget.

Figures

Figures reproduced from arXiv: 1907.00103 by Matthew Streeter.

**Figure 2.** Figure 2: Comparison of algorithms for tuning regularization hyperparameters. Each curve is the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of training, validation, and test log loss when learning a regularizer online [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized "easy" case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Streeter gives an anytime algorithm for loss function search that claims worst-case asymptotic optimality plus efficiency in an easy case, but the easy case is undefined so the theory-experiment link stays loose.

read the letter

The core contribution is an anytime algorithm for learning loss functions that is asymptotically optimal in the worst case for an NP-hard problem and provably efficient in some idealized easy case. Experiments reportedly show orders-of-magnitude speedups over prior methods for hyperparameter tuning and the ability to discover effective losses during training itself. That combination of anytime behavior, worst-case guarantee, and practical speed claims is what stands out as new here. The work also supplies both theoretical framing and empirical results on real tuning tasks, which is more than many loss-learning papers manage. The soft spot is exactly the one the stress-test flags: the easy case is never characterized in the abstract, so it is impossible to know whether the reported speedups fall inside the regime where the efficiency proof applies or whether they are just empirical behavior of an algorithm whose worst-case cost remains exponential. If the easy case turns out to be narrow or artificial, the practical claims rest only on the experiments without theoretical backing. The citation pattern and problem setup look standard for this area, with no obvious circularity. This paper is aimed at people working on automated loss design or AutoML pipelines who need faster search methods. A serious referee should see it to check whether the proofs are tight and whether the easy case is defined in a way that covers the experiments; the claims are substantial enough to warrant that review even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an anytime algorithm for learning loss functions that minimize a validation error metric when optimized over training data. It claims the algorithm is asymptotically optimal in the worst case for this NP-hard problem and provably efficient in an idealized 'easy' case. Experiments demonstrate orders-of-magnitude speedups in loss hyperparameter tuning compared to state-of-the-art methods and the ability to discover novel effective losses during training.

Significance. If the theoretical claims hold and align with the experimental regimes, the work would offer a principled approach to loss function design with both worst-case guarantees and practical efficiency, potentially impacting automated machine learning and meta-learning by reducing reliance on manual or grid-search-based loss tuning.

major comments (2)

[Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.
[Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.

minor comments (2)

Notation for the validation error metric and the loss parameterization is introduced without a dedicated preliminary section, making the problem formalization harder to follow on first reading.
[Abstract] The abstract states 'orders of magnitude faster' without referencing the specific baselines, datasets, or quantitative factors shown in the experimental tables or figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity on the idealized 'easy' case.

read point-by-point responses

Referee: [Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.

Authors: We agree that an explicit characterization of the 'easy' case is needed for verification. The theory section describes this case as the regime in which the validation metric admits efficient optimization (e.g., when it is convex in the model parameters and the loss family consists of differentiable functions whose gradients can be computed in closed form under i.i.d. data). However, these conditions are stated informally. We will revise the abstract to include a one-sentence summary of the conditions and add a formal subsection in the theory section that lists the precise assumptions on the loss family, data distribution, and validation metric. This revision will also clarify the relationship to the anytime property, whose worst-case exponential complexity is already noted in the manuscript. revision: yes
Referee: [Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.

Authors: We accept that the experiments section lacks explicit verification against the 'easy' case conditions. In the revised manuscript we will add a dedicated paragraph (or subsection) that checks each experimental setting against the conditions listed in the new theory subsection (e.g., convexity of the validation metric and differentiability of the loss family). For any experiment that falls outside the 'easy' case we will explicitly attribute the observed speedups to the anytime property rather than to the provable-efficiency guarantee. This will ensure the empirical claims are properly aligned with the theoretical statements. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic claims are independent of inputs

full rationale

The paper presents an anytime algorithm for loss function learning with stated worst-case asymptotic optimality and efficiency in an idealized easy case. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or result equivalent to its own inputs by construction. The NP-hardness acknowledgment and experimental claims stand as separate from the theoretical statements, with no self-definitional loops or renamed known results evident from the abstract or description. The derivation chain is self-contained as an algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger is minimal because only the abstract is available for review.

axioms (1)

domain assumption The problem of learning an optimal loss function is NP-hard
Stated directly in the abstract as background for the algorithmic contribution.

pith-pipeline@v0.9.0 · 5605 in / 1193 out tokens · 45555 ms · 2026-05-25T13:19:51.805975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Gradient-based optimization of hyperparameters

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): 1889–1900, 2000

work page 1900
[2]

Solving linear programs in the current matrix multiplication time

Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing (STOC), 2019

work page 2019
[3]

CVXPY: A python-embedded modeling language for convex optimization

Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016

work page 2016
[4]

DeCAF: A deep convolutional activation feature for generic visual recognition

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, pages 647–655, 2014

work page 2014
[5]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

work page 2011
[6]

Learning to teach

Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Inter- national Conference on Learning Representations, 2018. URL https://openreview.net/ pdf?id=HJewuJWCZ

work page 2018
[7]

Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004

work page 2004
[8]

A statistical view of some chemometrics regression tools

Ildiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993

work page 1993
[9]

Penalized regressions: the bridge versus the lasso

Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998

work page 1998
[10]

Ridge regression: Biased estimation for nonorthogonal problems

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970

work page 1970
[11]

Multi-class texture analysis in colorectal cancer histology

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology. Scientiﬁc reports, 6:27988, 2016

work page 2016
[12]

Asymptotically optimal regularization in smooth parametric models

Percy S Liang, Guillaume Bouchard, Francis R Bach, and Michael I Jordan. Asymptotically optimal regularization in smooth parametric models. In Advances in Neural Information Processing Systems, pages 1132–1140, 2009

work page 2009
[13]

Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions

Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, and Roger Grosse. Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions. 2019. URL https://openreview.net/pdf?id=r1eEG20qKQ

work page 2019
[14]

A survey of algorithms and analysis for adaptive online learning

H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017

work page 2017
[15]

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[16]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746. PMLR, 2016

work page 2016
[17]

ImageNet Large Scale Visual Recognition Challenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. 9

work page doi:10.1007/s11263-015-0816-y 2015
[18]

Practical Bayesian optimization of machine learning algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012

work page 2012
[19]

Learning optimal linear regularizers

Matthew Streeter. Learning optimal linear regularizers. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research , pages 5996–6004, Long Beach, California, USA, 09–15 Jun 2019. PMLR

work page 2019
[20]

Flowers, January 2019

The TensorFlow Team. Flowers, January 2019. URL http://download.tensorflow.org/ example_images/flower_photos.tgz

work page 2019
[21]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996

work page 1996
[22]

Bayesian optimization with gradients

Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pages 5267–5278, 2017

work page 2017
[23]

Learning to teach with dynamic loss functions

Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pages 6466–6477, 2018

work page 2018
[24]

Regularization and variable selection via the elastic net

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. Appendix A Proofs We now present a formal proof of Theorem 1. Theorem 1. Minimizing (5) is NP-hard, even in the special case when Θ = [0 , 1]n, F = Rk, φ(θ) =θ, ˜e(θ) is a ...

work page 2005

[1] [1]

Gradient-based optimization of hyperparameters

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): 1889–1900, 2000

work page 1900

[2] [2]

Solving linear programs in the current matrix multiplication time

Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing (STOC), 2019

work page 2019

[3] [3]

CVXPY: A python-embedded modeling language for convex optimization

Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016

work page 2016

[4] [4]

DeCAF: A deep convolutional activation feature for generic visual recognition

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, pages 647–655, 2014

work page 2014

[5] [5]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

work page 2011

[6] [6]

Learning to teach

Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Inter- national Conference on Learning Representations, 2018. URL https://openreview.net/ pdf?id=HJewuJWCZ

work page 2018

[7] [7]

Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004

work page 2004

[8] [8]

A statistical view of some chemometrics regression tools

Ildiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993

work page 1993

[9] [9]

Penalized regressions: the bridge versus the lasso

Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998

work page 1998

[10] [10]

Ridge regression: Biased estimation for nonorthogonal problems

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970

work page 1970

[11] [11]

Multi-class texture analysis in colorectal cancer histology

Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology. Scientiﬁc reports, 6:27988, 2016

work page 2016

[12] [12]

Asymptotically optimal regularization in smooth parametric models

Percy S Liang, Guillaume Bouchard, Francis R Bach, and Michael I Jordan. Asymptotically optimal regularization in smooth parametric models. In Advances in Neural Information Processing Systems, pages 1132–1140, 2009

work page 2009

[13] [13]

Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions

Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, and Roger Grosse. Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions. 2019. URL https://openreview.net/pdf?id=r1eEG20qKQ

work page 2019

[14] [14]

A survey of algorithms and analysis for adaptive online learning

H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017

work page 2017

[15] [15]

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012

[16] [16]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746. PMLR, 2016

work page 2016

[17] [17]

ImageNet Large Scale Visual Recognition Challenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. 9

work page doi:10.1007/s11263-015-0816-y 2015

[18] [18]

Practical Bayesian optimization of machine learning algorithms

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012

work page 2012

[19] [19]

Learning optimal linear regularizers

Matthew Streeter. Learning optimal linear regularizers. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research , pages 5996–6004, Long Beach, California, USA, 09–15 Jun 2019. PMLR

work page 2019

[20] [20]

Flowers, January 2019

The TensorFlow Team. Flowers, January 2019. URL http://download.tensorflow.org/ example_images/flower_photos.tgz

work page 2019

[21] [21]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996

work page 1996

[22] [22]

Bayesian optimization with gradients

Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pages 5267–5278, 2017

work page 2017

[23] [23]

Learning to teach with dynamic loss functions

Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pages 6466–6477, 2018

work page 2018

[24] [24]

Regularization and variable selection via the elastic net

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. Appendix A Proofs We now present a formal proof of Theorem 1. Theorem 1. Minimizing (5) is NP-hard, even in the special case when Θ = [0 , 1]n, F = Rk, φ(θ) =θ, ˜e(θ) is a ...

work page 2005