Learning Effective Loss Functions Efficiently
Pith reviewed 2026-05-25 13:19 UTC · model grok-4.3
The pith
An anytime algorithm learns loss functions that minimize validation error while guaranteeing asymptotic optimality in the worst case.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized 'easy' case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.
What carries the argument
The anytime algorithm that iteratively searches over candidate loss functions to reduce validation error, with built-in worst-case optimality guarantees.
If this is right
- Loss function hyperparameters can be tuned orders of magnitude faster than existing methods.
- Novel loss functions can be discovered automatically while training proceeds.
- The search process remains useful even when the space of possible loss functions is large, because the algorithm is anytime.
- Models trained with the learned losses achieve lower validation error than those trained with hand-designed losses.
Where Pith is reading between the lines
- The same search strategy could be adapted to other meta-learning tasks where the objective is itself an optimization problem.
- If the easy case corresponds to smooth dependence of validation error on loss parameters, the method may work especially well when the base model is convex or nearly convex.
- Combining the discrete search with gradient-based fine-tuning of loss parameters could reduce the time to reach good solutions further.
Load-bearing premise
There exists an idealized easy case in which the algorithm is provably efficient, and real-world instances are close enough to this case for the method to remain practical despite general NP-hardness.
What would settle it
On standard image or text classification benchmarks, the algorithm fails to reach lower validation error than random search or grid search within the same computational budget.
Figures
read the original abstract
We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized "easy" case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an anytime algorithm for learning loss functions that minimize a validation error metric when optimized over training data. It claims the algorithm is asymptotically optimal in the worst case for this NP-hard problem and provably efficient in an idealized 'easy' case. Experiments demonstrate orders-of-magnitude speedups in loss hyperparameter tuning compared to state-of-the-art methods and the ability to discover novel effective losses during training.
Significance. If the theoretical claims hold and align with the experimental regimes, the work would offer a principled approach to loss function design with both worst-case guarantees and practical efficiency, potentially impacting automated machine learning and meta-learning by reducing reliance on manual or grid-search-based loss tuning.
major comments (2)
- [Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.
- [Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.
minor comments (2)
- Notation for the validation error metric and the loss parameterization is introduced without a dedicated preliminary section, making the problem formalization harder to follow on first reading.
- [Abstract] The abstract states 'orders of magnitude faster' without referencing the specific baselines, datasets, or quantitative factors shown in the experimental tables or figures.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity on the idealized 'easy' case.
read point-by-point responses
-
Referee: [Abstract] Abstract and theory section: The idealized 'easy' case for which provable efficiency is claimed is not defined or characterized (e.g., no conditions on the loss family, data distribution, or validation metric are stated), preventing verification of whether the hyperparameter-tuning and on-the-fly learning experiments fall inside this case or rely only on the anytime property whose worst-case complexity remains exponential.
Authors: We agree that an explicit characterization of the 'easy' case is needed for verification. The theory section describes this case as the regime in which the validation metric admits efficient optimization (e.g., when it is convex in the model parameters and the loss family consists of differentiable functions whose gradients can be computed in closed form under i.i.d. data). However, these conditions are stated informally. We will revise the abstract to include a one-sentence summary of the conditions and add a formal subsection in the theory section that lists the precise assumptions on the loss family, data distribution, and validation metric. This revision will also clarify the relationship to the anytime property, whose worst-case exponential complexity is already noted in the manuscript. revision: yes
-
Referee: [Experiments] Experiments section: The reported speedups are presented without explicit checks or discussion confirming that the tested loss families and tasks satisfy the (unspecified) conditions of the 'easy' case; if they do not, the empirical results do not substantiate the claimed provable efficiency and rest solely on observed anytime behavior.
Authors: We accept that the experiments section lacks explicit verification against the 'easy' case conditions. In the revised manuscript we will add a dedicated paragraph (or subsection) that checks each experimental setting against the conditions listed in the new theory subsection (e.g., convexity of the validation metric and differentiability of the loss family). For any experiment that falls outside the 'easy' case we will explicitly attribute the observed speedups to the anytime property rather than to the provable-efficiency guarantee. This will ensure the empirical claims are properly aligned with the theoretical statements. revision: yes
Circularity Check
No circularity: algorithmic claims are independent of inputs
full rationale
The paper presents an anytime algorithm for loss function learning with stated worst-case asymptotic optimality and efficiency in an idealized easy case. No equations, fitted parameters, or self-citations are shown that would make any claimed prediction or result equivalent to its own inputs by construction. The NP-hardness acknowledgment and experimental claims stand as separate from the theoretical statements, with no self-definitional loops or renamed known results evident from the abstract or description. The derivation chain is self-contained as an algorithmic contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The problem of learning an optimal loss function is NP-hard
Reference graph
Works this paper leans on
-
[1]
Gradient-based optimization of hyperparameters
Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8): 1889–1900, 2000
work page 1900
-
[2]
Solving linear programs in the current matrix multiplication time
Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing (STOC), 2019
work page 2019
-
[3]
CVXPY: A python-embedded modeling language for convex optimization
Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016
work page 2016
-
[4]
DeCAF: A deep convolutional activation feature for generic visual recognition
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 31st International Conference on Machine Learning, pages 647–655, 2014
work page 2014
-
[5]
Adaptive subgradient methods for online learning and stochastic optimization
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011
work page 2011
-
[6]
Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Inter- national Conference on Learning Representations, 2018. URL https://openreview.net/ pdf?id=HJewuJWCZ
work page 2018
-
[7]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004
work page 2004
-
[8]
A statistical view of some chemometrics regression tools
Ildiko E Frank and Jerome H Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135, 1993
work page 1993
-
[9]
Penalized regressions: the bridge versus the lasso
Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998
work page 1998
-
[10]
Ridge regression: Biased estimation for nonorthogonal problems
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970
work page 1970
-
[11]
Multi-class texture analysis in colorectal cancer histology
Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Zöllner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988, 2016
work page 2016
-
[12]
Asymptotically optimal regularization in smooth parametric models
Percy S Liang, Guillaume Bouchard, Francis R Bach, and Michael I Jordan. Asymptotically optimal regularization in smooth parametric models. In Advances in Neural Information Processing Systems, pages 1132–1140, 2009
work page 2009
-
[13]
Matthew MacKay, Paul Vicol, Jonathan Lorraine, David Duvenaud, and Roger Grosse. Self- tuning networks: Bilevel optimization of hyperparameters using structured best-response func- tions. 2019. URL https://openreview.net/pdf?id=r1eEG20qKQ
work page 2019
-
[14]
A survey of algorithms and analysis for adaptive online learning
H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117–3166, 2017
work page 2017
-
[15]
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Computer Vision and Pattern Recognition, 2012
work page 2012
-
[16]
Hyperparameter optimization with approximate gradient
Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746. PMLR, 2016
work page 2016
-
[17]
ImageNet Large Scale Visual Recognition Challenge,
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. 9
-
[18]
Practical Bayesian optimization of machine learning algorithms
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012
work page 2012
-
[19]
Learning optimal linear regularizers
Matthew Streeter. Learning optimal linear regularizers. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research , pages 5996–6004, Long Beach, California, USA, 09–15 Jun 2019. PMLR
work page 2019
-
[20]
The TensorFlow Team. Flowers, January 2019. URL http://download.tensorflow.org/ example_images/flower_photos.tgz
work page 2019
-
[21]
Regression shrinkage and selection via the lasso
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996
work page 1996
-
[22]
Bayesian optimization with gradients
Jian Wu, Matthias Poloczek, Andrew G Wilson, and Peter Frazier. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pages 5267–5278, 2017
work page 2017
-
[23]
Learning to teach with dynamic loss functions
Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pages 6466–6477, 2018
work page 2018
-
[24]
Regularization and variable selection via the elastic net
Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. Appendix A Proofs We now present a formal proof of Theorem 1. Theorem 1. Minimizing (5) is NP-hard, even in the special case when Θ = [0 , 1]n, F = Rk, φ(θ) =θ, ˜e(θ) is a ...
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.