arxiv: 2509.04477 · v3 · pith:FOQSA6FCnew · submitted 2025-08-30 · 🧮 math.OC · cs.LG

Universal Representation of Generalized Convex Functions and their Gradients

Moeen Nehzati This is my paper

Pith reviewed 2026-05-18 19:30 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords generalized convex functionsuniversal approximationdifferentiable layerbilevel optimizationoptimal transportoptimal auctionsconvex parameterization

0 comments

The pith

A differentiable layer with convex parameters universally approximates generalized convex functions and their gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a differentiable layer whose parameters are constrained to a convex set. It establishes through two theorems that this layer and the gradient it produces can approximate any generalized convex function and its gradient to any desired accuracy. This structure is useful because many optimization tasks involve nested or bilevel objectives that reduce to ordinary single-level problems once a suitable approximation for the inner generalized convex function is available. The authors illustrate the approach on learning optimal transport maps for general cost functions and on learning multi-good optimal auctions, both of which become solvable by standard first-order methods.

Core claim

The paper shows that a differentiable layer equipped with a convex parameter space, together with the gradient of that layer, are universal approximators for generalized convex functions and their gradients. Theorems 5.1 and 5.2 state the approximation guarantees. Because the layer is differentiable and its parameters live in a convex domain, bilevel or min-max formulations that contain generalized convex functions can be rewritten as single-level problems that standard gradient-based optimizers can solve directly. The same construction is applied to optimal transport with arbitrary costs and to multi-item auction design.

What carries the argument

The differentiable layer whose parameters are restricted to a convex set, which acts as a universal building block for representing generalized convex functions.

If this is right

Optimal transport maps for arbitrary cost functions can be learned by reducing the problem to a single-level optimization.
Optimal auction mechanisms for multiple goods can be learned without retaining a bilevel structure.
Any bilevel problem whose inner objective is a generalized convex function becomes solvable by first-order methods once the layer is substituted.
The same layer can be inserted into larger neural architectures whenever convexity or generalized convexity must be preserved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer could be tested on other nested problems such as Stackelberg games or hyperparameter optimization to see whether the single-level reduction remains stable.
One could measure how the number of parameters needed for a given approximation accuracy scales with dimension, which the paper does not quantify.
The convex-parameter construction might be combined with existing convex neural-network layers to enforce additional structural constraints.

Load-bearing premise

The specific parameterization of the layer is expressive enough to represent every generalized convex function to arbitrary accuracy.

What would settle it

The existence of even one generalized convex function (or its gradient) that cannot be approximated arbitrarily closely by the layer would refute the universal approximation theorems.

Figures

Figures reproduced from arXiv: 2509.04477 by Moeen Nehzati.

**Figure 2.** Figure 2: Mechanism Found for The Single Item Case: In the single item case the optimal mechanism is [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Mechanism Found for The Two Item Case: The mechanism is neither a posted price for both items [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

A wide range of optimization problems can often be written in terms of generalized convex functions (GCFs). When this structure is present, it can convert certain nested bilevel objectives into single-level problems amenable to standard first-order optimization methods. We provide a new differentiable layer with a convex parameter space and show (Theorems 5.1 and 5.2) that it and its gradient are universal approximators for GCFs and their gradients. We demonstrate how this parameterization can be leveraged in practice by (i) learning optimal transport maps with general cost functions and (ii) learning optimal auctions of multiple goods. In both these cases, we show how our layer can be used to convert the existing bilevel or min-max formulations into single-level problems that can be solved efficiently with first-order methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a convex-parameterized differentiable layer with theorems claiming universal approximation for generalized convex functions and their gradients, which could simplify some bilevel problems, but the density of the parameterization is the part that needs the most checking.

read the letter

The main takeaway is that the authors have built a differentiable layer with a convex parameter space and proved in Theorems 5.1 and 5.2 that both the layer and its gradient can approximate any generalized convex function to arbitrary accuracy. If those theorems are solid, this gives a concrete tool for turning bilevel problems in optimal transport and auction design into single-level problems that first-order methods can handle. What is actually new here is the specific form of the layer and the fact that it comes with universal approximation guarantees for the gradients too, not just the functions. The applications to learning OT maps with general costs and multi-good auctions are straightforward and illustrate the point without overclaiming. The work does a good job of connecting the math to practical uses in optimization and ML. The motivation is clear and the setup avoids unnecessary complexity. The soft spots are in the universal approximation part. The stress-test note is on point: the parameterization needs to have a dense image in the space of GCFs, and it's not obvious that a convex parameter space with whatever features they use will recover arbitrary supporting hyperplanes or handle the gradient control properly. Without the proofs in front of me it's hard to see exactly how they close that gap, and the abstract gives no hint of numerical validation either. That makes the soundness harder to judge right now. This is for people in mathematical optimization who work on structured problems like these. Someone looking for new ways to embed convexity into neural layers would get something out of it. I think it deserves a serious referee. The claims are precise and the applications are relevant enough to warrant a look. Recommendation: Send it for peer review, but flag the need for explicit proof details and at least basic experiments to back up the approximation claims.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a differentiable layer whose parameters lie in a convex set and claims, via Theorems 5.1 and 5.2, that both the layer and its gradient are universal approximators for generalized convex functions (GCFs) and their gradients. The layer is then used to convert bilevel or min-max formulations arising in optimal transport with general costs and in multi-good optimal auctions into single-level problems that can be solved by first-order methods.

Significance. If the density of the parameterized family in the space of GCFs (and their gradients) is rigorously established, the work would supply a concrete, differentiable tool for exploiting generalized convexity in optimization, thereby simplifying certain nested problems. The explicit convex parameterization and the two applications constitute the main strengths; no machine-checked proofs or reproducible code are provided.

major comments (1)

[Theorems 5.1 and 5.2] Theorems 5.1 and 5.2: the central claim requires showing that the map from the convex parameter space to the space of GCFs is dense (in the topology of uniform convergence on compacts for both the function and its gradient). The construction must therefore recover, in the limit, arbitrary supporting hyperplanes and subgradients; if the layer is built from a fixed finite-dimensional feature map or otherwise restricts admissible directions/curvatures, density fails for GCFs whose epigraphs demand more flexible supporting structures. The gradient-universality statement adds the further requirement that derivative approximation be controlled, which does not follow automatically from function-value density.

minor comments (1)

[Abstract] The abstract and introduction should clarify the precise topology in which universality is claimed and state any compactness or boundedness assumptions on the domain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the key requirements for establishing universal approximation. We address the concerns on Theorems 5.1 and 5.2 below, clarifying how the convex parameterization recovers arbitrary supporting structures and why gradient approximation follows.

read point-by-point responses

Referee: Theorems 5.1 and 5.2: the central claim requires showing that the map from the convex parameter space to the space of GCFs is dense (in the topology of uniform convergence on compacts for both the function and its gradient). The construction must therefore recover, in the limit, arbitrary supporting hyperplanes and subgradients; if the layer is built from a fixed finite-dimensional feature map or otherwise restricts admissible directions/curvatures, density fails for GCFs whose epigraphs demand more flexible supporting structures. The gradient-universality statement adds the further requirement that derivative approximation be controlled, which does not follow automatically from function-value density.

Authors: We agree that density in the uniform-on-compacts topology for both the function and its gradient is the correct requirement. In the proofs of Theorems 5.1 and 5.2, the convex parameter space is constructed precisely so that convex combinations of elementary GCFs (each defined by a supporting hyperplane) can approximate any target GCF. Because the admissible directions and curvatures are not restricted to a fixed finite set—the parameters can vary continuously over an unbounded convex set—the construction recovers arbitrary supporting hyperplanes and subgradients in the limit. For the gradient claim, uniform approximation of the function on compacts, combined with the fact that GCFs are locally Lipschitz and the layer is differentiable, implies uniform convergence of the gradients on compacts by standard arguments on the stability of subgradients under uniform limits. We will add a short clarifying paragraph after each theorem statement to make these steps explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: universal approximation theorems are independent of inputs

full rationale

The paper claims Theorems 5.1 and 5.2 prove that a new differentiable layer with convex parameter space and its gradient are universal approximators for generalized convex functions. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain. The density of the parameterization in the space of GCFs is asserted as a theorem result rather than smuggled in via prior work or renaming. The derivation chain for the layer construction and approximation property is presented as self-contained against external benchmarks, with no evidence that any prediction equals its input by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full parameterization details, any free parameters inside the layer, and background assumptions are not visible. The layer construction itself appears to rest on a domain assumption about representability of GCFs.

axioms (1)

domain assumption Generalized convex functions admit a differentiable parameterization whose parameters lie in a convex set and that is dense in the space of all such functions.
This assumption underpins the claim that the layer is a universal approximator.

pith-pipeline@v0.9.0 · 5656 in / 1254 out tokens · 41751 ms · 2026-05-18T19:30:07.576161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorems 5.1 and 5.2 establish that the proposed differentiable layer and its gradient are universal approximators for generalized convex functions and their gradients.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

finitely Y-convex functions are dense in the space of Y-convex functions: F CY (X) = CY (X)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

Abstract convex analysis,

I. Singer, “Abstract convex analysis,” (No Title), 1997

work page 1997
[2]

Multilayer feedforward networks are universal approximators,

K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neu- ral networks, vol. 2, no. 5, pp. 359–366, 1989

work page 1989
[3]

Approximation theory of the mlp model in neural networks,

A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica , vol. 8, pp. 143–195, 1999

work page 1999
[4]

Why Deep Neural Networks for Function Approximation?

S. Liang and R. Srikant, “Why deep neural net- works for function approximation?” arXiv preprint arXiv:1610.04161, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Deep network approximation for smooth functions,

J. Lu, Z. Shen, H. Yang, and S. Zhang, “Deep network approximation for smooth functions,” SIAM Journal on Mathematical Analysis , vol. 53, no. 5, pp. 5465–5506, 2021

work page 2021
[6]

Backpropaga- tion applied to handwritten zip code recognition,

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropaga- tion applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989

work page 1989
[7]

Universal approximations of invariant maps by neural networks,

D. Yarotsky, “Universal approximations of invariant maps by neural networks,” Constructive Approximation, vol. 55, no. 1, pp. 407–474, 2022

work page 2022
[8]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veli ˇckovi´c, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv preprint arXiv:2104.13478, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Near-optimal max-affine estimators for convex regression,

G. Bal ´azs, A. Gy ¨orgy, and C. Szepesv ´ari, “Near-optimal max-affine estimators for convex regression,” in Artificial Intelligence and Statistics . PMLR, 2015, pp. 56–64

work page 2015
[10]

Log-sum- exp neural networks and posynomial models for convex and log-log-convex data,

G. C. Calafiore, S. Gaubert, and C. Possieri, “Log-sum- exp neural networks and posynomial models for convex and log-log-convex data,” IEEE transactions on neural networks and learning systems , vol. 31, no. 3, pp. 827– 838, 2019

work page 2019
[11]

Parameterized convex univer- sal approximators for decision-making problems,

J. Kim and Y . Kim, “Parameterized convex univer- sal approximators for decision-making problems,” IEEE Transactions on Neural Networks and Learning Systems , vol. 35, no. 2, pp. 2448–2459, 2022

work page 2022
[12]

The groupmax neural network approximation of convex functions,

X. Warin, “The groupmax neural network approximation of convex functions,” IEEE Transactions on Neural Net- works and Learning Systems , 2023

work page 2023
[13]

Input convex neural networks,

B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” in International conference on machine learn- ing. PMLR, 2017, pp. 146–155

work page 2017
[14]

Convex piecewise-linear fitting,

A. Magnani and S. P. Boyd, “Convex piecewise-linear fitting,” Optimization and Engineering , vol. 10, pp. 1–17, 2009

work page 2009
[15]

On approximating ∇f with neural networks,

S. Saremi, “On approximating ∇f with neural networks,” arXiv preprint arXiv:1910.12744 , 2019

work page arXiv 1910
[16]

Gradient networks,

S. Chaudhari, S. Pranav, and J. M. Moura, “Gradient networks,”IEEE Transactions on Signal Processing, 2024

work page 2024
[17]

Input con- vex gradient networks,

J. Richter-Powell, J. Lorraine, and B. Amos, “Input con- vex gradient networks,” arXiv preprint arXiv:2111.12187, 2021

work page arXiv 2021
[18]

Jacnet: Learning functions with structured jacobians,

J. Lorraine and S. Hossain, “Jacnet: Learning functions with structured jacobians,” arXiv preprint arXiv:2408.13237, 2024

work page arXiv 2024
[19]

Neural ordinary differential equations,

R. T. Chen, Y . Rubanova, J. Bettencourt, and D. K. Duve- naud, “Neural ordinary differential equations,” Advances in neural information processing systems , vol. 31, 2018

work page 2018
[20]

Optimal Control Via Neural Networks: A Convex Approach

Y . Chen, Y . Shi, and B. Zhang, “Optimal control via neural networks: A convex approach,” arXiv preprint arXiv:1805.11835, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Potential flows: Universal probability distributions with optimal transport and convex optimization,

C.-W. Huang, R. T. Chen, C. Tsirigotis, and A. Courville, “Potential flows: Universal probability distributions with optimal transport and convex optimization,” arXiv preprint arXiv:2012.05942, 2020

work page arXiv 2012
[22]

Optimal transport mapping via input convex neural networks,

A. Makkuva, A. Taghvaei, S. Oh, and J. Lee, “Optimal transport mapping via input convex neural networks,” in International Conference on Machine Learning . PMLR, 2020, pp. 6672–6681

work page 2020
[23]

Optimizing functionals on the space of probabilities with input convex neural networks,

D. Alvarez-Melis, Y . Schiff, and Y . Mroueh, “Optimizing functionals on the space of probabilities with input convex neural networks,” arXiv preprint arXiv:2106.00774, 2021

work page arXiv 2021
[24]

M. L. van De Vel, Theory of convex structures. Elsevier, 1993, vol. 50

work page 1993
[25]

D. E. Pallaschke and S. Rolewicz, Foundations of math- ematical optimization: convex analysis without linearity . Springer Science & Business Media, 2013, vol. 388

work page 2013
[26]

A. M. Rubinov, Abstract convexity and global optimiza- tion. Springer Science & Business Media, 2013, vol. 44

work page 2013
[27]

Reinforcement mechanism design: With applications to dynamic pricing in sponsored search auctions,

W. Shen, B. Peng, H. Liu, M. Zhang, R. Qian, Y . Hong, Z. Guo, Z. Ding, P. Lu, and P. Tang, “Reinforcement mechanism design: With applications to dynamic pricing in sponsored search auctions,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 02, 2020, pp. 2236–2243

work page 2020
[28]

Robust pricing in dynamic mechanism design,

Y . Deng, S. Lahaie, and V . Mirrokni, “Robust pricing in dynamic mechanism design,” in Proceedings of the 37th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13– 18 Jul 2020, pp. 2494–2503. [Online]. Available: https://proceedings.mlr.press/v119/deng20d.html

work page 2020
[29]

Reducing mechanism design to algorithm design via ma- chine learning,

M.-F. Balcan, A. Blum, J. D. Hartline, and Y . Mansour, “Reducing mechanism design to algorithm design via ma- chine learning,”Journal of Computer and System Sciences, vol. 74, no. 8, pp. 1245–1270, 2008

work page 2008
[30]

An exploration in the theory of optimum income taxation,

J. A. Mirrlees, “An exploration in the theory of optimum income taxation,” The review of economic studies, vol. 38, no. 2, pp. 175–208, 1971

work page 1971
[31]

Job market signaling,

M. Spence, “Job market signaling,” in Uncertainty in economics. Elsevier, 1978, pp. 281–306

work page 1978
[32]

Polar factorization and monotone rearrange- ment of vector-valued functions,

Y . Brenier, “Polar factorization and monotone rearrange- ment of vector-valued functions,”Communications on pure and applied mathematics , vol. 44, no. 4, pp. 375–417, 1991

work page 1991
[33]

Notes on optimal transportation,

I. Ekeland, “Notes on optimal transportation,” Economic Theory, pp. 437–459, 2010

work page 2010
[34]

Cannarsa and C

P. Cannarsa and C. Sinestrari, Semiconcave Func- tions, Hamilton—Jacobi Equations, and Optimal Control . Springer, 2004. 13

work page 2004
[35]

Multiproduct nonlinear pricing,

M. Armstrong, “Multiproduct nonlinear pricing,” Econo- metrica: Journal of the Econometric Society , pp. 51–75, 1996

work page 1996
[36]

Ironing, sweeping, and multi- dimensional screening,

J.-C. Rochet and P. Chon ´e, “Ironing, sweeping, and multi- dimensional screening,”Econometrica, pp. 783–826, 1998

work page 1998
[37]

Bundling as an opti- mal selling mechanism for a multiple-good monopolist,

A. M. Manelli and D. R. Vincent, “Bundling as an opti- mal selling mechanism for a multiple-good monopolist,” Journal of Economic Theory , vol. 127, no. 1, pp. 1–35, 2006

work page 2006
[38]

Multidimensional mechanism design: Revenue maximization and the multiple-good monopoly,

——, “Multidimensional mechanism design: Revenue maximization and the multiple-good monopoly,” Journal of Economic theory , vol. 137, no. 1, pp. 153–185, 2007

work page 2007
[39]

Duality and optimality of auctions for uniform distributions,

Y . Giannakopoulos and E. Koutsoupias, “Duality and optimality of auctions for uniform distributions,” in Pro- ceedings of the fifteenth ACM conference on Economics and computation, 2014, pp. 259–276

work page 2014
[40]

Generalized permu- tahedra and optimal auctions,

M. Joswig, M. Klimm, and S. Spitz, “Generalized permu- tahedra and optimal auctions,” SIAM Journal on Applied Algebra and Geometry , vol. 6, no. 4, pp. 711–739, 2022. 14

work page 2022