Transformer residual layers are approximated as an explicit Euler scheme for a controlled hidden-state flow whose mean-field limit is a first-order transport control problem with Pontryagin terminal condition given by the softmax residual.
hub
Nesterov.Lectures on Convex Optimization, volume 137 ofSpringer Optimization and Its Applications
15 Pith papers cite this work, alongside 713 external citations. Polarity classification is still indexing.
hub tools
verdicts
UNVERDICTED 15representative citing papers
Prox-ITEM achieves the minimax-optimal distance-to-solution rate among span-based first-order methods for smooth strongly convex composite problems, with Prox-TMM as its stationary limit matching TMM rates.
The paper proves finite-size general security for relativistic phase shift keying (RPSK) achieving secret key rates beyond 12 dB with 10^5 signals via entropy accumulation, Rényi leftover hashing, and conic optimization.
PUICL is a transformer pretrained on synthetic PU data from structural causal models that solves positive-unlabeled classification via in-context learning without gradient updates or fitting.
An intrinsic effective sample size for manifold MCMC is defined via kernel discrepancy as the number of independent draws yielding equivalent expected squared discrepancy to the target.
The profile maximum likelihood estimator for the location in anisotropic hyperbolic wrapped normal models is strongly consistent, asymptotically normal, and attains the Hájek-Le Cam minimax lower bound under squared geodesic loss.
Derives reliable and efficient a posteriori error estimators for a general class of stabilized finite element methods applied to time-dependent mean field games, with an improved version for specific mass-lumping and affine-preserving stabilizations.
A general framework for parameter-free smooth nonconvex optimization via higher-order regularization yields algorithms with optimal complexity bounds without prior parameter knowledge.
Introduces Riemannian Nyström approximation via subspace projections and Haar-Grassmann sketching for tangent operators, plus a randomized Newton method, tested on SPD and Grassmann manifolds.
Weak convergence rates of Markov transition kernels imply variance convergence bounds for Lipschitz functions and chi-squared divergence bounds under reversibility with Lipschitz initial densities.
An inexact subgradient algorithm achieves O(ε^{-2}) iteration complexity for ε-accurate solutions to copositive programs while allowing inexact solves of NP-hard quadratic subproblems and providing a sufficient condition for non-complete positivity.
Binno is a proximal-gradient first-order algorithm for nonconvex nonsmooth bi-level optimization, shown on sparse low-rank matrix factorization and regularized market-clearing problems with reported gains over baselines.
Develops multiplier-based contraction framework and LMI conditions for stability of regularized MPC interpreted as implicit Lur'e systems across three classes of regularizers.
This paper isolates admissibility conditions for trust-region radius updates that guarantee first-order stationarity and O(ε^{-2}) complexity, verifies them across five mechanism classes, and extends prior frameworks with new convergence results under linear Hessian growth.
Novel splitting scheme for kinetic Langevin sampling with exact harmonic integrator yields L2-Wasserstein convergence rates matching continuous dynamics and non-asymptotic error bounds for strongly log-concave targets.
citing papers explorer
-
A First-Order Mean Field Control Analysis of Transformer Layers under Cross-Entropy Training
Transformer residual layers are approximated as an explicit Euler scheme for a controlled hidden-state flow whose mean-field limit is a first-order transport control problem with Pontryagin terminal condition given by the softmax residual.
-
An optimal first-order method for smooth and strongly convex composite optimization and its stationary limit
Prox-ITEM achieves the minimax-optimal distance-to-solution rate among span-based first-order methods for smooth strongly convex composite problems, with Prox-TMM as its stationary limit matching TMM rates.
-
Finite-size general security for relativistic phase shift keying via variable-length quantum key distribution
The paper proves finite-size general security for relativistic phase shift keying (RPSK) achieving secret key rates beyond 12 dB with 10^5 signals via entropy accumulation, Rényi leftover hashing, and conic optimization.
-
In-Context Positive-Unlabeled Learning
PUICL is a transformer pretrained on synthetic PU data from structural causal models that solves positive-unlabeled classification via in-context learning without gradient updates or fitting.
-
Intrinsic effective sample size for manifold-valued Markov chain Monte Carlo via kernel discrepancy
An intrinsic effective sample size for manifold MCMC is defined via kernel discrepancy as the number of independent draws yielding equivalent expected squared discrepancy to the target.
-
Profile Likelihood Inference for Anisotropic Hyperbolic Wrapped Normal Models on Hyperbolic Space
The profile maximum likelihood estimator for the location in anisotropic hyperbolic wrapped normal models is strongly consistent, asymptotically normal, and attains the Hájek-Le Cam minimax lower bound under squared geodesic loss.
-
A posteriori error bounds for finite element approximations of time-dependent mean field games
Derives reliable and efficient a posteriori error estimators for a general class of stabilized finite element methods applied to time-dependent mean field games, with an improved version for specific mass-lumping and affine-preserving stabilizations.
-
A General Recipe for Parameter-Free Nonconvex Optimization via Higher-Order Regularization
A general framework for parameter-free smooth nonconvex optimization via higher-order regularization yields algorithms with optimal complexity bounds without prior parameter knowledge.
-
Nystr\"om Approximation on Manifolds
Introduces Riemannian Nyström approximation via subspace projections and Haar-Grassmann sketching for tangent operators, plus a randomized Newton method, tested on SPD and Grassmann manifolds.
-
Implications of weak convergence rates of Markov transition kernels
Weak convergence rates of Markov transition kernels imply variance convergence bounds for Lipschitz functions and chi-squared divergence bounds under reversibility with Lipschitz initial densities.
-
Inexact subgradient algorithm with a non-asymptotic convergence guarantee for copositive programming problems
An inexact subgradient algorithm achieves O(ε^{-2}) iteration complexity for ε-accurate solutions to copositive programs while allowing inexact solves of NP-hard quadratic subproblems and providing a sufficient condition for non-complete positivity.
-
Binno: A 1st-order method for Bi-level Nonconvex Nonsmooth Optimization for Matrix Factorizations
Binno is a proximal-gradient first-order algorithm for nonconvex nonsmooth bi-level optimization, shown on sparse low-rank matrix factorization and regularized market-clearing problems with reported gains over baselines.
-
Regularized Model Predictive Control via Contractivity and Implicit Lur'e Analysis
Develops multiplier-based contraction framework and LMI conditions for stability of regularized MPC interpreted as implicit Lur'e systems across three classes of regularizers.
-
A survey of trust-region radius update mechanisms. Part I: First-order analysis
This paper isolates admissibility conditions for trust-region radius updates that guarantee first-order stationarity and O(ε^{-2}) complexity, verifies them across five mechanism classes, and extends prior frameworks with new convergence results under linear Hessian growth.
-
Convergence and non-asymptotic error analysis for kinetic Langevin samplers using the exact harmonic Langevin integrator
Novel splitting scheme for kinetic Langevin sampling with exact harmonic integrator yields L2-Wasserstein convergence rates matching continuous dynamics and non-asymptotic error bounds for strongly log-concave targets.