arxiv: 1412.6980 · v9 · submitted 2014-12-22 · 💻 cs.LG

Recognition: 4 theorem links

· Lean Theorem

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

Pith reviewed 2026-05-09 01:36 UTC · model claude-opus-4-7

classification 💻 cs.LG MSC 90C1568T0590C25

keywords stochastic optimizationadaptive learning ratefirst-order methodsmoment estimationbias correctiononline convex optimizationdeep learningAdaMax

0 comments

The pith

Adam sets per-parameter step sizes from bias-corrected running averages of the gradient and its square, giving a robust default optimizer for noisy, high-dimensional problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks: can a first-order optimizer with almost no tuning reliably train large, noisy, sparse-gradient problems? Its answer is Adam, which keeps two running averages per parameter — one of the gradient, one of the squared gradient — and divides the first by the square root of the second to set a per-coordinate step. A small but important detail is the explicit bias correction that undoes the zero-initialization of those averages, which matters most when the second-moment decay rate β₂ is set close to 1 to handle sparse gradients. The construction makes the effective step size invariant to gradient rescaling and approximately bounded by the user-chosen α, so α functions like a trust-region radius rather than a raw learning rate. Empirically the authors show that one default setting tracks or beats AdaGrad, RMSProp, SGD with Nesterov momentum, AdaDelta, and a quasi-Newton baseline on logistic regression, MLPs with dropout, and convnets, and they derive an infinity-norm variant (AdaMax) with an even cleaner update bound.

Core claim

The paper proposes Adam, a first-order stochastic optimizer that maintains two exponential moving averages per parameter — one of the gradient (first moment) and one of the squared gradient (second raw moment) — and uses their ratio, with an explicit bias correction for the zero-initialization of those averages, to set a per-parameter step size. The authors argue this combines the sparse-gradient handling of AdaGrad with the non-stationarity handling of RMSProp, while the effective per-step move in parameter space stays approximately bounded by the user-chosen stepsize α, giving the method a built-in trust-region feel. They claim a single set of defaults (α=0.001, β₁=0.9, β₂=0.999, ε=1e-8) w

What carries the argument

The bias-corrected ratio m̂_t / √v̂_t, where m_t and v_t are exponential moving averages of g_t and g_t², and the corrections m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) undo the zero-initialization bias. This ratio is gradient-scale invariant, behaves like a per-coordinate signal-to-noise ratio that automatically anneals near optima, and bounds the per-step parameter move by roughly α — turning the stepsize hyperparameter into something close to a trust-region radius.

If this is right

<parameter name="0">A practitioner can train a wide range of deep models with the same optimizer and the same defaults
removing learning-rate tuning as a first-order concern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">Editorial: the regret proof's telescoping step requires √v̂_t/α_t to be non-decreasing along each coordinate
which is not generally true
later work has constructed simple convex counterexamples on which Adam diverges
so the O(√T) bound as stated should be read as suggestive rather than airtight
even though the empirical recipe survives unchanged.

Load-bearing premise

The regret proof leans on a quantity that grows monotonically along every coordinate as training proceeds; the paper asserts this without justification, and the bound only holds where that monotonicity actually holds.

What would settle it

Run Adam with the recommended defaults against well-tuned SGD-with-momentum, AdaGrad, and RMSProp on the same suite of problems (MNIST logistic regression and MLP, IMDB bag-of-words logistic regression, CIFAR-10 convnet, and a variational autoencoder). If Adam fails to match or beat them on training loss within the same wall-clock budget, or if removing the bias-correction terms does not visibly destabilize training when β₂ is close to 1, the central practical claim fails. For the regret claim, a convex online sequence on which Adam's iterates do not satisfy R(T)=O(√T) would falsify the theore

read the original abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adam: the algorithm and defaults are the contribution; the regret proof has a real gap, but it doesn't touch what made this paper matter.

read the letter

This is the Adam paper. You already know the practical story — bias-corrected EMAs of the first and second moments of the gradient, per-coordinate step sizes, defaults α=0.001, β₁=0.9, β₂=0.999, ε=1e-8 that have held up across a decade of deep learning. The reader's verdict is right and the stress-test note is right; let me just tell you which parts to trust.

What's actually new and good: the combination itself (momentum + RMSProp-style scaling) is incremental, but the bias-correction derivation in §3 is clean and matters — without it, β₂ near 1 produces enormous early steps, which they show empirically in §6.4. The §2.1 discussion of effective step size being roughly bounded by α (the "trust region" reading) is a genuinely useful piece of intuition and explains why the defaults transfer. §5 is honest about the relationship to RMSProp and AdaGrad rather than overclaiming. AdaMax in §7.1 falls out naturally from the L^p → L^∞ limit and is a nice aside. Experiments are standard for 2014 — MNIST logistic regression, MLPs, a small CIFAR ConvNet, a VAE — but they cover the cases that matter and the comparisons against AdaGrad, RMSProp, SGD+Nesterov, AdaDelta and SFO are fair.

The soft spot is exactly where the stress-test note puts it. Theorem 4.1's proof in §10.1 telescopes a sum of (√v̂_{t}/α_t − √v̂_{t−1}/α_{t−1}) terms weighted by (θ_t−θ*)² and bounds it by the final term. That step needs the increment to be non-negative coordinate-wise, which is not true: v̂_t is an EMA of g_t² and can shrink when a large gradient is followed by small ones. Reddi, Kale & Kumar (ICLR 2018) made this concrete with a 1-D convex example where Adam incurs Ω(T) regret, and AMSGrad's max-of-v fix is precisely what restores monotonicity. So the O(√T) bound as stated does not hold in the generality claimed. The algorithm, the bias-correction argument, and every empirical result are unaffected.

Recommendation: accept, easily. The proof gap is real and worth flagging in any discussion, but it doesn't undermine the contribution that made this paper consequential. Bring it to reading group if you have anyone who hasn't worked through the bias-correction derivation or the AMSGrad follow-up — the pairing is instructive. Cite without hesitation; everyone does, and they should.

Referee Report

4 major / 8 minor

Summary. The paper introduces Adam, a first-order stochastic optimizer that maintains exponential moving averages of the gradient (m_t) and the squared gradient (v_t), applies bias-correction for the zero-initialization, and updates parameters by θ_t ← θ_{t-1} − α · m̂_t / (√v̂_t + ε). The authors motivate the update via a signal-to-noise interpretation, derive the bias-correction from the EMA recurrence, prove an O(√T) regret bound in the online convex setting (Theorem 4.1), present an L_∞-norm variant (AdaMax), and report experiments on logistic regression (MNIST, IMDB-BoW), MLPs (MNIST, with and without dropout), CNNs (CIFAR-10), and a VAE. Default hyperparameters (α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸) are recommended and shown to be competitive with or better than SGD+Nesterov, AdaGrad, RMSProp, AdaDelta, and SFO.

Significance. If the algorithmic and empirical claims hold, Adam offers a practically important contribution: a simple, memory-light, scale-invariant adaptive optimizer with intuitive hyperparameters that performs robustly across convex and non-convex deep learning workloads. The bias-correction derivation in §3 is clean and useful in its own right (it cleanly explains an effect that earlier RMSProp-with-momentum variants get wrong for β₂ near 1), and the SNR/effective-step discussion in §2.1 gives a usable mental model for setting α. The AdaMax derivation (§7.1) is elegant and yields a particularly simple update with a tighter step bound |Δ_t| ≤ α. The empirical comparisons span enough model classes (logistic regression, fully-connected nets with/without dropout, CNNs, VAE) to support the robustness claim, and the bias-correction ablation in §6.4 is a genuinely informative experiment. The theoretical contribution (Theorem 4.1) is partial — see major comments — but the algorithmic and empirical case is strong.

major comments (4)

[§4 / §10.1, Theorem 4.1 / 10.5] The regret proof contains a load-bearing step that is not justified. In the displayed bound at the top of p. 14, the sum ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) is replaced by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This telescoping is valid only if √v̂_{t,i}/α_t is non-decreasing in t for every coordinate i. With α_t = α/√t, the quantity is √(t·v̂_{t,i})/α, and since v̂_t is a bias-corrected EMA of g_t² it can strictly decrease whenever a coordinate sees a small gradient following a large one. The authors should either (i) state and justify a monotonicity assumption on v̂_t, (ii) carry through the proof with the absolute value of the increment (which changes the bound), or (iii) restrict the theorem to a class of sequences for which the monotonicity holds. As written, the bound is not established for general bounded convex sequences, and a one-dimensional counterexamp
[§4, Theorem 4.1 statement] The hypothesis β₁²/√β₂ < 1 is stated but its role should be made explicit in the main text — it is used in Lemma 10.4 to bound an arithmetic-geometric series. With the recommended defaults β₁=0.9, β₂=0.999 one has β₁²/√β₂ ≈ 0.811, so the assumption is satisfied at defaults; however readers tuning β₁ upward (a common practice with momentum) can violate it. Please flag this in §4 alongside the theorem so that the regime of validity is clear.
[§6.3, Figure 3] The CNN experiment reports that v̂_t 'vanishes to zeros after a few epochs and is dominated by the ε in algorithm 1', and that consequently 'Adagrad converges much slower than others' while Adam shows only 'marginal improvement over SGD with momentum'. This is an interesting and honest observation, but it slightly undercuts the central claim that adaptive second-moment scaling is the source of Adam's advantage. It would strengthen the paper to (a) report what fraction of coordinates have √v̂_t < ε at the cited epochs, and (b) show an ablation in which ε is varied, so readers can tell whether Adam in this regime is effectively SGD-with-momentum + a small constant preconditioner or whether the second moment still contributes.
[§5, Related work / RMSProp comparison] The claim that lack of bias-correction in RMSProp 'leads to very large stepsizes and often divergence' for β₂ near 1 is supported by the VAE experiment in §6.4, but the comparison fixes architecture and dataset. Since this is one of the paper's main differentiators from RMSProp, a second setting (e.g., the MLP+dropout or CNN tasks already in the paper) showing the same effect would make the case substantially more robust.

minor comments (8)

[Algorithm 1] The placement of ε inside the square root (√v̂_t + ε) versus inside (√(v̂_t + ε)) matters in practice and differs across implementations. Please state explicitly which convention is used and whether the analysis is affected.
[§2.1] The two cases for the step bound, |Δ_t| ≤ α·(1−β₁)/√(1−β₂) versus |Δ_t| ≤ α, would be clearer with a one-line derivation rather than asserted. Currently the reader has to reconstruct the algebra.
[§3, Eq. (4)] The term ζ is introduced and immediately argued to be small for stationary or slowly-varying gradients, but is not formally bounded. A short remark giving an explicit bound in terms of the variation of E[g_t²] would tighten the derivation.
[§4] The decay schedule β_{1,t} = β₁·λ^{t−1} with λ very close to 1 is required for the proof but is not used in any of the experiments (which appear to use constant β₁=0.9). Please clarify whether the empirical performance corresponds to a regime covered by the theorem.
[§7.1, Eq. (12)] It would be helpful to note that u_t = max(β₂·u_{t−1}, |g_t|) corresponds to a max over an exponentially-weighted history and therefore does not require bias correction, as briefly stated; an explicit derivation showing E[u_t] in the stationary case would parallel §3.
[Lemma 10.3 proof] The inductive step uses the inequality √(a − b) ≤ √a − b/(2√a) which requires a ≥ b ≥ 0; this is fine but worth stating, since a = ∥g_{1:T,i}∥² and b = g_{T,i}² satisfy it by construction.
[§6] The phrase 'searched over a dense grid' for hyperparameters of the baselines is not specific. Listing the grids (at minimum for α and momentum) in an appendix would improve reproducibility.
[Typos] Several minor typos: 'theoratical' (§6.1, twice), 'BoW feature Logistic Regression' axis label, 'Initalization' (§7.2). 'β₁' appears where 'β₂' is meant in the sentence following Eq. (4) ('the exponential decay rate β₁ can be chosen…' — context is the second moment).

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a careful and constructive report. The most substantive point — the unstated monotonicity assumption underlying the telescoping step in the regret proof — is correct, and we will revise Theorem 4.1 and its proof to state the assumption explicitly rather than leaving it implicit. We also agree to flag the β₁²/√β₂ < 1 hypothesis prominently in §4, to add quantitative support to the §6.3 discussion of v̂_t vanishing on CNNs (including an ε ablation), and to broaden the bias-correction comparison in §6.4 beyond the VAE setting. None of these revisions affect the algorithm itself, the bias-correction derivation in §3, the SNR discussion in §2.1, the AdaMax derivation in §7.1, or the empirical conclusions; they sharpen the theoretical statement and strengthen the empirical case. A point-by-point response follows.

read point-by-point responses

Referee: The telescoping step in the regret proof (top of p. 14) replaces ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This is only valid if √v̂_{t,i}/α_t is non-decreasing in t per coordinate, which need not hold for a bias-corrected EMA of g_t² when a small gradient follows a large one.

Authors: We agree that this step requires an additional assumption that we did not state explicitly. The telescoping is valid when √(t·v̂_{t,i}) is non-decreasing in t for each coordinate, which is not guaranteed by a bias-corrected EMA of g_t² in general. We will revise §4 and §10.1 in two ways: (i) we will explicitly add the assumption that √(t·v̂_{t,i})/α is non-decreasing in t for all i (equivalently, that t·v̂_{t,i} is non-decreasing), and flag that this is what makes the telescoping well-defined; and (ii) we will note the alternative route in which the increment is replaced by its absolute value, which yields a weaker but unconditional bound. We thank the referee for catching this — the assumption is implicit in our derivation but should be made part of the theorem statement, and we will add a sentence describing the regime in which it is reasonable (sufficiently slowly-varying second-moment estimates) and acknowledging that pathological sequences can violate it. We do not claim a fix for the general non-monotone case in this revision. revision: yes
Referee: The hypothesis β₁²/√β₂ < 1 should be flagged in the main text alongside Theorem 4.1, since users tuning β₁ upward can violate it (defaults satisfy it: 0.9²/√0.999 ≈ 0.811).

Authors: We agree. We will add a short remark in §4 immediately after the theorem statement noting (i) the role of this assumption — it is used in Lemma 10.4 to bound an arithmetic-geometric series via γ = β₁²/√β₂ < 1 — (ii) that the recommended defaults β₁=0.9, β₂=0.999 give γ ≈ 0.811 and so satisfy it comfortably, and (iii) that practitioners increasing β₁ (e.g. β₁ ≥ 0.95 with default β₂) should check the inequality. We will also include a one-line worked example so the regime of validity is unambiguous. revision: yes
Referee: In the CNN experiment (§6.3), the authors note v̂_t vanishes to near-zero so the update is dominated by ε, which somewhat undercuts the claim that adaptive second-moment scaling drives Adam's advantage. Please report what fraction of coordinates have √v̂_t < ε at the cited epochs, and add an ablation varying ε.

Authors: This is a fair point and we agree the §6.3 discussion would benefit from quantitative support. In the revision we will add (a) a measurement, taken from the same CIFAR-10 run, of the fraction of coordinates with √v̂_t below ε (and below 10ε, 100ε) as a function of epoch, and (b) an ablation varying ε ∈ {10⁻⁴,10⁻⁶,10⁻⁸,10⁻¹⁰} to expose how much of Adam's behavior in this regime is attributable to the second-moment term versus an effectively constant preconditioner combined with the first-moment term. We will not retract the broader claim — on the logistic, MLP, and VAE experiments the second moment plainly contributes — but we will explicitly state that on CNNs of this size much of Adam's benefit over plain SGD with momentum comes from per-layer scale adaptation early in training and from the first-moment term, and that the improvement margin over well-tuned SGD+momentum is correspondingly modest. This nuance is consistent with what is already written in §6.3 but will be made quantitative. revision: yes
Referee: The claim that absent bias-correction RMSProp diverges for β₂ near 1 is supported only by the VAE experiment (§6.4); a second setting would substantially strengthen the differentiator from RMSProp.

Authors: We accept this. The bias-correction-vs-no-correction comparison is a central claim and one experiment is thinner than it should be. For the revision we will add a sweep over β₂ ∈ {0.99, 0.999, 0.9999} and α ∈ [10⁻⁵,10⁻¹], with and without the bias-correction terms, on the MLP+dropout MNIST setting from §6.2 (and, if space permits, on the CNN setting from §6.3). We expect — based on the analysis in §3, where the (1−β₂^t) factor is largest precisely when β₂ is near 1 — to reproduce the same instability pattern observed in §6.4. The resulting figure will be added as a panel to Figure 4 or as a new figure in §6.4. revision: yes

standing simulated objections not resolved

We do not have a proof of the O(√T) regret bound that dispenses with the monotonicity assumption on √(t·v̂_{t,i})/α. The revised theorem will therefore be conditional on this assumption; an unconditional bound for general bounded convex sequences with bias-corrected v̂_t is left to future work.

Circularity Check

0 steps flagged

No meaningful circularity: Adam's algorithm, bias correction, and empirical claims stand on independent content; the proof gap flagged by the reader is a correctness/soundness issue, not a circular derivation.

full rationale

Walking the derivation chain: (1) §2 Algorithm: defines Adam by EMAs of g and g². No claim is being "derived" from itself — the update rule is a definition. (2) §3 Bias correction: derives E[v_t] = E[g_t²]·(1−β_2^t) + ζ from the EMA recursion (Eq. 1–4). The (1−β_2^t) divisor is then read off this expectation. This is a straightforward algebraic identity, not a circular fit; nothing is fitted to data and then re-predicted. (3) §2.1 SNR / effective stepsize bounds: |Δ_t| ≤ α-style bounds follow from the algebra of m̂_t/√v̂_t. Independent content. (4) §4 / §10 Convergence: Theorem 4.1 derives an O(√T) regret bound from stated assumptions (bounded gradients, bounded iterate distance, β_1²/√β_2 < 1). The reader's concern is that the telescoping step in the proof of Theorem 10.5 implicitly assumes √(t·v̂_{t,i})/α is monotone non-decreasing — a soundness gap later exploited by Reddi et al. (2018). That is a *correctness* problem, not a circularity problem: the bound is not "the input renamed as the output" — it is an attempted proof from external assumptions that turns out to have an unjustified inequality. No quantity is fitted to the regret and then claimed as a prediction of the regret; no self-citation is load-bearing (the proof cites Zinkevich 2003's framework, not the authors' own prior work). (5) §5 Related work / §6 Experiments: comparisons to AdaGrad/RMSProp/SGD use independently implemented baselines on standard datasets (MNIST, IMDB, CIFAR-10). No fitted-input-as-prediction pattern. (6) §7 AdaMax: derived as the p→∞ limit of an L_p generalization (Eq. 6–12). Algebraic limit, not circular. There is essentially no self-citation load: the references are to Duchi, Tieleman & Hinton, Zeiler, Sohl-Dickstein, Zinkevich, etc. The Kingma & Welling (2013) self-cite is only used to specify the VAE architecture used as a *test problem* in §6.4 — it is not load-bearing for any theoretical claim. Conclusion: the paper's core claims are self-contained against external benchmarks and standard online-convex machinery. The Theorem 4.1 issue is a real bug in the proof (correctly diagnosed by the reader), but it is a missing-step / unjustified-monotonicity flaw, not a circular derivation. Score: 1 (one minor self-citation, not load-bearing).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The algorithm itself introduces no postulated entities. Its hyperparameters (α, β₁, β₂, ε) are user-settable knobs, not free parameters fitted to make a derivation work, though the recommended defaults were chosen empirically. The convergence theorem leans on standard online-convex-optimization assumptions plus one assumption that turned out to be false in general.

pith-pipeline@v0.9.0 · 9516 in / 5575 out tokens · 82700 ms · 2026-05-09T01:36:22.993249+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/PhiForcing.lean, Foundation/DimensionForcing.lean phi_equation, dimension_forced unclear
Good default settings for the tested machine learning problems are α = 0.001, β₁ = 0.9, β₂ = 0.999 and ε = 10⁻⁸.
Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
The algorithm updates exponential moving averages of the gradient (m_t) and the squared gradient (v_t) where the hyper-parameters β₁, β₂ ∈ [0,1) control the exponential decay rates of these moving averages.
Foundation/DAlembert/Inevitability.lean bilinear_family_forced unclear
R(T) ≤ D²/(2α(1−β₁)) Σ_i √(T·v̂_{T,i}) + α(1+β₁)G_∞/((1−β₁)√(1−β₂)(1−γ)²) Σ_i ‖g_{1:T,i}‖₂ + ...
Foundation/LawOfExistence.lean law_of_existence unclear
We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
cs.LG 2026-05 unverdicted novelty 8.0

In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...
Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
SLayerGen: a Crystal Generative Model for all Space and Layer Groups
cond-mat.mtrl-sci 2026-05 unverdicted novelty 8.0

SLayerGen generates crystals invariant to any space or layer group via autoregressive lattice and Wyckoff sampling plus equivariant diffusion, achieving gains over bulk models on diperiodic materials after correcting ...
Random test functions, $H^{-1}$ norm equivalence, and stochastic variational physics-informed neural networks
math.NA 2026-05 unverdicted novelty 8.0

H^{-1} norm equivalence to expected squared evaluations on domain-dependent random test functions enables SV-PINNs that recover accurate solutions to challenging second-order elliptic PDEs faster than standard PINNs.
A Parameter-Free First-Order Algorithm for Non-Convex Optimization with $\tilde{\mkern1mu O}(\epsilon^{-5/3})$ Global Rate
math.OC 2026-05 conditional novelty 8.0

PF-AGD is the first parameter-free deterministic accelerated first-order method with Õ(ε^{-5/3} log(1/ε)) complexity for smooth non-convex optimization.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
cs.CR 2026-05 unverdicted novelty 8.0

STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
quant-ph 2026-04 unverdicted novelty 8.0

Qvine uses vine copula-inspired quantum circuit structures to achieve linear or quadratic depth scaling for loading high-dimensional distributions with high approximation quality.
Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
hep-th 2026-04 unverdicted novelty 8.0

Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
MMGait: Towards Multi-Modal Gait Recognition
cs.CV 2026-04 conditional novelty 8.0

MMGait provides a new multi-sensor gait dataset and OmniGait baseline to support single-modal, cross-modal, and unified multi-modal person identification from walking patterns.
Proton Structure from Neural Simulation-Based Inference at the LHC
hep-ph 2026-04 unverdicted novelty 8.0

Neural simulation-based inference on unbinned top-quark pair data at 13 TeV yields improved gluon PDF precision over traditional binned analyses while incorporating experimental and theoretical uncertainties.
Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
math.OC 2026-04 unverdicted novelty 8.0

Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification
cs.CV 2026-04 unverdicted novelty 8.0

The paper introduces the CMCC-ReID task, constructs the SYSU-CMCC benchmark dataset, and proposes the PIA network with disentangling and prototype modules that outperforms prior methods on combined modality and clothi...
Traces of Helium Detected in Type Ic Supernova 2014L
astro-ph.HE 2026-03 accept novelty 8.0

Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Offline Reinforcement Learning with Implicit Q-Learning
cs.LG 2021-10 unverdicted novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Passage Re-ranking with BERT
cs.IR 2019-01 unverdicted novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
Density estimation using Real NVP
cs.LG 2016-05 accept novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
cs.LG 2015-11 accept novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
NICE: Non-linear Independent Components Estimation
cs.LG 2014-10 accept novelty 8.0

NICE learns a composition of invertible neural-network layers that transform data into independent latent variables, enabling exact log-likelihood training and sampling for density estimation.
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

SEMIR replaces dense voxel computation with a learned topology-preserving graph minor that supports exact decoding and GNN-based inference for small-structure segmentation in large medical images.
Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale
cs.LG 2026-05 unverdicted novelty 7.0

Local neural operators on 3x3x3 patches, composed via Schwarz iteration, solve large-scale nonlinear elasticity on arbitrary geometries without domain-specific retraining.
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
Delightful Gradients Accelerate Corner Escape
cs.LG 2026-05 unverdicted novelty 7.0

Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers
cs.CR 2026-05 unverdicted novelty 7.0

AccLock extracts user-specific features from in-ear ballistocardiogram signals via a disentanglement model and Siamese network to achieve average FAR of 3.13% and FRR of 2.99% in tests with 33 participants.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media
physics.optics 2026-05 unverdicted novelty 7.0

The BiLT autoencoder recovers absorption and scattering spectra from integrating sphere data with high accuracy while remaining robust to wavelength shifts up to 10 bands and generalizing to different instrument line ...
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
cs.LG 2026-05 unverdicted novelty 7.0

Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
Variational predictive resampling
stat.ME 2026-05 unverdicted novelty 7.0

Variational predictive resampling uses sequential imputation from variational predictives to generate samples whose distribution converges to the exact Bayesian posterior in Gaussian models and improves dependence cap...
Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models
cs.LG 2026-05 unverdicted novelty 7.0

Spectra defines and controls effective capacity in graph embeddings via the Shannon effective rank of a trace-normalized kernel spectrum, making capacity a post-fit property rather than a pre-training hyperparameter.
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
Fixed-Point Neural Optimal Transport without Implicit Differentiation
math.OC 2026-05 unverdicted novelty 7.0

A single-network fixed-point formulation for neural optimal transport eliminates adversarial min-max optimization and implicit differentiation while enforcing dual feasibility exactly.
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
cs.LG 2026-05 unverdicted novelty 7.0

LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapt...
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
cs.LG 2026-05 unverdicted novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs
cs.LG 2026-05 unverdicted novelty 7.0

Update direction selection for PINN training is cast as a Chebyshev-center problem in the dual cone, yielding an efficient dual formulation with nonconvex convergence guarantees and automatic recovery of scale robustn...
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
cs.LG 2026-05 conditional novelty 7.0

An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
Accelerating 3D Non-LTE Synthesis with Graph Neural Networks
astro-ph.SR 2026-05 unverdicted novelty 7.0

Graph neural networks can approximate full 3D non-LTE Ca II populations in solar models with correlations above 0.99 and extreme computational efficiency.
Constitutive Priors for Inverse Design
physics.comp-ph 2026-05 unverdicted novelty 7.0

A framework learns constitutive priors from noisy data to enable PDE-constrained inverse design of elastic networks using latent variables, homotopy continuation, Chamfer distance matching, and neural smoothness constraints.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Optimality of Sub-network Laplace Approximations: New Results and Methods
stat.ML 2026-05 conditional novelty 7.0

Sub-network Laplace approximations always underestimate full-model predictive variance, and two new gradient-based and greedy selection rules provide theoretically grounded improvements.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 unverdicted novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow
cs.LG 2026-05 unverdicted novelty 7.0

Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...
HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis
cs.GR 2026-05 unverdicted novelty 7.0

HairGPT reframes 3D hairstyle synthesis as dual-decoupled autoregressive strand sequence modeling with geometric tokenization for semantic control and rare style generation.
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning
cs.LG 2026-05 unverdicted novelty 7.0

The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
NeuralBench: A Unifying Framework to Benchmark NeuroAI Models
cs.LG 2026-05 conditional novelty 7.0

NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
cs.CV 2026-05 unverdicted novelty 7.0

Retrieval from motion datasets combined with LLM task parsing and reward-guided noise initialization enables training-free diffusion optimization to satisfy severe spatiotemporal constraints in human motion generation.
Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data
cs.LG 2026-05 unverdicted novelty 7.0

ADD-PINN adaptively decomposes the spatial domain based on PINN residuals and a shock indicator to improve offline traffic state estimation under the LWR model, outperforming baselines in most sparse-sensor cases whil...
Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks
eess.IV 2026-05 unverdicted novelty 7.0

Self-supervised monocular depth estimation improves in low-texture regions by using distance transforms on jointly estimated pre-semantic contours to create more informative loss signals.
What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
cs.CV 2026-05 unverdicted novelty 7.0

PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
cs.CR 2026-05 unverdicted novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
cs.LG 2026-05 unverdicted novelty 7.0

PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
IntentGrasp: A Comprehensive Benchmark for Intent Understanding
cs.CL 2026-05 unverdicted novelty 7.0

IntentGrasp benchmark demonstrates that LLMs have low intent understanding capabilities, with most models underperforming random guessing on a challenging subset, but Intentional Fine-Tuning provides large improvements.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · cited by 666 Pith papers

[1]

Natural gradient works efficiently in learning

Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998

1998
[2]

Recent advances in deep learning for speech research at microsoft

Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013

2013
[3]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

2011
[4]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page arXiv 2013
[5]

Speech recognition with deep recurrent neural networks

Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013

2013
[6]

and Salakhutdinov, R.R

Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006

2006
[7]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a

2012
[8]

E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R

Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b

work page arXiv 2012
[9]

Auto-Encoding Variational Bayes

Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013

2013
[10]

Imagenet classification with deep convolutional neural networks

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

2012
[11]

Learning word vectors for sentiment analysis

Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011

2011
[12]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011

2011
[13]

Razvan Pascanu, Clare Lyle, Ionut-Vlad Modoranu, Naima Elosegui Borras, Dan Alistarh, Petar Velickovic, Sarath Chandar, Soham De, and James Martens

Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013

work page arXiv 2013
[14]

Acceleration of stochastic approximation by averaging

Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992

1992
[15]

A fast natural newton method

Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010

2010
[16]

Efficient estimations from a slowly convergent robbins-monro process

Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

1988
[17]

arXiv , arxivId =:arXiv:1206.1106v2 , title =

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012

work page arXiv 2012
[18]

Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods

Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

2014
[19]

On the importance of initialization and momentum in deep learning

Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013

2013
[20]

and Hinton, G

Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012

2012
[21]

Fast dropout training

Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013

2013
[22]

Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,

Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012

work page arXiv 2012
[23]

Online convex programming and generalized infinitesimal gradient ascent

Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003

2003
[24]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[25]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[26]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[27]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...