Adam: A Method for Stochastic Optimization
Pith reviewed 2026-05-09 01:36 UTC · model claude-opus-4-7
The pith
Adam sets per-parameter step sizes from bias-corrected running averages of the gradient and its square, giving a robust default optimizer for noisy, high-dimensional problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes Adam, a first-order stochastic optimizer that maintains two exponential moving averages per parameter — one of the gradient (first moment) and one of the squared gradient (second raw moment) — and uses their ratio, with an explicit bias correction for the zero-initialization of those averages, to set a per-parameter step size. The authors argue this combines the sparse-gradient handling of AdaGrad with the non-stationarity handling of RMSProp, while the effective per-step move in parameter space stays approximately bounded by the user-chosen stepsize α, giving the method a built-in trust-region feel. They claim a single set of defaults (α=0.001, β₁=0.9, β₂=0.999, ε=1e-8) w
What carries the argument
The bias-corrected ratio m̂_t / √v̂_t, where m_t and v_t are exponential moving averages of g_t and g_t², and the corrections m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) undo the zero-initialization bias. This ratio is gradient-scale invariant, behaves like a per-coordinate signal-to-noise ratio that automatically anneals near optima, and bounds the per-step parameter move by roughly α — turning the stepsize hyperparameter into something close to a trust-region radius.
If this is right
- <parameter name="0">A practitioner can train a wide range of deep models with the same optimizer and the same defaults
- removing learning-rate tuning as a first-order concern.
Where Pith is reading between the lines
- <parameter name="0">Editorial: the regret proof's telescoping step requires √v̂_t/α_t to be non-decreasing along each coordinate
- which is not generally true
- later work has constructed simple convex counterexamples on which Adam diverges
- so the O(√T) bound as stated should be read as suggestive rather than airtight
- even though the empirical recipe survives unchanged.
Load-bearing premise
The regret proof leans on a quantity that grows monotonically along every coordinate as training proceeds; the paper asserts this without justification, and the bound only holds where that monotonicity actually holds.
What would settle it
Run Adam with the recommended defaults against well-tuned SGD-with-momentum, AdaGrad, and RMSProp on the same suite of problems (MNIST logistic regression and MLP, IMDB bag-of-words logistic regression, CIFAR-10 convnet, and a variational autoencoder). If Adam fails to match or beat them on training loss within the same wall-clock budget, or if removing the bias-correction terms does not visibly destabilize training when β₂ is close to 1, the central practical claim fails. For the regret claim, a convex online sequence on which Adam's iterates do not satisfy R(T)=O(√T) would falsify the theore
read the original abstract
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Adam, a first-order stochastic optimizer that maintains exponential moving averages of the gradient (m_t) and the squared gradient (v_t), applies bias-correction for the zero-initialization, and updates parameters by θ_t ← θ_{t-1} − α · m̂_t / (√v̂_t + ε). The authors motivate the update via a signal-to-noise interpretation, derive the bias-correction from the EMA recurrence, prove an O(√T) regret bound in the online convex setting (Theorem 4.1), present an L_∞-norm variant (AdaMax), and report experiments on logistic regression (MNIST, IMDB-BoW), MLPs (MNIST, with and without dropout), CNNs (CIFAR-10), and a VAE. Default hyperparameters (α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸) are recommended and shown to be competitive with or better than SGD+Nesterov, AdaGrad, RMSProp, AdaDelta, and SFO.
Significance. If the algorithmic and empirical claims hold, Adam offers a practically important contribution: a simple, memory-light, scale-invariant adaptive optimizer with intuitive hyperparameters that performs robustly across convex and non-convex deep learning workloads. The bias-correction derivation in §3 is clean and useful in its own right (it cleanly explains an effect that earlier RMSProp-with-momentum variants get wrong for β₂ near 1), and the SNR/effective-step discussion in §2.1 gives a usable mental model for setting α. The AdaMax derivation (§7.1) is elegant and yields a particularly simple update with a tighter step bound |Δ_t| ≤ α. The empirical comparisons span enough model classes (logistic regression, fully-connected nets with/without dropout, CNNs, VAE) to support the robustness claim, and the bias-correction ablation in §6.4 is a genuinely informative experiment. The theoretical contribution (Theorem 4.1) is partial — see major comments — but the algorithmic and empirical case is strong.
major comments (4)
- [§4 / §10.1, Theorem 4.1 / 10.5] The regret proof contains a load-bearing step that is not justified. In the displayed bound at the top of p. 14, the sum ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) is replaced by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This telescoping is valid only if √v̂_{t,i}/α_t is non-decreasing in t for every coordinate i. With α_t = α/√t, the quantity is √(t·v̂_{t,i})/α, and since v̂_t is a bias-corrected EMA of g_t² it can strictly decrease whenever a coordinate sees a small gradient following a large one. The authors should either (i) state and justify a monotonicity assumption on v̂_t, (ii) carry through the proof with the absolute value of the increment (which changes the bound), or (iii) restrict the theorem to a class of sequences for which the monotonicity holds. As written, the bound is not established for general bounded convex sequences, and a one-dimensional counterexamp
- [§4, Theorem 4.1 statement] The hypothesis β₁²/√β₂ < 1 is stated but its role should be made explicit in the main text — it is used in Lemma 10.4 to bound an arithmetic-geometric series. With the recommended defaults β₁=0.9, β₂=0.999 one has β₁²/√β₂ ≈ 0.811, so the assumption is satisfied at defaults; however readers tuning β₁ upward (a common practice with momentum) can violate it. Please flag this in §4 alongside the theorem so that the regime of validity is clear.
- [§6.3, Figure 3] The CNN experiment reports that v̂_t 'vanishes to zeros after a few epochs and is dominated by the ε in algorithm 1', and that consequently 'Adagrad converges much slower than others' while Adam shows only 'marginal improvement over SGD with momentum'. This is an interesting and honest observation, but it slightly undercuts the central claim that adaptive second-moment scaling is the source of Adam's advantage. It would strengthen the paper to (a) report what fraction of coordinates have √v̂_t < ε at the cited epochs, and (b) show an ablation in which ε is varied, so readers can tell whether Adam in this regime is effectively SGD-with-momentum + a small constant preconditioner or whether the second moment still contributes.
- [§5, Related work / RMSProp comparison] The claim that lack of bias-correction in RMSProp 'leads to very large stepsizes and often divergence' for β₂ near 1 is supported by the VAE experiment in §6.4, but the comparison fixes architecture and dataset. Since this is one of the paper's main differentiators from RMSProp, a second setting (e.g., the MLP+dropout or CNN tasks already in the paper) showing the same effect would make the case substantially more robust.
minor comments (8)
- [Algorithm 1] The placement of ε inside the square root (√v̂_t + ε) versus inside (√(v̂_t + ε)) matters in practice and differs across implementations. Please state explicitly which convention is used and whether the analysis is affected.
- [§2.1] The two cases for the step bound, |Δ_t| ≤ α·(1−β₁)/√(1−β₂) versus |Δ_t| ≤ α, would be clearer with a one-line derivation rather than asserted. Currently the reader has to reconstruct the algebra.
- [§3, Eq. (4)] The term ζ is introduced and immediately argued to be small for stationary or slowly-varying gradients, but is not formally bounded. A short remark giving an explicit bound in terms of the variation of E[g_t²] would tighten the derivation.
- [§4] The decay schedule β_{1,t} = β₁·λ^{t−1} with λ very close to 1 is required for the proof but is not used in any of the experiments (which appear to use constant β₁=0.9). Please clarify whether the empirical performance corresponds to a regime covered by the theorem.
- [§7.1, Eq. (12)] It would be helpful to note that u_t = max(β₂·u_{t−1}, |g_t|) corresponds to a max over an exponentially-weighted history and therefore does not require bias correction, as briefly stated; an explicit derivation showing E[u_t] in the stationary case would parallel §3.
- [Lemma 10.3 proof] The inductive step uses the inequality √(a − b) ≤ √a − b/(2√a) which requires a ≥ b ≥ 0; this is fine but worth stating, since a = ∥g_{1:T,i}∥² and b = g_{T,i}² satisfy it by construction.
- [§6] The phrase 'searched over a dense grid' for hyperparameters of the baselines is not specific. Listing the grids (at minimum for α and momentum) in an appendix would improve reproducibility.
- [Typos] Several minor typos: 'theoratical' (§6.1, twice), 'BoW feature Logistic Regression' axis label, 'Initalization' (§7.2). 'β₁' appears where 'β₂' is meant in the sentence following Eq. (4) ('the exponential decay rate β₁ can be chosen…' — context is the second moment).
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report. The most substantive point — the unstated monotonicity assumption underlying the telescoping step in the regret proof — is correct, and we will revise Theorem 4.1 and its proof to state the assumption explicitly rather than leaving it implicit. We also agree to flag the β₁²/√β₂ < 1 hypothesis prominently in §4, to add quantitative support to the §6.3 discussion of v̂_t vanishing on CNNs (including an ε ablation), and to broaden the bias-correction comparison in §6.4 beyond the VAE setting. None of these revisions affect the algorithm itself, the bias-correction derivation in §3, the SNR discussion in §2.1, the AdaMax derivation in §7.1, or the empirical conclusions; they sharpen the theoretical statement and strengthen the empirical case. A point-by-point response follows.
read point-by-point responses
-
Referee: The telescoping step in the regret proof (top of p. 14) replaces ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This is only valid if √v̂_{t,i}/α_t is non-decreasing in t per coordinate, which need not hold for a bias-corrected EMA of g_t² when a small gradient follows a large one.
Authors: We agree that this step requires an additional assumption that we did not state explicitly. The telescoping is valid when √(t·v̂_{t,i}) is non-decreasing in t for each coordinate, which is not guaranteed by a bias-corrected EMA of g_t² in general. We will revise §4 and §10.1 in two ways: (i) we will explicitly add the assumption that √(t·v̂_{t,i})/α is non-decreasing in t for all i (equivalently, that t·v̂_{t,i} is non-decreasing), and flag that this is what makes the telescoping well-defined; and (ii) we will note the alternative route in which the increment is replaced by its absolute value, which yields a weaker but unconditional bound. We thank the referee for catching this — the assumption is implicit in our derivation but should be made part of the theorem statement, and we will add a sentence describing the regime in which it is reasonable (sufficiently slowly-varying second-moment estimates) and acknowledging that pathological sequences can violate it. We do not claim a fix for the general non-monotone case in this revision. revision: yes
-
Referee: The hypothesis β₁²/√β₂ < 1 should be flagged in the main text alongside Theorem 4.1, since users tuning β₁ upward can violate it (defaults satisfy it: 0.9²/√0.999 ≈ 0.811).
Authors: We agree. We will add a short remark in §4 immediately after the theorem statement noting (i) the role of this assumption — it is used in Lemma 10.4 to bound an arithmetic-geometric series via γ = β₁²/√β₂ < 1 — (ii) that the recommended defaults β₁=0.9, β₂=0.999 give γ ≈ 0.811 and so satisfy it comfortably, and (iii) that practitioners increasing β₁ (e.g. β₁ ≥ 0.95 with default β₂) should check the inequality. We will also include a one-line worked example so the regime of validity is unambiguous. revision: yes
-
Referee: In the CNN experiment (§6.3), the authors note v̂_t vanishes to near-zero so the update is dominated by ε, which somewhat undercuts the claim that adaptive second-moment scaling drives Adam's advantage. Please report what fraction of coordinates have √v̂_t < ε at the cited epochs, and add an ablation varying ε.
Authors: This is a fair point and we agree the §6.3 discussion would benefit from quantitative support. In the revision we will add (a) a measurement, taken from the same CIFAR-10 run, of the fraction of coordinates with √v̂_t below ε (and below 10ε, 100ε) as a function of epoch, and (b) an ablation varying ε ∈ {10⁻⁴,10⁻⁶,10⁻⁸,10⁻¹⁰} to expose how much of Adam's behavior in this regime is attributable to the second-moment term versus an effectively constant preconditioner combined with the first-moment term. We will not retract the broader claim — on the logistic, MLP, and VAE experiments the second moment plainly contributes — but we will explicitly state that on CNNs of this size much of Adam's benefit over plain SGD with momentum comes from per-layer scale adaptation early in training and from the first-moment term, and that the improvement margin over well-tuned SGD+momentum is correspondingly modest. This nuance is consistent with what is already written in §6.3 but will be made quantitative. revision: yes
-
Referee: The claim that absent bias-correction RMSProp diverges for β₂ near 1 is supported only by the VAE experiment (§6.4); a second setting would substantially strengthen the differentiator from RMSProp.
Authors: We accept this. The bias-correction-vs-no-correction comparison is a central claim and one experiment is thinner than it should be. For the revision we will add a sweep over β₂ ∈ {0.99, 0.999, 0.9999} and α ∈ [10⁻⁵,10⁻¹], with and without the bias-correction terms, on the MLP+dropout MNIST setting from §6.2 (and, if space permits, on the CNN setting from §6.3). We expect — based on the analysis in §3, where the (1−β₂^t) factor is largest precisely when β₂ is near 1 — to reproduce the same instability pattern observed in §6.4. The resulting figure will be added as a panel to Figure 4 or as a new figure in §6.4. revision: yes
- We do not have a proof of the O(√T) regret bound that dispenses with the monotonicity assumption on √(t·v̂_{t,i})/α. The revised theorem will therefore be conditional on this assumption; an unconditional bound for general bounded convex sequences with bias-corrected v̂_t is left to future work.
Circularity Check
No meaningful circularity: Adam's algorithm, bias correction, and empirical claims stand on independent content; the proof gap flagged by the reader is a correctness/soundness issue, not a circular derivation.
full rationale
Walking the derivation chain: (1) §2 Algorithm: defines Adam by EMAs of g and g². No claim is being "derived" from itself — the update rule is a definition. (2) §3 Bias correction: derives E[v_t] = E[g_t²]·(1−β_2^t) + ζ from the EMA recursion (Eq. 1–4). The (1−β_2^t) divisor is then read off this expectation. This is a straightforward algebraic identity, not a circular fit; nothing is fitted to data and then re-predicted. (3) §2.1 SNR / effective stepsize bounds: |Δ_t| ≤ α-style bounds follow from the algebra of m̂_t/√v̂_t. Independent content. (4) §4 / §10 Convergence: Theorem 4.1 derives an O(√T) regret bound from stated assumptions (bounded gradients, bounded iterate distance, β_1²/√β_2 < 1). The reader's concern is that the telescoping step in the proof of Theorem 10.5 implicitly assumes √(t·v̂_{t,i})/α is monotone non-decreasing — a soundness gap later exploited by Reddi et al. (2018). That is a *correctness* problem, not a circularity problem: the bound is not "the input renamed as the output" — it is an attempted proof from external assumptions that turns out to have an unjustified inequality. No quantity is fitted to the regret and then claimed as a prediction of the regret; no self-citation is load-bearing (the proof cites Zinkevich 2003's framework, not the authors' own prior work). (5) §5 Related work / §6 Experiments: comparisons to AdaGrad/RMSProp/SGD use independently implemented baselines on standard datasets (MNIST, IMDB, CIFAR-10). No fitted-input-as-prediction pattern. (6) §7 AdaMax: derived as the p→∞ limit of an L_p generalization (Eq. 6–12). Algebraic limit, not circular. There is essentially no self-citation load: the references are to Duchi, Tieleman & Hinton, Zeiler, Sohl-Dickstein, Zinkevich, etc. The Kingma & Welling (2013) self-cite is only used to specify the VAE architecture used as a *test problem* in §6.4 — it is not load-bearing for any theoretical claim. Conclusion: the paper's core claims are self-contained against external benchmarks and standard online-convex machinery. The Theorem 4.1 issue is a real bug in the proof (correctly diagnosed by the reader), but it is a missing-step / unjustified-monotonicity flaw, not a circular derivation. Score: 1 (one minor self-citation, not load-bearing).
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/PhiForcing.lean, Foundation/DimensionForcing.leanphi_equation, dimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Good default settings for the tested machine learning problems are α = 0.001, β₁ = 0.9, β₂ = 0.999 and ε = 10⁻⁸.
-
Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The algorithm updates exponential moving averages of the gradient (m_t) and the squared gradient (v_t) where the hyper-parameters β₁, β₂ ∈ [0,1) control the exponential decay rates of these moving averages.
-
Foundation/DAlembert/Inevitability.leanbilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R(T) ≤ D²/(2α(1−β₁)) Σ_i √(T·v̂_{T,i}) + α(1+β₁)G_∞/((1−β₁)√(1−β₂)(1−γ)²) Σ_i ‖g_{1:T,i}‖₂ + ...
-
Foundation/LawOfExistence.leanlaw_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Canonical Regularisation of Wide Feature-Learning Neural Networks
Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
-
Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
-
ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
-
ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
SLayerGen: a Crystal Generative Model for all Space and Layer Groups
SLayerGen generates crystals invariant to any space or layer group via autoregressive lattice and Wyckoff sampling plus equivariant diffusion, achieving gains over bulk models on diperiodic materials after correcting ...
-
3DSS: 3D Surface Splatting for Inverse Rendering
3DSS is the first differentiable surface splatting renderer that recovers shape, spatially-varying BRDF materials, and HDR illumination from multi-view images via a coverage-based compositing model derived from recons...
-
Random test functions, $H^{-1}$ norm equivalence, and stochastic variational physics-informed neural networks
H^{-1} norm equivalence to expected squared evaluations on domain-dependent random test functions enables SV-PINNs that recover accurate solutions to challenging second-order elliptic PDEs faster than standard PINNs.
-
A Parameter-Free First-Order Algorithm for Non-Convex Optimization with $\tilde{\mkern1mu O}(\epsilon^{-5/3})$ Global Rate
PF-AGD is the first parameter-free deterministic accelerated first-order method with Õ(ε^{-5/3} log(1/ε)) complexity for smooth non-convex optimization.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
-
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
Qvine uses vine copula-inspired quantum circuit structures to achieve linear or quadratic depth scaling for loading high-dimensional distributions with high approximation quality.
-
Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
-
MMGait: Towards Multi-Modal Gait Recognition
MMGait provides a new multi-sensor gait dataset and OmniGait baseline to support single-modal, cross-modal, and unified multi-modal person identification from walking patterns.
-
Proton Structure from Neural Simulation-Based Inference at the LHC
Neural simulation-based inference on unbinned top-quark pair data at 13 TeV yields improved gluon PDF precision over traditional binned analyses while incorporating experimental and theoretical uncertainties.
-
Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
-
CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification
The paper introduces the CMCC-ReID task, constructs the SYSU-CMCC benchmark dataset, and proposes the PIA network with disentangling and prototype modules that outperforms prior methods on combined modality and clothi...
-
Traces of Helium Detected in Type Ic Supernova 2014L
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
-
Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Adam attains a δ^{-1/2} high-probability rate while any SGD guarantee must incur at least δ^{-1} dependence.
-
Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
Adam achieves a δ^{-1/2} high-probability convergence rate while SGD requires at least δ^{-1} due to second-moment normalization, established via stopping-time/martingale analysis under bounded variance.
-
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence
Contrastive learning evolves population measures on a fixed manifold into either a unique convex Gibbs equilibrium or a cross-coupled multimodal landscape containing a persistent negative symmetric divergence.
-
Automated discovery of heralded ballistic graph state generators for fusion-based photonic quantum computation
A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including fi...
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Locating and Editing Factual Associations in GPT
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
-
Offline Reinforcement Learning with Implicit Q-Learning
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
-
PathVQA: 30000+ Questions for Medical Visual Question Answering
PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
-
Scaling Laws for Neural Language Models
Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
-
MIPaaL: Mixed Integer Program as a Layer
MIPaaL differentiates through mixed integer programs via cutting planes to enable decision-focused learning for general MIPs, outperforming two-stage prediction-plus-optimization and LP-relaxation baselines on real-wo...
-
AGAN: Towards Automated Design of Generative Adversarial Networks
AGAN is the first neural architecture search method for GANs that discovers architectures outperforming state-of-the-art on CIFAR-10 unsupervised image generation and competitive on supervised tasks.
-
Passage Re-ranking with BERT
Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
-
Neural Ordinary Differential Equations
Neural networks are redefined as continuous dynamical systems by learning the derivative of the hidden state with a neural network and integrating it with an ODE solver.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Adaptive Computation Time for Recurrent Neural Networks
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
NICE: Non-linear Independent Components Estimation
NICE learns a composition of invertible neural-network layers that transform data into independent latent variables, enabling exact log-likelihood training and sampling for density estimation.
-
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
-
Valid and Expressive Copulas for Irregular Multivariate Time Series
CopFITi is the first marginalization-consistent copula for irregular multivariate time series, using normalizing flows for marginals and a Gaussian mixture copula for dependencies to reach new state-of-the-art joint d...
-
Non-normal spectral signatures of instability in neural network training dynamics
Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
-
Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning
CRiSP uses neural-guided MCTS and curriculum learning to insert Clifford prefixes before parameterized rotations in VQAs, yielding mean 3.17x and max 45x gains in energy accuracy on 22-qubit QAOA benchmarks versus pri...
-
ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation
A wavelet-guided adaptive INR for DEMs achieves 66.25 dB PSNR on Swiss tiles with 3.2x fewer parameters than prior work, plus post-training compression to 1.23 bpp.
-
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
SMFP introduces a one-step generative policy class using MeanFlow to map noise to actions, providing a tractable entropy surrogate for unified off-policy mirror descent training that outperforms Gaussian and generativ...
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
Finite-Time Regret Analysis of Retry-Aware Bandits
ReMax achieves the first sublinear finite-time regret bound for Gaussian bandits with M=2 by deriving an expected-improvement balance condition for its optimal sampling distribution and separating saturation from unde...
-
ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization
ShapeBench is a new unified benchmark for aerodynamic shape optimization that shows optimizer performance varies substantially across different shape classes and problem setups.
-
k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics
Constructs k-inductive neural barrier certificates for partially unknown nonlinear dynamics by combining neural networks, a data-driven fundamental lemma from one trajectory, and CEGIS-SMT verification.
-
INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
INSHAPE discovers instance-specific non-overlapping shapelets, models their temporal dependencies, and aggregates them bottom-up into population-level prototypes for improved accuracy and interpretability in time-seri...
-
Learning Orthonormal Bases for Function Spaces
Neural networks parameterize finite-rank generators for ODEs on the orthogonal Lie group, allowing optimization of orthonormal bases in function space with a universality result that rank-2 generators suffice for density.
-
Targeted Downstream-Agnostic Attack
Introduces Targeted Downstream-Agnostic Attack (TDAA) that uses a threat image as feature anchor and example-specific perturbations to achieve targeted attacks on unknown downstream tasks from pre-trained encoders.
-
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
-
Reducing the upper bound for the Borsuk number in $\mathbb{R}^4$ to 8
Explicit partitions of variants of the truncated Lassak cover establish b(4) ≤ 8.
-
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
-
Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
AdaGrad converges at a rate depending on the unknown tail index p for 4/3 < p ≤ 2 in non-convex optimization, with an algorithm-dependent lower bound and an improved rate for AdaGrad-Norm under a mild extra assumption...
-
Pointwise Generalization in Deep Neural Networks
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
Reference graph
Works this paper leans on
-
[1]
Natural gradient works efficiently in learning
Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998
work page 1998
-
[2]
Recent advances in deep learning for speech research at microsoft
Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013
work page 2013
-
[3]
Adaptive subgradient methods for online learning and stochastic optimization
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011
work page 2011
-
[4]
Generating Sequences With Recurrent Neural Networks
Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
work page Pith review arXiv 2013
-
[5]
Speech recognition with deep recurrent neural networks
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013
work page 2013
-
[6]
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006
work page 2006
-
[7]
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a
work page 2012
-
[8]
Improving neural networks by preventing co-adaptation of feature detectors
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b
work page Pith review arXiv 2012
-
[9]
Auto-Encoding Variational Bayes
Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013
work page 2013
-
[10]
Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012
work page 2012
-
[11]
Learning word vectors for sentiment analysis
Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011
work page 2011
-
[12]
Non-asymptotic analysis of stochastic approximation algorithms for machine learning
Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011
work page 2011
-
[13]
Revisiting Natural Gradient for Deep Networks
Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013
work page Pith review arXiv 2013
-
[14]
Acceleration of stochastic approximation by averaging
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992
work page 1992
-
[15]
Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010
work page 2010
-
[16]
Efficient estimations from a slowly convergent robbins-monro process
Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988
work page 1988
-
[17]
arXiv , arxivId =:arXiv:1206.1106v2 , title =
Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012
-
[18]
Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods
Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014
work page 2014
-
[19]
On the importance of initialization and momentum in deep learning
Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013
work page 2013
-
[20]
Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012
work page 2012
-
[21]
Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013
work page 2013
-
[22]
ADADELTA: An Adaptive Learning Rate Method
Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012
work page Pith review arXiv 2012
-
[23]
Online convex programming and generalized infinitesimal gradient ascent
Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003
work page 2003
-
[24]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[25]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[26]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[27]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.