Adam: A Method for Stochastic Optimization

Diederik P. Kingma; Jimmy Ba

arxiv: 1412.6980 · v9 · submitted 2014-12-22 · 💻 cs.LG

Adam: A Method for Stochastic Optimization

Diederik P. Kingma , Jimmy Ba This is my paper

Pith reviewed 2026-05-09 01:36 UTC · model claude-opus-4-7

classification 💻 cs.LG MSC 90C1568T0590C25

keywords stochastic optimizationadaptive learning ratefirst-order methodsmoment estimationbias correctiononline convex optimizationdeep learningAdaMax

0 comments

The pith

Adam sets per-parameter step sizes from bias-corrected running averages of the gradient and its square, giving a robust default optimizer for noisy, high-dimensional problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks: can a first-order optimizer with almost no tuning reliably train large, noisy, sparse-gradient problems? Its answer is Adam, which keeps two running averages per parameter — one of the gradient, one of the squared gradient — and divides the first by the square root of the second to set a per-coordinate step. A small but important detail is the explicit bias correction that undoes the zero-initialization of those averages, which matters most when the second-moment decay rate β₂ is set close to 1 to handle sparse gradients. The construction makes the effective step size invariant to gradient rescaling and approximately bounded by the user-chosen α, so α functions like a trust-region radius rather than a raw learning rate. Empirically the authors show that one default setting tracks or beats AdaGrad, RMSProp, SGD with Nesterov momentum, AdaDelta, and a quasi-Newton baseline on logistic regression, MLPs with dropout, and convnets, and they derive an infinity-norm variant (AdaMax) with an even cleaner update bound.

Core claim

The paper proposes Adam, a first-order stochastic optimizer that maintains two exponential moving averages per parameter — one of the gradient (first moment) and one of the squared gradient (second raw moment) — and uses their ratio, with an explicit bias correction for the zero-initialization of those averages, to set a per-parameter step size. The authors argue this combines the sparse-gradient handling of AdaGrad with the non-stationarity handling of RMSProp, while the effective per-step move in parameter space stays approximately bounded by the user-chosen stepsize α, giving the method a built-in trust-region feel. They claim a single set of defaults (α=0.001, β₁=0.9, β₂=0.999, ε=1e-8) w

What carries the argument

The bias-corrected ratio m̂_t / √v̂_t, where m_t and v_t are exponential moving averages of g_t and g_t², and the corrections m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) undo the zero-initialization bias. This ratio is gradient-scale invariant, behaves like a per-coordinate signal-to-noise ratio that automatically anneals near optima, and bounds the per-step parameter move by roughly α — turning the stepsize hyperparameter into something close to a trust-region radius.

If this is right

<parameter name="0">A practitioner can train a wide range of deep models with the same optimizer and the same defaults
removing learning-rate tuning as a first-order concern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">Editorial: the regret proof's telescoping step requires √v̂_t/α_t to be non-decreasing along each coordinate
which is not generally true
later work has constructed simple convex counterexamples on which Adam diverges
so the O(√T) bound as stated should be read as suggestive rather than airtight
even though the empirical recipe survives unchanged.

Load-bearing premise

The regret proof leans on a quantity that grows monotonically along every coordinate as training proceeds; the paper asserts this without justification, and the bound only holds where that monotonicity actually holds.

What would settle it

Run Adam with the recommended defaults against well-tuned SGD-with-momentum, AdaGrad, and RMSProp on the same suite of problems (MNIST logistic regression and MLP, IMDB bag-of-words logistic regression, CIFAR-10 convnet, and a variational autoencoder). If Adam fails to match or beat them on training loss within the same wall-clock budget, or if removing the bias-correction terms does not visibly destabilize training when β₂ is close to 1, the central practical claim fails. For the regret claim, a convex online sequence on which Adam's iterates do not satisfy R(T)=O(√T) would falsify the theore

read the original abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adam: the algorithm and defaults are the contribution; the regret proof has a real gap, but it doesn't touch what made this paper matter.

read the letter

This is the Adam paper. You already know the practical story — bias-corrected EMAs of the first and second moments of the gradient, per-coordinate step sizes, defaults α=0.001, β₁=0.9, β₂=0.999, ε=1e-8 that have held up across a decade of deep learning. The reader's verdict is right and the stress-test note is right; let me just tell you which parts to trust.

What's actually new and good: the combination itself (momentum + RMSProp-style scaling) is incremental, but the bias-correction derivation in §3 is clean and matters — without it, β₂ near 1 produces enormous early steps, which they show empirically in §6.4. The §2.1 discussion of effective step size being roughly bounded by α (the "trust region" reading) is a genuinely useful piece of intuition and explains why the defaults transfer. §5 is honest about the relationship to RMSProp and AdaGrad rather than overclaiming. AdaMax in §7.1 falls out naturally from the L^p → L^∞ limit and is a nice aside. Experiments are standard for 2014 — MNIST logistic regression, MLPs, a small CIFAR ConvNet, a VAE — but they cover the cases that matter and the comparisons against AdaGrad, RMSProp, SGD+Nesterov, AdaDelta and SFO are fair.

The soft spot is exactly where the stress-test note puts it. Theorem 4.1's proof in §10.1 telescopes a sum of (√v̂_{t}/α_t − √v̂_{t−1}/α_{t−1}) terms weighted by (θ_t−θ*)² and bounds it by the final term. That step needs the increment to be non-negative coordinate-wise, which is not true: v̂_t is an EMA of g_t² and can shrink when a large gradient is followed by small ones. Reddi, Kale & Kumar (ICLR 2018) made this concrete with a 1-D convex example where Adam incurs Ω(T) regret, and AMSGrad's max-of-v fix is precisely what restores monotonicity. So the O(√T) bound as stated does not hold in the generality claimed. The algorithm, the bias-correction argument, and every empirical result are unaffected.

Recommendation: accept, easily. The proof gap is real and worth flagging in any discussion, but it doesn't undermine the contribution that made this paper consequential. Bring it to reading group if you have anyone who hasn't worked through the bias-correction derivation or the AMSGrad follow-up — the pairing is instructive. Cite without hesitation; everyone does, and they should.

Referee Report

4 major / 8 minor

Summary. The paper introduces Adam, a first-order stochastic optimizer that maintains exponential moving averages of the gradient (m_t) and the squared gradient (v_t), applies bias-correction for the zero-initialization, and updates parameters by θ_t ← θ_{t-1} − α · m̂_t / (√v̂_t + ε). The authors motivate the update via a signal-to-noise interpretation, derive the bias-correction from the EMA recurrence, prove an O(√T) regret bound in the online convex setting (Theorem 4.1), present an L_∞-norm variant (AdaMax), and report experiments on logistic regression (MNIST, IMDB-BoW), MLPs (MNIST, with and without dropout), CNNs (CIFAR-10), and a VAE. Default hyperparameters (α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸) are recommended and shown to be competitive with or better than SGD+Nesterov, AdaGrad, RMSProp, AdaDelta, and SFO.

Significance. If the algorithmic and empirical claims hold, Adam offers a practically important contribution: a simple, memory-light, scale-invariant adaptive optimizer with intuitive hyperparameters that performs robustly across convex and non-convex deep learning workloads. The bias-correction derivation in §3 is clean and useful in its own right (it cleanly explains an effect that earlier RMSProp-with-momentum variants get wrong for β₂ near 1), and the SNR/effective-step discussion in §2.1 gives a usable mental model for setting α. The AdaMax derivation (§7.1) is elegant and yields a particularly simple update with a tighter step bound |Δ_t| ≤ α. The empirical comparisons span enough model classes (logistic regression, fully-connected nets with/without dropout, CNNs, VAE) to support the robustness claim, and the bias-correction ablation in §6.4 is a genuinely informative experiment. The theoretical contribution (Theorem 4.1) is partial — see major comments — but the algorithmic and empirical case is strong.

major comments (4)

[§4 / §10.1, Theorem 4.1 / 10.5] The regret proof contains a load-bearing step that is not justified. In the displayed bound at the top of p. 14, the sum ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) is replaced by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This telescoping is valid only if √v̂_{t,i}/α_t is non-decreasing in t for every coordinate i. With α_t = α/√t, the quantity is √(t·v̂_{t,i})/α, and since v̂_t is a bias-corrected EMA of g_t² it can strictly decrease whenever a coordinate sees a small gradient following a large one. The authors should either (i) state and justify a monotonicity assumption on v̂_t, (ii) carry through the proof with the absolute value of the increment (which changes the bound), or (iii) restrict the theorem to a class of sequences for which the monotonicity holds. As written, the bound is not established for general bounded convex sequences, and a one-dimensional counterexamp
[§4, Theorem 4.1 statement] The hypothesis β₁²/√β₂ < 1 is stated but its role should be made explicit in the main text — it is used in Lemma 10.4 to bound an arithmetic-geometric series. With the recommended defaults β₁=0.9, β₂=0.999 one has β₁²/√β₂ ≈ 0.811, so the assumption is satisfied at defaults; however readers tuning β₁ upward (a common practice with momentum) can violate it. Please flag this in §4 alongside the theorem so that the regime of validity is clear.
[§6.3, Figure 3] The CNN experiment reports that v̂_t 'vanishes to zeros after a few epochs and is dominated by the ε in algorithm 1', and that consequently 'Adagrad converges much slower than others' while Adam shows only 'marginal improvement over SGD with momentum'. This is an interesting and honest observation, but it slightly undercuts the central claim that adaptive second-moment scaling is the source of Adam's advantage. It would strengthen the paper to (a) report what fraction of coordinates have √v̂_t < ε at the cited epochs, and (b) show an ablation in which ε is varied, so readers can tell whether Adam in this regime is effectively SGD-with-momentum + a small constant preconditioner or whether the second moment still contributes.
[§5, Related work / RMSProp comparison] The claim that lack of bias-correction in RMSProp 'leads to very large stepsizes and often divergence' for β₂ near 1 is supported by the VAE experiment in §6.4, but the comparison fixes architecture and dataset. Since this is one of the paper's main differentiators from RMSProp, a second setting (e.g., the MLP+dropout or CNN tasks already in the paper) showing the same effect would make the case substantially more robust.

minor comments (8)

[Algorithm 1] The placement of ε inside the square root (√v̂_t + ε) versus inside (√(v̂_t + ε)) matters in practice and differs across implementations. Please state explicitly which convention is used and whether the analysis is affected.
[§2.1] The two cases for the step bound, |Δ_t| ≤ α·(1−β₁)/√(1−β₂) versus |Δ_t| ≤ α, would be clearer with a one-line derivation rather than asserted. Currently the reader has to reconstruct the algebra.
[§3, Eq. (4)] The term ζ is introduced and immediately argued to be small for stationary or slowly-varying gradients, but is not formally bounded. A short remark giving an explicit bound in terms of the variation of E[g_t²] would tighten the derivation.
[§4] The decay schedule β_{1,t} = β₁·λ^{t−1} with λ very close to 1 is required for the proof but is not used in any of the experiments (which appear to use constant β₁=0.9). Please clarify whether the empirical performance corresponds to a regime covered by the theorem.
[§7.1, Eq. (12)] It would be helpful to note that u_t = max(β₂·u_{t−1}, |g_t|) corresponds to a max over an exponentially-weighted history and therefore does not require bias correction, as briefly stated; an explicit derivation showing E[u_t] in the stationary case would parallel §3.
[Lemma 10.3 proof] The inductive step uses the inequality √(a − b) ≤ √a − b/(2√a) which requires a ≥ b ≥ 0; this is fine but worth stating, since a = ∥g_{1:T,i}∥² and b = g_{T,i}² satisfy it by construction.
[§6] The phrase 'searched over a dense grid' for hyperparameters of the baselines is not specific. Listing the grids (at minimum for α and momentum) in an appendix would improve reproducibility.
[Typos] Several minor typos: 'theoratical' (§6.1, twice), 'BoW feature Logistic Regression' axis label, 'Initalization' (§7.2). 'β₁' appears where 'β₂' is meant in the sentence following Eq. (4) ('the exponential decay rate β₁ can be chosen…' — context is the second moment).

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a careful and constructive report. The most substantive point — the unstated monotonicity assumption underlying the telescoping step in the regret proof — is correct, and we will revise Theorem 4.1 and its proof to state the assumption explicitly rather than leaving it implicit. We also agree to flag the β₁²/√β₂ < 1 hypothesis prominently in §4, to add quantitative support to the §6.3 discussion of v̂_t vanishing on CNNs (including an ε ablation), and to broaden the bias-correction comparison in §6.4 beyond the VAE setting. None of these revisions affect the algorithm itself, the bias-correction derivation in §3, the SNR discussion in §2.1, the AdaMax derivation in §7.1, or the empirical conclusions; they sharpen the theoretical statement and strengthen the empirical case. A point-by-point response follows.

read point-by-point responses

Referee: The telescoping step in the regret proof (top of p. 14) replaces ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This is only valid if √v̂_{t,i}/α_t is non-decreasing in t per coordinate, which need not hold for a bias-corrected EMA of g_t² when a small gradient follows a large one.

Authors: We agree that this step requires an additional assumption that we did not state explicitly. The telescoping is valid when √(t·v̂_{t,i}) is non-decreasing in t for each coordinate, which is not guaranteed by a bias-corrected EMA of g_t² in general. We will revise §4 and §10.1 in two ways: (i) we will explicitly add the assumption that √(t·v̂_{t,i})/α is non-decreasing in t for all i (equivalently, that t·v̂_{t,i} is non-decreasing), and flag that this is what makes the telescoping well-defined; and (ii) we will note the alternative route in which the increment is replaced by its absolute value, which yields a weaker but unconditional bound. We thank the referee for catching this — the assumption is implicit in our derivation but should be made part of the theorem statement, and we will add a sentence describing the regime in which it is reasonable (sufficiently slowly-varying second-moment estimates) and acknowledging that pathological sequences can violate it. We do not claim a fix for the general non-monotone case in this revision. revision: yes
Referee: The hypothesis β₁²/√β₂ < 1 should be flagged in the main text alongside Theorem 4.1, since users tuning β₁ upward can violate it (defaults satisfy it: 0.9²/√0.999 ≈ 0.811).

Authors: We agree. We will add a short remark in §4 immediately after the theorem statement noting (i) the role of this assumption — it is used in Lemma 10.4 to bound an arithmetic-geometric series via γ = β₁²/√β₂ < 1 — (ii) that the recommended defaults β₁=0.9, β₂=0.999 give γ ≈ 0.811 and so satisfy it comfortably, and (iii) that practitioners increasing β₁ (e.g. β₁ ≥ 0.95 with default β₂) should check the inequality. We will also include a one-line worked example so the regime of validity is unambiguous. revision: yes
Referee: In the CNN experiment (§6.3), the authors note v̂_t vanishes to near-zero so the update is dominated by ε, which somewhat undercuts the claim that adaptive second-moment scaling drives Adam's advantage. Please report what fraction of coordinates have √v̂_t < ε at the cited epochs, and add an ablation varying ε.

Authors: This is a fair point and we agree the §6.3 discussion would benefit from quantitative support. In the revision we will add (a) a measurement, taken from the same CIFAR-10 run, of the fraction of coordinates with √v̂_t below ε (and below 10ε, 100ε) as a function of epoch, and (b) an ablation varying ε ∈ {10⁻⁴,10⁻⁶,10⁻⁸,10⁻¹⁰} to expose how much of Adam's behavior in this regime is attributable to the second-moment term versus an effectively constant preconditioner combined with the first-moment term. We will not retract the broader claim — on the logistic, MLP, and VAE experiments the second moment plainly contributes — but we will explicitly state that on CNNs of this size much of Adam's benefit over plain SGD with momentum comes from per-layer scale adaptation early in training and from the first-moment term, and that the improvement margin over well-tuned SGD+momentum is correspondingly modest. This nuance is consistent with what is already written in §6.3 but will be made quantitative. revision: yes
Referee: The claim that absent bias-correction RMSProp diverges for β₂ near 1 is supported only by the VAE experiment (§6.4); a second setting would substantially strengthen the differentiator from RMSProp.

Authors: We accept this. The bias-correction-vs-no-correction comparison is a central claim and one experiment is thinner than it should be. For the revision we will add a sweep over β₂ ∈ {0.99, 0.999, 0.9999} and α ∈ [10⁻⁵,10⁻¹], with and without the bias-correction terms, on the MLP+dropout MNIST setting from §6.2 (and, if space permits, on the CNN setting from §6.3). We expect — based on the analysis in §3, where the (1−β₂^t) factor is largest precisely when β₂ is near 1 — to reproduce the same instability pattern observed in §6.4. The resulting figure will be added as a panel to Figure 4 or as a new figure in §6.4. revision: yes

standing simulated objections not resolved

We do not have a proof of the O(√T) regret bound that dispenses with the monotonicity assumption on √(t·v̂_{t,i})/α. The revised theorem will therefore be conditional on this assumption; an unconditional bound for general bounded convex sequences with bias-corrected v̂_t is left to future work.

Circularity Check

0 steps flagged

No meaningful circularity: Adam's algorithm, bias correction, and empirical claims stand on independent content; the proof gap flagged by the reader is a correctness/soundness issue, not a circular derivation.

full rationale

Walking the derivation chain: (1) §2 Algorithm: defines Adam by EMAs of g and g². No claim is being "derived" from itself — the update rule is a definition. (2) §3 Bias correction: derives E[v_t] = E[g_t²]·(1−β_2^t) + ζ from the EMA recursion (Eq. 1–4). The (1−β_2^t) divisor is then read off this expectation. This is a straightforward algebraic identity, not a circular fit; nothing is fitted to data and then re-predicted. (3) §2.1 SNR / effective stepsize bounds: |Δ_t| ≤ α-style bounds follow from the algebra of m̂_t/√v̂_t. Independent content. (4) §4 / §10 Convergence: Theorem 4.1 derives an O(√T) regret bound from stated assumptions (bounded gradients, bounded iterate distance, β_1²/√β_2 < 1). The reader's concern is that the telescoping step in the proof of Theorem 10.5 implicitly assumes √(t·v̂_{t,i})/α is monotone non-decreasing — a soundness gap later exploited by Reddi et al. (2018). That is a *correctness* problem, not a circularity problem: the bound is not "the input renamed as the output" — it is an attempted proof from external assumptions that turns out to have an unjustified inequality. No quantity is fitted to the regret and then claimed as a prediction of the regret; no self-citation is load-bearing (the proof cites Zinkevich 2003's framework, not the authors' own prior work). (5) §5 Related work / §6 Experiments: comparisons to AdaGrad/RMSProp/SGD use independently implemented baselines on standard datasets (MNIST, IMDB, CIFAR-10). No fitted-input-as-prediction pattern. (6) §7 AdaMax: derived as the p→∞ limit of an L_p generalization (Eq. 6–12). Algebraic limit, not circular. There is essentially no self-citation load: the references are to Duchi, Tieleman & Hinton, Zeiler, Sohl-Dickstein, Zinkevich, etc. The Kingma & Welling (2013) self-cite is only used to specify the VAE architecture used as a *test problem* in §6.4 — it is not load-bearing for any theoretical claim. Conclusion: the paper's core claims are self-contained against external benchmarks and standard online-convex machinery. The Theorem 4.1 issue is a real bug in the proof (correctly diagnosed by the reader), but it is a missing-step / unjustified-monotonicity flaw, not a circular derivation. Score: 1 (one minor self-citation, not load-bearing).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The algorithm itself introduces no postulated entities. Its hyperparameters (α, β₁, β₂, ε) are user-settable knobs, not free parameters fitted to make a derivation work, though the recommended defaults were chosen empirically. The convergence theorem leans on standard online-convex-optimization assumptions plus one assumption that turned out to be false in general.

pith-pipeline@v0.9.0 · 9516 in / 5575 out tokens · 82700 ms · 2026-05-09T01:36:22.993249+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/PhiForcing.lean, Foundation/DimensionForcing.lean phi_equation, dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Good default settings for the tested machine learning problems are α = 0.001, β₁ = 0.9, β₂ = 0.999 and ε = 10⁻⁸.
Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The algorithm updates exponential moving averages of the gradient (m_t) and the squared gradient (v_t) where the hyper-parameters β₁, β₂ ∈ [0,1) control the exponential decay rates of these moving averages.
Foundation/DAlembert/Inevitability.lean bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R(T) ≤ D²/(2α(1−β₁)) Σ_i √(T·v̂_{T,i}) + α(1+β₁)G_∞/((1−β₁)√(1−β₂)(1−γ)²) Σ_i ‖g_{1:T,i}‖₂ + ...
Foundation/LawOfExistence.lean law_of_existence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Canonical Regularisation of Wide Feature-Learning Neural Networks
stat.ML 2026-05 unverdicted novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
Stochastic Non-Smooth Convex Optimization with Unbounded Gradients
math.OC 2026-05 unverdicted novelty 8.0

Introduces generalized Lipschitz class and shows clipped AdamW outperforms SGD and AdaGrad for stochastic convex optimization under this and related assumptions.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
cs.LG 2026-05 unverdicted novelty 8.0

In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...
Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
SLayerGen: a Crystal Generative Model for all Space and Layer Groups
cond-mat.mtrl-sci 2026-05 unverdicted novelty 8.0

SLayerGen generates crystals invariant to any space or layer group via autoregressive lattice and Wyckoff sampling plus equivariant diffusion, achieving gains over bulk models on diperiodic materials after correcting ...
3DSS: 3D Surface Splatting for Inverse Rendering
cs.GR 2026-05 unverdicted novelty 8.0

3DSS is the first differentiable surface splatting renderer that recovers shape, spatially-varying BRDF materials, and HDR illumination from multi-view images via a coverage-based compositing model derived from recons...
Random test functions, $H^{-1}$ norm equivalence, and stochastic variational physics-informed neural networks
math.NA 2026-05 unverdicted novelty 8.0

H^{-1} norm equivalence to expected squared evaluations on domain-dependent random test functions enables SV-PINNs that recover accurate solutions to challenging second-order elliptic PDEs faster than standard PINNs.
A Parameter-Free First-Order Algorithm for Non-Convex Optimization with $\tilde{\mkern1mu O}(\epsilon^{-5/3})$ Global Rate
math.OC 2026-05 conditional novelty 8.0

PF-AGD is the first parameter-free deterministic accelerated first-order method with Õ(ε^{-5/3} log(1/ε)) complexity for smooth non-convex optimization.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
cs.CR 2026-05 unverdicted novelty 8.0

STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
quant-ph 2026-04 unverdicted novelty 8.0

Qvine uses vine copula-inspired quantum circuit structures to achieve linear or quadratic depth scaling for loading high-dimensional distributions with high approximation quality.
Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
hep-th 2026-04 unverdicted novelty 8.0

Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
MMGait: Towards Multi-Modal Gait Recognition
cs.CV 2026-04 conditional novelty 8.0

MMGait provides a new multi-sensor gait dataset and OmniGait baseline to support single-modal, cross-modal, and unified multi-modal person identification from walking patterns.
Proton Structure from Neural Simulation-Based Inference at the LHC
hep-ph 2026-04 unverdicted novelty 8.0

Neural simulation-based inference on unbinned top-quark pair data at 13 TeV yields improved gluon PDF precision over traditional binned analyses while incorporating experimental and theoretical uncertainties.
Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
math.OC 2026-04 unverdicted novelty 8.0

Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification
cs.CV 2026-04 unverdicted novelty 8.0

The paper introduces the CMCC-ReID task, constructs the SYSU-CMCC benchmark dataset, and proposes the PIA network with disentangling and prototype modules that outperforms prior methods on combined modality and clothi...
Traces of Helium Detected in Type Ic Supernova 2014L
astro-ph.HE 2026-03 accept novelty 8.0

Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
cs.LG 2026-03 unverdicted novelty 8.0

Adam attains a δ^{-1/2} high-probability rate while any SGD guarantee must incur at least δ^{-1} dependence.
Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
cs.LG 2026-03 unverdicted novelty 8.0

Adam achieves a δ^{-1/2} high-probability convergence rate while SGD requires at least δ^{-1} due to second-moment normalization, established via stopping-time/martingale analysis under bounded variance.
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence
cs.LG 2026-01 unverdicted novelty 8.0

Contrastive learning evolves population measures on a fixed manifold into either a unique convex Gibbs equilibrium or a cross-coupled multimodal landscape containing a persistent negative symmetric divergence.
Automated discovery of heralded ballistic graph state generators for fusion-based photonic quantum computation
quant-ph 2025-08 unverdicted novelty 8.0

A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including fi...
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
cs.LG 2024-07 conditional novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Locating and Editing Factual Associations in GPT
cs.CL 2022-02 accept novelty 8.0

Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
Offline Reinforcement Learning with Implicit Q-Learning
cs.LG 2021-10 unverdicted novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
cs.CV 2021-03 accept novelty 8.0

Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
PathVQA: 30000+ Questions for Medical Visual Question Answering
cs.CL 2020-03 accept novelty 8.0

PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
Scaling Laws for Neural Language Models
cs.LG 2020-01 unverdicted novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
MIPaaL: Mixed Integer Program as a Layer
cs.LG 2019-07 unverdicted novelty 8.0

MIPaaL differentiates through mixed integer programs via cutting planes to enable decision-focused learning for general MIPs, outperforming two-stage prediction-plus-optimization and LP-relaxation baselines on real-wo...
AGAN: Towards Automated Design of Generative Adversarial Networks
cs.LG 2019-06 unverdicted novelty 8.0

AGAN is the first neural architecture search method for GANs that discovers architectures outperforming state-of-the-art on CIFAR-10 unsupervised image generation and competitive on supervised tasks.
Passage Re-ranking with BERT
cs.IR 2019-01 unverdicted novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
Neural Ordinary Differential Equations
cs.LG 2018-06 accept novelty 8.0

Neural networks are redefined as continuous dynamical systems by learning the derivative of the hidden state with a neural network and integrating it with an ODE solver.
Density estimation using Real NVP
cs.LG 2016-05 accept novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Adaptive Computation Time for Recurrent Neural Networks
cs.NE 2016-03 accept novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
cs.LG 2015-11 accept novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
NICE: Non-linear Independent Components Estimation
cs.LG 2014-10 accept novelty 8.0

NICE learns a composition of invertible neural-network layers that transform data into independent latent variables, enabling exact log-likelihood training and sampling for density estimation.
Learning Through Noise: Why Subliminal Learning Works and When It Fails
cs.LG 2026-05 unverdicted novelty 7.0

Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
Valid and Expressive Copulas for Irregular Multivariate Time Series
cs.LG 2026-05 unverdicted novelty 7.0

CopFITi is the first marginalization-consistent copula for irregular multivariate time series, using normalizing flows for marginals and a Gaussian mixture copula for dependencies to reach new state-of-the-art joint d...
Non-normal spectral signatures of instability in neural network training dynamics
cs.LG 2026-05 unverdicted novelty 7.0

Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.
Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning
quant-ph 2026-05 unverdicted novelty 7.0

CRiSP uses neural-guided MCTS and curriculum learning to insert Clifford prefixes before parameterized rotations in VQAs, yielding mean 3.17x and max 45x gains in energy accuracy on 22-qubit QAOA benchmarks versus pri...
ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation
cs.LG 2026-05 unverdicted novelty 7.0

A wavelet-guided adaptive INR for DEMs achieves 66.25 dB PSNR on Swiss tiles with 3.2x fewer parameters than prior work, plus post-training compression to 1.23 bpp.
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

SMFP introduces a one-step generative policy class using MeanFlow to map noise to actions, providing a tractable entropy surrogate for unified off-policy mirror descent training that outperforms Gaussian and generativ...
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Finite-Time Regret Analysis of Retry-Aware Bandits
cs.LG 2026-05 unverdicted novelty 7.0

ReMax achieves the first sublinear finite-time regret bound for Gaussian bandits with M=2 by deriving an expected-improvement balance condition for its optimal sampling distribution and separating saturation from unde...
ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization
cs.LG 2026-05 unverdicted novelty 7.0

ShapeBench is a new unified benchmark for aerodynamic shape optimization that shows optimizer performance varies substantially across different shape classes and problem setups.
k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics
eess.SY 2026-05 unverdicted novelty 7.0

Constructs k-inductive neural barrier certificates for partially unknown nonlinear dynamics by combining neural networks, a data-driven fundamental lemma from one trajectory, and CEGIS-SMT verification.
INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification
cs.LG 2026-05 unverdicted novelty 7.0

INSHAPE discovers instance-specific non-overlapping shapelets, models their temporal dependencies, and aggregates them bottom-up into population-level prototypes for improved accuracy and interpretability in time-seri...
Learning Orthonormal Bases for Function Spaces
cs.LG 2026-05 unverdicted novelty 7.0

Neural networks parameterize finite-rank generators for ODEs on the orthogonal Lie group, allowing optimization of orthonormal bases in function space with a universality result that rank-2 generators suffice for density.
Targeted Downstream-Agnostic Attack
cs.CV 2026-05 unverdicted novelty 7.0

Introduces Targeted Downstream-Agnostic Attack (TDAA) that uses a threat image as feature anchor and example-specific perturbations to achieve targeted attacks on unknown downstream tasks from pre-trained encoders.
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
cs.LG 2026-05 unverdicted novelty 7.0

Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
Reducing the upper bound for the Borsuk number in $\mathbb{R}^4$ to 8
math.MG 2026-05 unverdicted novelty 7.0

Explicit partitions of variants of the truncated Lassak cover establish b(4) ≤ 8.
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
eess.SP 2026-05 unverdicted novelty 7.0

NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minim...
Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
math.OC 2026-05 unverdicted novelty 7.0

AdaGrad converges at a rate depending on the unknown tail index p for 4/3 < p ≤ 2 in non-convex optimization, with an algorithm-dependent lower bound and an improved rate for AdaGrad-Norm under a mild extra assumption...
Pointwise Generalization in Deep Neural Networks
cs.LG 2026-05 unverdicted novelty 7.0

Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1531 Pith papers

[1]

Natural gradient works efficiently in learning

Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998

work page 1998
[2]

Recent advances in deep learning for speech research at microsoft

Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013

work page 2013
[3]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

work page 2011
[4]

Generating Sequences With Recurrent Neural Networks

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page Pith review arXiv 2013
[5]

Speech recognition with deep recurrent neural networks

Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013

work page 2013
[6]

and Salakhutdinov, R.R

Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006

work page 2006
[7]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a

work page 2012
[8]

Improving neural networks by preventing co-adaptation of feature detectors

Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b

work page Pith review arXiv 2012
[9]

Auto-Encoding Variational Bayes

Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013

work page 2013
[10]

Imagenet classification with deep convolutional neural networks

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

work page 2012
[11]

Learning word vectors for sentiment analysis

Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011

work page 2011
[12]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011

work page 2011
[13]

Revisiting Natural Gradient for Deep Networks

Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013

work page Pith review arXiv 2013
[14]

Acceleration of stochastic approximation by averaging

Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992

work page 1992
[15]

A fast natural newton method

Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010

work page 2010
[16]

Efficient estimations from a slowly convergent robbins-monro process

Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988
[17]

arXiv , arxivId =:arXiv:1206.1106v2 , title =

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012

work page arXiv 2012
[18]

Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods

Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

work page 2014
[19]

On the importance of initialization and momentum in deep learning

Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013

work page 2013
[20]

and Hinton, G

Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012

work page 2012
[21]

Fast dropout training

Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013

work page 2013
[22]

ADADELTA: An Adaptive Learning Rate Method

Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012

work page Pith review arXiv 2012
[23]

Online convex programming and generalized infinitesimal gradient ascent

Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003

work page 2003
[24]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[25]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[26]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[27]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Natural gradient works efficiently in learning

Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998

work page 1998

[2] [2]

Recent advances in deep learning for speech research at microsoft

Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013

work page 2013

[3] [3]

Adaptive subgradient methods for online learning and stochastic optimization

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011

work page 2011

[4] [4]

Generating Sequences With Recurrent Neural Networks

Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page Pith review arXiv 2013

[5] [5]

Speech recognition with deep recurrent neural networks

Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013

work page 2013

[6] [6]

and Salakhutdinov, R.R

Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006

work page 2006

[7] [7]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a

work page 2012

[8] [8]

Improving neural networks by preventing co-adaptation of feature detectors

Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b

work page Pith review arXiv 2012

[9] [9]

Auto-Encoding Variational Bayes

Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013

work page 2013

[10] [10]

Imagenet classification with deep convolutional neural networks

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

work page 2012

[11] [11]

Learning word vectors for sentiment analysis

Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011

work page 2011

[12] [12]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning

Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011

work page 2011

[13] [13]

Revisiting Natural Gradient for Deep Networks

Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013

work page Pith review arXiv 2013

[14] [14]

Acceleration of stochastic approximation by averaging

Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992

work page 1992

[15] [15]

A fast natural newton method

Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010

work page 2010

[16] [16]

Efficient estimations from a slowly convergent robbins-monro process

Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988

[17] [17]

arXiv , arxivId =:arXiv:1206.1106v2 , title =

Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012

work page arXiv 2012

[18] [18]

Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods

Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

work page 2014

[19] [19]

On the importance of initialization and momentum in deep learning

Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013

work page 2013

[20] [20]

and Hinton, G

Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012

work page 2012

[21] [21]

Fast dropout training

Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013

work page 2013

[22] [22]

ADADELTA: An Adaptive Learning Rate Method

Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012

work page Pith review arXiv 2012

[23] [23]

Online convex programming and generalized infinitesimal gradient ascent

Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003

work page 2003

[24] [24]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[25] [25]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[26] [26]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[27] [27]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page