Recognition: 4 theorem links
· Lean TheoremAdam: A Method for Stochastic Optimization
Pith reviewed 2026-05-09 01:36 UTC · model claude-opus-4-7
The pith
Adam sets per-parameter step sizes from bias-corrected running averages of the gradient and its square, giving a robust default optimizer for noisy, high-dimensional problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes Adam, a first-order stochastic optimizer that maintains two exponential moving averages per parameter — one of the gradient (first moment) and one of the squared gradient (second raw moment) — and uses their ratio, with an explicit bias correction for the zero-initialization of those averages, to set a per-parameter step size. The authors argue this combines the sparse-gradient handling of AdaGrad with the non-stationarity handling of RMSProp, while the effective per-step move in parameter space stays approximately bounded by the user-chosen stepsize α, giving the method a built-in trust-region feel. They claim a single set of defaults (α=0.001, β₁=0.9, β₂=0.999, ε=1e-8) w
What carries the argument
The bias-corrected ratio m̂_t / √v̂_t, where m_t and v_t are exponential moving averages of g_t and g_t², and the corrections m̂_t = m_t/(1−β₁ᵗ), v̂_t = v_t/(1−β₂ᵗ) undo the zero-initialization bias. This ratio is gradient-scale invariant, behaves like a per-coordinate signal-to-noise ratio that automatically anneals near optima, and bounds the per-step parameter move by roughly α — turning the stepsize hyperparameter into something close to a trust-region radius.
If this is right
- <parameter name="0">A practitioner can train a wide range of deep models with the same optimizer and the same defaults
- removing learning-rate tuning as a first-order concern.
Where Pith is reading between the lines
- <parameter name="0">Editorial: the regret proof's telescoping step requires √v̂_t/α_t to be non-decreasing along each coordinate
- which is not generally true
- later work has constructed simple convex counterexamples on which Adam diverges
- so the O(√T) bound as stated should be read as suggestive rather than airtight
- even though the empirical recipe survives unchanged.
Load-bearing premise
The regret proof leans on a quantity that grows monotonically along every coordinate as training proceeds; the paper asserts this without justification, and the bound only holds where that monotonicity actually holds.
What would settle it
Run Adam with the recommended defaults against well-tuned SGD-with-momentum, AdaGrad, and RMSProp on the same suite of problems (MNIST logistic regression and MLP, IMDB bag-of-words logistic regression, CIFAR-10 convnet, and a variational autoencoder). If Adam fails to match or beat them on training loss within the same wall-clock budget, or if removing the bias-correction terms does not visibly destabilize training when β₂ is close to 1, the central practical claim fails. For the regret claim, a convex online sequence on which Adam's iterates do not satisfy R(T)=O(√T) would falsify the theore
read the original abstract
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Adam, a first-order stochastic optimizer that maintains exponential moving averages of the gradient (m_t) and the squared gradient (v_t), applies bias-correction for the zero-initialization, and updates parameters by θ_t ← θ_{t-1} − α · m̂_t / (√v̂_t + ε). The authors motivate the update via a signal-to-noise interpretation, derive the bias-correction from the EMA recurrence, prove an O(√T) regret bound in the online convex setting (Theorem 4.1), present an L_∞-norm variant (AdaMax), and report experiments on logistic regression (MNIST, IMDB-BoW), MLPs (MNIST, with and without dropout), CNNs (CIFAR-10), and a VAE. Default hyperparameters (α=10⁻³, β₁=0.9, β₂=0.999, ε=10⁻⁸) are recommended and shown to be competitive with or better than SGD+Nesterov, AdaGrad, RMSProp, AdaDelta, and SFO.
Significance. If the algorithmic and empirical claims hold, Adam offers a practically important contribution: a simple, memory-light, scale-invariant adaptive optimizer with intuitive hyperparameters that performs robustly across convex and non-convex deep learning workloads. The bias-correction derivation in §3 is clean and useful in its own right (it cleanly explains an effect that earlier RMSProp-with-momentum variants get wrong for β₂ near 1), and the SNR/effective-step discussion in §2.1 gives a usable mental model for setting α. The AdaMax derivation (§7.1) is elegant and yields a particularly simple update with a tighter step bound |Δ_t| ≤ α. The empirical comparisons span enough model classes (logistic regression, fully-connected nets with/without dropout, CNNs, VAE) to support the robustness claim, and the bias-correction ablation in §6.4 is a genuinely informative experiment. The theoretical contribution (Theorem 4.1) is partial — see major comments — but the algorithmic and empirical case is strong.
major comments (4)
- [§4 / §10.1, Theorem 4.1 / 10.5] The regret proof contains a load-bearing step that is not justified. In the displayed bound at the top of p. 14, the sum ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) is replaced by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This telescoping is valid only if √v̂_{t,i}/α_t is non-decreasing in t for every coordinate i. With α_t = α/√t, the quantity is √(t·v̂_{t,i})/α, and since v̂_t is a bias-corrected EMA of g_t² it can strictly decrease whenever a coordinate sees a small gradient following a large one. The authors should either (i) state and justify a monotonicity assumption on v̂_t, (ii) carry through the proof with the absolute value of the increment (which changes the bound), or (iii) restrict the theorem to a class of sequences for which the monotonicity holds. As written, the bound is not established for general bounded convex sequences, and a one-dimensional counterexamp
- [§4, Theorem 4.1 statement] The hypothesis β₁²/√β₂ < 1 is stated but its role should be made explicit in the main text — it is used in Lemma 10.4 to bound an arithmetic-geometric series. With the recommended defaults β₁=0.9, β₂=0.999 one has β₁²/√β₂ ≈ 0.811, so the assumption is satisfied at defaults; however readers tuning β₁ upward (a common practice with momentum) can violate it. Please flag this in §4 alongside the theorem so that the regime of validity is clear.
- [§6.3, Figure 3] The CNN experiment reports that v̂_t 'vanishes to zeros after a few epochs and is dominated by the ε in algorithm 1', and that consequently 'Adagrad converges much slower than others' while Adam shows only 'marginal improvement over SGD with momentum'. This is an interesting and honest observation, but it slightly undercuts the central claim that adaptive second-moment scaling is the source of Adam's advantage. It would strengthen the paper to (a) report what fraction of coordinates have √v̂_t < ε at the cited epochs, and (b) show an ablation in which ε is varied, so readers can tell whether Adam in this regime is effectively SGD-with-momentum + a small constant preconditioner or whether the second moment still contributes.
- [§5, Related work / RMSProp comparison] The claim that lack of bias-correction in RMSProp 'leads to very large stepsizes and often divergence' for β₂ near 1 is supported by the VAE experiment in §6.4, but the comparison fixes architecture and dataset. Since this is one of the paper's main differentiators from RMSProp, a second setting (e.g., the MLP+dropout or CNN tasks already in the paper) showing the same effect would make the case substantially more robust.
minor comments (8)
- [Algorithm 1] The placement of ε inside the square root (√v̂_t + ε) versus inside (√(v̂_t + ε)) matters in practice and differs across implementations. Please state explicitly which convention is used and whether the analysis is affected.
- [§2.1] The two cases for the step bound, |Δ_t| ≤ α·(1−β₁)/√(1−β₂) versus |Δ_t| ≤ α, would be clearer with a one-line derivation rather than asserted. Currently the reader has to reconstruct the algebra.
- [§3, Eq. (4)] The term ζ is introduced and immediately argued to be small for stationary or slowly-varying gradients, but is not formally bounded. A short remark giving an explicit bound in terms of the variation of E[g_t²] would tighten the derivation.
- [§4] The decay schedule β_{1,t} = β₁·λ^{t−1} with λ very close to 1 is required for the proof but is not used in any of the experiments (which appear to use constant β₁=0.9). Please clarify whether the empirical performance corresponds to a regime covered by the theorem.
- [§7.1, Eq. (12)] It would be helpful to note that u_t = max(β₂·u_{t−1}, |g_t|) corresponds to a max over an exponentially-weighted history and therefore does not require bias correction, as briefly stated; an explicit derivation showing E[u_t] in the stationary case would parallel §3.
- [Lemma 10.3 proof] The inductive step uses the inequality √(a − b) ≤ √a − b/(2√a) which requires a ≥ b ≥ 0; this is fine but worth stating, since a = ∥g_{1:T,i}∥² and b = g_{T,i}² satisfy it by construction.
- [§6] The phrase 'searched over a dense grid' for hyperparameters of the baselines is not specific. Listing the grids (at minimum for α and momentum) in an appendix would improve reproducibility.
- [Typos] Several minor typos: 'theoratical' (§6.1, twice), 'BoW feature Logistic Regression' axis label, 'Initalization' (§7.2). 'β₁' appears where 'β₂' is meant in the sentence following Eq. (4) ('the exponential decay rate β₁ can be chosen…' — context is the second moment).
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report. The most substantive point — the unstated monotonicity assumption underlying the telescoping step in the regret proof — is correct, and we will revise Theorem 4.1 and its proof to state the assumption explicitly rather than leaving it implicit. We also agree to flag the β₁²/√β₂ < 1 hypothesis prominently in §4, to add quantitative support to the §6.3 discussion of v̂_t vanishing on CNNs (including an ε ablation), and to broaden the bias-correction comparison in §6.4 beyond the VAE setting. None of these revisions affect the algorithm itself, the bias-correction derivation in §3, the SNR discussion in §2.1, the AdaMax derivation in §7.1, or the empirical conclusions; they sharpen the theoretical statement and strengthen the empirical case. A point-by-point response follows.
read point-by-point responses
-
Referee: The telescoping step in the regret proof (top of p. 14) replaces ∑_{t=2}^T (θ_{t,i}−θ*_i)² (√v̂_{t,i}/α_t − √v̂_{t−1,i}/α_{t−1}) by (D²/(2α(1−β₁))) ∑_i √(T v̂_{T,i}). This is only valid if √v̂_{t,i}/α_t is non-decreasing in t per coordinate, which need not hold for a bias-corrected EMA of g_t² when a small gradient follows a large one.
Authors: We agree that this step requires an additional assumption that we did not state explicitly. The telescoping is valid when √(t·v̂_{t,i}) is non-decreasing in t for each coordinate, which is not guaranteed by a bias-corrected EMA of g_t² in general. We will revise §4 and §10.1 in two ways: (i) we will explicitly add the assumption that √(t·v̂_{t,i})/α is non-decreasing in t for all i (equivalently, that t·v̂_{t,i} is non-decreasing), and flag that this is what makes the telescoping well-defined; and (ii) we will note the alternative route in which the increment is replaced by its absolute value, which yields a weaker but unconditional bound. We thank the referee for catching this — the assumption is implicit in our derivation but should be made part of the theorem statement, and we will add a sentence describing the regime in which it is reasonable (sufficiently slowly-varying second-moment estimates) and acknowledging that pathological sequences can violate it. We do not claim a fix for the general non-monotone case in this revision. revision: yes
-
Referee: The hypothesis β₁²/√β₂ < 1 should be flagged in the main text alongside Theorem 4.1, since users tuning β₁ upward can violate it (defaults satisfy it: 0.9²/√0.999 ≈ 0.811).
Authors: We agree. We will add a short remark in §4 immediately after the theorem statement noting (i) the role of this assumption — it is used in Lemma 10.4 to bound an arithmetic-geometric series via γ = β₁²/√β₂ < 1 — (ii) that the recommended defaults β₁=0.9, β₂=0.999 give γ ≈ 0.811 and so satisfy it comfortably, and (iii) that practitioners increasing β₁ (e.g. β₁ ≥ 0.95 with default β₂) should check the inequality. We will also include a one-line worked example so the regime of validity is unambiguous. revision: yes
-
Referee: In the CNN experiment (§6.3), the authors note v̂_t vanishes to near-zero so the update is dominated by ε, which somewhat undercuts the claim that adaptive second-moment scaling drives Adam's advantage. Please report what fraction of coordinates have √v̂_t < ε at the cited epochs, and add an ablation varying ε.
Authors: This is a fair point and we agree the §6.3 discussion would benefit from quantitative support. In the revision we will add (a) a measurement, taken from the same CIFAR-10 run, of the fraction of coordinates with √v̂_t below ε (and below 10ε, 100ε) as a function of epoch, and (b) an ablation varying ε ∈ {10⁻⁴,10⁻⁶,10⁻⁸,10⁻¹⁰} to expose how much of Adam's behavior in this regime is attributable to the second-moment term versus an effectively constant preconditioner combined with the first-moment term. We will not retract the broader claim — on the logistic, MLP, and VAE experiments the second moment plainly contributes — but we will explicitly state that on CNNs of this size much of Adam's benefit over plain SGD with momentum comes from per-layer scale adaptation early in training and from the first-moment term, and that the improvement margin over well-tuned SGD+momentum is correspondingly modest. This nuance is consistent with what is already written in §6.3 but will be made quantitative. revision: yes
-
Referee: The claim that absent bias-correction RMSProp diverges for β₂ near 1 is supported only by the VAE experiment (§6.4); a second setting would substantially strengthen the differentiator from RMSProp.
Authors: We accept this. The bias-correction-vs-no-correction comparison is a central claim and one experiment is thinner than it should be. For the revision we will add a sweep over β₂ ∈ {0.99, 0.999, 0.9999} and α ∈ [10⁻⁵,10⁻¹], with and without the bias-correction terms, on the MLP+dropout MNIST setting from §6.2 (and, if space permits, on the CNN setting from §6.3). We expect — based on the analysis in §3, where the (1−β₂^t) factor is largest precisely when β₂ is near 1 — to reproduce the same instability pattern observed in §6.4. The resulting figure will be added as a panel to Figure 4 or as a new figure in §6.4. revision: yes
- We do not have a proof of the O(√T) regret bound that dispenses with the monotonicity assumption on √(t·v̂_{t,i})/α. The revised theorem will therefore be conditional on this assumption; an unconditional bound for general bounded convex sequences with bias-corrected v̂_t is left to future work.
Circularity Check
No meaningful circularity: Adam's algorithm, bias correction, and empirical claims stand on independent content; the proof gap flagged by the reader is a correctness/soundness issue, not a circular derivation.
full rationale
Walking the derivation chain: (1) §2 Algorithm: defines Adam by EMAs of g and g². No claim is being "derived" from itself — the update rule is a definition. (2) §3 Bias correction: derives E[v_t] = E[g_t²]·(1−β_2^t) + ζ from the EMA recursion (Eq. 1–4). The (1−β_2^t) divisor is then read off this expectation. This is a straightforward algebraic identity, not a circular fit; nothing is fitted to data and then re-predicted. (3) §2.1 SNR / effective stepsize bounds: |Δ_t| ≤ α-style bounds follow from the algebra of m̂_t/√v̂_t. Independent content. (4) §4 / §10 Convergence: Theorem 4.1 derives an O(√T) regret bound from stated assumptions (bounded gradients, bounded iterate distance, β_1²/√β_2 < 1). The reader's concern is that the telescoping step in the proof of Theorem 10.5 implicitly assumes √(t·v̂_{t,i})/α is monotone non-decreasing — a soundness gap later exploited by Reddi et al. (2018). That is a *correctness* problem, not a circularity problem: the bound is not "the input renamed as the output" — it is an attempted proof from external assumptions that turns out to have an unjustified inequality. No quantity is fitted to the regret and then claimed as a prediction of the regret; no self-citation is load-bearing (the proof cites Zinkevich 2003's framework, not the authors' own prior work). (5) §5 Related work / §6 Experiments: comparisons to AdaGrad/RMSProp/SGD use independently implemented baselines on standard datasets (MNIST, IMDB, CIFAR-10). No fitted-input-as-prediction pattern. (6) §7 AdaMax: derived as the p→∞ limit of an L_p generalization (Eq. 6–12). Algebraic limit, not circular. There is essentially no self-citation load: the references are to Duchi, Tieleman & Hinton, Zeiler, Sohl-Dickstein, Zinkevich, etc. The Kingma & Welling (2013) self-cite is only used to specify the VAE architecture used as a *test problem* in §6.4 — it is not load-bearing for any theoretical claim. Conclusion: the paper's core claims are self-contained against external benchmarks and standard online-convex machinery. The Theorem 4.1 issue is a real bug in the proof (correctly diagnosed by the reader), but it is a missing-step / unjustified-monotonicity flaw, not a circular derivation. Score: 1 (one minor self-citation, not load-bearing).
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/PhiForcing.lean, Foundation/DimensionForcing.leanphi_equation, dimension_forced unclearGood default settings for the tested machine learning problems are α = 0.001, β₁ = 0.9, β₂ = 0.999 and ε = 10⁻⁸.
-
Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearThe algorithm updates exponential moving averages of the gradient (m_t) and the squared gradient (v_t) where the hyper-parameters β₁, β₂ ∈ [0,1) control the exponential decay rates of these moving averages.
-
Foundation/DAlembert/Inevitability.leanbilinear_family_forced unclearR(T) ≤ D²/(2α(1−β₁)) Σ_i √(T·v̂_{T,i}) + α(1+β₁)G_∞/((1−β₁)√(1−β₂)(1−γ)²) Σ_i ‖g_{1:T,i}‖₂ + ...
-
Foundation/LawOfExistence.leanlaw_of_existence unclearWe propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement.
Forward citations
Cited by 60 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
SLayerGen: a Crystal Generative Model for all Space and Layer Groups
SLayerGen generates crystals invariant to any space or layer group via autoregressive lattice and Wyckoff sampling plus equivariant diffusion, achieving gains over bulk models on diperiodic materials after correcting ...
-
Random test functions, $H^{-1}$ norm equivalence, and stochastic variational physics-informed neural networks
H^{-1} norm equivalence to expected squared evaluations on domain-dependent random test functions enables SV-PINNs that recover accurate solutions to challenging second-order elliptic PDEs faster than standard PINNs.
-
A Parameter-Free First-Order Algorithm for Non-Convex Optimization with $\tilde{\mkern1mu O}(\epsilon^{-5/3})$ Global Rate
PF-AGD is the first parameter-free deterministic accelerated first-order method with Õ(ε^{-5/3} log(1/ε)) complexity for smooth non-convex optimization.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
-
Qvine: Vine Structured Quantum Circuits for Loading High Dimensional Distributions
Qvine uses vine copula-inspired quantum circuit structures to achieve linear or quadratic depth scaling for loading high-dimensional distributions with high approximation quality.
-
Neural Spectral Bias and Conformal Correlators I: Introduction and Applications
Neural networks optimized solely on crossing symmetry reconstruct CFT correlators from minimal input data to few-percent accuracy across generalized free fields, minimal models, Ising, N=4 SYM, and AdS diagrams.
-
MMGait: Towards Multi-Modal Gait Recognition
MMGait provides a new multi-sensor gait dataset and OmniGait baseline to support single-modal, cross-modal, and unified multi-modal person identification from walking patterns.
-
Proton Structure from Neural Simulation-Based Inference at the LHC
Neural simulation-based inference on unbinned top-quark pair data at 13 TeV yields improved gluon PDF precision over traditional binned analyses while incorporating experimental and theoretical uncertainties.
-
Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
-
CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification
The paper introduces the CMCC-ReID task, constructs the SYSU-CMCC benchmark dataset, and proposes the PIA network with disentangling and prototype modules that outperforms prior methods on combined modality and clothi...
-
Traces of Helium Detected in Type Ic Supernova 2014L
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Offline Reinforcement Learning with Implicit Q-Learning
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
-
Passage Re-ranking with BERT
Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Adaptive Computation Time for Recurrent Neural Networks
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
NICE: Non-linear Independent Components Estimation
NICE learns a composition of invertible neural-network layers that transform data into independent latent variables, enabling exact log-likelihood training and sampling for density estimation.
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation
SEMIR replaces dense voxel computation with a learned topology-preserving graph minor that supports exact decoding and GNN-based inference for small-structure segmentation in large medical images.
-
Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale
Local neural operators on 3x3x3 patches, composed via Schwarz iteration, solve large-scale nonlinear elasticity on arbitrary geometries without domain-specific retraining.
-
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
-
Delightful Gradients Accelerate Corner Escape
Delightful Policy Gradient removes exponential corner trapping in softmax policy optimization for bandits and tabular MDPs, achieving logarithmic escape times and global O(1/t) convergence.
-
AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers
AccLock extracts user-specific features from in-ear ballistocardiogram signals via a disentanglement model and Siamese network to achieve average FAR of 3.13% and FRR of 2.99% in tests with 33 participants.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media
The BiLT autoencoder recovers absorption and scattering spectra from integrating sphere data with high accuracy while remaining robust to wavelength shifts up to 10 bands and generalizing to different instrument line ...
-
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
-
Variational predictive resampling
Variational predictive resampling uses sequential imputation from variational predictives to generate samples whose distribution converges to the exact Bayesian posterior in Gaussian models and improves dependence cap...
-
Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models
Spectra defines and controls effective capacity in graph embeddings via the Shannon effective rank of a trace-normalized kernel spectrum, making capacity a post-fit property rather than a pre-training hyperparameter.
-
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
-
Fixed-Point Neural Optimal Transport without Implicit Differentiation
A single-network fixed-point formulation for neural optimal transport eliminates adversarial min-max optimization and implicit differentiation while enforcing dual feasibility exactly.
-
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapt...
-
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
-
Chebyshev Center-Based Direction Selection for Multi-Objective Optimization and Training PINNs
Update direction selection for PINN training is cast as a Chebyshev-center problem in the dual cone, yielding an efficient dual formulation with nonconvex convergence guarantees and automatic recovery of scale robustn...
-
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
-
Accelerating 3D Non-LTE Synthesis with Graph Neural Networks
Graph neural networks can approximate full 3D non-LTE Ca II populations in solar models with correlations above 0.99 and extreme computational efficiency.
-
Constitutive Priors for Inverse Design
A framework learns constitutive priors from noisy data to enable PDE-constrained inverse design of elastic networks using latent variables, homotopy continuation, Chamfer distance matching, and neural smoothness constraints.
-
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
-
Optimality of Sub-network Laplace Approximations: New Results and Methods
Sub-network Laplace approximations always underestimate full-model predictive variance, and two new gradient-based and greedy selection rules provide theoretically grounded improvements.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
-
Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow
Spatio-Temporal MeanFlow adapts MeanFlow to PDEs by replacing the generative velocity field with the physical operator and extending the integral constraint to the spatio-temporal domain, yielding a unified solver for...
-
HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis
HairGPT reframes 3D hairstyle synthesis as dual-decoupled autoregressive strand sequence modeling with geometric tokenization for semantic control and rare style generation.
-
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
-
NeuralBench: A Unifying Framework to Benchmark NeuroAI Models
NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.
-
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
Retrieval from motion datasets combined with LLM task parsing and reward-guided noise initialization enables training-free diffusion optimization to satisfy severe spatiotemporal constraints in human motion generation.
-
Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data
ADD-PINN adaptively decomposes the spatial domain based on PINN residuals and a shock indicator to improve offline traffic state estimation under the LWR model, outperforming baselines in most sparse-sensor cases whil...
-
Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks
Self-supervised monocular depth estimation improves in low-texture regions by using distance transforms on jointly estimated pre-semantic contours to create more informative loss signals.
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
-
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
-
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
-
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
-
Path-Coupled Bellman Flows for Distributional Reinforcement Learning
Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
-
IntentGrasp: A Comprehensive Benchmark for Intent Understanding
IntentGrasp benchmark demonstrates that LLMs have low intent understanding capabilities, with most models underperforming random guessing on a challenging subset, but Intentional Fine-Tuning provides large improvements.
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
Reference graph
Works this paper leans on
-
[1]
Natural gradient works efficiently in learning
Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10 0 (2): 0 251--276, 1998
1998
-
[2]
Recent advances in deep learning for speech research at microsoft
Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff, He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft. ICASSP 2013, 2013
2013
-
[3]
Adaptive subgradient methods for online learning and stochastic optimization
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12: 0 2121--2159, 2011
2011
-
[4]
arXiv preprint arXiv:1308.0850 (2013) 4, 5
Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
-
[5]
Speech recognition with deep recurrent neural networks
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.\ 6645--6649. IEEE, 2013
2013
-
[6]
and Salakhutdinov, R.R
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313 0 (5786): 0 504--507, 2006
2006
-
[7]
Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 0 (6): 0 82--97, 2012 a
2012
-
[8]
E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012 b
-
[9]
Auto-Encoding Variational Bayes
Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes . In The 2nd International Conference on Learning Representations (ICLR), 2013
2013
-
[10]
Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012
2012
-
[11]
Learning word vectors for sentiment analysis
Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.\ 142--150. Association for Computational Linguistics, 2011
2011
-
[12]
Non-asymptotic analysis of stochastic approximation algorithms for machine learning
Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp.\ 451--459, 2011
2011
-
[13]
Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013
-
[14]
Acceleration of stochastic approximation by averaging
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 0 (4): 0 838--855, 1992
1992
-
[15]
A fast natural newton method
Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.\ 623--630, 2010
2010
-
[16]
Efficient estimations from a slowly convergent robbins-monro process
Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988
1988
-
[17]
arXiv , arxivId =:arXiv:1206.1106v2 , title =
Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106, 2012
-
[18]
Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods
Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014
2014
-
[19]
On the importance of initialization and momentum in deep learning
Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 1139--1147, 2013
2013
-
[20]
and Hinton, G
Tieleman, T. and Hinton, G. Lecture 6.5 - RMSP rop, COURSERA : N eural N etworks for M achine L earning. Technical report, 2012
2012
-
[21]
Fast dropout training
Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp.\ 118--126, 2013
2013
-
[22]
Adadelta: an adaptive learning rate method.arXiv preprint arXiv:1212.5701,
Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012
-
[23]
Online convex programming and generalized infinitesimal gradient ascent
Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003
2003
-
[24]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[25]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[26]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[27]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.