pith. machine review for the scientific record. sign in

arxiv: 1212.5701 · v1 · submitted 2012-12-22 · 💻 cs.LG

Recognition: unknown

ADADELTA: An Adaptive Learning Rate Method

Authors on Pith no claims yet
classification 💻 cs.LG
keywords methodgradientlearningrateadadeltadescentinformationadaptive
0
0 comments X
read the original abstract

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Neural Machine Translation of Rare Words with Subword Units

    cs.CL 2015-08 accept novelty 8.0

    Subword segmentation via byte pair encoding enables open-vocabulary neural machine translation and improves BLEU scores by 1.1 on English-German and 1.3 on English-Russian WMT 2015 tasks over dictionary back-off baselines.

  2. Neural Machine Translation by Jointly Learning to Align and Translate

    cs.CL 2014-09 accept novelty 8.0

    An attention-based encoder-decoder model achieves English-to-French translation performance comparable to phrase-based systems by automatically learning soft alignments.

  3. Adam: A Method for Stochastic Optimization

    cs.LG 2014-12 accept novelty 7.5

    A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.

  4. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  5. When Descent Is Too Stable: Event-Triggered Hamiltonian Learning to Optimize

    cs.LG 2026-05 unverdicted novelty 7.0

    SHAPE lifts gradient descent to an augmented phase space with a learned Hamiltonian vector field and event-triggered port updates to balance descent, exploitation, and exploration, improving best-so-far performance ov...

  6. Universal Adaptive Proximal Gradient Methods via Gradient Mapping Accumulation

    math.OC 2026-05 unverdicted novelty 7.0

    A universal adaptive proximal gradient method converges at rates matching standard proximal gradient methods up to logarithmic factors for three problem classes without requiring knowledge of problem parameters.

  7. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    cs.CL 2014-06 unverdicted novelty 7.0

    RNN Encoder-Decoder learns semantically meaningful phrase representations whose conditional probabilities improve statistical machine translation when added to log-linear models.

  8. \mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments

    cs.LG 2026-05 unverdicted novelty 6.0

    VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.

  9. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  10. SGDR: Stochastic Gradient Descent with Warm Restarts

    cs.LG 2016-08 accept novelty 6.0

    SGDR uses periodic warm restarts of the learning rate in SGD to reach new state-of-the-art error rates of 3.14% on CIFAR-10 and 16.21% on CIFAR-100.

  11. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  12. NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics

    cs.LG 2026-04 unverdicted novelty 5.0

    NeuroPlastic is a gradient-based optimizer augmented with a multi-signal plasticity modulation mechanism that improves performance over standard updates on image classification tasks, especially in low-data regimes.

  13. Harmonizing MR Images Across 100+ Scanners: Multi-site Validation with Traveling Subjects and Real-world Protocols

    eess.IV 2026-04 conditional novelty 5.0

    HACA3^+ improves upon HACA3 with better artifact encoding, attention mechanisms, and training on 100+ scanners, validated via traveling subjects for better downstream performance.