arxiv: 2605.09126 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

Vatsal Shah , Jiahao Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords DiLoCostaleness-aware optimizationasynchronous trainingAdam optimizerlanguage model pretrainingconvergence boundsouter-loop optimization

0 comments

The pith

Cosine-gated scaling of stale pseudo-gradients yields a DiLoCo outer optimizer whose convergence bound depends only on decay rate α

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Cosine-Gated Adam-Decay (CGAD) as a drop-in replacement for the Nesterov outer optimizer in asynchronous DiLoCo systems. The method multiplies each stale pseudo-gradient by an exponential decay factor times a cosine gate that cuts off large ages before the gradient enters Adam's moment buffers. For an idealized version of this update the authors derive a non-asymptotic convergence bound on smooth non-convex objectives in which the staleness bias term depends only on the decay constant α rather than on the maximum delay. Experiments on Llama-style pretraining at scales up to 7 billion parameters demonstrate that CGAD remains stable across a sweep of controlled delays where both the standard Nesterov recipe and a simpler Adam Decay baseline exhibit growing instability and variance.

Core claim

The central claim is that modulating incoming pseudo-gradients with the age-dependent factor σ(τ) = γ(τ) e^{-α τ} before they update Adam moments produces a staleness-aware outer optimizer for DiLoCo that converges with a bound independent of τ_max.

What carries the argument

The scaling factor σ(τ) = γ(τ) e^{-α τ} applied to each pseudo-gradient, where γ(τ) is the cosine gate that smoothly zeros contributions beyond a cutoff age.

Load-bearing premise

The idealized gated-adaptive update studied in the convergence proof accurately models the dynamics of the complete CGAD algorithm when embedded in the actual DiLoCo outer loop.

What would settle it

Observe whether a 7B model trained under CGAD at delay τ=16 maintains loss below chance level; failure to do so would indicate the bound does not translate to the full practical setting.

Figures

Figures reproduced from arXiv: 2605.09126 by Jiahao Sun, Vatsal Shah.

**Figure 2.** Figure 2: Final eval cross-entropy vs. communication delay [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Single-shot deployment risk (mean ±σ, top marker = mean +σ) vs. scale for the four stalenessaware methods (Nesterov is in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The cosine cutoff is scale insurance. (a) Seed-to-seed σ at τ=8 versus model scale (log-log). Adam-Decay’s spread blows up by 27× from 25 M to 7 B as the bf16 + 8-bit pipeline starts leaking stalegradient noise into Adam’s second moment; CGAD’s hard cutoff prevents this and σ stays roughly flat. (b) Single-shot risk-adjusted final loss (mean +σ) at 7 B across τ ∈ {0,8,16}. CGAD is the lowest-risk method a… view at source ↗

read the original abstract

Asynchronous DiLoCo systems may receive pseudo-gradients computed several outer rounds earlier, yet the standard Nesterov outer optimizer does not explicitly condition its update on per-update age. This can make the outer momentum buffer brittle under large controlled delays. We propose Cosine Gated Adam Decay (CGAD), a simple, drop-in, age-aware outer optimizer that scales each incoming pseudo-gradient by $\sigma(\tau) = \gamma(\tau) e^{-\alpha\tau}$ before it enters Adam's first- and second-moment buffers; the exponential models information decay and the cosine gate $\gamma(\tau)$ smoothly zeroes contributions past a chosen cutoff. CGAD reduces to plain Adam at $\tau=0$, adds two hyperparameters whose defaults transfer across scales, and extends to partial-sync schedulers via a per-fragment age-aware variant (PA-CGAD). For an idealized gated-adaptive update on smooth non convex objectives, we prove a non-asymptotic convergence bound whose staleness-bias term depends on $\alpha$ alone, rather than on the realized maximum delay $\tau_{\max}$; standard analyses of asynchronous momentum-SGD instead carry a $\tau_{\max}^2$ factor. Empirically, on Llama style language model pretraining at 25M, 1B, and 7B parameters, CGAD trains stably across the controlled delays we sweep. The cosine cutoff acts as scale insurance: the closest baseline, Adam Decay (CGAD without the cutoff), is competitive at 25M but its seed-to-seed $\sigma$ at $\tau=8$ grows 27x from 25M to 7B, pushing its single-shot risk (mean + $\sigma$) above the chance-level loss while CGAD's stays well below. The published Nesterov recipe is the least stable method on the full sweep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CGAD adds a cosine-gated exponential decay to Adam for async DiLoCo outer loops and claims a convergence bound whose staleness term depends only on alpha, but the proof stops at an idealized update.

read the letter

The paper introduces CGAD, which scales each incoming pseudo-gradient by sigma(tau) = gamma(tau) * exp(-alpha * tau) before it reaches Adam's moment buffers. The cosine gate zeros out very old updates while the exponential models gradual loss of relevance. This is presented as a drop-in replacement for the usual Nesterov outer optimizer in DiLoCo, with an extension to partial-sync cases. The central theoretical claim is a non-asymptotic bound for smooth non-convex objectives in which the bias from staleness depends on alpha alone rather than on tau_max, unlike typical async momentum analyses that carry a tau_max squared term. They also note that the method collapses to plain Adam at tau=0 and that two hyperparameters transfer across scales in their tests. On the empirical side, the Llama-style pretraining runs from 25M to 7B parameters show CGAD remaining stable under controlled delays up to 8 steps, while the Nesterov baseline is least stable and the no-cutoff Adam Decay variant sees its seed-to-seed variance grow sharply with model size. The variance comparison at tau=8 is the clearest quantitative signal they provide. The main gap is that the convergence argument is written for an idealized gated update applied directly to the gradient. The actual algorithm inserts the scaling before Adam's bias-corrected moments and runs inside DiLoCo's decoupled outer loop, where state persists across steps and pseudo-gradients arrive with variable age. No analysis is given for how the moments or the decoupling affect the effective bias, so the guarantee does not automatically cover the implemented version. The experiments are summarized without error bars or full tables, which limits how strongly one can read the stability claims. This work is aimed at people running large-model training on clusters with variable latency who need a lightweight way to dampen stale outer updates. It has a concrete idea, some supporting runs, and a theory claim that is at least internally consistent, so it is worth sending to referees even if the theory needs to be extended to the full dynamics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Cosine-Gated Adam-Decay (CGAD), a drop-in outer optimizer for asynchronous DiLoCo that scales each incoming pseudo-gradient by σ(τ) = γ(τ) e^{-ατ} before it enters Adam's moment buffers, with the cosine gate γ(τ) providing a smooth cutoff. It proves a non-asymptotic convergence bound for an idealized gated-adaptive update on smooth non-convex objectives in which the staleness-bias term depends on α alone rather than on realized maximum delay τ_max. Empirically, CGAD is shown to train stably on Llama-style language model pretraining at 25M, 1B, and 7B scales across controlled delays, with the cosine cutoff acting as scale insurance relative to Adam Decay and the published Nesterov baseline.

Significance. If the idealized bound transfers to the implemented CGAD and the empirical stability claims are statistically supported, the method offers a practical, low-overhead way to stabilize outer optimization in decoupled DiLoCo without explicit dependence on τ_max. The non-asymptotic analysis for the idealized gated update and the reported hyperparameter transfer across three orders of magnitude in model size are concrete strengths that could inform asynchronous training practice.

major comments (2)

[Abstract and theoretical analysis] Abstract and theoretical section: the non-asymptotic convergence bound is derived explicitly for an idealized gated-adaptive update that applies σ(τ) directly to the gradient before any moment accumulation. The actual CGAD inserts the scaling before Adam's m and v buffers and runs inside DiLoCo's decoupled outer loop with persistent state across outer steps; no analysis is given for the interaction of the exponential decay and cosine gate with the bias-corrected second-moment estimate or the parameter-server decoupling. This gap is load-bearing for the claim that the staleness-bias term depends on α alone in the deployed algorithm.
[Empirical evaluation] Empirical evaluation: the abstract asserts stable training and transferability of the two hyperparameters across 25M–7B scales, yet provides no quantitative metrics, error bars, or statistical details for the experiments. The specific claim that Adam Decay's seed-to-seed σ at τ=8 grows 27× from 25M to 7B (pushing single-shot risk above chance-level loss) is presented without supporting tables or variance calculations, undermining assessment of the cosine cutoff's scale-insurance benefit.

minor comments (2)

Explicitly define the cosine cutoff function γ(τ) and the composite σ(τ) with all parameters in the main text rather than relying on the abstract.
Clarify whether the partial-sync scheduler variant (PA-CGAD) inherits the same idealized convergence guarantee or requires a separate argument.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and theoretical analysis] Abstract and theoretical section: the non-asymptotic convergence bound is derived explicitly for an idealized gated-adaptive update that applies σ(τ) directly to the gradient before any moment accumulation. The actual CGAD inserts the scaling before Adam's m and v buffers and runs inside DiLoCo's decoupled outer loop with persistent state across outer steps; no analysis is given for the interaction of the exponential decay and cosine gate with the bias-corrected second-moment estimate or the parameter-server decoupling. This gap is load-bearing for the claim that the staleness-bias term depends on α alone in the deployed algorithm.

Authors: We agree that the convergence bound is stated for an idealized gated-adaptive update and does not analyze the full interactions present in the implemented CGAD (pre-Adam scaling, bias correction, and DiLoCo decoupling). The manuscript already qualifies the result as applying to the idealized case, but we will revise the theoretical section to more explicitly discuss this limitation and explain why the α-only dependence is expected to carry over approximately, as supported by the empirical results across scales. A full non-asymptotic analysis of the complete deployed algorithm lies beyond the scope of the current work. revision: partial
Referee: [Empirical evaluation] Empirical evaluation: the abstract asserts stable training and transferability of the two hyperparameters across 25M–7B scales, yet provides no quantitative metrics, error bars, or statistical details for the experiments. The specific claim that Adam Decay's seed-to-seed σ at τ=8 grows 27× from 25M to 7B (pushing single-shot risk above chance-level loss) is presented without supporting tables or variance calculations, undermining assessment of the cosine cutoff's scale-insurance benefit.

Authors: We accept that the abstract and main text would benefit from more explicit quantitative support. While the full manuscript contains experimental figures, we will add a dedicated table in the revised version that reports mean and standard deviation of final loss across seeds for each model scale and delay value. This table will directly document the reported 27× growth in seed-to-seed standard deviation for Adam Decay and the corresponding stability under CGAD, allowing readers to assess the scale-insurance claim with concrete statistics. revision: yes

standing simulated objections not resolved

A complete non-asymptotic convergence analysis for the full CGAD implementation, including interactions with Adam bias correction and DiLoCo decoupling.

Circularity Check

0 steps flagged

No significant circularity; convergence bound is a standard first-principles derivation under explicit idealization

full rationale

The paper's central theoretical claim is a non-asymptotic convergence bound derived for an idealized gated-adaptive update on smooth non-convex objectives, with the staleness-bias term depending only on the hyperparameter α. This follows directly from standard analysis assumptions on the idealized update rule σ(τ) applied before moment buffers, without reducing to any fitted parameters, self-definitional equations, or load-bearing self-citations. The empirical results on Llama-style pretraining at multiple scales are presented separately and do not rely on the proof for their validity. No steps in the derivation chain match the enumerated circularity patterns; the idealized model is explicitly distinguished from the full CGAD + DiLoCo implementation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on two tunable hyperparameters whose defaults are asserted to transfer and on the domain assumption of smoothness for the idealized convergence analysis. The gating function itself is a new construction without external validation.

free parameters (2)

α (exponential decay rate)
Controls how quickly stale gradient information is discounted; one of the two added hyperparameters.
cosine cutoff age
Determines the point at which the gate zeros contributions; second hyperparameter with claimed default transfer across scales.

axioms (1)

domain assumption The loss is smooth and non-convex
Invoked to obtain the non-asymptotic convergence bound for the idealized gated-adaptive update.

invented entities (1)

σ(τ) = γ(τ) e^{-ατ} no independent evidence
purpose: To scale incoming pseudo-gradients according to their staleness before they enter Adam buffers
New functional form introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5647 in / 1592 out tokens · 68328 ms · 2026-05-12T02:26:22.360624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Decoupled DiLoCo for Resilient Distributed Pre-training

The Decoupled DiLoCo Team. Decoupled DiLoCo for Resilient Distributed Pre-training. arXiv:2604.21428, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Diloco: Distributed low- communication training of language models.arXiv preprint arXiv:2311.08105,

A. Douillard, Q. Feng, A. A. Rusu, R. Chhaparia, Y . Donchev, A. Kuncoro, M. Ranzato, A. Szlam, J. Shen. DiLoCo: Distributed Low-Communication Training of Language Models.ICML Workshop, 2024. arXiv:2311.08105

work page arXiv 2024
[3]

Douillard et al

A. Douillard et al. Streaming DiLoCo with Overlapping Communication.arXiv:2501.18512, 2025

work page arXiv 2025
[4]

S. Kale, A. Douillard, Y . Donchev. Eager Updates For Overlapped Communication and Computation in DiLoCo.arXiv:2502.12996, 2025

work page arXiv 2025
[5]

Jaghouar, J

S. Jaghouar, J. M. Ong, J. Hagemann. OpenDiLoCo.arXiv:2407.07852, 2024

work page arXiv 2024
[6]

URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/136b9a13861308c8948cd308ccd02658-Paper-Conference.pdf

A. Bhardwaj et al. Smoothing DiLoCo with Primal Averaging.arXiv:2512.17131, 2025

work page arXiv 2025
[7]

B. Liu, R. Chhaparia, A. Douillard, S. Kale, A. A. Rusu, J. Shen, A. Szlam, M. Ranzato. Asynchronous Local-SGD Training for Language Modeling.ICML Workshop, 2024. arXiv:2401.09135

work page arXiv 2024
[8]

W. Sun, Z. Qin, W. Sun, S. Li, D. Li, X. Shen, Y . Qiao, Y . Zhong. CO2: Efficient Distributed Training with Full Communication-Computation Overlap.ICLR, 2024. arXiv:2401.16265

work page arXiv 2024
[9]

Ajanthan, S

T. Ajanthan, S. Ramasinghe, G. Avraham, Y . Zuo, A. Long. Momentum Look-Ahead for Asynchronous Distributed Low-Communication Training.ICLR-W MCDC, 2025

work page 2025
[10]

C. Xie, S. Koyejo, I. Gupta. Asynchronous Federated Optimization.arXiv:1903.03934, 2019

work page arXiv 1903
[11]

Zheng, Q

S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, T.-Y . Liu. Asynchronous Stochastic Gradient Descent with Delay Compensation.ICML, 2017

work page 2017
[12]

Mishchenko, F

K. Mishchenko, F. Bach, M. Even, B. Woodworth. Asynchronous SGD beats minibatch SGD under arbitrary delays.arXiv:2206.07638, 2022

work page arXiv 2022
[13]

S. U. Stich. Local SGD converges fast and communicates little.ICLR, 2019

work page 2019
[14]

X. Lian, Y . Huang, Y . Li, J. Liu. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. NeurIPS, 2015. arXiv:1506.08272

work page arXiv 2015
[15]

Koloskova, S

A. Koloskova, S. U. Stich, M. Jaggi. Sharper convergence guarantees for asynchronous SGD. arXiv:2206.08307, 2022

work page arXiv 2022
[16]

Cohen, A

A. Cohen, A. Daniely, Y . Drori, T. Koren, M. Schain. Asynchronous stochastic optimization robust to arbitrary delays.arXiv:2106.11879, 2021

work page arXiv 2021
[17]

Reddi, S

S. Reddi, S. Kale, S. Kumar. On the Convergence of Adam and Beyond.ICLR, 2018

work page 2018
[18]

Raffel et al

C. Raffel et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 2020

work page 2020
[19]

Wang et al

Y . Wang et al. FADAS: Federated Adaptive Asynchronous Optimization.arXiv:2407.18365, 2024

work page arXiv 2024
[20]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, Y . Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024. arXiv:2104.09864. 11 A Full proof of Theorem 1 ByL-smoothness, F(θt+1)≤F(θ t) +⟨∇F(θ t),θ t+1 −θ t⟩+ L 2 ∥θt+1 −θ t∥2. The CGAD step is θt+1 −θ t =−ησ t ˆmt/(√ ˆvt +ε) . By the Adam-ratio assumption, ∥ˆmt/(√ ˆvt ...

work page internal anchor Pith review Pith/arXiv arXiv 2024