pith. sign in

arxiv: 2506.05249 · v4 · submitted 2025-06-05 · 💻 cs.LG · math.OC

On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Pith reviewed 2026-05-19 10:46 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords transformergradient descentlinear convergenceresidual connectionsself-attentionsoftmaxoptimizationconditioning
0
0 comments X

The pith

Gradient descent converges linearly on single-layer Transformers with residual connections under suitable initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent on a complete single-layer Transformer, including self-attention, feedforward network, and residual connections, achieves linear convergence when weights start from an appropriate initialization. Convergence speed is controlled by the smallest and largest singular values of the matrix produced by the attention layer. Residual connections counteract the ill-conditioning that arises from the low-rank structure created by the softmax operation, which in turn supports more stable optimization. The same linear rate is shown to hold when the analysis is extended to multi-layer Transformers. A sympathetic reader would care because this supplies a concrete mechanism explaining why residual connections help training succeed in practice.

Core claim

We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization.

What carries the argument

The output matrix of the attention layer, whose singular values set the linear convergence rate of gradient descent; residual connections improve its conditioning to offset the low-rank effect of softmax.

If this is right

  • Linear convergence holds for the structurally complete single-layer Transformer.
  • Residual connections directly reduce ill-conditioning caused by softmax.
  • The linear convergence guarantee carries over to multi-layer Transformers under matching initialization.
  • Empirical runs confirm that residual connections promote convergence stability as predicted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Removing residual connections would likely produce slower or unstable convergence because of worse matrix conditioning.
  • The singular-value condition could be used to design or verify initializations that keep training well-behaved.
  • The same conditioning argument might apply to other architectures that rely on low-rank operations similar to softmax.

Load-bearing premise

The proof depends on a specific appropriate initialization of the model weights that makes the singular-value bounds and conditioning improvement hold.

What would settle it

Train a single-layer Transformer from the paper's specified initialization and measure whether the training loss decreases at a linear rate whose speed matches the ratio of the maximum to minimum singular value of the attention-layer output matrix.

read the original abstract

Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the convergence of gradient descent for a single-layer Transformer consisting of self-attention, feedforward network, and residual connections. It claims that, under an appropriate initialization, GD exhibits linear convergence whose rate is governed by the minimum and maximum singular values of the attention-layer output matrix. Residual connections are shown to mitigate the ill-conditioning induced by the low-rank structure of the softmax, and the linear-convergence result is extended to the multi-layer case with supporting experiments.

Significance. If the central claims hold, the work supplies one of the first end-to-end convergence analyses of a structurally complete Transformer block, explicitly linking residual connections to improved conditioning of the attention output. The explicit dependence of the rate on singular values of the attention matrix and the empirical corroboration constitute concrete, falsifiable predictions that could guide initialization practice and architectural analysis.

major comments (2)
  1. [Abstract and §3] Abstract and §3: the linear convergence rate is expressed directly in terms of the min/max singular values of the attention output matrix produced by the forward pass itself. This creates a circularity risk; the quantity used to bound convergence speed is defined by the same model whose training dynamics are being analyzed, and it is not shown that these singular values remain bounded away from zero after the first gradient step once the FFN and residual branches become active.
  2. [§2–3 (Initialization)] §2–3 (Initialization): the central claim requires an 'appropriate initialization' that directly posits σ_min/σ_max ≳ constant > 0 for the post-softmax attention output. This premise is stated as the condition under which linear convergence holds but is not derived from standard random initialization (e.g., Gaussian with variance 1/d or 2/d) nor shown to be invariant under gradient updates. Without an explicit basin or invariance argument, the claimed conditioning benefit of residuals cannot be guaranteed to survive training.
minor comments (2)
  1. [Notation] Notation: clarify whether the 'output matrix from the attention layer' is taken before or after the residual addition; the distinction affects both the singular-value bounds and the claimed conditioning improvement.
  2. [Experiments] Experiments: the empirical section should report the observed singular-value ratio of the attention output at initialization and after a few steps, together with an ablation on initialization scale, to test the persistence of the theoretical assumption.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, clarifying the scope of our assumptions and the role of residual connections. We are prepared to revise the manuscript to improve clarity on these points.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: the linear convergence rate is expressed directly in terms of the min/max singular values of the attention output matrix produced by the forward pass itself. This creates a circularity risk; the quantity used to bound convergence speed is defined by the same model whose training dynamics are being analyzed, and it is not shown that these singular values remain bounded away from zero after the first gradient step once the FFN and residual branches become active.

    Authors: We acknowledge the referee's concern regarding potential circularity. Our analysis establishes linear convergence under the assumption that the iterates remain in a neighborhood of the initialization where the singular values of the attention output matrix are bounded away from zero and infinity. The residual connections are shown to mitigate the ill-conditioning induced by the softmax, thereby helping to preserve this neighborhood for the initial phase of training. We do not claim global invariance of the singular values under arbitrary updates; the result is local in nature. We will revise §3 to explicitly state the local basin assumption and add a brief discussion of how the residual branch contributes to staying within this region during early iterations. revision: partial

  2. Referee: [§2–3 (Initialization)] §2–3 (Initialization): the central claim requires an 'appropriate initialization' that directly posits σ_min/σ_max ≳ constant > 0 for the post-softmax attention output. This premise is stated as the condition under which linear convergence holds but is not derived from standard random initialization (e.g., Gaussian with variance 1/d or 2/d) nor shown to be invariant under gradient updates. Without an explicit basin or invariance argument, the claimed conditioning benefit of residuals cannot be guaranteed to survive training.

    Authors: We agree that the 'appropriate initialization' is an explicit assumption rather than a derived property of standard Gaussian initializations. The manuscript does not prove that typical random initializations automatically satisfy the bounded condition number; instead, it shows that when this condition holds at initialization, the residual connections improve the conditioning of the attention output and enable the linear convergence guarantee. We will add a clarifying paragraph in §2 explaining that the initialization can be achieved by appropriate scaling of standard random weights (consistent with common practice) and that our experiments use such scaled initializations. A full invariance argument throughout training is not provided and would require additional technical machinery beyond the current scope. revision: yes

standing simulated objections not resolved
  • Proving that the singular values of the attention output matrix remain bounded away from zero and infinity for the entire training trajectory when starting from unmodified standard random initializations without additional assumptions on the basin of attraction.

Circularity Check

0 steps flagged

No significant circularity; conditional convergence analysis is self-contained

full rationale

The paper establishes a linear convergence guarantee for gradient descent on a single-layer (and extended multi-layer) Transformer under an explicitly stated 'appropriate initialization' assumption. The convergence rate is expressed in terms of the min/max singular values of the attention-layer output matrix, which is the standard way optimization analyses parameterize the effective strong-convexity or PL constant along the trajectory; this does not reduce the claimed result to a tautology or to a fitted quantity renamed as a prediction. Residual connections are shown to counteract the rank deficiency induced by softmax, improving the conditioning ratio—an algebraic relationship derived from the forward-pass structure rather than smuggled in by self-citation or ansatz. No load-bearing step equates the target convergence statement to its own inputs by construction, and the initialization condition is treated as a premise rather than derived from the dynamics being analyzed. The derivation therefore remains independent of the quantities it bounds.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard optimization assumptions for linear convergence of gradient descent and on domain properties of the softmax and residual addition; no new entities are postulated.

free parameters (1)
  • initialization scale and distribution
    Appropriate initialization is required for the singular-value bounds and linear rate to hold; its precise form is not derived from first principles.
axioms (2)
  • standard math Gradient descent exhibits linear convergence when the loss satisfies certain smoothness and strong-convexity-like conditions around the initialization
    Invoked to obtain the linear rate from the singular-value bounds.
  • domain assumption The attention output matrix has low-rank structure induced by the softmax operation
    Used to explain the source of ill-conditioning that residuals are claimed to ameliorate.

pith-pipeline@v0.9.0 · 5747 in / 1539 out tokens · 51642 ms · 2026-05-19T10:46:18.469571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Adapt: In-Context Learning Beyond Stationarity

    cs.LG 2026-04 unverdicted novelty 6.0

    Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.