pith. sign in

arxiv: 2605.18855 · v1 · pith:WX22IF2Onew · submitted 2026-05-13 · 💻 cs.LG · cs.CV

Delta Attention Residuals

Pith reviewed 2026-05-20 20:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords delta attention residualsattention residualsresidual connectionstransformer modelslanguage modelingcross-layer routingmodel scaling
0
0 comments X

The pith

Attending over layer deltas rather than cumulative states produces higher-contrast routing in attention residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard attention residuals still suffer from redundant cumulative hidden states that cause low-contrast attention weights in deeper layers. Delta Attention Residuals instead route the changes introduced by each sublayer, which are more diverse and lead to sharper attention distributions. This change improves validation perplexity across model sizes from 220 million to 7.6 billion parameters. The approach also allows converting existing pretrained models to the new residual style through fine-tuning.

Core claim

By replacing cumulative hidden states with deltas defined as the difference between consecutive layer outputs, attention residuals achieve attention weights with a maximum around 0.6 instead of 0.2, allowing more selective selection of informative states from previous layers and reducing routing collapse in deep networks.

What carries the argument

Delta representations, computed as the vector difference between a sublayer's input and output, which serve as the basis for the attention computation in the residual connection.

If this is right

  • Delta Attention Residuals deliver consistent validation perplexity reductions of 1.7 to 8.2 percent compared to both standard additive residuals and cumulative attention residuals.
  • Higher contrast in attention weights enables more effective selective routing of information across layers at both sublayer and block levels.
  • Pretrained transformer checkpoints can be adapted to Delta Attention Residuals using ordinary fine-tuning procedures without starting from scratch.
  • The performance gains hold across the full range of tested scales from 220M to 7.6B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Delta-based routing might be combined with other layer-wise adaptation techniques to further improve efficiency in very deep models.
  • Similar delta concepts could be tested in non-transformer architectures where residual connections are used, to see if redundancy issues appear there as well.
  • If the diversity of deltas persists, this method could support scaling to models much larger than 7.6B while maintaining selective information flow.

Load-bearing premise

The structural diversity of delta representations remains the main driver of higher-contrast attention and continues to provide benefits without causing new training instabilities at scales beyond those tested.

What would settle it

Observe the attention weight entropy or maximum weight in the deepest layers of a model using Delta Attention Residuals; if the max weight falls below 0.3 and matches the uniform-like distribution of cumulative attention residuals, the advantage would be falsified.

Figures

Figures reproduced from arXiv: 2605.18855 by Cheng Luo, Junjie Hu, Zefan Cai.

Figure 1
Figure 1. Figure 1: Source redundancy in cross-layer routing (Qwen3-0.6B, L=28). (a) Per-layer routing sharpness: AttnRes routing degrades to max weight ∼0.2 in deep layers, while Delta Block maintains sharp routing (∼0.6). (b) Routing quality: Delta Block achieves 1.8× higher average max weight (0.62 vs. 0.35). (c) Training loss: Delta Block (green) consistently outperforms AttnRes (red). This raises a critical yet underexpl… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture comparison. (a) Standard Residuals: uniform additive accumulation with fixed unit coefficients. (b) Attention Residuals [Kimi, 2025]: learned softmax attention (weights w, aggregation a) over cumulative hidden states, but source redundancy degrades routing to max weight ∼0.2 in deep layers. (c) Delta Attention Residuals (ours): attends over per-sublayer delta outputs, maintaining sharp routing… view at source ↗
Figure 3
Figure 3. Figure 3: Delta Attention Residuals pseudocode. depth_route computes additive softmax routing over delta sources. Delta Block stores the embedding on first call and appends block deltas (change since last source); Delta AttnRes appends all sublayer outputs directly. every sublayer’s contribution as a distinct source, ensuring that no intermediate computation is lost to aggregation. 3. Safe initialization. At initial… view at source ↗
Figure 4
Figure 4. Figure 4: Routing analysis at Qwen3-0.6B scale (L=28, N=28). (a) Delta Block maintains sharp routing (max weight ∼0.6–1.0) while AttnRes degrades to ∼0.2 in deep layers. (b) Delta Block achieves 1.8× higher average max weight (0.62 vs. 0.35). (c) Training loss: Delta Block (green) consistently outperforms AttnRes (red). Scaling up: 8B parameters. We next scale to a Qwen3-8B-sized model (d=4096, L=36, 7.57B params) t… view at source ↗
Figure 5
Figure 5. Figure 5: Routing analysis (Qwen3-0.6B fine-tuned on FineWeb-Edu). (a) Per-layer routing sharpness: Delta Block maintains high max attention weight (∼0.87) throughout depth, while AttnRes degrades from 0.7 to 0.3. (b) Average routing quality: Delta Block achieves 1.8× higher max weight (0.87 vs. 0.49). (c) Validation loss: AttnRes (blue) starts higher due to initialization disruption and converges slower; Delta Bloc… view at source ↗
Figure 6
Figure 6. Figure 6: Learned routing weights (Qwen3-0.6B, from scratch). Left: AttnRes [Kimi, 2025] with cumulative states and replacement routing—attention becomes diffuse in deep layers due to source redundancy. Right: Delta Block (ours) with delta sources and additive routing—sharp cross-layer shortcuts, with deep layers selectively concentrating on specific early outputs. routing sharpness (max softmax weight) drops from ∼… view at source ↗
Figure 7
Figure 7. Figure 7: Original AttnRes [Kimi, 2025]: replacement routing over cumulative intra-block states, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Delta Block schematic. Sublayer outputs within each block of B layers are summed into a single block delta ∆b, which becomes one routing source. The current (in-progress) block contributes a partial delta. Compared to per-sublayer Delta AttnRes, Delta Block reduces the number of sources from 2L to ∼L/B, trading routing granularity for compute and memory efficiency. The residual stream h˜ l is preserved thr… view at source ↗
read the original abstract

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight ${\approx}$0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer ($\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight ${\approx}$0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Delta Attention Residuals, an architectural modification to Attention Residuals in which the attention mechanism operates over delta representations (v_i = h_{i+1} - h_i) rather than cumulative hidden states. This change is motivated by the claim that cumulative states are redundant, leading to low-contrast attention (max weight ≈0.2); deltas are argued to be structurally diverse, producing higher-contrast distributions (max weight ≈0.6) and thereby more effective cross-layer routing. Empirical results report consistent validation perplexity reductions of 1.7–8.2% relative to both standard residuals and Attention Residuals across model scales from 220M to 7.6B parameters, with an additional claim that pretrained checkpoints can be converted via fine-tuning.

Significance. If the performance gains prove robust and the mechanistic account is substantiated, the work offers a lightweight, parameter-free-in-principle change to residual routing that could improve information flow in deep transformers. The reported scale range and the practical conversion procedure are positive features; the public code release supports reproducibility.

major comments (2)
  1. [Experimental section] Experimental section: The central explanatory claim—that structural diversity of deltas produces higher-contrast attention which in turn drives the perplexity gains—rests on comparisons of complete architectures. No ablation holds the attention mechanism fixed while varying only the redundancy of the attended representations (e.g., deltas versus other decorrelated inputs). Without such isolation, it remains unclear whether the observed increase in max attention weight is causal or merely correlated with other properties of the delta inputs.
  2. [Results section] Results section: The reported 1.7–8.2% perplexity gains are presented without error bars, multiple random seeds, or statistical significance tests. Given the range of model scales and the modest size of some gains, this omission weakens confidence that the improvements are reliably attributable to the proposed change rather than optimization variance.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise citation or one-sentence definition of the baseline 'Attention Residuals' architecture for readers who have not encountered the prior work.
  2. [Method] Notation for the delta computation (v_i = h_{i+1} - h_i) is clear in the abstract but should be restated with explicit indexing when first introduced in the method description to avoid any ambiguity about sublayer versus block granularity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Experimental section] Experimental section: The central explanatory claim—that structural diversity of deltas produces higher-contrast attention which in turn drives the perplexity gains—rests on comparisons of complete architectures. No ablation holds the attention mechanism fixed while varying only the redundancy of the attended representations (e.g., deltas versus other decorrelated inputs). Without such isolation, it remains unclear whether the observed increase in max attention weight is causal or merely correlated with other properties of the delta inputs.

    Authors: We agree that a controlled ablation isolating the effect of input redundancy while holding the attention mechanism fixed would strengthen the causal interpretation. Our current results compare full architectures and show that delta inputs produce higher max attention weights together with consistent perplexity improvements. To address the concern directly, the revised manuscript will include a new ablation that fixes the attention module and varies only the attended representations: cumulative hidden states, delta representations, and additional decorrelated variants (e.g., layer-normalized states and random projections). This will clarify whether the contrast and performance differences arise specifically from the structural properties of deltas. revision: yes

  2. Referee: [Results section] Results section: The reported 1.7–8.2% perplexity gains are presented without error bars, multiple random seeds, or statistical significance tests. Given the range of model scales and the modest size of some gains, this omission weakens confidence that the improvements are reliably attributable to the proposed change rather than optimization variance.

    Authors: We acknowledge that reporting variability across seeds would increase confidence in the results. Training runs at the largest scales are computationally expensive, which is why we initially reported single-run outcomes. In the revision we add multi-seed results (three independent seeds) with error bars for the 220M and 1B models. For the 7.6B scale we retain the single-run numbers but note the directional consistency of gains across all five scales tested. A short discussion of this limitation has been added to the results section. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; architectural change and empirical gains are independently defined and measured

full rationale

The paper defines Delta Attention Residuals via an explicit architectural substitution: attention is performed over per-sublayer deltas v_i = h_{i+1} - h_i instead of cumulative hidden states. This substitution is introduced as a design choice motivated by observed redundancy in cumulative states, not derived from any equation that presupposes the performance outcome. Reported improvements in attention contrast (max weight ≈0.6 versus ≈0.2) and validation perplexity (1.7–8.2 % gains) are obtained from direct experimental comparison on held-out data across model scales; no fitted parameter is relabeled as a prediction, no self-citation chain supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observation rather than derivation from first principles; no new mathematical axioms or invented entities are introduced beyond standard transformer assumptions.

axioms (1)
  • domain assumption Standard transformer residual connections and attention mechanisms behave as described in prior literature.
    The paper builds directly on the Attention Residuals baseline without re-deriving its properties.

pith-pipeline@v0.9.0 · 5784 in / 1226 out tokens · 36970 ms · 2026-05-20T20:50:27.806416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 7 internal anchors

  1. [1]

    Attention Residuals

    Attention Residuals , author=. arXiv preprint arXiv:2603.15031 , year=

  2. [2]

    CVPR , year=

    Deep residual learning for image recognition , author=. CVPR , year=

  3. [3]

    NeurIPS , year=

    Attention is all you need , author=. NeurIPS , year=

  4. [4]

    CVPR , year=

    Densely connected convolutional networks , author=. CVPR , year=

  5. [5]

    arXiv:2402.02622 , year =

    DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Average , author=. arXiv preprint arXiv:2402.02622 , year=

  6. [6]

    NeurIPS , year=

    Training very deep networks , author=. NeurIPS , year=

  7. [7]

    Hyper-Connections.arXiv preprint arXiv:2409.19606, 2024

    Hyper-Connections , author=. arXiv preprint arXiv:2409.19606 , year=

  8. [8]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. arXiv preprint arXiv:2406.17557 , year=

  9. [9]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  10. [10]

    Paperno, Denis and Kruszewski, Germ. The. ACL , year=

  11. [11]

    ACL , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. ACL , year=

  12. [12]

    Think you have Solved Question Answering? Try

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have Solved Question Answering? Try

  13. [13]

    Bisk, Yonatan and Zellers, Rowan and Gao, Jianfeng and Choi, Yejin and others , booktitle=

  14. [14]

    Sakaguchi, Keisuke and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin , booktitle=

  15. [15]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

  16. [16]

    ICLR , year=

    Measuring Massive Multitask Language Understanding , author=. ICLR , year=

  17. [17]

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

  18. [18]

    UAI , year=

    ReZero is All You Need: Fast Convergence at Large Depth , author=. UAI , year=

  19. [19]

    ICML , year=

    MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections , author=. ICML , year=

  20. [20]

    ICML , year=

    On Layer Normalization in the Transformer Architecture , author=. ICML , year=

  21. [21]

    Shazeer, Noam , journal=

  22. [22]

    ICLR , year=

    Decoupled Weight Decay Regularization , author=. ICLR , year=

  23. [23]

    JMLR , year=

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. JMLR , year=

  24. [24]

    ICLR , year=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. ICLR , year=

  25. [25]

    NeurIPS , year=

    Residual Networks Behave Like Ensembles of Relatively Shallow Networks , author=. NeurIPS , year=

  26. [26]

    Transformer Circuits Thread , year=

    A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

  27. [27]

    ACL , year=

    Your Transformer is Secretly Linear , author=. ACL , year=

  28. [28]

    ACL Findings , year=

    Realformer: Transformer Likes Residual Attention , author=. ACL Findings , year=

  29. [29]

    Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y

    DeepNet: Scaling Transformers to 1,000 Layers , author=. arXiv preprint arXiv:2203.00555 , year=

  30. [30]

    Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

    Mixture-of-Depths: Dynamically allocating compute in transformer-based language models , author=. arXiv preprint arXiv:2404.02258 , year=

  31. [31]

    mHC: Manifold-Constrained Hyper-Connections

    Manifold-Constrained Hyper-Connections , author=. arXiv preprint arXiv:2512.24880 , year=

  32. [32]

    Residual Stream Duality in Modern Transformer Architectures

    Residual Stream Duality in Modern Transformer Architectures , author=. arXiv preprint arXiv:2603.16039 , year=

  33. [33]

    Deep Delta Learning

    Deep Delta Learning , author=. arXiv preprint arXiv:2601.00417 , year=

  34. [34]

    ACL , year=

    Contrastive Decoding: Open-ended Text Generation as Optimization , author=. ACL , year=

  35. [35]

    Chuang, Yung-Sung and Xie, Yujia and Luo, Hongyin and Kim, Yoon and Glass, James and He, Pengcheng , booktitle=

  36. [36]

    COLM , year=

    Tuning Language Models by Proxy , author=. COLM , year=

  37. [37]

    2024 , publisher=

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...