Depth-Attention: Cross-Layer Value Mixing for Language Models
Pith reviewed 2026-06-28 05:49 UTC · model grok-4.3
The pith
Depth-Attention mixes values from earlier layers inside the attention module by reusing the existing key-value cache, improving perplexity and accuracy with no added parameters or state.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Depth-Attention performs cross-layer selection inside attention: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads, storing the depth-mixed values in the standard key-value cache slots.
What carries the argument
Cross-layer key attention at fixed token positions that mixes earlier-layer values into the current value vector before sequence self-attention.
If this is right
- Depth-Attention reaches the lowest perplexity among compared methods on the tested decoder scales.
- Average downstream accuracy rises by up to 2.3 points over the vanilla Transformer.
- The approach adds under 0.01 percent extra arithmetic FLOPs and zero additional persistent inference state.
- Improvements hold from 360M to 3B parameters and extend to looped Transformers.
Where Pith is reading between the lines
- The same cache-reuse pattern could be tested with grouped-query or multi-head latent attention to keep memory savings intact.
- Better selective reuse across depth might allow comparable performance with fewer total layers in some settings.
- The mixing step could be applied in encoder-only or encoder-decoder architectures to test whether the benefit generalizes beyond decoder-only language models.
Load-bearing premise
That depth-mixed values can replace the original values in the cache without disrupting training dynamics or requiring changes to the residual connections.
What would settle it
Training the same 1.5B and 3B Qwen3-style models with Depth-Attention and measuring no reduction in perplexity or gain in average downstream accuracy compared with the vanilla baseline would falsify the performance claim.
read the original abstract
Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Depth-Attention, an architectural change that performs cross-layer value mixing inside the attention module: the current layer's query attends exclusively over the keys of prior layers at the identical token position, and the resulting mixed value overwrites the layer's value slot in the KV cache for use by standard self-attention. The method reuses existing Q/K/V machinery with no added parameters or persistent inference state beyond the standard cache. On Qwen3-style decoders (360M–3B parameters) it reports the lowest perplexity and highest average downstream accuracy, with gains of up to 2.3 accuracy points over vanilla Transformers and superiority over other cross-layer baselines, at <0.01% extra arithmetic FLOPs; the gains are also stated to extend to looped Transformers.
Significance. If the reported empirical gains and efficiency properties hold under scrutiny, the work would be significant for its demonstration that selective depth-wise mixing can be realized with zero extra persistent state or parameters—an advantage that grows in importance for models that already compress the KV cache via GQA or MLA. The explicit reuse of the existing attention cache slots and the negligible depth-wise cost (depth ≪ sequence length) are cleanly argued strengths. The consistent scaling behavior from 360M to 3B and the looped-Transformer extension, if reproducible, would strengthen the case for the approach.
major comments (2)
- [Abstract, §4] Abstract and §4 (Experiments): the central empirical claim of superiority (lowest perplexity, +2.3 accuracy points, outperformance of strong baselines) is presented without error bars, number of random seeds, or statistical significance tests. This directly affects the load-bearing assertion that Depth-Attention is reliably better than the listed baselines.
- [§3, §4.1] §3 (Method) and §4.1 (Training details): the description states that depth-mixed values replace original values “in place” with no additional persistent state, yet the paper does not provide an explicit ablation or measurement confirming that the extra depth-wise attention (over ~24–32 layers) does not alter KV-cache eviction behavior or numerical stability under realistic batching and quantization.
minor comments (2)
- [§3] Notation for the depth-attention operation (presumably Eq. (X) in §3) should be introduced with an explicit small example showing the shape of the depth-wise key matrix and how the mixed value is written back into the cache slot.
- [§4.1] The abstract refers to “Qwen3-style decoders” without clarifying whether any other modifications (e.g., RoPE scaling, GQA grouping) differ from the public Qwen2/Qwen3 checkpoints; this should be stated once in §4.1.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the empirical evaluation and implementation details. We agree that additional statistical rigor and verification ablations would improve the manuscript and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the central empirical claim of superiority (lowest perplexity, +2.3 accuracy points, outperformance of strong baselines) is presented without error bars, number of random seeds, or statistical significance tests. This directly affects the load-bearing assertion that Depth-Attention is reliably better than the listed baselines.
Authors: We acknowledge this limitation in the current manuscript. In the revised version, we will rerun experiments with multiple random seeds, report means and standard deviations for key metrics, and include statistical significance tests (e.g., t-tests) comparing Depth-Attention to baselines to substantiate the superiority claims. revision: yes
-
Referee: [§3, §4.1] §3 (Method) and §4.1 (Training details): the description states that depth-mixed values replace original values “in place” with no additional persistent state, yet the paper does not provide an explicit ablation or measurement confirming that the extra depth-wise attention (over ~24–32 layers) does not alter KV-cache eviction behavior or numerical stability under realistic batching and quantization.
Authors: The depth attention is computed per position over a fixed small set of prior layers and overwrites the value in the standard KV cache without increasing its size, so eviction behavior (which is sequence-length based) remains unchanged. That said, we did not include explicit measurements for quantization or batching effects. We will add such an ablation in the revision, evaluating perplexity and accuracy under different quantization schemes and batch configurations to confirm stability. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces an architectural change (Depth-Attention) that reuses existing Q/K/V attention machinery to mix values across layers at the same token position, then reports empirical results on perplexity and downstream tasks for models from 360M to 3B parameters. All central claims are experimental comparisons against vanilla Transformers and other cross-layer baselines; no derivation chain, first-principles prediction, or fitted parameter is presented whose output is definitionally identical to its input. The efficiency statements (reuse of cache slots, <0.01% extra FLOPs, no added persistent state) follow directly from the stated implementation without circular reduction. No self-citation load-bearing steps or ansatz smuggling appear in the abstract or described mechanism.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard multi-head attention computation and residual stream addition are available as building blocks.
Reference graph
Works this paper leans on
-
[1]
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901
Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901. Thomas Bachlechner, Bodhisattwa Prasad Majumder, Henry Mao, Gary Cottrell, and Julian McAuley
2023
-
[2]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027. Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papail- iopoulos
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
InFindings of the association for computational linguistics: ACL-IJCNLP 2021, pages 929–943
Realformer: Transformer likes residual attention. InFindings of the association for computational linguistics: ACL-IJCNLP 2021, pages 929–943. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger
2021
-
[5]
Scaling Laws for Neural Language Models
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, and Kristian Kersting
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy
Depth- recurrent attention mixtures: Giving latent reasoning the attention it deserves.arXiv preprint arXiv:2601.21582. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy
-
[7]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434. Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2411.07501
Laurel: Learned augmented residual layer. arXiv preprint arXiv:2411.07501. Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, and Martin Jaggi
-
[10]
Deep contextualized word representations
Deep contextualized word representations. arxiv 2018.arXiv preprint arXiv:1802.05365,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Attention residuals.arXiv preprint arXiv:2603.15031. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Crowdsourcing Multiple Choice Science Questions
Crowdsourcing multiple choice science questions.arXiv preprint arXiv:1707.06209. Da Xiao, Qingye Meng, Shengping Li, and Xingyuan Yuan
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Muddformer: Breaking residual bot- tlenecks in transformers via multiway dynamic dense connections.arXiv preprint arXiv:2502.12170. Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, and Rui Yan
-
[14]
Residual: Transformer with dual residual connections.arXiv preprint arXiv:2304.14802. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, and 1 others
-
[15]
mHC: Manifold-Constrained Hyper-Connections
mhc: Manifold-constrained hyper-connections. arXiv preprint arXiv:2512.24880. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
HellaSwag: Can a Machine Really Finish Your Sentence?
Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830. Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhouhan Lin, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[18]
Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan
Ancre: Adaptive neural connection reassignment for efficient depth scaling.arXiv preprint arXiv:2602.09009. Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Fares Obeid, and Zhenzhong Lan
-
[19]
InInternational Conference on Learning Representations, volume 2025, pages 97183–97219
Hyper-connections. InInternational Conference on Learning Representations, volume 2025, pages 97183–97219. 13 Depth-Attention: Cross-Layer Value Mixing for Language Models A. Main Experiment Details For the main experiments in section 4, we first summarize the baseline-specific settings in Table
2025
-
[20]
mHC denotes manifold hyperconnec- tion. We choose the baseline-specific hyperparameters by following the recommended or commonly used practical configurations in the corresponding papers, while keeping the base architecture, data, optimizer, training budget, sequence length, and precision fixed across methods. For Dense- Former Pagliardini et al. (2024), ...
2024
-
[21]
All theoretical estimates report the additional cost beyond a vanilla Transformer decoder. B.1. Common Counting Assumptions We use the following notation. Let𝐿 be the number of Transformer layers,𝑑model be the hidden dimension, and𝑑kv be the total key-value dimension used by the grouped-query attention cache. For our 3B Qwen3-style configuration, we use 𝐿...
2048
-
[22]
Depth-Attention For a target layerℓ, Depth-Attention mixes value states from a depth source setDℓ
15 Depth-Attention: Cross-Layer Value Mixing for Language Models B.2. Depth-Attention For a target layerℓ, Depth-Attention mixes value states from a depth source setDℓ. Let 𝑀ℓ =|D ℓ | denote the number of depth sources at layerℓ. In our implementation, the depth source set is maintained with stride𝑠=24, giving the layer-wise asymptotic cost 𝑂((ℓ/𝑠+1)𝑑 kv)...
2048
-
[23]
Configuration / Hyperparameter 360M 500M 710M Model configuration Architecture Qwen-style decoder Qwen-style decoder Qwen-style decoder Hidden size 960 1152 1280 Intermediate size 2540 3072 3072 Number of hidden layers 24 24 32 Number of attention heads 15 18 20 Number of key-value heads 15 18 20 GQA group size 1 1 1 Activation function SiLU SiLU SiLU Voc...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.