DeepCrossAttention: Supercharging transformer residual connections.arXiv preprint arXiv:2502.06785

Lucas Heddes et al · arXiv 2502.06785

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Gradient Boosting within a Single Attention Layer

cs.LG · 2026-04-03 · conditional · novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

Hyperloop Transformers

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

citing papers explorer

Showing 2 of 2 citing papers.

Gradient Boosting within a Single Attention Layer cs.LG · 2026-04-03 · conditional · none · ref 2
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.
Hyperloop Transformers cs.LG · 2026-04-23 · unverdicted · none · ref 10
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

DeepCrossAttention: Supercharging transformer residual connections.arXiv preprint arXiv:2502.06785

fields

years

verdicts

representative citing papers

citing papers explorer