pith. machine review for the scientific record. sign in

DeepCrossAttention: Supercharging transformer residual connections.arXiv preprint arXiv:2502.06785

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.LG 2

years

2026 2

representative citing papers

Gradient Boosting within a Single Attention Layer

cs.LG · 2026-04-03 · conditional · novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

Hyperloop Transformers

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

citing papers explorer

Showing 2 of 2 citing papers.

  • Gradient Boosting within a Single Attention Layer cs.LG · 2026-04-03 · conditional · none · ref 2

    Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

  • Hyperloop Transformers cs.LG · 2026-04-23 · unverdicted · none · ref 10

    Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.