A residual-aware theory of position bias in transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue · 2026 · cs.LG · arXiv 2602.16837

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. This bias is closely connected to the Lost-in-the-Middle phenomenon, where models underutilize information placed in the middle of the context. We show that Lost-in-the-Middle-type behavior can arise from the architecture of causal Transformers itself. To do so, we develop a structural theory of position bias based on residual-aware cumulative attention rollout. At finite depth, causal masking and residual connections induce broad, often U-shaped, influence profiles. At infinite depth, our framework resolves a discrepancy between prior attention-only collapse theory and practical Transformer behavior: residual connections fundamentally change cumulative attention dynamics. Empirically, the predicted profiles closely match measured input-token influence in pretrained language models.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Kinetic theory for Transformers and the lost-in-the-middle phenomenon

math.AP · 2026-05-09 · conditional · novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.

How Token Influence Decays with Distance: A Green-Function View of Trained Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 5.0

Empirical Jacobian analysis reveals that token influence in trained language models decays as a power law with distance (exponent ~0.8), a learned property not present in random models.

citing papers explorer

Showing 2 of 2 citing papers.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon math.AP · 2026-05-09 · conditional · none · ref 22 · internal anchor
A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
How Token Influence Decays with Distance: A Green-Function View of Trained Language Models cs.LG · 2026-06-28 · unverdicted · none · ref 19 · internal anchor
Empirical Jacobian analysis reveals that token influence in trained language models decays as a power law with distance (exponent ~0.8), a learned property not present in random models.

A residual-aware theory of position bias in transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer