pith. sign in

A residual-aware theory of position bias in transformers

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. This bias is closely connected to the Lost-in-the-Middle phenomenon, where models underutilize information placed in the middle of the context. We show that Lost-in-the-Middle-type behavior can arise from the architecture of causal Transformers itself. To do so, we develop a structural theory of position bias based on residual-aware cumulative attention rollout. At finite depth, causal masking and residual connections induce broad, often U-shaped, influence profiles. At infinite depth, our framework resolves a discrepancy between prior attention-only collapse theory and practical Transformer behavior: residual connections fundamentally change cumulative attention dynamics. Empirically, the predicted profiles closely match measured input-token influence in pretrained language models.

citation-role summary

background 1

citation-polarity summary

years

2026 2

roles

background 1

polarities

background 1

representative citing papers

citing papers explorer

Showing 2 of 2 citing papers.