Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
arXiv preprint arXiv:2408.10189 (2024)
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
PSCT-Net introduces a geometry-aware neural framework that uses differentiable back-projection and attention-guided 3D refinement to reconstruct pediatric skull CT from bi-planar X-rays.