Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Efficient attention: Attention with linear complexities
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
Structured Recurrent Mixers provide a dual parallel-recurrent representation for sequence models, claiming superior training efficiency, information capacity, and inference throughput over linear complexity alternatives.
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
citing papers explorer
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.