RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
hub Mixed citations
Rethinking Attention with Performers
Mixed citation behavior. Most common role is background (67%).
abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representat
co-cited works
representative citing papers
Functional Attention replaces pairwise softmax attention with structured linear operators inspired by geometric functional maps to produce compact, resolution-invariant representations for operator learning.
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex Transformers.
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
A hybrid attention mechanism with editable request-local memory slots and sparse fallback achieves high accuracy on synthetic overwrite, version, and anti-pollution tasks where pure fixed-state or sparse methods fail, while identifying open-domain selection as the remaining bottleneck.
Oryx hybridizes attention and linear recurrent mixers along the sequence axis with high parameter sharing, outperforming single-mixer baselines on language modeling and retrieval at up to 1.4B scale under mixed training.
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
H²MT uses offline semantic hierarchy construction, bottom-up memory aggregation, and coarse-to-fine query routing to achieve competitive QA quality with lower memory and latency than flat or retrieval baselines on LongBench tasks.
COAgents introduces a cooperative multi-agent system with a partial search graph to guide intensification and diversification in vehicle routing problems, achieving new state-of-the-art results among learning-based methods on VRPTW benchmarks.
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
citing papers explorer
-
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.