WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
hub
Gated linear attention transformers with hardware-efficient training
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.
Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.
PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupling-cap placement, macro placement, and IR-drop prediction.
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning
Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.
-
PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay
PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupling-cap placement, macro placement, and IR-drop prediction.
-
Adaptive Memory Decay for Log-Linear Attention
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.