pith. machine review for the scientific record. sign in

hub

Gated linear attention transformers with hardware-efficient training

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

hub tools

citation-role summary

background 1 method 1

citation-polarity summary

years

2026 19 2025 1

polarities

unclear 1 use method 1

clear filters

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

TIDES: Implicit Time-Awareness in Selective State Space Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

VORT: Adaptive Power-Law Memory for NLP Transformers

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

Long Context Pre-Training with Lighthouse Attention

cs.CL · 2026-05-07 · conditional · novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Linearizing Vision Transformer with Test-Time Training

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.

Learning to Adapt: In-Context Learning Beyond Stationarity

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

Adaptive Memory Decay for Log-Linear Attention

cs.LG · 2026-05-07 · conditional · novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

On The Application of Linear Attention in Multimodal Transformers

cs.CV · 2026-04-11 · unverdicted · novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 33

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  • TIDES: Implicit Time-Awareness in Selective State Space Models cs.LG · 2026-05-10 · unverdicted · none · ref 25

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

  • VORT: Adaptive Power-Law Memory for NLP Transformers cs.LG · 2026-05-09 · unverdicted · none · ref 48

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  • Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning cs.LG · 2026-05-07 · unverdicted · none · ref 98

    Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.

  • Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 32

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  • RT-Transformer: The Transformer Block as a Spherical State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 114

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  • Learning to Adapt: In-Context Learning Beyond Stationarity cs.LG · 2026-04-13 · unverdicted · none · ref 49

    Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

  • In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 59

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  • Beyond Similarity: Temporal Operator Attention for Time Series Analysis cs.LG · 2026-05-11 · unverdicted · none · ref 29

    Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

  • PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay cs.LG · 2026-05-11 · unverdicted · none · ref 13

    PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupling-cap placement, macro placement, and IR-drop prediction.

  • Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 8

    Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.