pith. machine review for the scientific record. sign in

citation dossier

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim · 2023 · arXiv 2312.06635

19Pith papers citing it
19reference links
cs.LGtop field · 10 papers
UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.LG (10 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

years

2026 18 2025 1

representative citing papers

TIDES: Implicit Time-Awareness in Selective State Space Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

VORT: Adaptive Power-Law Memory for NLP Transformers

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

Long Context Pre-Training with Lighthouse Attention

cs.CL · 2026-05-07 · conditional · novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

Linearizing Vision Transformer with Test-Time Training

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.

Learning to Adapt: In-Context Learning Beyond Stationarity

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

Adaptive Memory Decay for Log-Linear Attention

cs.LG · 2026-05-07 · conditional · novelty 5.0

Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

On The Application of Linear Attention in Multimodal Transformers

cs.CV · 2026-04-11 · unverdicted · novelty 4.0

Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

citing papers explorer

Showing 19 of 19 citing papers.

  • TIDES: Implicit Time-Awareness in Selective State Space Models cs.LG · 2026-05-10 · unverdicted · none · ref 25

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and Physiome-ODE benchmarks.

  • VORT: Adaptive Power-Law Memory for NLP Transformers cs.LG · 2026-05-09 · unverdicted · none · ref 48

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  • Long Context Pre-Training with Lighthouse Attention cs.CL · 2026-05-07 · conditional · none · ref 34

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower loss than standard full-attention training.

  • Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning cs.LG · 2026-05-07 · unverdicted · none · ref 98

    Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classical RNNs on long-horizon solar forecasting while running on NISQ devices.

  • Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 32

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  • Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 77

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

  • A Single-Layer Model Can Do Language Modeling cs.CL · 2026-05-11 · unverdicted · none · ref 11

    A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

  • RT-Transformer: The Transformer Block as a Spherical State Estimator cs.LG · 2026-05-10 · unverdicted · none · ref 114

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  • The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 34

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  • Linearizing Vision Transformer with Test-Time Training cs.CV · 2026-05-04 · unverdicted · none · ref 14

    Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.

  • Learning to Adapt: In-Context Learning Beyond Stationarity cs.LG · 2026-04-13 · unverdicted · none · ref 49

    Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

  • In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 59

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  • Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 96

    PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

  • MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 47

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

  • Beyond Similarity: Temporal Operator Attention for Time Series Analysis cs.LG · 2026-05-11 · unverdicted · none · ref 29

    Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on time series benchmarks.

  • PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay cs.LG · 2026-05-11 · unverdicted · none · ref 13

    PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupling-cap placement, macro placement, and IR-drop prediction.

  • Adaptive Memory Decay for Log-Linear Attention cs.LG · 2026-05-07 · conditional · none · ref 8

    Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.

  • On The Application of Linear Attention in Multimodal Transformers cs.CV · 2026-04-11 · unverdicted · none · ref 38

    Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.

  • Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview cs.NI · 2026-05-11 · unverdicted · none · ref 137

    The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.