arxiv: 1904.10509 · v1 · submitted 2019-04-23 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Generating Long Sequences with Sparse Transformers

Alec Radford, Ilya Sutskever, Rewon Child, Scott Gray

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:45 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords sparse transformersattention mechanismslong sequence modelingdensity modelingenwik8cifar-10imagenet-64sequence generation

0 comments

The pith

Sparse factorizations of the attention matrix let transformers model sequences tens of thousands of timesteps long.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces changes to the transformer architecture that address the quadratic growth in time and memory with sequence length. Sparse factorizations reduce attention complexity to O(n sqrt n), while additional modifications allow deeper networks, save memory through recomputation, and speed up training with custom kernels. The resulting Sparse Transformers handle sequences of tens of thousands of steps across hundreds of layers and are applied to raw bytes of text, images, and audio. A reader would care because this makes it feasible to capture long-range structure in data that was previously too costly to model directly.

Core claim

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n sqrt n). We also introduce a variation on architecture and initialization to train deeper networks, the recomputation of attention matrices to save memory, and fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling.

What carries the argument

Sparse factorizations of the attention matrix that lower complexity from quadratic to O(n sqrt n) while supporting long-range dependencies.

If this is right

Hundreds of layers become practical on sequences of tens of thousands of timesteps.
State-of-the-art density modeling results are reached on Enwik8, CIFAR-10, and ImageNet-64 from raw bytes.
Unconditional generation produces samples with global coherence and diversity.
Self-attention in principle extends to sequences of length one million or more.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured sparsity may suffice for many long-range dependencies instead of requiring full attention.
The approach could be tested on other high-dimensional sequence data such as video frames or full-length audio tracks.
Learned or data-adaptive sparsity patterns might further improve efficiency beyond the fixed factorizations used here.

Load-bearing premise

The chosen sparse factorizations of the attention matrix retain sufficient expressivity to capture the long-range dependencies needed for the reported density modeling tasks.

What would settle it

A head-to-head comparison on one of the long-sequence tasks where a full-attention transformer achieves clearly superior density estimates or sample coherence compared with the sparse version.

read the original abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse factorizations cut transformer attention to O(n sqrt n) and let them train on 16k+ sequences with new SOTA density numbers, but missing direct sparse-vs-dense ablations leaves the expressivity claim untested.

read the letter

The core advance here is the pair of hand-designed sparse attention patterns—strided and fixed—that factor the attention matrix so cost drops from quadratic to O(n sqrt n). They pair this with a deeper-network initialization, attention recomputation to save memory, and fast kernels, then train models with hundreds of layers on sequences up to tens of thousands of steps. The same architecture is applied to raw-byte modeling of text, images, and audio, beating prior numbers on Enwik8, CIFAR-10, and ImageNet-64 while producing samples that show global coherence over long ranges. They also sketch how the approach could reach a million steps in principle. That combination of scaling demonstration and concrete implementation tricks is what is actually new relative to the 2017-2018 transformer baseline. The work is useful because it gives practitioners a workable recipe for longer contexts without needing to invent new hardware. The patterns are simple enough to implement and the memory tricks are immediately practical. The main soft spot is the lack of a controlled comparison: there is no head-to-head run of the sparse version against a dense transformer on lengths like 1k or 2k where both are still feasible. Without that, it is hard to separate the benefit of sparsity from the deeper training and initialization changes. The SOTA claims would also land more solidly with error bars and fuller ablation tables on the individual components. The patterns are task-specific and hand-crafted, so it remains open whether they preserve enough expressivity for every long-range dependency that dense attention would capture. This paper is aimed at people who need transformers on long sequences in language, vision, or audio and who are willing to trade some theoretical generality for practical scaling. A reader already working on efficient attention or scaling experiments will find the concrete numbers and code-level details worth examining. It deserves a serious referee because the empirical results are substantial and the method is reproducible enough to test. I would send it to review, expecting the main questions to focus on the missing ablations and the precise contribution of each trick.

Referee Report

2 major / 2 minor

Summary. The paper introduces sparse factorizations of the self-attention matrix in Transformers that reduce complexity from quadratic to O(n √n). Combined with changes to architecture and initialization for training deeper networks, attention recomputation to reduce memory use, and optimized fast attention kernels, the resulting Sparse Transformers are shown to model sequences of tens of thousands of timesteps. The same architecture is applied to raw-byte modeling of text (Enwik8), images (CIFAR-10 and ImageNet-64), achieving new state-of-the-art density modeling results and generating globally coherent unconditional samples; the work also indicates that self-attention can in principle handle sequences of length one million or more.

Significance. If the reported results hold under verification, the work is significant: it provides concrete, practical sparse attention patterns that enable self-attention to scale to sequence lengths far beyond the reach of dense Transformers, while retaining sufficient expressivity for high-quality density modeling on established benchmarks. The accompanying engineering contributions (recomputation, fast kernels) are immediately usable and lower the barrier to experimenting with longer contexts in language, vision, and audio.

major comments (2)

[§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.
[§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.

minor comments (2)

[§3] Notation for the two sparse patterns (strided vs. fixed) is introduced in §3 but the precise definition of the attention mask for each is only shown graphically in Figure 2; an explicit matrix-level equation would improve clarity.
[abstract] The claim that sequences of length one million are feasible “in principle” is stated in the abstract but is not supported by any timing or memory measurements at that scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.

Authors: We agree that a controlled ablation on shorter sequences would help isolate the contribution of the sparse patterns from the architectural and initialization changes. Although the primary motivation is scaling to lengths where dense attention is infeasible, we will add an ablation in the revised manuscript: we will train matched-depth dense and sparse models on sequences of length 512–2048 and report the resulting bits-per-byte (or bits-per-dim) to quantify any expressivity gap introduced by sparsity. revision: yes
Referee: [§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.

Authors: We acknowledge that the current presentation lacks error bars, component-wise ablations, and a fully explicit training protocol, all of which are important for reproducibility. In the revised version we will (i) report standard deviations from at least three independent runs for the main Enwik8, CIFAR-10, and ImageNet-64 results, (ii) add ablation tables that isolate the effect of the sparse factorization, the deeper-network initialization, and attention recomputation, and (iii) include a detailed appendix listing all hyperparameters, optimizer settings, data preprocessing, and hardware used for each experiment. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal and empirical benchmarks are independent of fitted inputs or self-referential definitions.

full rationale

The paper defines sparse attention factorizations (strided and fixed patterns) explicitly in §3 as a hand-designed reduction from dense O(n²) to O(n√n) attention, then evaluates the resulting model on standard external density-modeling benchmarks (Enwik8, CIFAR-10, ImageNet-64) whose test sets are disjoint from any training or hyperparameter choices. No equation equates a reported performance gain to a quantity defined by fitting the same data; no uniqueness theorem or ansatz is imported via self-citation to force the factorization choice; and the central claim (long-sequence modeling with hundreds of layers) rests on measured perplexity/BPD numbers rather than a renaming or self-definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that sparse attention patterns can approximate full attention for the density modeling tasks; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Sparse factorizations of the attention matrix preserve enough long-range modeling capacity for the target tasks
Invoked to justify the O(n sqrt(n)) reduction while still claiming SOTA performance.

pith-pipeline@v0.9.0 · 5457 in / 1285 out tokens · 64595 ms · 2026-05-10T19:45:41.453749+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean eight_tick_forces_D3; linking_requires_D3 unclear
We introduce sparse factorizations of the attention matrix which reduce this to O(n√n)... We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers... setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost unclear
Sparse factorizations of the attention matrix... two 2d factorized attention schemes... strided... fixed

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Limits of Long-Context Transformers
cs.LG 2026-05 unverdicted novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
Denoising Diffusion Probabilistic Models
cs.LG 2020-06 accept novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
cs.LG 2026-05 unverdicted novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
gr-qc 2026-05 unverdicted novelty 7.0

Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
Adaptive Head Budgeting for Efficient Multi-Head Attention
cs.LG 2026-04 unverdicted novelty 7.0

BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
cs.CL 2026-04 unverdicted novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
cs.LG 2026-04 unverdicted novelty 7.0

Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
cs.LG 2024-02 unverdicted novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
cs.CV 2026-05 unverdicted novelty 6.0

A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse
cs.AI 2026-05 unverdicted novelty 6.0

AdapShot adaptively tunes shot count via entropy probes and reuses semantically-matched KV caches with position decoupling to deliver ~10% accuracy gains and 4.64x speedup over fixed-shot baselines.
Stochastic Sparse Attention for Memory-Bound Inference
cs.LG 2026-05 accept novelty 6.0

SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...
Linear-Time Global Visual Modeling without Explicit Attention
cs.CV 2026-05 unverdicted novelty 6.0

Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces
cs.NE 2026-04 unverdicted novelty 6.0

UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalizatio...
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling
cs.NE 2026-04 unverdicted novelty 6.0

S4D state space models correspond exactly to wave propagation and nonlinear wave interactions in a one-dimensional ring oscillator network, with a closed-form operator describing the complete input-output map.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
cs.AR 2026-04 unverdicted novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
cs.DC 2026-04 unverdicted novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
cs.CV 2026-04 unverdicted novelty 6.0

FreqFormer applies heterogeneous attention (dense global on low frequencies, block-sparse on mid, local on high) plus adaptive spectral routing to reduce attention cost in long-sequence video diffusion transformers.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
cs.CV 2026-04 unverdicted novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
cs.CV 2026-04 conditional novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 85 Pith papers · 5 internal anchors

[1]

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P

Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Character-level language modeling with deeper self- attention. arXiv preprint arXiv:1808.04444,

work page arXiv
[2]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Y ., and Luong, M.-T

Britz, D., Guan, M. Y ., and Luong, M.-T. Efﬁcient attention using a ﬁxed-size memory representation. arXiv preprint arXiv:1707.00110,

work page arXiv
[4]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review arXiv
[5]

Pixelsnail: An improved autoregressive generative model

Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,

work page arXiv
[6]

Monotonic chunkwise attention.arXiv preprint arXiv:1712.05382, 2017a

Chiu, C.-C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382,

work page arXiv
[7]

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y . N. Convolutional sequence to sequence learning.arXiv preprint arXiv:1705.03122,

work page arXiv
[8]

Identity mappings in deep residual networks

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027,

work page arXiv
[9]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A

Generating Long Sequences with Sparse Transformers Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A. M., Hoffman, M. D., and Eck, D. An improved relative self-attention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281,

work page arXiv
[11]

Exploring the Limits of Language Modeling

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y . Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv
[12]

A clockwork rnn,

Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. A clockwork rnn. arXiv preprint arXiv:1402.3511,

work page arXiv
[13]

Generating Wikipedia by summarizing long sequences

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- ssi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page arXiv
[14]

Samplernn: An unconditional end-to-end neural audio generation model

Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y . Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837,

work page arXiv
[15]

and Kalchbrenner, N

Menick, J. and Kalchbrenner, N. Generating high ﬁdelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,

work page arXiv
[16]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review arXiv
[17]

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759,

work page Pith review arXiv
[18]

Image transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,

work page arXiv
[19]

Reed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas, N. Paral- lel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664,

work page arXiv
[20]

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517,

work page arXiv
[21]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499,

work page internal anchor Pith review arXiv
[22]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017

work page 2017