arxiv: 2605.02568 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.PF

Recognition: 2 theorem links

· Lean Theorem

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Jaber Jaber, Osama Jaber

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:41 UTC · model grok-4.3

classification 💻 cs.LG cs.PF

keywords compressed sparse attentionstreaming top-kmemory-efficient indexinglong-sequence modelstop-k selectionsparse attention kernelchunked merge

0 comments

The pith

A chunked top-k driver lets the CSA lightning indexer run on sequences up to one million tokens with only 6 GB peak memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to avoid building the full score tensor that CSA models need for their lightning indexer. Instead of materializing a huge intermediate array that grows linearly with sequence length, the implementation partitions the work into chunks, finds local top-k scores in each chunk, and merges the results incrementally. This keeps peak high-bandwidth memory under 7 GB even when the sequence reaches 1,048,576 tokens, where the standard approach runs out of memory at 65,536 tokens. Across multiple sweeps of chunk size, tile size, and k value the selected sets match the exact top-k almost perfectly on synthetic inputs shaped like the target model. The same driver can be plugged into an existing sparse attention kernel to process longer contexts end-to-end without changing the attention code itself.

Core claim

StreamIndex replaces the materializing top-k reduction step inside the CSA lightning indexer with a Triton chunked partition-merge driver that never allocates the full [B, S, H_I, T] FP32 score tensor. On V4-shaped synthetic inputs the driver runs the indexer to S = 1,048,576 using 6.21 GB peak HBM while the materialize path OOMs at S = 65,536, and set-overlap recall against the ground-truth top-k stays at or above 0.9980 across all tested design points.

What carries the argument

The chunked partition-merge top-k driver, which tiles the sequence, computes local top-k inside each tile, and merges the partial results without ever storing the complete score tensor.

If this is right

CSA indexer step becomes feasible at sequence lengths 32 times longer than the previous single-GPU limit.
Peak memory for the indexer stays roughly constant with sequence length rather than scaling linearly.
The driver composes directly with existing pipelined sparse attention kernels, enabling longer contexts without kernel changes.
Recall remains above 0.998 across wide ranges of chunk size, key-tile size, and k value on the target input shape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunked-merge pattern could be applied to any attention variant that first scores and then selects a sparse subset of keys.
If the recall numbers hold on real data, training runs that previously required multi-GPU sharding of the indexer could run on single-GPU hardware.
The approach leaves open the question of whether the same streaming logic can be fused inside the attention kernel itself to reduce kernel launch overhead.

Load-bearing premise

That high overlap with the exact top-k on synthetic indexer inputs is enough to keep downstream model quality intact when the selected keys are handed to the attention kernel.

What would settle it

Measure end-to-end perplexity or task accuracy of a real CSA checkpoint at S = 262144 using the chunked indexer versus a materialize baseline that still fits in memory; any statistically significant drop would falsify the assumption.

Figures

Figures reproduced from arXiv: 2605.02568 by Jaber Jaber, Osama Jaber.

**Figure 1.** Figure 1: The chunked indexer pipeline. The driver iterates over view at source ↗

**Figure 2.** Figure 2: V4-Flash indexer scaling on a single H200, log-log axes. The materialize path runs at view at source ↗

**Figure 3.** Figure 3: Key-tile size sweep at S=262,144, V4-Flash dimensions (cS=2048, k=512, T=65,536). Going from cT =1024 (64 T-tiles) to cT =T (a single T-tile per S-tile) drops wall-clock by 5.9× at modest memory cost (peak HBM rises from 1.58 to 2.81 GB). Larger key-tile is uniformly better when memory permits view at source ↗

read the original abstract

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: https://github.com/RightNow-AI/StreamIndex.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a solid engineering note that ships a chunked top-k Triton driver for the CSA indexer and shows it reaches 1M tokens at 6GB peak memory with recall at or above 0.998 on their synthetic tests.

read the letter

The paper's main contribution is a practical Triton implementation of a partition-merge top-k that keeps the CSA lightning indexer from materializing its full [B, S, H_I, T] score tensor. On V4-Flash dimensions the baseline OOMs at 65k sequence length while StreamIndex runs to 1M tokens using 6.21 GB peak HBM on an H200. They also demonstrate composition with TileLang attention at 262k tokens where the combined path stays under 19 GB and finishes in under 2 seconds. Code is released, which is useful by itself.

Referee Report

0 major / 3 minor

Summary. The paper claims to introduce StreamIndex, a Triton implementation for the CSA indexer in DeepSeek models that uses a chunked partition-merge top-k driver to avoid materializing the full score tensor. This enables running at S up to 1,048,576 with 6.21 GB peak HBM on H200 GPU (32x extension over materialize which OOMs at 65,536), while achieving high set-overlap recall (bit-exact at small S, mean 1.0000 min >=0.9980 on synthetic V4 inputs across sweeps). It also shows composition with TileLang attention at S=262k without OOM, and limits claims to the indexer step with open code.

Significance. If the results hold, this work significantly advances practical deployment of compressed sparse attention by removing a key memory barrier, allowing much longer sequences on single GPUs. Strengths include the reproducible code release, explicit OOM comparisons, bit-exact recall validation on synthetic inputs, and careful scoping that avoids overclaiming end-to-end model quality.

minor comments (3)

[Abstract] The calculation of the 256 GB intermediate tensor size is stated but not derived; adding the explicit formula (e.g., B × S × H_I × T × sizeof(FP32)) would improve clarity for readers.
The three 5-point design sweeps are referenced but without table or figure numbers in the provided abstract; ensure all experimental results are clearly linked to specific tables or figures in the full manuscript.
[Experiments] While the recall metrics are impressive, the paper could briefly note the computational overhead of the chunked approach compared to materialize at small S where both fit, even if not central to the memory claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept. We appreciate the recognition of the reproducible code release, explicit OOM comparisons, bit-exact recall validation on synthetic inputs, and the careful scoping of claims to the indexer step.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering implementation (Triton chunked partition-merge top-k driver) for memory-bounded CSA indexer execution. All central claims are direct empirical measurements against an explicit materialize baseline on the same synthetic V4-shaped inputs: peak HBM usage, OOM thresholds, and set-overlap recall (bit-exact at small S, mean 1.0000 with min >=0.9980 at large S). No mathematical derivation chain, fitted parameters, predictions, ansatzes, or self-citation load-bearing steps exist; the result is an implementation artifact with released code and explicit scope limitation to the indexer step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is a pure systems implementation whose correctness is validated by direct comparison to the materialize baseline.

pith-pipeline@v0.9.0 · 5672 in / 1102 out tokens · 42341 ms · 2026-05-08T18:41:38.338355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 34 canonical work pages · 17 internal anchors

[1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ACL, 2024. arXiv:2308.14508

work page internal anchor Pith review arXiv 2024
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[3]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. ICALP, LNCS 2380, pp.\ 693--703, Springer, 2002

2002
[4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[5]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking Attention with Performers. ICLR, 2021. arXiv:2009.14794

work page internal anchor Pith review arXiv 2021
[6]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691, 2023

work page internal anchor Pith review arXiv 2023
[7]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\'e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS, 2022. arXiv:2205.14135

work page internal anchor Pith review arXiv 2022
[8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[9]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437, 2024

work page Pith review arXiv 2024
[10]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556, 2025

work page internal anchor Pith review arXiv 2025
[11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM, 2024. arXiv:2312.00752

work page Pith review arXiv 2024
[12]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What's the Real Context Size of Your Long-Context Language Models?. COLM, 2024. arXiv:2404.06654

work page internal anchor Pith review arXiv 2024
[13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR, 2024. arXiv:2310.06770

work page internal anchor Pith review arXiv 2024
[14]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, et al. Mixtral of Experts. arXiv:2401.04088, 2024

work page Pith review arXiv 2024
[15]

Greg Kamradt

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, et al. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. NeurIPS, 2024. arXiv:2407.02490

work page arXiv 2024
[16]

Transformers are

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML, 2020. arXiv:2006.16236

work page arXiv 2020
[17]

Reformer: The Efficient Transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. ICLR, 2020. arXiv:2001.04451

work page internal anchor Pith review arXiv 2020
[18]

Selective Attention Improves Transformer

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Selective Attention Improves Transformer. arXiv:2410.02703, 2024

work page arXiv 2024
[19]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP, 2023. arXiv:2309.06180

work page internal anchor Pith review arXiv 2023
[20]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889, 2023

work page internal anchor Pith review arXiv 2023
[21]

Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. FP8 Formats for Deep Learning. arXiv:2209.05433, 2022

work page arXiv 2022
[22]

Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis

Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2nd edition, 2017

2017
[23]

arXiv preprint arXiv:2310.10537 , year=

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez,...

work page arXiv 2023
[24]

and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R\'e. Hyena Hierarchy: Towards Larger Convolutional Language Models. ICML, 2023. arXiv:2302.10866

work page arXiv 2023
[25]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568:127063, 2024. arXiv:2104.09864

work page internal anchor Pith review arXiv 2024
[26]

& Hooker, S

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS, 2024. arXiv:2404.00456

work page arXiv 2024
[27]

Noam Shazeer

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv:2407.08608, 2024

work page arXiv 2024
[28]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150, 2019

work page internal anchor Pith review arXiv 1911
[29]

arXiv preprint arXiv:2406.10774 , year=

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. ICML, 2024. arXiv:2406.10774

work page arXiv 2024
[30]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. MAPL, ACM SIGPLAN, pp.\ 10--19, 2019. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019
[31]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity. arXiv:2006.04768, 2020

work page internal anchor Pith review arXiv 2006
[32]

io/blog/qwen3.5

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, et al. TileLang: A Composable Tiled Programming Model for AI Systems. arXiv:2504.17577, 2025

work page arXiv 2025
[33]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks. ICLR, 2024. arXiv:2309.17453

work page internal anchor Pith review arXiv 2024
[34]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv:2502.11089, 2025

work page arXiv 2025
[35]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, et al. Big Bird: Transformers for Longer Sequences. NeurIPS, 2020. arXiv:2007.14062

work page arXiv 2020
[36]

arXiv preprint arXiv:2306.14048 , year=

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\'e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. NeurIPS, 2023. arXiv:2306.14048

work page arXiv 2023