arxiv: 2604.07935 · v1 · submitted 2026-04-09 · 💻 cs.AR

Recognition: no theorem link

The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency

Robin Geens , Jonas De Schouwer , Marian Verhelst , Thierry Tambe

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:40 UTC · model grok-4.3

classification 💻 cs.AR

keywords state-space modelsMambaedge computinghyperscale lotterylatency penaltyalgorithmic efficiencyhardware lottery

0 comments

The pith

Mamba-3's changes for hyperscale GPUs increase edge-device latency by 28 to 48 percent compared to earlier versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that state-space models have shifted their designs to prioritize performance on massive cloud computing setups over efficiency on resource-limited edge devices. This Hyperscale Lottery means that while Mamba models were initially promising for linear-complexity edge intelligence, later iterations introduce modifications that boost cloud throughput but add significant latency penalties on the edge. A sympathetic reader would care because this trend could undermine the potential for real-time, single-user AI applications on mobile and embedded hardware. The evidence comes from latency measurements showing worsening penalties as model sizes decrease, which is critical for edge scenarios. If correct, this suggests a need to separate cloud optimization strategies from fundamental architecture choices to keep edge AI viable.

Core claim

The central discovery is the Hyperscale Lottery, a phenomenon where model architectures for state-space models are optimized for saturating hyperscale GPUs at the cost of algorithmic efficiency on edge devices. From Mamba-1 to Mamba-3, changes designed for cloud lead to a 28% latency increase at 880M parameters and 48% at 15M parameters on edge hardware. The paper claims this divergence from edge-native efficiency threatens the viability of real-time edge intelligence and calls for decoupling cloud-scale saturation from core design.

What carries the argument

The Hyperscale Lottery, defined as the optimization of model architectures for cloud throughput at the expense of edge efficiency, which drives the observed latency penalties in evolved state-space models.

If this is right

Maintaining original state-space model designs would enable lower-latency inference on edge devices without sacrificing linear complexity.
Cloud-optimized models like Mamba-3 would show reduced viability for single-user real-time applications compared to predecessors.
Architectural decisions should prioritize edge metrics alongside cloud throughput to sustain broad deployment.
The observed penalty worsens at smaller scales, making it particularly detrimental for resource-constrained environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hyperscale biases may exist in other architecture families, suggesting a broader need to audit efficiency tradeoffs across AI models.
Reverting or isolating the cloud-specific changes could yield models better suited for embedded systems, testable via targeted benchmarks.
This dynamic might slow the proliferation of on-device AI, pushing reliance on cloud services instead.

Load-bearing premise

The latency increases observed are caused by the hyperscale-oriented architectural changes in Mamba-3 rather than by differences in implementation, measurement methods, or unrelated factors.

What would settle it

A direct comparison of latency between Mamba-3 and Mamba-1 implementations on the same edge hardware, using identical software frameworks and excluding any cloud-specific optimizations, that shows no significant difference would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.07935 by Jonas De Schouwer, Marian Verhelst, Robin Geens, Thierry Tambe.

**Figure 2.** Figure 2: Normalized latency of different model sizes, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

The Hardware Lottery posits that research directions are dictated by available silicon compute platforms. We identify a derivative phenomenon, the Hyperscale Lottery, where model architectures are optimized for cloud throughput at the expense of algorithmic efficiency. While State-Space Models (SSMs) such as Mamba were lauded for their linear complexity, ideal for edge intelligence, their evolution from Mamba-1 to Mamba-3 reveals a systematic divergence from edge-native efficiency. We demonstrate that Mamba-3's architectural changes, designed to saturate hyperscale GPUs, impose a significant edge penalty: a 28% latency increase at 880M parameters, worsening to 48% for 15M-parameter models. We argue for decoupling cloud-scale saturation strategies from core architectural design to preserve the viability of single-user, real-time edge intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the 'Hyperscale Lottery' as a derivative of the Hardware Lottery, arguing that State-Space Models (SSMs) such as the Mamba series have evolved from Mamba-1 to Mamba-3 with architectural modifications that prioritize saturation of hyperscale GPUs over edge-device efficiency. It supports this with quantitative claims of measured latency penalties on edge hardware: a 28% increase at 880M parameters, worsening to 48% for 15M-parameter models, and calls for decoupling cloud-scale strategies from core architectural design.

Significance. If the empirical measurements hold after proper controls, the work identifies a potentially important trend in SSM evolution that could affect the viability of real-time edge intelligence applications. The conceptual framing provides a useful lens for discussing optimization trade-offs, though its broader impact hinges on the rigor of the supporting data.

major comments (2)

[Abstract] Abstract: the specific latency penalty figures (28% at 880M parameters and 48% at 15M parameters) are presented as demonstrated results, yet the abstract (and by extension the results) supplies no experimental methods, hardware details, baselines, measurement protocol, or error analysis. This is load-bearing for the central claim, as the attribution of the increases to hyperscale-oriented architectural changes cannot be evaluated without these elements.
[Results/Evaluation] The argument that latency increases stem from hyperscale-oriented modifications (rather than implementation details such as kernel efficiency, compilation, or unoptimized intermediate states) requires explicit controls and ablations; without them the causal link remains unverified and undermines the edge-penalty conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key areas where greater methodological detail will strengthen the paper's claims. We address each major comment below and will revise the manuscript to improve transparency and rigor while preserving the core argument.

read point-by-point responses

Referee: [Abstract] Abstract: the specific latency penalty figures (28% at 880M parameters and 48% at 15M parameters) are presented as demonstrated results, yet the abstract (and by extension the results) supplies no experimental methods, hardware details, baselines, measurement protocol, or error analysis. This is load-bearing for the central claim, as the attribution of the increases to hyperscale-oriented architectural changes cannot be evaluated without these elements.

Authors: We agree that the abstract's brevity precludes experimental details and that the results section should enable full evaluation of the reported figures. The current manuscript provides hardware context in the evaluation section, but we acknowledge it lacks a dedicated, explicit description of the measurement protocol, baselines, and error analysis. In revision, we will add a new 'Experimental Methodology' subsection specifying the edge hardware, latency measurement procedure (including run count and aggregation), exact baselines at matched parameter counts, and statistical reporting. This will allow readers to assess the 28% and 48% penalties directly. revision: yes
Referee: [Results/Evaluation] The argument that latency increases stem from hyperscale-oriented modifications (rather than implementation details such as kernel efficiency, compilation, or unoptimized intermediate states) requires explicit controls and ablations; without them the causal link remains unverified and undermines the edge-penalty conclusion.

Authors: This concern is well-taken. Our existing comparisons hold hardware platform, software stack, and parameter count fixed across Mamba-1 and Mamba-3 to focus on architectural differences. To strengthen the causal attribution, the revised manuscript will include targeted ablations that isolate specific hyperscale-oriented changes (e.g., state dimension scaling and kernel modifications) while varying only one factor at a time and reporting results under controlled compilation settings. These additions will help rule out implementation artifacts and support the link to the observed edge latency penalties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical measurements

full rationale

The paper's core argument compares Mamba-1/2/3 architectures via direct latency benchmarks on edge hardware, reporting observed penalties (28% at 880M params, 48% at 15M) without any derivation chain, fitted parameters renamed as predictions, or self-citation that bears the load of the result. Architectural descriptions are presented as inputs to the measurements rather than outputs derived from them, and no equations or uniqueness theorems reduce the findings to tautology. The attribution of penalties to hyperscale-oriented changes is an interpretive claim open to external verification or falsification via re-benchmarking, not a self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level framing.

axioms (1)

domain assumption Linear-complexity state-space models are inherently ideal for edge intelligence
Abstract states SSMs were 'lauded for their linear complexity, ideal for edge intelligence'

invented entities (1)

Hyperscale Lottery no independent evidence
purpose: To name the phenomenon of cloud-throughput optimization at the expense of edge efficiency
New term introduced in the paper to describe the observed divergence in Mamba evolution

pith-pipeline@v0.9.0 · 5445 in / 1283 out tokens · 127668 ms · 2026-05-10T17:40:28.168262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Da- vide Rossi, and Francesco Conti. 2024. A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU. arXiv:2412.06321 [cs.AR] https://arxiv.org/abs/2412.06321

work page arXiv 2024
[2]

Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Du- ality. arXiv:2405.21060 [cs.LG] https://arxiv.org/abs/2405.21060

work page internal anchor Pith review arXiv 2024
[3]

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirs- bilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. 2024. Hymba: A Hybrid-head Architecture for Small Language Models. arXiv:2411.13676 [cs.CL] https://arxiv.org/ abs/2411.13676

work page arXiv 2024
[4]

Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Nemotron- Flash: Towards Latency-Optimal Hybrid Small Language Models. arXiv:2511.18890 [cs.LG] https://arxiv...

work page arXiv 2025
[5]

Robin Geens, Arne Symons, and Marian Verhelst. 2025. Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Ac- celeration . In2025 34th International Conference on Parallel Architec- tures and Compilation Techniques (PACT). IEEE Computer Society, Los Alamitos, CA, USA, 281–291. doi:10.1109/PACT65351.2025.00034

work page doi:10.1109/pact65351.2025.00034 2025
[6]

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. 2024. Zamba: A Compact 7B SSM Hybrid Model. arXiv:2405.16712 [cs.LG] https: //arxiv.org/abs/2405.16712 3

work page arXiv 2024
[7]

Albert Gu and Tri Dao. 2024. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling. https://openreview.net/forum?id=tEYskw1VY2

2024
[8]

Sara Hooker. 2020. The Hardware Lottery. arXiv:2009.06489 [cs.CY] https://arxiv.org/abs/2009.06489

work page arXiv 2020
[9]

Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. 2026. Mamba-3: Improved Sequence Modeling using State Space Principles. InThe Fourteenth Interna- tional Conference on Learning Representations. https://openreview.net/ forum?id=HwCvaJOiCj

2026
[10]

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2025. VideoMamba: State Space Model for Efficient Video Understanding. InComputer Vision – ECCV 2024, Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (Eds.). Springer Nature Switzerland, Cham, 237–255

2025
[11]

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. 2024. RoboMamba: Efficient Vision- Language-Action Model for Robotic Reasoning and Manipulation. arXiv:2406.04339 [cs.CV] https://arxiv.org/abs/2406.04339

work page arXiv 2024
[12]

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. 2024. VMamba: Vi- sual State Space Model. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 103031–103063. doi:10....

work page doi:10.52202/079017-3273 2024
[13]

Cristobal Ortega, Yann Falevoz, and Renaud Ayrignac. 2024. PIM- AI: A Novel Architecture for High-Efficiency LLM Inference. arXiv:2411.17309 [cs.AR] https://arxiv.org/abs/2411.17309

work page arXiv 2024
[14]

arXiv preprint arXiv:2403.15360 , year=

Badri N. Patro and Vijay S. Agneeswaran. 2024. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv:2403.15360 [cs.CV] https://arxiv.org/abs/2403.15360

work page arXiv 2024
[15]

Arne Symons, Linyan Mei, Steven Colleman, Pouya Houshmand, Se- bastian Karl, and Marian Verhelst. 2025. Stream: Design Space Explo- ration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators. IEEE Trans. Comput.74, 1 (2025), 237–249. doi:10.1109/TC.2024.3477938

work page doi:10.1109/tc.2024.3477938 2025
[16]

Weihao Yu and Xinchao Wang. 2025. MambaOut: Do We Really Need Mamba for Vision?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2025
[17]

Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, and Marcelo H. Ang Jr. 2024. DRAMA: An Efficient End-to-end Motion Planner for Autonomous Driving with Mamba. arXiv:2408.03601 [cs.RO] https://arxiv.org/abs/2408.03601

work page arXiv 2024
[18]

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision Mamba: Efficient Visual Representa- tion Learning With Bidirectional State Space Model. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria) (ICML’24). JMLR.org, Article 2584, 14 pages. 4

2024