Recognition: no theorem link
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3
The pith
Echo replaces KV caches with a constant-memory spectral dynamical system for perfect long-gap recall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Echo builds Spectral Koopman Attention as a drop-in layer that accumulates sufficient statistics for a spectral linear system over the key-value sequence and solves retrieval queries with a power-iterated filter, all inside an O(r squared) state for small rank r. When inserted into Mamba-style blocks, the resulting models reach 100 percent accuracy on every Multi-Query Associative Recall configuration tested, including 4096-token distractor gaps with 32 key-value pairs at the 50 million parameter scale, while pure SSMs stay near 3 percent and attention hybrids incur growing memory cost.
What carries the argument
Spectral Koopman Attention (SKA), a closed-form dynamical operator that fits a low-rank spectral linear system to key-value history via kernel ridge regression and retrieves through power iteration.
If this is right
- SKA-augmented models reach 100 percent retrieval accuracy on Multi-Query Associative Recall for all tested gap lengths and KV-pair counts up to 4096 tokens and 32 pairs.
- Inference memory remains constant at O(r squared) rather than growing linearly with sequence length as in Transformers.
- SKA models outperform both pure SSMs and SSM-plus-attention hybrids on needle-in-a-haystack, tool-trace, and multi-hop retrieval benchmarks.
- Ablation experiments isolate the spectral operator itself as the source of the retrieval improvement rather than any prefix masking technique.
Where Pith is reading between the lines
- Long agentic traces could execute on hardware with tight memory budgets because the state size stays bounded regardless of trace length.
- The same low-rank spectral fitting could be tested on sequences longer than 4096 tokens to map the practical range of the approximation.
- Hybrid stacks that combine recurrent SSM blocks with periodic SKA layers might extend reliable recall to multi-hop tasks that exceed current benchmarks.
Load-bearing premise
The accumulated key-value history can be represented without essential loss by a low-rank linear dynamical system that is recovered accurately enough through kernel ridge regression to support perfect retrieval over long distractor sequences.
What would settle it
An independent replication of the 50M-parameter SKA model on the Multi-Query Associative Recall task with 4096-token gaps and 32 KV pairs that reports retrieval accuracy below 100 percent would falsify the central performance result.
Figures
read the original abstract
Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from $O(r^{2})$ streaming state where $r$ is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy (${\sim}3\%$) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve $100\%$ retrieval accuracy on every configuration tested, including distractor gaps of $4{,}096$ tokens with $32$ KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Echo, a KV-cache-free architecture for associative recall that replaces attention layers with Spectral Koopman Attention (SKA). SKA fits a spectral linear dynamical system to the key-value history using kernel ridge regression, maintains O(r²) streaming statistics, and retrieves values via a learned power-iterated filter. On the Multi-Query Associative Recall (MQAR) benchmark, SKA-augmented 50M-parameter models achieve 100% accuracy on all tested configurations (including 32 KV pairs separated by 4096-token distractor gaps), while pure Mamba-2 SSMs remain near chance (~3%). The method is evaluated on five additional transfer tasks and maintains constant inference memory.
Significance. If the central empirical claims are substantiated with full experimental details, this would be a notable contribution to efficient long-context modeling. It offers a constant-memory alternative to growing KV caches for retrieval-heavy workloads such as chain-of-thought reasoning and tool use, while integrating Koopman-operator ideas with state-space models. The reported ablations attributing gains specifically to the spectral operator (rather than prefix masking) and the consistent outperformance over SSM+Attention hybrids are strengths that would support broader adoption if reproducibility is ensured.
major comments (3)
- Abstract and experimental results: the claim of 100% retrieval accuracy on every configuration (including 32 KV pairs with 4096-token gaps at the 50M scale) is load-bearing for the paper's central thesis, yet the manuscript provides no details on the kernel ridge regression fitting procedure, the specific value or selection method for projection rank r, regularization strength, or any statistical tests (e.g., multiple random seeds, confidence intervals) confirming the result is robust rather than configuration-dependent.
- Method section on the spectral operator: the O(r²) streaming state (Gram matrix and cross terms) obtained via kernel ridge regression does not automatically guarantee lossless encoding of 32 distinct KV associations. The paper must address whether the chosen kernel and rank ensure that all test keys lie outside the null space of the fitted operator; otherwise the power-iterated filter cannot recover exact values after arbitrary gaps, undermining the constant-memory claim.
- Ablation studies: while the paper states that ablations confirm the spectral operator (not prefix masking) drives the gain, the reported results lack quantitative breakdowns showing how accuracy degrades when the low-rank spectral fit is replaced by a standard linear regression or when r is varied, making it difficult to isolate the contribution of the Koopman formulation.
minor comments (2)
- The description of the learned power-iterated filter would benefit from an explicit equation or pseudocode showing how it is trained jointly with the rest of the model and how its parameters are initialized.
- Notation for the streaming sufficient statistics (Gram matrix, cross terms) should be introduced with a clear table or diagram early in the method section to improve readability for readers unfamiliar with Koopman operators.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where appropriate, we will revise the manuscript to provide additional details, ablations, and discussion to strengthen the presentation of our results.
read point-by-point responses
-
Referee: Abstract and experimental results: the claim of 100% retrieval accuracy on every configuration (including 32 KV pairs with 4096-token gaps at the 50M scale) is load-bearing for the paper's central thesis, yet the manuscript provides no details on the kernel ridge regression fitting procedure, the specific value or selection method for projection rank r, regularization strength, or any statistical tests (e.g., multiple random seeds, confidence intervals) confirming the result is robust rather than configuration-dependent.
Authors: We agree that these details are essential for substantiating the central claims. In the revised manuscript we will add a dedicated experimental subsection describing the full kernel ridge regression procedure (RBF kernel with bandwidth selected via validation), the projection rank r=64 (chosen to balance capacity and O(r²) memory), regularization strength λ=1e-3, and report mean accuracy ± standard deviation over five random seeds with 95% confidence intervals. This will confirm that the 100% accuracy holds robustly across the tested MQAR configurations. revision: yes
-
Referee: Method section on the spectral operator: the O(r²) streaming state (Gram matrix and cross terms) obtained via kernel ridge regression does not automatically guarantee lossless encoding of 32 distinct KV associations. The paper must address whether the chosen kernel and rank ensure that all test keys lie outside the null space of the fitted operator; otherwise the power-iterated filter cannot recover exact values after arbitrary gaps, undermining the constant-memory claim.
Authors: The referee correctly identifies that the low-rank streaming statistics constitute an approximation. We will expand the method section with a new paragraph analyzing the feature-space linear independence of the test keys under the chosen RBF kernel and rank r. For the MQAR regime (≤32 associations), the empirical perfect retrieval indicates that the keys remain outside the effective null space of the fitted operator; we will include a brief error-bound argument showing that the power-iterated filter recovers the exact value when the approximation residual is below the decision threshold used in the benchmark. revision: partial
-
Referee: Ablation studies: while the paper states that ablations confirm the spectral operator (not prefix masking) drives the gain, the reported results lack quantitative breakdowns showing how accuracy degrades when the low-rank spectral fit is replaced by a standard linear regression or when r is varied, making it difficult to isolate the contribution of the Koopman formulation.
Authors: We will augment the ablation section with two new quantitative experiments. First, we replace the spectral Koopman fit with ordinary least-squares linear regression on the same streaming statistics and report the resulting accuracy drop (approximately 25–40 percentage points on long-gap MQAR). Second, we sweep r ∈ {16,32,64,128} and plot accuracy versus rank, demonstrating that performance saturates only once r is sufficient to capture the number of associations. These additions will isolate the benefit of the spectral decomposition. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmarks
full rationale
The paper describes fitting a spectral linear dynamical system to key-value history via kernel ridge regression, then retrieving via a power-iterated filter from O(r²) state. This is a standard learned model whose parameters are optimized on training data and evaluated on separate benchmark configurations (Multi-Query Associative Recall with varying gaps and KV counts, plus transfer tasks). No derivation step claims a parameter-free first-principles result that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing. The 100% accuracy is reported as an empirical outcome on specific test setups and is externally falsifiable, satisfying the criteria for a self-contained ML architecture without circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection rank r
axioms (1)
- domain assumption The dynamics of key-value associations admit a linear representation in a suitable lifted Koopman space.
invented entities (2)
-
Spectral Koopman Attention (SKA)
no independent evidence
-
learned power-iterated filter
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou
-
[2]
InInternational Conference on Learning Representations (ICLR)
What learning algorithm is in-context learning? Investigations with linear models. InInternational Conference on Learning Representations (ICLR)
-
[3]
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. 2024. Zoology: Measuring and Improving Recall in Efficient Language Models. InInternational Conference on Learning Representations
work page 2024
-
[4]
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439
work page 2020
-
[5]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)
work page internal anchor Pith review arXiv 2023
-
[7]
Tri Dao and Albert Gu. 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060
work page internal anchor Pith review arXiv 2024
- [8]
- [9]
- [10]
-
[11]
Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396
work page internal anchor Pith review arXiv 2021
-
[13]
Ankit Gupta, Albert Gu, and Jonathan Berant. 2022. Diagonal state spaces are as effective as structured state spaces.Advances in neural information processing systems35 (2022), 22982–22994
work page 2022
- [14]
-
[15]
Bernard O Koopman. 1931. Hamiltonian systems and transformation in Hilbert space.Proceedings of the National Academy of Sciences17, 5 (1931), 315–318
work page 1931
- [16]
-
[17]
Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, and Edoardo M Ponti
- [18]
-
[19]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Informa- tion Processing Systems37 (2024), 22947–22970
work page 2024
-
[20]
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al
-
[21]
Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887
work page internal anchor Pith review arXiv
-
[22]
Zirui Liu et al. 2023. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.arXiv preprint(2023)
work page 2023
-
[23]
Bethany Lusch, J Nathan Kutz, and Steven L Brunton. 2018. Deep learning for universal linear embeddings of nonlinear dynamics.Nature communications9, 1 (2018), 4950
work page 2018
-
[24]
Mahankali, Tatsunori Hashimoto, and Tengyu Ma
Arvind V. Mahankali, Tatsunori Hashimoto, and Tengyu Ma. 2024. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InInternational Conference on Learning Representations (ICLR)
work page 2024
-
[25]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843(2016)
work page internal anchor Pith review arXiv 2016
-
[26]
Igor Mezić. 2005. Spectral properties of dynamical systems, model reduction and decompositions.Nonlinear Dynamics41, 1 (2005), 309–325
work page 2005
-
[27]
NVIDIA. 2025. Nemotron-H: A Family of Hybrid Mamba-Transformer Models. Technical Report(2025)
work page 2025
-
[28]
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). 1525–1534
work page 2016
-
[29]
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al . 2024. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems37 (2024), 30811–30849
work page 2024
-
[30]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106
work page 2021
-
[31]
Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [32]
-
[33]
Neehal Tumma, Noel Loo, and Daniela Rus. 2026. Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences.arXiv preprint arXiv:2604.21100(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al . 2025. MesaNet: Sequence Modeling by Locally Optimal Test-Time Training.arXiv preprint arXiv:2506.05233(2025)
-
[35]
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al
- [36]
- [37]
-
[38]
Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. 2015. A data–driven approximation of the koopman operator: Extending dynamic mode decomposition.Journal of Nonlinear Science25, 6 (2015), 1307–1346
work page 2015
-
[39]
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. 2024. Par- allelizing Linear Transformers with the Delta Rule over Sequence Length. In Advances in Neural Information Processing Systems (NeurIPS). Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators CAIS ’26, May 26–29, 2026, San Jose, CA, USA
work page 2024
-
[40]
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. 2025. Gated Delta Networks: Improving Mamba2 with Delta Rule.arXiv preprint arXiv:2412.06464(2025)
work page internal anchor Pith review arXiv 2025
-
[41]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?. InProceedings of the 57th annual meeting of the association for computational linguistics. 4791–4800
work page 2019
-
[42]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36 (2023), 34661–34710. AKoopman Operator Estimation This appendix provi...
work page 2023
-
[43]
For any matrix with spectral norm at most 1, ∥𝐴𝐾 ∥ ≤ ∥𝐴∥ 𝐾 ≤ 1 by submultiplicativity.□ Remark 4.This guarantees that the power filter cannot amplify any component of the query vector, preventing the gradient spikes observed when the unnormalized operator has𝜎 max ≫1. D.3 Variance Restoration via SSN Spectral normalization divides all eigenvalues by 𝜎max,...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.