Recognition: 2 theorem links
· Lean TheoremBreaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall
Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3
The pith
FDM achieves fixed 867 MB decode memory for sequences up to 8192 tokens by splitting into a wave component and a 272-slot particle cache.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FDM separates sequence processing into a wave component that compresses long-range patterns into a fixed-size complex hidden state through recurrent scans with phase-preserving Givens rotations and a particle component that retrieves specific tokens via learned associative addressing using W+K=272 slots independent of sequence length N. This architecture yields strictly O(1) decode memory of 867 MB fixed across prompt lengths 128-8192 tokens, compared with the Transformer's growing 853-4247 MB usage. Joint training of the components converges poorly, so Freeze-Scan training that freezes the recurrent scan and optimizes the cache with embeddings achieves PPL=64.9 on WikiText-103 in 44K steps.
What carries the argument
The wave-particle duality consisting of a fixed-size recurrent wave state updated by Givens rotations for pattern compression and a separate 272-slot associative particle cache for exact token retrieval.
If this is right
- Strictly constant 867 MB decode memory holds for all tested lengths up to 8192 tokens, a 4.9x reduction versus transformers at the longest length.
- MQAR accuracy reaches 0.966, exceeding the transformer's 0.606 by 59.5 percent while a pure scan without the cache scores only 0.011.
- Freeze-Scan training improves convergence to PPL=64.9 on WikiText-103 in 44K steps, a 7.5x gain over full fine-tuning.
- Holographic reference beam decoding using the current input to modulate the hidden state reduces PPL by up to 2.13 points with 1.3M extra parameters.
Where Pith is reading between the lines
- The fixed-size cache approach could extend to streaming or edge-device sequence tasks where memory is strictly bounded.
- If the 272 slots prove robust, hybrid recurrent-associative designs might replace attention-based caches in other long-context applications.
- The holographic interpretation of the hidden state opens a route to test whether reference-beam modulation generalizes to other recurrent architectures.
- The separation of wave and particle components suggests testing whether similar duality can reduce memory in non-language sequence domains such as time-series forecasting.
Load-bearing premise
A fixed cache of only 272 slots can retrieve arbitrary specific tokens from sequences of arbitrary length without degradation.
What would settle it
Running multi-query associative recall on sequences longer than 8192 tokens or with more than 272 distinct items to retrieve would show accuracy drop or memory growth if the O(1) claim fails.
read the original abstract
We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze-Scan, a two-phase training strategy that freezes the recurrent scan and optimizes the cache jointly with embeddings, achieving PPL=64.9 on WikiText-103 in 44K steps -- a 7.5x improvement over full fine-tuning (PPL=487). On Multi-Query Associative Recall (MQAR), FDM achieves 0.966 accuracy, surpassing Transformer (0.606) by 59.5%, while pure scan without cache scores only 0.011, confirming the necessity of the particle component. Finally, we introduce Holographic Reference Beam Decoding, interpreting the complex hidden state h_t as a holographic plate encoding the entire temporal history. Using the current input x_t as a reference beam to modulate h_t reduces PPL by up to 2.13 points (PPL=62.79) with a 4-head orthogonal reference beam using only 1.3M additional parameters, providing empirical support for the holographic interpretation. Code and pretrained weights: https://github.com/YasongFan/FDM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Fan Duality Model (FDM), a linear sequence architecture that decomposes processing into a wave component (recurrent scan via phase-preserving Givens rotations compressing patterns into a fixed-size complex state) and a particle component (local-global cache using learned associative addressing over a fixed W+K=272 slots independent of sequence length N). It claims strictly O(1) decode memory (fixed 867 MB across N=128 to 8192), superior MQAR accuracy of 0.966 (vs. Transformer 0.606), improved WikiText-103 perplexity of 64.9 via a two-phase Freeze-Scan training strategy, and further PPL gains from Holographic Reference Beam Decoding that modulates the hidden state with the current input.
Significance. If the fixed 272-slot cache sustains high associative recall without degradation at longer contexts and the empirical gains are reproducible, the work would offer a meaningful advance toward memory-efficient sequence models that decouple inference cost from prompt length. The dual wave-particle design, Freeze-Scan procedure, and holographic decoding interpretation provide concrete architectural and training ideas that could be tested in other linear or hybrid architectures.
major comments (3)
- [Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.
- [Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.
- [Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.
minor comments (2)
- The abstract cites concrete memory figures (867 MB fixed vs. Transformer 853-4247 MB) but does not specify the model dimension, precision, or hardware assumptions underlying the MB conversion; adding these details would improve reproducibility.
- The GitHub link for code and pretrained weights is provided, but the manuscript should include a brief reproducibility checklist (random seeds, exact hyper-parameters for the 272-slot cache, and evaluation scripts) to allow independent verification of the reported MQAR and PPL numbers.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and commit to revisions where appropriate to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.
Authors: We agree that slot-count ablations and results at larger N would provide stronger evidence. The 272 slots (W+K) were selected based on preliminary experiments balancing memory footprint and recall performance. In the revised version, we will include an ablation study varying the number of slots from 64 to 512 and report MQAR accuracy for each. For N>8192, we currently lack results due to resource limitations, but the fixed cache size ensures O(1) memory by design, and we will add a discussion on potential collision risks at scale. This does not falsify the claim for the tested regimes, but we acknowledge the need for further validation. revision: partial
-
Referee: [Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.
Authors: We will revise the manuscript to include the precise details of the Freeze-Scan procedure. Specifically, the recurrent scan is trained for the initial 10,000 steps, after which its parameters are frozen, and the particle component along with embeddings are optimized for the subsequent 34,000 steps. We will also add comparisons to standard optimizers such as Adam and SGD, as well as report standard deviations from multiple runs to demonstrate the robustness of the 7.5x improvement. revision: yes
-
Referee: [Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.
Authors: In the revision, we will provide a short derivation in the methods section explaining the holographic analogy: the complex hidden state h_t encodes the history in a manner analogous to a holographic plate, and modulating it with x_t serves as the reference beam to retrieve the encoded information. Furthermore, we will include ablations comparing the orthogonal beam modulation against simpler operations like element-wise addition and a basic cross-attention module, showing the proposed method's advantages in terms of PPL reduction and parameter efficiency. revision: yes
- Experimental results for sequence lengths exceeding 8192 tokens are not available in our current study.
Circularity Check
O(1) decode memory is self-definitional from fixed 272-slot particle component
specific steps
-
self definitional
[Abstract]
"a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192)."
The claim that the model 'yields strictly O(1) decode memory' follows immediately from the preceding clause that defines the addressing mechanism with a constant slot count (272) that does not scale with N. No additional derivation or theorem is supplied; the memory scaling property is the direct consequence of the fixed-size design choice.
full rationale
The paper's headline efficiency result reduces directly to an architectural definition rather than an independent derivation. The wave component is described as fixed-size and the particle component is explicitly given a constant slot count (W+K=272) independent of N; the O(1) memory statement is therefore tautological with that choice. Empirical results on MQAR accuracy, PPL, and Freeze-Scan training are reported separately and do not participate in the circularity. No self-citation chains, ansatz smuggling, or uniqueness theorems appear in the supplied text. The circularity is partial because the recall performance claim remains an empirical assertion rather than a definitional one.
Axiom & Free-Parameter Ledger
free parameters (2)
- W+K cache slots
- Number of heads for reference beam
axioms (1)
- domain assumption Phase-preserving Givens rotations can compress long-range sequence patterns into a fixed-size complex hidden state without information loss for downstream tasks.
invented entities (3)
-
Wave component (recurrent scan)
no independent evidence
-
Particle component (local-global cache)
no independent evidence
-
Holographic reference beam
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ht = (1−pt)⊙R(θt)·ht−1 + pt ⊙(Wr xt + i Wi xt)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora et al. Simple linear attention language models balance the recall-throughput tradeoff. In ICML, 2024
2024
-
[2]
Longformer: The long-document transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. 2020
2020
-
[3]
Transformers are SSM s: Generalized models and efficient algorithms
Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms. In ICML, 2024
2024
-
[4]
MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions
Yasong Fan. MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions. arXiv, 2026
2026
-
[5]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher R\'e. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022
2022
-
[7]
The large- N limit of superconformal field theories and supergravity
Juan Maldacena. The large- N limit of superconformal field theories and supergravity. International journal of theoretical physics, 38: 0 1113--1133, 1999
1999
-
[8]
Pointer sentinel mixture models
Stephen Merity et al. Pointer sentinel mixture models. In ICLR, 2017
2017
-
[9]
In-context learning and induction heads
Catherine Olsson et al. In-context learning and induction heads. Transformer Circuits Thread, 2022
2022
-
[10]
RWKV : Reinventing RNN s for the transformer era
Bo Peng et al. RWKV : Reinventing RNN s for the transformer era. In EMNLP, 2023
2023
-
[11]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun et al. Retentive network: A successor to transformer. arXiv:2307.08621, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. In NeurIPS, 2017
2017
-
[13]
Efficient streaming language models with attention sinks
Guangxuan Xiao et al. Efficient streaming language models with attention sinks. In ICLR, 2024
2024
-
[14]
Big B ird: Transformers for longer sequences
Manzil Zaheer et al. Big B ird: Transformers for longer sequences. In NeurIPS, 2020
2020
-
[15]
H2O : Heavy-hitter oracle for efficient generative inference
Zhenyu Zhang et al. H2O : Heavy-hitter oracle for efficient generative inference. In NeurIPS, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.