arxiv: 2604.07716 · v2 · submitted 2026-04-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Yasong Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords Fan Duality Modellinear sequence modelingO(1) decode memoryassociative recallKV cache bottleneckGivens rotationsrecurrent scanFreeze-Scan training

0 comments

The pith

FDM achieves fixed 867 MB decode memory for sequences up to 8192 tokens by splitting into a wave component and a 272-slot particle cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Fan Duality Model resolves the memory-associative recall tradeoff in sequence modeling by using a recurrent wave component to compress long-range patterns into a fixed complex hidden state via phase-preserving Givens rotations. A separate particle component then retrieves specific tokens through learned associative addressing that relies on only 272 slots regardless of sequence length. This separation produces strictly constant O(1) memory during decoding, in contrast to the linearly growing key-value cache of transformers. The paper also introduces Freeze-Scan training that freezes the recurrent scan to optimize the cache jointly with embeddings, yielding improved perplexity and superior performance on multi-query associative recall benchmarks.

Core claim

FDM separates sequence processing into a wave component that compresses long-range patterns into a fixed-size complex hidden state through recurrent scans with phase-preserving Givens rotations and a particle component that retrieves specific tokens via learned associative addressing using W+K=272 slots independent of sequence length N. This architecture yields strictly O(1) decode memory of 867 MB fixed across prompt lengths 128-8192 tokens, compared with the Transformer's growing 853-4247 MB usage. Joint training of the components converges poorly, so Freeze-Scan training that freezes the recurrent scan and optimizes the cache with embeddings achieves PPL=64.9 on WikiText-103 in 44K steps.

What carries the argument

The wave-particle duality consisting of a fixed-size recurrent wave state updated by Givens rotations for pattern compression and a separate 272-slot associative particle cache for exact token retrieval.

If this is right

Strictly constant 867 MB decode memory holds for all tested lengths up to 8192 tokens, a 4.9x reduction versus transformers at the longest length.
MQAR accuracy reaches 0.966, exceeding the transformer's 0.606 by 59.5 percent while a pure scan without the cache scores only 0.011.
Freeze-Scan training improves convergence to PPL=64.9 on WikiText-103 in 44K steps, a 7.5x gain over full fine-tuning.
Holographic reference beam decoding using the current input to modulate the hidden state reduces PPL by up to 2.13 points with 1.3M extra parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed-size cache approach could extend to streaming or edge-device sequence tasks where memory is strictly bounded.
If the 272 slots prove robust, hybrid recurrent-associative designs might replace attention-based caches in other long-context applications.
The holographic interpretation of the hidden state opens a route to test whether reference-beam modulation generalizes to other recurrent architectures.
The separation of wave and particle components suggests testing whether similar duality can reduce memory in non-language sequence domains such as time-series forecasting.

Load-bearing premise

A fixed cache of only 272 slots can retrieve arbitrary specific tokens from sequences of arbitrary length without degradation.

What would settle it

Running multi-query associative recall on sequences longer than 8192 tokens or with more than 272 distinct items to retrieve would show accuracy drop or memory growth if the O(1) claim fails.

read the original abstract

We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze-Scan, a two-phase training strategy that freezes the recurrent scan and optimizes the cache jointly with embeddings, achieving PPL=64.9 on WikiText-103 in 44K steps -- a 7.5x improvement over full fine-tuning (PPL=487). On Multi-Query Associative Recall (MQAR), FDM achieves 0.966 accuracy, surpassing Transformer (0.606) by 59.5%, while pure scan without cache scores only 0.011, confirming the necessity of the particle component. Finally, we introduce Holographic Reference Beam Decoding, interpreting the complex hidden state h_t as a holographic plate encoding the entire temporal history. Using the current input x_t as a reference beam to modulate h_t reduces PPL by up to 2.13 points (PPL=62.79) with a 4-head orthogonal reference beam using only 1.3M additional parameters, providing empirical support for the holographic interpretation. Code and pretrained weights: https://github.com/YasongFan/FDM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The O(1) memory claim hinges on an unproven assumption that a fixed 272-slot associative cache can maintain high recall as context grows, and the current experiments do not test that limit.

read the letter

The paper's main move is to split processing into a fixed-size recurrent wave state and a separate particle cache with learned addressing that stays at 272 slots no matter how long the sequence gets. That construction does give constant decode memory by design, and the reported 867 MB figure across 128 to 8k tokens is consistent with the claim. The MQAR accuracy jump to 0.966 when the cache is added, versus 0.011 without it, shows the particle component is doing real work on associative recall. Releasing code and weights is also useful for anyone who wants to check the implementation directly. Freeze-Scan is a practical training tweak that improves convergence over joint optimization, though the paper does not compare it to standard optimizers or learning-rate schedules, so it is hard to know how general the gain is. The holographic reference beam is an interesting decoding trick that shaves a couple of PPL points with few extra parameters. The soft spots are straightforward. There is no ablation on slot count, no capacity bound, and no results past 8k tokens, so we do not yet know whether the addressing mechanism saturates or collides when N grows further. The WikiText PPL of 64.9 is still far from competitive, which limits immediate practical interest. The work is aimed at researchers focused on long-context inference efficiency. It is worth sending to peer review because the architecture is clearly described, the memory numbers are concrete, and the core empirical question is falsifiable with longer-context tests and slot ablations. A referee can ask for those checks without needing to rewrite the paper.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Fan Duality Model (FDM), a linear sequence architecture that decomposes processing into a wave component (recurrent scan via phase-preserving Givens rotations compressing patterns into a fixed-size complex state) and a particle component (local-global cache using learned associative addressing over a fixed W+K=272 slots independent of sequence length N). It claims strictly O(1) decode memory (fixed 867 MB across N=128 to 8192), superior MQAR accuracy of 0.966 (vs. Transformer 0.606), improved WikiText-103 perplexity of 64.9 via a two-phase Freeze-Scan training strategy, and further PPL gains from Holographic Reference Beam Decoding that modulates the hidden state with the current input.

Significance. If the fixed 272-slot cache sustains high associative recall without degradation at longer contexts and the empirical gains are reproducible, the work would offer a meaningful advance toward memory-efficient sequence models that decouple inference cost from prompt length. The dual wave-particle design, Freeze-Scan procedure, and holographic decoding interpretation provide concrete architectural and training ideas that could be tested in other linear or hybrid architectures.

major comments (3)

[Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.
[Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.
[Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.

minor comments (2)

The abstract cites concrete memory figures (867 MB fixed vs. Transformer 853-4247 MB) but does not specify the model dimension, precision, or hardware assumptions underlying the MB conversion; adding these details would improve reproducibility.
The GitHub link for code and pretrained weights is provided, but the manuscript should include a brief reproducibility checklist (random seeds, exact hyper-parameters for the 272-slot cache, and evaluation scripts) to allow independent verification of the reported MQAR and PPL numbers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and commit to revisions where appropriate to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.

Authors: We agree that slot-count ablations and results at larger N would provide stronger evidence. The 272 slots (W+K) were selected based on preliminary experiments balancing memory footprint and recall performance. In the revised version, we will include an ablation study varying the number of slots from 64 to 512 and report MQAR accuracy for each. For N>8192, we currently lack results due to resource limitations, but the fixed cache size ensures O(1) memory by design, and we will add a discussion on potential collision risks at scale. This does not falsify the claim for the tested regimes, but we acknowledge the need for further validation. revision: partial
Referee: [Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.

Authors: We will revise the manuscript to include the precise details of the Freeze-Scan procedure. Specifically, the recurrent scan is trained for the initial 10,000 steps, after which its parameters are frozen, and the particle component along with embeddings are optimized for the subsequent 34,000 steps. We will also add comparisons to standard optimizers such as Adam and SGD, as well as report standard deviations from multiple runs to demonstrate the robustness of the 7.5x improvement. revision: yes
Referee: [Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.

Authors: In the revision, we will provide a short derivation in the methods section explaining the holographic analogy: the complex hidden state h_t encodes the history in a manner analogous to a holographic plate, and modulating it with x_t serves as the reference beam to retrieve the encoded information. Furthermore, we will include ablations comparing the orthogonal beam modulation against simpler operations like element-wise addition and a basic cross-attention module, showing the proposed method's advantages in terms of PPL reduction and parameter efficiency. revision: yes

standing simulated objections not resolved

Experimental results for sequence lengths exceeding 8192 tokens are not available in our current study.

Circularity Check

1 steps flagged

O(1) decode memory is self-definitional from fixed 272-slot particle component

specific steps

self definitional [Abstract]
"a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192)."

The claim that the model 'yields strictly O(1) decode memory' follows immediately from the preceding clause that defines the addressing mechanism with a constant slot count (272) that does not scale with N. No additional derivation or theorem is supplied; the memory scaling property is the direct consequence of the fixed-size design choice.

full rationale

The paper's headline efficiency result reduces directly to an architectural definition rather than an independent derivation. The wave component is described as fixed-size and the particle component is explicitly given a constant slot count (W+K=272) independent of N; the O(1) memory statement is therefore tautological with that choice. Empirical results on MQAR accuracy, PPL, and Freeze-Scan training are reported separately and do not participate in the circularity. No self-citation chains, ansatz smuggling, or uniqueness theorems appear in the supplied text. The circularity is partial because the recall performance claim remains an empirical assertion rather than a definitional one.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central O(1) memory claim rests on the assumption that 272 learned slots suffice for associative recall; the wave component assumes phase-preserving Givens rotations can compress arbitrary long-range patterns without loss.

free parameters (2)

W+K cache slots
Fixed at 272 independent of N; chosen to achieve the reported memory footprint.
Number of heads for reference beam
Set to 4 in the holographic decoding experiment.

axioms (1)

domain assumption Phase-preserving Givens rotations can compress long-range sequence patterns into a fixed-size complex hidden state without information loss for downstream tasks.
Invoked in the wave component description.

invented entities (3)

Wave component (recurrent scan) no independent evidence
purpose: Compresses long-range patterns into fixed complex state
New architectural split introduced in the paper.
Particle component (local-global cache) no independent evidence
purpose: Learned associative addressing with fixed slots
New architectural split introduced in the paper.
Holographic reference beam no independent evidence
purpose: Modulates hidden state using current input as reference
New decoding interpretation introduced in the paper.

pith-pipeline@v0.9.0 · 5658 in / 1491 out tokens · 40867 ms · 2026-05-10T17:45:57.667162+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ht = (1−pt)⊙R(θt)·ht−1 + pt ⊙(Wr xt + i Wi xt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora et al. Simple linear attention language models balance the recall-throughput tradeoff. In ICML, 2024

2024
[2]

Longformer: The long-document transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. 2020

2020
[3]

Transformers are SSM s: Generalized models and efficient algorithms

Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms. In ICML, 2024

2024
[4]

MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions

Yasong Fan. MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions. arXiv, 2026

2026
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R\'e. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022

2022
[7]

The large- N limit of superconformal field theories and supergravity

Juan Maldacena. The large- N limit of superconformal field theories and supergravity. International journal of theoretical physics, 38: 0 1113--1133, 1999

1999
[8]

Pointer sentinel mixture models

Stephen Merity et al. Pointer sentinel mixture models. In ICLR, 2017

2017
[9]

In-context learning and induction heads

Catherine Olsson et al. In-context learning and induction heads. Transformer Circuits Thread, 2022

2022
[10]

RWKV : Reinventing RNN s for the transformer era

Bo Peng et al. RWKV : Reinventing RNN s for the transformer era. In EMNLP, 2023

2023
[11]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun et al. Retentive network: A successor to transformer. arXiv:2307.08621, 2023

work page internal anchor Pith review arXiv 2023
[12]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. In NeurIPS, 2017

2017
[13]

Efficient streaming language models with attention sinks

Guangxuan Xiao et al. Efficient streaming language models with attention sinks. In ICLR, 2024

2024
[14]

Big B ird: Transformers for longer sequences

Manzil Zaheer et al. Big B ird: Transformers for longer sequences. In NeurIPS, 2020

2020
[15]

H2O : Heavy-hitter oracle for efficient generative inference

Zhenyu Zhang et al. H2O : Heavy-hitter oracle for efficient generative inference. In NeurIPS, 2023

2023