pith. machine review for the scientific record. sign in

arxiv: 2604.07716 · v2 · submitted 2026-04-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords Fan Duality Modellinear sequence modelingO(1) decode memoryassociative recallKV cache bottleneckGivens rotationsrecurrent scanFreeze-Scan training
0
0 comments X

The pith

FDM achieves fixed 867 MB decode memory for sequences up to 8192 tokens by splitting into a wave component and a 272-slot particle cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Fan Duality Model resolves the memory-associative recall tradeoff in sequence modeling by using a recurrent wave component to compress long-range patterns into a fixed complex hidden state via phase-preserving Givens rotations. A separate particle component then retrieves specific tokens through learned associative addressing that relies on only 272 slots regardless of sequence length. This separation produces strictly constant O(1) memory during decoding, in contrast to the linearly growing key-value cache of transformers. The paper also introduces Freeze-Scan training that freezes the recurrent scan to optimize the cache jointly with embeddings, yielding improved perplexity and superior performance on multi-query associative recall benchmarks.

Core claim

FDM separates sequence processing into a wave component that compresses long-range patterns into a fixed-size complex hidden state through recurrent scans with phase-preserving Givens rotations and a particle component that retrieves specific tokens via learned associative addressing using W+K=272 slots independent of sequence length N. This architecture yields strictly O(1) decode memory of 867 MB fixed across prompt lengths 128-8192 tokens, compared with the Transformer's growing 853-4247 MB usage. Joint training of the components converges poorly, so Freeze-Scan training that freezes the recurrent scan and optimizes the cache with embeddings achieves PPL=64.9 on WikiText-103 in 44K steps.

What carries the argument

The wave-particle duality consisting of a fixed-size recurrent wave state updated by Givens rotations for pattern compression and a separate 272-slot associative particle cache for exact token retrieval.

If this is right

  • Strictly constant 867 MB decode memory holds for all tested lengths up to 8192 tokens, a 4.9x reduction versus transformers at the longest length.
  • MQAR accuracy reaches 0.966, exceeding the transformer's 0.606 by 59.5 percent while a pure scan without the cache scores only 0.011.
  • Freeze-Scan training improves convergence to PPL=64.9 on WikiText-103 in 44K steps, a 7.5x gain over full fine-tuning.
  • Holographic reference beam decoding using the current input to modulate the hidden state reduces PPL by up to 2.13 points with 1.3M extra parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-size cache approach could extend to streaming or edge-device sequence tasks where memory is strictly bounded.
  • If the 272 slots prove robust, hybrid recurrent-associative designs might replace attention-based caches in other long-context applications.
  • The holographic interpretation of the hidden state opens a route to test whether reference-beam modulation generalizes to other recurrent architectures.
  • The separation of wave and particle components suggests testing whether similar duality can reduce memory in non-language sequence domains such as time-series forecasting.

Load-bearing premise

A fixed cache of only 272 slots can retrieve arbitrary specific tokens from sequences of arbitrary length without degradation.

What would settle it

Running multi-query associative recall on sequences longer than 8192 tokens or with more than 272 distinct items to retrieve would show accuracy drop or memory growth if the O(1) claim fails.

read the original abstract

We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze-Scan, a two-phase training strategy that freezes the recurrent scan and optimizes the cache jointly with embeddings, achieving PPL=64.9 on WikiText-103 in 44K steps -- a 7.5x improvement over full fine-tuning (PPL=487). On Multi-Query Associative Recall (MQAR), FDM achieves 0.966 accuracy, surpassing Transformer (0.606) by 59.5%, while pure scan without cache scores only 0.011, confirming the necessity of the particle component. Finally, we introduce Holographic Reference Beam Decoding, interpreting the complex hidden state h_t as a holographic plate encoding the entire temporal history. Using the current input x_t as a reference beam to modulate h_t reduces PPL by up to 2.13 points (PPL=62.79) with a 4-head orthogonal reference beam using only 1.3M additional parameters, providing empirical support for the holographic interpretation. Code and pretrained weights: https://github.com/YasongFan/FDM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Fan Duality Model (FDM), a linear sequence architecture that decomposes processing into a wave component (recurrent scan via phase-preserving Givens rotations compressing patterns into a fixed-size complex state) and a particle component (local-global cache using learned associative addressing over a fixed W+K=272 slots independent of sequence length N). It claims strictly O(1) decode memory (fixed 867 MB across N=128 to 8192), superior MQAR accuracy of 0.966 (vs. Transformer 0.606), improved WikiText-103 perplexity of 64.9 via a two-phase Freeze-Scan training strategy, and further PPL gains from Holographic Reference Beam Decoding that modulates the hidden state with the current input.

Significance. If the fixed 272-slot cache sustains high associative recall without degradation at longer contexts and the empirical gains are reproducible, the work would offer a meaningful advance toward memory-efficient sequence models that decouple inference cost from prompt length. The dual wave-particle design, Freeze-Scan procedure, and holographic decoding interpretation provide concrete architectural and training ideas that could be tested in other linear or hybrid architectures.

major comments (3)
  1. [Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.
  2. [Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.
  3. [Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.
minor comments (2)
  1. The abstract cites concrete memory figures (867 MB fixed vs. Transformer 853-4247 MB) but does not specify the model dimension, precision, or hardware assumptions underlying the MB conversion; adding these details would improve reproducibility.
  2. The GitHub link for code and pretrained weights is provided, but the manuscript should include a brief reproducibility checklist (random seeds, exact hyper-parameters for the 272-slot cache, and evaluation scripts) to allow independent verification of the reported MQAR and PPL numbers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and commit to revisions where appropriate to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of strictly O(1) decode memory with superior associative recall rests on the particle component maintaining MQAR accuracy of 0.966 using only 272 fixed slots; however, the manuscript reports no slot-count ablation, no capacity bound, and no results for N>8192, leaving open whether addressing collisions or retrieval failures appear at scale and thereby falsifying the joint efficiency-plus-recall guarantee.

    Authors: We agree that slot-count ablations and results at larger N would provide stronger evidence. The 272 slots (W+K) were selected based on preliminary experiments balancing memory footprint and recall performance. In the revised version, we will include an ablation study varying the number of slots from 64 to 512 and report MQAR accuracy for each. For N>8192, we currently lack results due to resource limitations, but the fixed cache size ensures O(1) memory by design, and we will add a discussion on potential collision risks at scale. This does not falsify the claim for the tested regimes, but we acknowledge the need for further validation. revision: partial

  2. Referee: [Abstract] Abstract: The Freeze-Scan strategy is stated to reach PPL=64.9 in 44K steps (7.5x better than full fine-tuning at PPL=487), yet the description supplies neither the precise freezing schedule for the recurrent scan, nor comparisons against standard optimizers, nor error bars across runs; these omissions make it impossible to isolate the contribution of the two-phase procedure from other training choices.

    Authors: We will revise the manuscript to include the precise details of the Freeze-Scan procedure. Specifically, the recurrent scan is trained for the initial 10,000 steps, after which its parameters are frozen, and the particle component along with embeddings are optimized for the subsequent 34,000 steps. We will also add comparisons to standard optimizers such as Adam and SGD, as well as report standard deviations from multiple runs to demonstrate the robustness of the 7.5x improvement. revision: yes

  3. Referee: [Abstract] Abstract: The Holographic Reference Beam Decoding is reported to lower PPL by up to 2.13 points with a 4-head orthogonal beam (1.3M extra parameters), but the manuscript provides no derivation showing why modulating the complex state h_t with x_t corresponds to a holographic reference beam, nor any ablation against simpler modulation or attention-based alternatives.

    Authors: In the revision, we will provide a short derivation in the methods section explaining the holographic analogy: the complex hidden state h_t encodes the history in a manner analogous to a holographic plate, and modulating it with x_t serves as the reference beam to retrieve the encoded information. Furthermore, we will include ablations comparing the orthogonal beam modulation against simpler operations like element-wise addition and a basic cross-attention module, showing the proposed method's advantages in terms of PPL reduction and parameter efficiency. revision: yes

standing simulated objections not resolved
  • Experimental results for sequence lengths exceeding 8192 tokens are not available in our current study.

Circularity Check

1 steps flagged

O(1) decode memory is self-definitional from fixed 272-slot particle component

specific steps
  1. self definitional [Abstract]
    "a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192)."

    The claim that the model 'yields strictly O(1) decode memory' follows immediately from the preceding clause that defines the addressing mechanism with a constant slot count (272) that does not scale with N. No additional derivation or theorem is supplied; the memory scaling property is the direct consequence of the fixed-size design choice.

full rationale

The paper's headline efficiency result reduces directly to an architectural definition rather than an independent derivation. The wave component is described as fixed-size and the particle component is explicitly given a constant slot count (W+K=272) independent of N; the O(1) memory statement is therefore tautological with that choice. Empirical results on MQAR accuracy, PPL, and Freeze-Scan training are reported separately and do not participate in the circularity. No self-citation chains, ansatz smuggling, or uniqueness theorems appear in the supplied text. The circularity is partial because the recall performance claim remains an empirical assertion rather than a definitional one.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central O(1) memory claim rests on the assumption that 272 learned slots suffice for associative recall; the wave component assumes phase-preserving Givens rotations can compress arbitrary long-range patterns without loss.

free parameters (2)
  • W+K cache slots
    Fixed at 272 independent of N; chosen to achieve the reported memory footprint.
  • Number of heads for reference beam
    Set to 4 in the holographic decoding experiment.
axioms (1)
  • domain assumption Phase-preserving Givens rotations can compress long-range sequence patterns into a fixed-size complex hidden state without information loss for downstream tasks.
    Invoked in the wave component description.
invented entities (3)
  • Wave component (recurrent scan) no independent evidence
    purpose: Compresses long-range patterns into fixed complex state
    New architectural split introduced in the paper.
  • Particle component (local-global cache) no independent evidence
    purpose: Learned associative addressing with fixed slots
    New architectural split introduced in the paper.
  • Holographic reference beam no independent evidence
    purpose: Modulates hidden state using current input as reference
    New decoding interpretation introduced in the paper.

pith-pipeline@v0.9.0 · 5658 in / 1491 out tokens · 40867 ms · 2026-05-10T17:45:57.667162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora et al. Simple linear attention language models balance the recall-throughput tradeoff. In ICML, 2024

  2. [2]

    Longformer: The long-document transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. 2020

  3. [3]

    Transformers are SSM s: Generalized models and efficient algorithms

    Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms. In ICML, 2024

  4. [4]

    MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions

    Yasong Fan. MIPT-SSM : Scaling language models with O (1) inference cache via phase transitions. arXiv, 2026

  5. [5]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023

  6. [6]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher R\'e. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022

  7. [7]

    The large- N limit of superconformal field theories and supergravity

    Juan Maldacena. The large- N limit of superconformal field theories and supergravity. International journal of theoretical physics, 38: 0 1113--1133, 1999

  8. [8]

    Pointer sentinel mixture models

    Stephen Merity et al. Pointer sentinel mixture models. In ICLR, 2017

  9. [9]

    In-context learning and induction heads

    Catherine Olsson et al. In-context learning and induction heads. Transformer Circuits Thread, 2022

  10. [10]

    RWKV : Reinventing RNN s for the transformer era

    Bo Peng et al. RWKV : Reinventing RNN s for the transformer era. In EMNLP, 2023

  11. [11]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun et al. Retentive network: A successor to transformer. arXiv:2307.08621, 2023

  12. [12]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. In NeurIPS, 2017

  13. [13]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao et al. Efficient streaming language models with attention sinks. In ICLR, 2024

  14. [14]

    Big B ird: Transformers for longer sequences

    Manzil Zaheer et al. Big B ird: Transformers for longer sequences. In NeurIPS, 2020

  15. [15]

    H2O : Heavy-hitter oracle for efficient generative inference

    Zhenyu Zhang et al. H2O : Heavy-hitter oracle for efficient generative inference. In NeurIPS, 2023